subject:"RE\: \[RFC\] mm\: remove swapcache page early"

Re: [RFC] mm: remove swapcache page early

2013-04-07 Thread Simon Jeons


On 04/08/2013 09:48 AM, Minchan Kim wrote:

Hello Simon,

On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote:

Ping Minchan.
On 04/02/2013 09:40 PM, Simon Jeons wrote:

Hi Hugh,
On 03/28/2013 05:41 AM, Hugh Dickins wrote:

On Wed, 27 Mar 2013, Minchan Kim wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.

  so we can avoid unnecessary write.

If page can be swap out again, which codes can avoid unnecessary
write? Could you point out to me? Thanks in advance. ;-)

Look at shrink_page_list.

1) PageAnon(page) && !PageSwapCache()
2) add_to_swap's SetPageDirty
3) __remove_mapping

P.S)
It seems you are misunderstanding. Here isn't proper place to ask a
question for your understanding the code. As I know, there are some
project(ex, kernelnewbies) and books for study and sharing the
knowledge linux kernel.

I recommend Mel's "Understand the Linux Virtual Memory Manager".
It's rather outdated but will be very helpful to understand VM of
linux kernel. You can get it freely but I hope you pay for.
So if author become a billionaire by selecting best book in Amazon,
he might print out second edition which covers all of new VM features
and may solve all of you curiosity.

It would be a another method to contribute open source project. :)

I believe you talented developers can catch it up with reading the
code enoughly and find more bonus knowledge. I think it's why our senior
developers yell out RTFM and I follow them.


What's the meaning of RTFM?



Cheers!




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-07 Thread Minchan Kim

Hello Simon,

On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote:
> Ping Minchan.
> On 04/02/2013 09:40 PM, Simon Jeons wrote:
> >Hi Hugh,
> >On 03/28/2013 05:41 AM, Hugh Dickins wrote:
> >>On Wed, 27 Mar 2013, Minchan Kim wrote:
> >>
> >>>Swap subsystem does lazy swap slot free with expecting the page
> >>>would be swapped out again so we can't avoid unnecessary write.
> >>  so we can avoid unnecessary write.
> >
> >If page can be swap out again, which codes can avoid unnecessary
> >write? Could you point out to me? Thanks in advance. ;-)

Look at shrink_page_list.

1) PageAnon(page) && !PageSwapCache() 
2) add_to_swap's SetPageDirty
3) __remove_mapping

P.S)
It seems you are misunderstanding. Here isn't proper place to ask a
question for your understanding the code. As I know, there are some
project(ex, kernelnewbies) and books for study and sharing the
knowledge linux kernel.

I recommend Mel's "Understand the Linux Virtual Memory Manager".
It's rather outdated but will be very helpful to understand VM of
linux kernel. You can get it freely but I hope you pay for.
So if author become a billionaire by selecting best book in Amazon,
he might print out second edition which covers all of new VM features
and may solve all of you curiosity.

It would be a another method to contribute open source project. :)

I believe you talented developers can catch it up with reading the
code enoughly and find more bonus knowledge. I think it's why our senior
developers yell out RTFM and I follow them.

Cheers!

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-07 Thread Simon Jeons


Ping Minchan.
On 04/02/2013 09:40 PM, Simon Jeons wrote:

Hi Hugh,
On 03/28/2013 05:41 AM, Hugh Dickins wrote:

On Wed, 27 Mar 2013, Minchan Kim wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.

  so we can avoid unnecessary write.


If page can be swap out again, which codes can avoid unnecessary 
write? Could you point out to me? Thanks in advance. ;-)



But the problem in in-memory swap is that it consumes memory space
until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small 
in-memory swap

and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

And I guess swap readahead is utterly unhelpful in this case too.


This patch changes vm_swap_full logic slightly so it could free
swap slot early if the backed device is really fast.
For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it?  In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices. The
memory set aside may be wasted, but that is accepted upfront.

Similarly, this is not a problem with swapping to SSD.  There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

The problem here is peculiar to frontswap, and the variably sized
memory behind it, isn't it?  We are accustomed to using swap to free
up memory by transferring its data to some other, cheaper but slower
resource.

But in the case of frontswap and zmem (I'll say that to avoid thinking
through which backends are actually involved), it is not a cheaper and
slower resource, but the very same memory we are trying to save: swap
is stolen from the memory under reclaim, so any duplication becomes
counter-productive (if we ignore cpu compression/decompression costs:
I have no idea how fair it is to do so, but anyone who chooses zmem
is prepared to pay some cpu price for that).

And because it's a frontswap thing, we cannot decide this by device:
frontswap may or may not stand in front of each device.  There is no
problem with swapcache duplicated on disk (until that area approaches
being full or fragmented), but at the higher level we cannot see what
is in zmem and what is on disk: we only want to free up the zmem dup.

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

We cannot rely on freeing the swap itself, because in general there
may be multiple references to the swap, and we only satisfy the one
which has faulted.  It may or may not be a good idea to use rmap to
locate the other places to insert pte in place of swap entry, to
resolve them all at once; but we have chosen not to do so in the
past, and there's no need for that, if the zmem gets invalidated
and the swapcache page set dirty.

Hugh


So let's add Ccing Shaohua and Hugh.
If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
or something for z* family.

Other problem is zram is block device so that it can set SWP_INMEMORY
or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
I have no idea to use it for frontswap.

Any idea?

Other optimize point is we remove it unconditionally when we
found it's exclusive when swap in happen.
It could help frontswap family, too.
What do you think about it?

Cc: Hugh Dickins 
Cc: Dan Magenheimer 
Cc: Seth Jennings 
Cc: Nitin Gupta 
Cc: Konrad Rzeszutek Wilk 
Cc: Shaohua Li 
Signed-off-by: Minchan Kim 
---
  include/linux/swap.h | 11 ---
  mm/memory.c  |  3 ++-
  mm/swapfile.c| 11 +++
  mm/vmscan.c  |  2 +-
  4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..1f4df66 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,9 +359,14 @@ extern struct page 
*swapin_readahead(swp_entry_t, gfp_t,

  extern atomic_long_t nr_swap_pages;
  extern long total_swap_pages;
  -/* Swap 50% full? Release swapcache more aggressively.. */
-static inline bool vm_swap_full(void)
+/*
+ * Swap 50% full or fast backed device?
+ * Release swapcache more aggressively.
+ */
+static inline bool vm_swap_full(struct swap_info_struct *si)
  {

Re: [RFC] mm: remove swapcache page early

2013-04-07 Thread Simon Jeons


Ping Minchan.
On 04/02/2013 09:40 PM, Simon Jeons wrote:

Hi Hugh,
On 03/28/2013 05:41 AM, Hugh Dickins wrote:

On Wed, 27 Mar 2013, Minchan Kim wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.

  so we can avoid unnecessary write.


If page can be swap out again, which codes can avoid unnecessary 
write? Could you point out to me? Thanks in advance. ;-)



But the problem in in-memory swap is that it consumes memory space
until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small 
in-memory swap

and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

And I guess swap readahead is utterly unhelpful in this case too.


This patch changes vm_swap_full logic slightly so it could free
swap slot early if the backed device is really fast.
For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it?  In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices. The
memory set aside may be wasted, but that is accepted upfront.

Similarly, this is not a problem with swapping to SSD.  There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

The problem here is peculiar to frontswap, and the variably sized
memory behind it, isn't it?  We are accustomed to using swap to free
up memory by transferring its data to some other, cheaper but slower
resource.

But in the case of frontswap and zmem (I'll say that to avoid thinking
through which backends are actually involved), it is not a cheaper and
slower resource, but the very same memory we are trying to save: swap
is stolen from the memory under reclaim, so any duplication becomes
counter-productive (if we ignore cpu compression/decompression costs:
I have no idea how fair it is to do so, but anyone who chooses zmem
is prepared to pay some cpu price for that).

And because it's a frontswap thing, we cannot decide this by device:
frontswap may or may not stand in front of each device.  There is no
problem with swapcache duplicated on disk (until that area approaches
being full or fragmented), but at the higher level we cannot see what
is in zmem and what is on disk: we only want to free up the zmem dup.

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

We cannot rely on freeing the swap itself, because in general there
may be multiple references to the swap, and we only satisfy the one
which has faulted.  It may or may not be a good idea to use rmap to
locate the other places to insert pte in place of swap entry, to
resolve them all at once; but we have chosen not to do so in the
past, and there's no need for that, if the zmem gets invalidated
and the swapcache page set dirty.

Hugh


So let's add Ccing Shaohua and Hugh.
If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
or something for z* family.

Other problem is zram is block device so that it can set SWP_INMEMORY
or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
I have no idea to use it for frontswap.

Any idea?

Other optimize point is we remove it unconditionally when we
found it's exclusive when swap in happen.
It could help frontswap family, too.
What do you think about it?

Cc: Hugh Dickins hu...@google.com
Cc: Dan Magenheimer dan.magenhei...@oracle.com
Cc: Seth Jennings sjenn...@linux.vnet.ibm.com
Cc: Nitin Gupta ngu...@vflare.org
Cc: Konrad Rzeszutek Wilk kon...@darnok.org
Cc: Shaohua Li s...@kernel.org
Signed-off-by: Minchan Kim minc...@kernel.org
---
  include/linux/swap.h | 11 ---
  mm/memory.c  |  3 ++-
  mm/swapfile.c| 11 +++
  mm/vmscan.c  |  2 +-
  4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..1f4df66 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,9 +359,14 @@ extern struct page 
*swapin_readahead(swp_entry_t, gfp_t,

  extern atomic_long_t nr_swap_pages;
  extern long total_swap_pages;
  -/* Swap 50% full? Release swapcache more aggressively.. */
-static inline bool vm_swap_full(void)
+/*
+ * Swap 50%

Re: [RFC] mm: remove swapcache page early

2013-04-07 Thread Minchan Kim

Hello Simon,

On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote:
 Ping Minchan.
 On 04/02/2013 09:40 PM, Simon Jeons wrote:
 Hi Hugh,
 On 03/28/2013 05:41 AM, Hugh Dickins wrote:
 On Wed, 27 Mar 2013, Minchan Kim wrote:
 
 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.
   so we can avoid unnecessary write.
 
 If page can be swap out again, which codes can avoid unnecessary
 write? Could you point out to me? Thanks in advance. ;-)

Look at shrink_page_list.

1) PageAnon(page)  !PageSwapCache() 
2) add_to_swap's SetPageDirty
3) __remove_mapping

P.S)
It seems you are misunderstanding. Here isn't proper place to ask a
question for your understanding the code. As I know, there are some
project(ex, kernelnewbies) and books for study and sharing the
knowledge linux kernel.

I recommend Mel's Understand the Linux Virtual Memory Manager.
It's rather outdated but will be very helpful to understand VM of
linux kernel. You can get it freely but I hope you pay for.
So if author become a billionaire by selecting best book in Amazon,
he might print out second edition which covers all of new VM features
and may solve all of you curiosity.

It would be a another method to contribute open source project. :)

I believe you talented developers can catch it up with reading the
code enoughly and find more bonus knowledge. I think it's why our senior
developers yell out RTFM and I follow them.

Cheers!


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-07 Thread Simon Jeons


On 04/08/2013 09:48 AM, Minchan Kim wrote:

Hello Simon,

On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote:

Ping Minchan.
On 04/02/2013 09:40 PM, Simon Jeons wrote:

Hi Hugh,
On 03/28/2013 05:41 AM, Hugh Dickins wrote:

On Wed, 27 Mar 2013, Minchan Kim wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.

  so we can avoid unnecessary write.

If page can be swap out again, which codes can avoid unnecessary
write? Could you point out to me? Thanks in advance. ;-)

Look at shrink_page_list.

1) PageAnon(page)  !PageSwapCache()
2) add_to_swap's SetPageDirty
3) __remove_mapping

P.S)
It seems you are misunderstanding. Here isn't proper place to ask a
question for your understanding the code. As I know, there are some
project(ex, kernelnewbies) and books for study and sharing the
knowledge linux kernel.

I recommend Mel's Understand the Linux Virtual Memory Manager.
It's rather outdated but will be very helpful to understand VM of
linux kernel. You can get it freely but I hope you pay for.
So if author become a billionaire by selecting best book in Amazon,
he might print out second edition which covers all of new VM features
and may solve all of you curiosity.

It would be a another method to contribute open source project. :)

I believe you talented developers can catch it up with reading the
code enoughly and find more bonus knowledge. I think it's why our senior
developers yell out RTFM and I follow them.


What's the meaning of RTFM?



Cheers!




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-02 Thread Simon Jeons


Hi Hugh,
On 03/28/2013 05:41 AM, Hugh Dickins wrote:

On Wed, 27 Mar 2013, Minchan Kim wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.

  so we can avoid unnecessary write.


If page can be swap out again, which codes can avoid unnecessary write? 
Could you point out to me? Thanks in advance. ;-)



But the problem in in-memory swap is that it consumes memory space
until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small in-memory swap
and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

And I guess swap readahead is utterly unhelpful in this case too.


This patch changes vm_swap_full logic slightly so it could free
swap slot early if the backed device is really fast.
For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it?  In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices.  The
memory set aside may be wasted, but that is accepted upfront.

Similarly, this is not a problem with swapping to SSD.  There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

The problem here is peculiar to frontswap, and the variably sized
memory behind it, isn't it?  We are accustomed to using swap to free
up memory by transferring its data to some other, cheaper but slower
resource.

But in the case of frontswap and zmem (I'll say that to avoid thinking
through which backends are actually involved), it is not a cheaper and
slower resource, but the very same memory we are trying to save: swap
is stolen from the memory under reclaim, so any duplication becomes
counter-productive (if we ignore cpu compression/decompression costs:
I have no idea how fair it is to do so, but anyone who chooses zmem
is prepared to pay some cpu price for that).

And because it's a frontswap thing, we cannot decide this by device:
frontswap may or may not stand in front of each device.  There is no
problem with swapcache duplicated on disk (until that area approaches
being full or fragmented), but at the higher level we cannot see what
is in zmem and what is on disk: we only want to free up the zmem dup.

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

We cannot rely on freeing the swap itself, because in general there
may be multiple references to the swap, and we only satisfy the one
which has faulted.  It may or may not be a good idea to use rmap to
locate the other places to insert pte in place of swap entry, to
resolve them all at once; but we have chosen not to do so in the
past, and there's no need for that, if the zmem gets invalidated
and the swapcache page set dirty.

Hugh


So let's add Ccing Shaohua and Hugh.
If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
or something for z* family.

Other problem is zram is block device so that it can set SWP_INMEMORY
or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
I have no idea to use it for frontswap.

Any idea?

Other optimize point is we remove it unconditionally when we
found it's exclusive when swap in happen.
It could help frontswap family, too.
What do you think about it?

Cc: Hugh Dickins 
Cc: Dan Magenheimer 
Cc: Seth Jennings 
Cc: Nitin Gupta 
Cc: Konrad Rzeszutek Wilk 
Cc: Shaohua Li 
Signed-off-by: Minchan Kim 
---
  include/linux/swap.h | 11 ---
  mm/memory.c  |  3 ++-
  mm/swapfile.c| 11 +++
  mm/vmscan.c  |  2 +-
  4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..1f4df66 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
  extern atomic_long_t nr_swap_pages;
  extern long total_swap_pages;
  
-/* Swap 50% full? Release swapcache more aggressively.. */

-static inline bool vm_swap_full(void)
+/*
+ * Swap 50% full or fast backed device?
+ * Release swapcache more aggressively.
+ */
+static inline bool vm_swap_full(struct swap_info_struct *si)
  {
+   if (si->flags & SWP_SOLIDSTATE)
+

Re: [RFC] mm: remove swapcache page early

2013-04-02 Thread Simon Jeons


Hi Hugh,
On 03/28/2013 05:41 AM, Hugh Dickins wrote:

On Wed, 27 Mar 2013, Minchan Kim wrote:


Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.

  so we can avoid unnecessary write.


If page can be swap out again, which codes can avoid unnecessary write? 
Could you point out to me? Thanks in advance. ;-)



But the problem in in-memory swap is that it consumes memory space
until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small in-memory swap
and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

And I guess swap readahead is utterly unhelpful in this case too.


This patch changes vm_swap_full logic slightly so it could free
swap slot early if the backed device is really fast.
For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it?  In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices.  The
memory set aside may be wasted, but that is accepted upfront.

Similarly, this is not a problem with swapping to SSD.  There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

The problem here is peculiar to frontswap, and the variably sized
memory behind it, isn't it?  We are accustomed to using swap to free
up memory by transferring its data to some other, cheaper but slower
resource.

But in the case of frontswap and zmem (I'll say that to avoid thinking
through which backends are actually involved), it is not a cheaper and
slower resource, but the very same memory we are trying to save: swap
is stolen from the memory under reclaim, so any duplication becomes
counter-productive (if we ignore cpu compression/decompression costs:
I have no idea how fair it is to do so, but anyone who chooses zmem
is prepared to pay some cpu price for that).

And because it's a frontswap thing, we cannot decide this by device:
frontswap may or may not stand in front of each device.  There is no
problem with swapcache duplicated on disk (until that area approaches
being full or fragmented), but at the higher level we cannot see what
is in zmem and what is on disk: we only want to free up the zmem dup.

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

We cannot rely on freeing the swap itself, because in general there
may be multiple references to the swap, and we only satisfy the one
which has faulted.  It may or may not be a good idea to use rmap to
locate the other places to insert pte in place of swap entry, to
resolve them all at once; but we have chosen not to do so in the
past, and there's no need for that, if the zmem gets invalidated
and the swapcache page set dirty.

Hugh


So let's add Ccing Shaohua and Hugh.
If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
or something for z* family.

Other problem is zram is block device so that it can set SWP_INMEMORY
or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
I have no idea to use it for frontswap.

Any idea?

Other optimize point is we remove it unconditionally when we
found it's exclusive when swap in happen.
It could help frontswap family, too.
What do you think about it?

Cc: Hugh Dickins hu...@google.com
Cc: Dan Magenheimer dan.magenhei...@oracle.com
Cc: Seth Jennings sjenn...@linux.vnet.ibm.com
Cc: Nitin Gupta ngu...@vflare.org
Cc: Konrad Rzeszutek Wilk kon...@darnok.org
Cc: Shaohua Li s...@kernel.org
Signed-off-by: Minchan Kim minc...@kernel.org
---
  include/linux/swap.h | 11 ---
  mm/memory.c  |  3 ++-
  mm/swapfile.c| 11 +++
  mm/vmscan.c  |  2 +-
  4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..1f4df66 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
  extern atomic_long_t nr_swap_pages;
  extern long total_swap_pages;
  
-/* Swap 50% full? Release swapcache more aggressively.. */

-static inline bool vm_swap_full(void)
+/*
+ * Swap 50% full or fast backed device?
+ * Release swapcache more

Re: [RFC] mm: remove swapcache page early

2013-04-01 Thread Minchan Kim

On Mon, Apr 01, 2013 at 10:13:58PM -0700, Hugh Dickins wrote:
> On Tue, 2 Apr 2013, Minchan Kim wrote:
> > On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
> > > On Fri, 29 Mar 2013, Minchan Kim wrote:
> > > > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > > > 
> > > > > I wonder if something like this would have a similar result for zram?
> > > > > (Completely untested... snippet stolen from swap_entry_free with
> > > > > SetPageDirty added... doesn't compile yet, but should give you the 
> > > > > idea.)
> > > 
> > > Be careful, although Dan is right that something like this can be
> > > done for zram, I believe you will find that it needs a little more:
> > > either a separate new entry point (not my preference) or a flags arg
> > > (or boolean) added to swap_slot_free_notify.
> > > 
> > > Because this is a different operation: end_swap_bio_read() wants
> > > to free up zram's compressed copy of the page, but the swp_entry_t
> > > must remain valid until swap_entry_free() can clear up the rest.
> > > Precisely how much of the work each should do, you will discover.
> > 
> > First of all, Thanks for noticing it for me!
> > 
> > If I parse your concern correctly, you are concerning about
> > different semantic on two functions.
> > (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).
> > 
> > But current implementatoin on zram_slot_free_notify could cover both cases
> > properly with luck.
> > 
> > zram_free_page caused by end_swap_bio_read will free compressed copy
> > of the page and zram_free_page caused by swap_entry_free later won't find
> > right index from zram->table and just return.
> > So I think there is no problem.
> > 
> > Remained problem is zram->stats.notify_free, which could be counted
> > redundantly but not sure it's valuable to count exactly.
> > 
> > If I miss your point, please pinpoint your concern. :)
> 
> Looking at it again, I do believe you and Dan are perfectly correct,
> and I was again the confused one.  Though I'd be happier if I could
> see just how I was misreading it: makes me wonder if I had a great
> insight that I can no longer grasp hold of!  I think I was paranoid
> about a swp_entry_t getting recycled prematurely: but swap_entry_free
> remains in control of that - freeing a swap entry is no part of what
> notify_free gets up to.  Sorry for wasting your time.

Hey, Hugh, Please don't do apology.
It gives me a chance to look into that part in detail.
It never wasted my time.

And your deep insight and kind advise always makes everybody happier.

Looking forward to seeing you soon in LSF/MM.
Thanks!

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-01 Thread Hugh Dickins

On Tue, 2 Apr 2013, Minchan Kim wrote:
> On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
> > On Fri, 29 Mar 2013, Minchan Kim wrote:
> > > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > > 
> > > > I wonder if something like this would have a similar result for zram?
> > > > (Completely untested... snippet stolen from swap_entry_free with
> > > > SetPageDirty added... doesn't compile yet, but should give you the 
> > > > idea.)
> > 
> > Be careful, although Dan is right that something like this can be
> > done for zram, I believe you will find that it needs a little more:
> > either a separate new entry point (not my preference) or a flags arg
> > (or boolean) added to swap_slot_free_notify.
> > 
> > Because this is a different operation: end_swap_bio_read() wants
> > to free up zram's compressed copy of the page, but the swp_entry_t
> > must remain valid until swap_entry_free() can clear up the rest.
> > Precisely how much of the work each should do, you will discover.
> 
> First of all, Thanks for noticing it for me!
> 
> If I parse your concern correctly, you are concerning about
> different semantic on two functions.
> (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).
> 
> But current implementatoin on zram_slot_free_notify could cover both cases
> properly with luck.
> 
> zram_free_page caused by end_swap_bio_read will free compressed copy
> of the page and zram_free_page caused by swap_entry_free later won't find
> right index from zram->table and just return.
> So I think there is no problem.
> 
> Remained problem is zram->stats.notify_free, which could be counted
> redundantly but not sure it's valuable to count exactly.
> 
> If I miss your point, please pinpoint your concern. :)

Looking at it again, I do believe you and Dan are perfectly correct,
and I was again the confused one.  Though I'd be happier if I could
see just how I was misreading it: makes me wonder if I had a great
insight that I can no longer grasp hold of!  I think I was paranoid
about a swp_entry_t getting recycled prematurely: but swap_entry_free
remains in control of that - freeing a swap entry is no part of what
notify_free gets up to.  Sorry for wasting your time.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-01 Thread Minchan Kim

Hi Hugh,

On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
> On Fri, 29 Mar 2013, Minchan Kim wrote:
> > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > > > > From: Hugh Dickins [mailto:hu...@google.com]
> > > > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > > > >
> > > > > > I believe the answer is for frontswap/zmem to invalidate the 
> > > > > > frontswap
> > > > > > copy of the page (to free up the compressed memory when possible) 
> > > > > > and
> > > > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > > > > (setting page dirty so nothing will later go to read it from the
> > > > > > unfreed location on backing swap disk, which was never written).
> > > > >
> > > > > There are two duplication issues:  (1) When can the page be removed
> > > > > from the swap cache after a call to frontswap_store; and (2) When
> > > > > can the page be removed from the frontswap storage after it
> > > > > has been brought back into memory via frontswap_load.
> > > > >
> > > > > This patch from Minchan addresses (1).  The issue you are raising
> > > > 
> > > > No. I am addressing (2).
> > > > 
> > > > > here is (2).  You may not know that (2) has recently been solved
> > > > > in frontswap, at least for zcache.  See 
> > > > > frontswap_exclusive_gets_enabled.
> > > > > If this is enabled (and it is for zcache but not yet for zswap),
> > > > > what you suggest (SetPageDirty) is what happens.
> > > > 
> > > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> > > > on zram and zswap.
> > > 
> > > Zswap can enable it trivially by adding a function call in init_zswap.
> > > (Note that it is not enabled by default for all frontswap backends
> > > because it is another complicated tradeoff of cpu time vs memory space
> > > that needs more study on a broad set of workloads.)
> > > 
> > > I wonder if something like this would have a similar result for zram?
> > > (Completely untested... snippet stolen from swap_entry_free with
> > > SetPageDirty added... doesn't compile yet, but should give you the idea.)
> 
> Thanks for correcting me on zram (in earlier mail of this thread), yes,
> I was forgetting about the swap_slot_free_notify entry point which lets
> that memory be freed.
> 
> > 
> > Nice idea!
> > 
> > After I see your patch, I realized it was Hugh's suggestion and
> > you implemented it in proper place.
> > 
> > Will resend it after testing. Maybe nextweek.
> > Thanks!
> 
> Be careful, although Dan is right that something like this can be
> done for zram, I believe you will find that it needs a little more:
> either a separate new entry point (not my preference) or a flags arg
> (or boolean) added to swap_slot_free_notify.
> 
> Because this is a different operation: end_swap_bio_read() wants
> to free up zram's compressed copy of the page, but the swp_entry_t
> must remain valid until swap_entry_free() can clear up the rest.
> Precisely how much of the work each should do, you will discover.

First of all, Thanks for noticing it for me!

If I parse your concern correctly, you are concerning about
different semantic on two functions.
(end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).

But current implementatoin on zram_slot_free_notify could cover both cases
properly with luck.

zram_free_page caused by end_swap_bio_read will free compressed copy
of the page and zram_free_page caused by swap_entry_free later won't find
right index from zram->table and just return.
So I think there is no problem.

Remained problem is zram->stats.notify_free, which could be counted
redundantly but not sure it's valuable to count exactly.

If I miss your point, please pinpoint your concern. :)

Thanks!
-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-01 Thread Minchan Kim

Hi Hugh,

On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
 On Fri, 29 Mar 2013, Minchan Kim wrote:
  On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
From: Minchan Kim [mailto:minc...@kernel.org]
On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
  From: Hugh Dickins [mailto:hu...@google.com]
  Subject: Re: [RFC] mm: remove swapcache page early

  I believe the answer is for frontswap/zmem to invalidate the 
  frontswap
  copy of the page (to free up the compressed memory when possible) 
  and
  SetPageDirty on the PageUptodate PageSwapCache page when swapping in
  (setting page dirty so nothing will later go to read it from the
  unfreed location on backing swap disk, which was never written).

 There are two duplication issues:  (1) When can the page be removed
 from the swap cache after a call to frontswap_store; and (2) When
 can the page be removed from the frontswap storage after it
 has been brought back into memory via frontswap_load.

 This patch from Minchan addresses (1).  The issue you are raising

No. I am addressing (2).

 here is (2).  You may not know that (2) has recently been solved
 in frontswap, at least for zcache.  See 
 frontswap_exclusive_gets_enabled.
 If this is enabled (and it is for zcache but not yet for zswap),
 what you suggest (SetPageDirty) is what happens.

I am blind on zcache so I didn't see it. Anyway, I'd like to address it
on zram and zswap.

   Zswap can enable it trivially by adding a function call in init_zswap.
   (Note that it is not enabled by default for all frontswap backends
   because it is another complicated tradeoff of cpu time vs memory space
   that needs more study on a broad set of workloads.)

   I wonder if something like this would have a similar result for zram?
   (Completely untested... snippet stolen from swap_entry_free with
   SetPageDirty added... doesn't compile yet, but should give you the idea.)

 Thanks for correcting me on zram (in earlier mail of this thread), yes,
 I was forgetting about the swap_slot_free_notify entry point which lets
 that memory be freed.

  Nice idea!

  After I see your patch, I realized it was Hugh's suggestion and
  you implemented it in proper place.

  Will resend it after testing. Maybe nextweek.
  Thanks!

 Be careful, although Dan is right that something like this can be
 done for zram, I believe you will find that it needs a little more:
 either a separate new entry point (not my preference) or a flags arg
 (or boolean) added to swap_slot_free_notify.

 Because this is a different operation: end_swap_bio_read() wants
 to free up zram's compressed copy of the page, but the swp_entry_t
 must remain valid until swap_entry_free() can clear up the rest.
 Precisely how much of the work each should do, you will discover.

First of all, Thanks for noticing it for me!

If I parse your concern correctly, you are concerning about
different semantic on two functions.
(end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).

But current implementatoin on zram_slot_free_notify could cover both cases
properly with luck.

zram_free_page caused by end_swap_bio_read will free compressed copy
of the page and zram_free_page caused by swap_entry_free later won't find
right index from zram-table and just return.
So I think there is no problem.

Remained problem is zram-stats.notify_free, which could be counted
redundantly but not sure it's valuable to count exactly.

If I miss your point, please pinpoint your concern. :)

Thanks!
-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-01 Thread Hugh Dickins

On Tue, 2 Apr 2013, Minchan Kim wrote:
 On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
  On Fri, 29 Mar 2013, Minchan Kim wrote:
   On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:

I wonder if something like this would have a similar result for zram?
(Completely untested... snippet stolen from swap_entry_free with
SetPageDirty added... doesn't compile yet, but should give you the 
idea.)
  
  Be careful, although Dan is right that something like this can be
  done for zram, I believe you will find that it needs a little more:
  either a separate new entry point (not my preference) or a flags arg
  (or boolean) added to swap_slot_free_notify.
  
  Because this is a different operation: end_swap_bio_read() wants
  to free up zram's compressed copy of the page, but the swp_entry_t
  must remain valid until swap_entry_free() can clear up the rest.
  Precisely how much of the work each should do, you will discover.
 
 First of all, Thanks for noticing it for me!
 
 If I parse your concern correctly, you are concerning about
 different semantic on two functions.
 (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).
 
 But current implementatoin on zram_slot_free_notify could cover both cases
 properly with luck.
 
 zram_free_page caused by end_swap_bio_read will free compressed copy
 of the page and zram_free_page caused by swap_entry_free later won't find
 right index from zram-table and just return.
 So I think there is no problem.
 
 Remained problem is zram-stats.notify_free, which could be counted
 redundantly but not sure it's valuable to count exactly.
 
 If I miss your point, please pinpoint your concern. :)

Looking at it again, I do believe you and Dan are perfectly correct,
and I was again the confused one.  Though I'd be happier if I could
see just how I was misreading it: makes me wonder if I had a great
insight that I can no longer grasp hold of!  I think I was paranoid
about a swp_entry_t getting recycled prematurely: but swap_entry_free
remains in control of that - freeing a swap entry is no part of what
notify_free gets up to.  Sorry for wasting your time.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-04-01 Thread Minchan Kim

On Mon, Apr 01, 2013 at 10:13:58PM -0700, Hugh Dickins wrote:
 On Tue, 2 Apr 2013, Minchan Kim wrote:
  On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
   On Fri, 29 Mar 2013, Minchan Kim wrote:
On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
 
 I wonder if something like this would have a similar result for zram?
 (Completely untested... snippet stolen from swap_entry_free with
 SetPageDirty added... doesn't compile yet, but should give you the 
 idea.)
   
   Be careful, although Dan is right that something like this can be
   done for zram, I believe you will find that it needs a little more:
   either a separate new entry point (not my preference) or a flags arg
   (or boolean) added to swap_slot_free_notify.
   
   Because this is a different operation: end_swap_bio_read() wants
   to free up zram's compressed copy of the page, but the swp_entry_t
   must remain valid until swap_entry_free() can clear up the rest.
   Precisely how much of the work each should do, you will discover.
  
  First of all, Thanks for noticing it for me!
  
  If I parse your concern correctly, you are concerning about
  different semantic on two functions.
  (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).
  
  But current implementatoin on zram_slot_free_notify could cover both cases
  properly with luck.
  
  zram_free_page caused by end_swap_bio_read will free compressed copy
  of the page and zram_free_page caused by swap_entry_free later won't find
  right index from zram-table and just return.
  So I think there is no problem.
  
  Remained problem is zram-stats.notify_free, which could be counted
  redundantly but not sure it's valuable to count exactly.
  
  If I miss your point, please pinpoint your concern. :)
 
 Looking at it again, I do believe you and Dan are perfectly correct,
 and I was again the confused one.  Though I'd be happier if I could
 see just how I was misreading it: makes me wonder if I had a great
 insight that I can no longer grasp hold of!  I think I was paranoid
 about a swp_entry_t getting recycled prematurely: but swap_entry_free
 remains in control of that - freeing a swap entry is no part of what
 notify_free gets up to.  Sorry for wasting your time.

Hey, Hugh, Please don't do apology.
It gives me a chance to look into that part in detail.
It never wasted my time.

And your deep insight and kind advise always makes everybody happier.

Looking forward to seeing you soon in LSF/MM.
Thanks!

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-29 Thread Hugh Dickins

On Fri, 29 Mar 2013, Minchan Kim wrote:
> On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > From: Minchan Kim [mailto:minc...@kernel.org]
> > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > > > From: Hugh Dickins [mailto:hu...@google.com]
> > > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > > >
> > > > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > > > copy of the page (to free up the compressed memory when possible) and
> > > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > > > (setting page dirty so nothing will later go to read it from the
> > > > > unfreed location on backing swap disk, which was never written).
> > > >
> > > > There are two duplication issues:  (1) When can the page be removed
> > > > from the swap cache after a call to frontswap_store; and (2) When
> > > > can the page be removed from the frontswap storage after it
> > > > has been brought back into memory via frontswap_load.
> > > >
> > > > This patch from Minchan addresses (1).  The issue you are raising
> > > 
> > > No. I am addressing (2).
> > > 
> > > > here is (2).  You may not know that (2) has recently been solved
> > > > in frontswap, at least for zcache.  See 
> > > > frontswap_exclusive_gets_enabled.
> > > > If this is enabled (and it is for zcache but not yet for zswap),
> > > > what you suggest (SetPageDirty) is what happens.
> > > 
> > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> > > on zram and zswap.
> > 
> > Zswap can enable it trivially by adding a function call in init_zswap.
> > (Note that it is not enabled by default for all frontswap backends
> > because it is another complicated tradeoff of cpu time vs memory space
> > that needs more study on a broad set of workloads.)
> > 
> > I wonder if something like this would have a similar result for zram?
> > (Completely untested... snippet stolen from swap_entry_free with
> > SetPageDirty added... doesn't compile yet, but should give you the idea.)

Thanks for correcting me on zram (in earlier mail of this thread), yes,
I was forgetting about the swap_slot_free_notify entry point which lets
that memory be freed.

> 
> Nice idea!
> 
> After I see your patch, I realized it was Hugh's suggestion and
> you implemented it in proper place.
> 
> Will resend it after testing. Maybe nextweek.
> Thanks!

Be careful, although Dan is right that something like this can be
done for zram, I believe you will find that it needs a little more:
either a separate new entry point (not my preference) or a flags arg
(or boolean) added to swap_slot_free_notify.

Because this is a different operation: end_swap_bio_read() wants
to free up zram's compressed copy of the page, but the swp_entry_t
must remain valid until swap_entry_free() can clear up the rest.
Precisely how much of the work each should do, you will discover.

Hugh

> 
> > 
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 56276fe..2d10988 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
> > iminor(bio->bi_bdev->bd_inode),
> > (unsigned long long)bio->bi_sector);
> > } else {
> > +   struct swap_info_struct *sis;
> > +
> > SetPageUptodate(page);
> > +   sis = page_swap_info(page);
> > +   if (sis->flags & SWP_BLKDEV) {
> > +   struct gendisk *disk = sis->bdev->bd_disk;
> > +   if (disk->fops->swap_slot_free_notify) {
> > +   SetPageDirty(page);
> > +   disk->fops->swap_slot_free_notify(sis->bdev,
> > + offset);
> > +   }
> > +   }
> > }
> > unlock_page(page);
> > bio_put(bio);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-29 Thread Hugh Dickins

On Fri, 29 Mar 2013, Minchan Kim wrote:
 On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
   From: Minchan Kim [mailto:minc...@kernel.org]
   On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
 From: Hugh Dickins [mailto:hu...@google.com]
 Subject: Re: [RFC] mm: remove swapcache page early

 I believe the answer is for frontswap/zmem to invalidate the frontswap
 copy of the page (to free up the compressed memory when possible) and
 SetPageDirty on the PageUptodate PageSwapCache page when swapping in
 (setting page dirty so nothing will later go to read it from the
 unfreed location on backing swap disk, which was never written).

There are two duplication issues:  (1) When can the page be removed
from the swap cache after a call to frontswap_store; and (2) When
can the page be removed from the frontswap storage after it
has been brought back into memory via frontswap_load.

This patch from Minchan addresses (1).  The issue you are raising

   No. I am addressing (2).

here is (2).  You may not know that (2) has recently been solved
in frontswap, at least for zcache.  See 
frontswap_exclusive_gets_enabled.
If this is enabled (and it is for zcache but not yet for zswap),
what you suggest (SetPageDirty) is what happens.

   I am blind on zcache so I didn't see it. Anyway, I'd like to address it
   on zram and zswap.

  Zswap can enable it trivially by adding a function call in init_zswap.
  (Note that it is not enabled by default for all frontswap backends
  because it is another complicated tradeoff of cpu time vs memory space
  that needs more study on a broad set of workloads.)

  I wonder if something like this would have a similar result for zram?
  (Completely untested... snippet stolen from swap_entry_free with
  SetPageDirty added... doesn't compile yet, but should give you the idea.)

Thanks for correcting me on zram (in earlier mail of this thread), yes,
I was forgetting about the swap_slot_free_notify entry point which lets
that memory be freed.

 Nice idea!

 After I see your patch, I realized it was Hugh's suggestion and
 you implemented it in proper place.

 Will resend it after testing. Maybe nextweek.
 Thanks!

Be careful, although Dan is right that something like this can be
done for zram, I believe you will find that it needs a little more:
either a separate new entry point (not my preference) or a flags arg
(or boolean) added to swap_slot_free_notify.

Because this is a different operation: end_swap_bio_read() wants
to free up zram's compressed copy of the page, but the swp_entry_t
must remain valid until swap_entry_free() can clear up the rest.
Precisely how much of the work each should do, you will discover.

Hugh

  diff --git a/mm/page_io.c b/mm/page_io.c
  index 56276fe..2d10988 100644
  --- a/mm/page_io.c
  +++ b/mm/page_io.c
  @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
  iminor(bio-bi_bdev-bd_inode),
  (unsigned long long)bio-bi_sector);
  } else {
  +   struct swap_info_struct *sis;
  +
  SetPageUptodate(page);
  +   sis = page_swap_info(page);
  +   if (sis-flags  SWP_BLKDEV) {
  +   struct gendisk *disk = sis-bdev-bd_disk;
  +   if (disk-fops-swap_slot_free_notify) {
  +   SetPageDirty(page);
  +   disk-fops-swap_slot_free_notify(sis-bdev,
  + offset);
  +   }
  +   }
  }
  unlock_page(page);
  bio_put(bio);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-28 Thread Minchan Kim

On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > From: Minchan Kim [mailto:minc...@kernel.org]
> > Subject: Re: [RFC] mm: remove swapcache page early
> > 
> > Hi Dan,
> > 
> > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > > From: Hugh Dickins [mailto:hu...@google.com]
> > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > >
> > > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > > copy of the page (to free up the compressed memory when possible) and
> > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > > (setting page dirty so nothing will later go to read it from the
> > > > unfreed location on backing swap disk, which was never written).
> > >
> > > There are two duplication issues:  (1) When can the page be removed
> > > from the swap cache after a call to frontswap_store; and (2) When
> > > can the page be removed from the frontswap storage after it
> > > has been brought back into memory via frontswap_load.
> > >
> > > This patch from Minchan addresses (1).  The issue you are raising
> > 
> > No. I am addressing (2).
> > 
> > > here is (2).  You may not know that (2) has recently been solved
> > > in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
> > > If this is enabled (and it is for zcache but not yet for zswap),
> > > what you suggest (SetPageDirty) is what happens.
> > 
> > I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> > on zram and zswap.
> 
> Zswap can enable it trivially by adding a function call in init_zswap.
> (Note that it is not enabled by default for all frontswap backends
> because it is another complicated tradeoff of cpu time vs memory space
> that needs more study on a broad set of workloads.)
> 
> I wonder if something like this would have a similar result for zram?
> (Completely untested... snippet stolen from swap_entry_free with
> SetPageDirty added... doesn't compile yet, but should give you the idea.)

Nice idea!

After I see your patch, I realized it was Hugh's suggestion and
you implemented it in proper place.

Will resend it after testing. Maybe nextweek.
Thanks!

> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 56276fe..2d10988 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
>   iminor(bio->bi_bdev->bd_inode),
>   (unsigned long long)bio->bi_sector);
>   } else {
> + struct swap_info_struct *sis;
> +
>   SetPageUptodate(page);
> + sis = page_swap_info(page);
> + if (sis->flags & SWP_BLKDEV) {
> + struct gendisk *disk = sis->bdev->bd_disk;
> + if (disk->fops->swap_slot_free_notify) {
> + SetPageDirty(page);
> + disk->fops->swap_slot_free_notify(sis->bdev,
> +   offset);
> + }
> + }
>   }
>   unlock_page(page);
>   bio_put(bio);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: Re: [RFC] mm: remove swapcache page early
> 
> Hi Dan,
> 
> On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:hu...@google.com]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > >
> > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > copy of the page (to free up the compressed memory when possible) and
> > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > (setting page dirty so nothing will later go to read it from the
> > > unfreed location on backing swap disk, which was never written).
> >
> > There are two duplication issues:  (1) When can the page be removed
> > from the swap cache after a call to frontswap_store; and (2) When
> > can the page be removed from the frontswap storage after it
> > has been brought back into memory via frontswap_load.
> >
> > This patch from Minchan addresses (1).  The issue you are raising
> 
> No. I am addressing (2).
> 
> > here is (2).  You may not know that (2) has recently been solved
> > in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
> > If this is enabled (and it is for zcache but not yet for zswap),
> > what you suggest (SetPageDirty) is what happens.
> 
> I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> on zram and zswap.

Zswap can enable it trivially by adding a function call in init_zswap.
(Note that it is not enabled by default for all frontswap backends
because it is another complicated tradeoff of cpu time vs memory space
that needs more study on a broad set of workloads.)

I wonder if something like this would have a similar result for zram?
(Completely untested... snippet stolen from swap_entry_free with
SetPageDirty added... doesn't compile yet, but should give you the idea.)

diff --git a/mm/page_io.c b/mm/page_io.c
index 56276fe..2d10988 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
iminor(bio->bi_bdev->bd_inode),
(unsigned long long)bio->bi_sector);
} else {
+   struct swap_info_struct *sis;
+
SetPageUptodate(page);
+   sis = page_swap_info(page);
+   if (sis->flags & SWP_BLKDEV) {
+   struct gendisk *disk = sis->bdev->bd_disk;
+   if (disk->fops->swap_slot_free_notify) {
+   SetPageDirty(page);
+   disk->fops->swap_slot_free_notify(sis->bdev,
+ offset);
+   }
+   }
}
unlock_page(page);
bio_put(bio);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

> From: Hugh Dickins [mailto:hu...@google.com]
> Subject: RE: [RFC] mm: remove swapcache page early
> 
> On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:hu...@google.com]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > >
> > The issue you are raising
> > here is (2).  You may not know that (2) has recently been solved
> > in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
> > If this is enabled (and it is for zcache but not yet for zswap),
> > what you suggest (SetPageDirty) is what happens.
> 
> Ah, and I have a dim, perhaps mistaken, memory that I gave you
> input on that before, suggesting the SetPageDirty.  Good, sounds
> like the solution is already in place, if not actually activated.
> 
> Thanks, must dash,
> Hugh

Hi Hugh --

Credit where it is due...  Yes, I do recall now that the idea
was originally yours.  It went on a to-do list where I eventually
tried it and it worked... I'm sorry I had forgotten and neglected
to give you credit!

(BTW, it is activated for zcache in 3.9.)

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

 From: Hugh Dickins [mailto:hu...@google.com]
 Subject: RE: [RFC] mm: remove swapcache page early

 On Wed, 27 Mar 2013, Dan Magenheimer wrote:
   From: Hugh Dickins [mailto:hu...@google.com]
   Subject: Re: [RFC] mm: remove swapcache page early

  The issue you are raising
  here is (2).  You may not know that (2) has recently been solved
  in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
  If this is enabled (and it is for zcache but not yet for zswap),
  what you suggest (SetPageDirty) is what happens.

 Ah, and I have a dim, perhaps mistaken, memory that I gave you
 input on that before, suggesting the SetPageDirty.  Good, sounds
 like the solution is already in place, if not actually activated.

 Thanks, must dash,
 Hugh

Hi Hugh --

Credit where it is due...  Yes, I do recall now that the idea
was originally yours.  It went on a to-do list where I eventually
tried it and it worked... I'm sorry I had forgotten and neglected
to give you credit!

(BTW, it is activated for zcache in 3.9.)

Thanks,
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-28 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: Re: [RFC] mm: remove swapcache page early

 Hi Dan,

 On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
   From: Hugh Dickins [mailto:hu...@google.com]
   Subject: Re: [RFC] mm: remove swapcache page early

   I believe the answer is for frontswap/zmem to invalidate the frontswap
   copy of the page (to free up the compressed memory when possible) and
   SetPageDirty on the PageUptodate PageSwapCache page when swapping in
   (setting page dirty so nothing will later go to read it from the
   unfreed location on backing swap disk, which was never written).

  There are two duplication issues:  (1) When can the page be removed
  from the swap cache after a call to frontswap_store; and (2) When
  can the page be removed from the frontswap storage after it
  has been brought back into memory via frontswap_load.

  This patch from Minchan addresses (1).  The issue you are raising

 No. I am addressing (2).

  here is (2).  You may not know that (2) has recently been solved
  in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
  If this is enabled (and it is for zcache but not yet for zswap),
  what you suggest (SetPageDirty) is what happens.

 I am blind on zcache so I didn't see it. Anyway, I'd like to address it
 on zram and zswap.

Zswap can enable it trivially by adding a function call in init_zswap.
(Note that it is not enabled by default for all frontswap backends
because it is another complicated tradeoff of cpu time vs memory space
that needs more study on a broad set of workloads.)

I wonder if something like this would have a similar result for zram?
(Completely untested... snippet stolen from swap_entry_free with
SetPageDirty added... doesn't compile yet, but should give you the idea.)

diff --git a/mm/page_io.c b/mm/page_io.c
index 56276fe..2d10988 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
iminor(bio-bi_bdev-bd_inode),
(unsigned long long)bio-bi_sector);
} else {
+   struct swap_info_struct *sis;
+
SetPageUptodate(page);
+   sis = page_swap_info(page);
+   if (sis-flags  SWP_BLKDEV) {
+   struct gendisk *disk = sis-bdev-bd_disk;
+   if (disk-fops-swap_slot_free_notify) {
+   SetPageDirty(page);
+   disk-fops-swap_slot_free_notify(sis-bdev,
+ offset);
+   }
+   }
}
unlock_page(page);
bio_put(bio);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-28 Thread Minchan Kim

On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
  From: Minchan Kim [mailto:minc...@kernel.org]
  Subject: Re: [RFC] mm: remove swapcache page early

  Hi Dan,

  On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
From: Hugh Dickins [mailto:hu...@google.com]
Subject: Re: [RFC] mm: remove swapcache page early

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

   There are two duplication issues:  (1) When can the page be removed
   from the swap cache after a call to frontswap_store; and (2) When
   can the page be removed from the frontswap storage after it
   has been brought back into memory via frontswap_load.

   This patch from Minchan addresses (1).  The issue you are raising

  No. I am addressing (2).

   here is (2).  You may not know that (2) has recently been solved
   in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
   If this is enabled (and it is for zcache but not yet for zswap),
   what you suggest (SetPageDirty) is what happens.

  I am blind on zcache so I didn't see it. Anyway, I'd like to address it
  on zram and zswap.

 Zswap can enable it trivially by adding a function call in init_zswap.
 (Note that it is not enabled by default for all frontswap backends
 because it is another complicated tradeoff of cpu time vs memory space
 that needs more study on a broad set of workloads.)

 I wonder if something like this would have a similar result for zram?
 (Completely untested... snippet stolen from swap_entry_free with
 SetPageDirty added... doesn't compile yet, but should give you the idea.)

Nice idea!

After I see your patch, I realized it was Hugh's suggestion and
you implemented it in proper place.

Will resend it after testing. Maybe nextweek.
Thanks!

 diff --git a/mm/page_io.c b/mm/page_io.c
 index 56276fe..2d10988 100644
 --- a/mm/page_io.c
 +++ b/mm/page_io.c
 @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
   iminor(bio-bi_bdev-bd_inode),
   (unsigned long long)bio-bi_sector);
   } else {
 + struct swap_info_struct *sis;
 +
   SetPageUptodate(page);
 + sis = page_swap_info(page);
 + if (sis-flags  SWP_BLKDEV) {
 + struct gendisk *disk = sis-bdev-bd_disk;
 + if (disk-fops-swap_slot_free_notify) {
 + SetPageDirty(page);
 + disk-fops-swap_slot_free_notify(sis-bdev,
 +   offset);
 + }
 + }
   }
   unlock_page(page);
   bio_put(bio);

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Shaohua Li

On Thu, Mar 28, 2013 at 10:18:24AM +0900, Minchan Kim wrote:
> On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote:
> > On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > > > From: Hugh Dickins [mailto:hu...@google.com]
> > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > > 
> > > > On Wed, 27 Mar 2013, Minchan Kim wrote:
> > > > 
> > > > > Swap subsystem does lazy swap slot free with expecting the page
> > > > > would be swapped out again so we can't avoid unnecessary write.
> > > >  so we can avoid unnecessary write.
> > > > >
> > > > > But the problem in in-memory swap is that it consumes memory space
> > > > > until vm_swap_full(ie, used half of all of swap device) condition
> > > > > meet. It could be bad if we use multiple swap device, small in-memory 
> > > > > swap
> > > > > and big storage swap or in-memory swap alone.
> > > > 
> > > > That is a very good realization: it's surprising that none of us
> > > > thought of it before - no disrespect to you, well done, thank you.
> > > 
> > > Yes, my compliments also Minchan.  This problem has been thought of before
> > > but this patch is the first to identify a possible solution.
> > >  
> > > > And I guess swap readahead is utterly unhelpful in this case too.
> > > 
> > > Yes... as is any "swap writeahead".  Excuse my ignorance, but I
> > > think this is not done in the swap subsystem but instead the kernel
> > > assumes write-coalescing will be done in the block I/O subsystem,
> > > which means swap writeahead would affect zram but not zcache/zswap
> > > (since frontswap subverts the block I/O subsystem).
> > 
> > I don't know what swap writeahead is; but write coalescing, yes.
> > I don't see any problem with it in this context.
> > 
> > > 
> > > However I think a swap-readahead solution would be helpful to
> > > zram as well as zcache/zswap.
> > 
> > Whereas swap readahead on zmem is uncompressing zmem to pagecache
> > which may never be needed, and may take a circuit of the inactive
> > LRU before it gets reclaimed (if it turns out not to be needed,
> > at least it will remain clean and be easily reclaimed).
> 
> But it could evict more important pages before reaching out the tail.
> That's thing we really want to avoid if possible.
> 
> > 
> > > 
> > > > > This patch changes vm_swap_full logic slightly so it could free
> > > > > swap slot early if the backed device is really fast.
> > > > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > > > 
> > > > But I strongly disagree with almost everything in your patch :)
> > > > I disagree with addressing it in vm_swap_full(), I disagree that
> > > > it can be addressed by device, I disagree that it has anything to
> > > > do with SWP_SOLIDSTATE.
> > > > 
> > > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > > > is it?  In those cases, a fixed amount of memory has been set aside
> > > > for swap, and it works out just like with disk block devices.  The
> > > > memory set aside may be wasted, but that is accepted upfront.
> > > 
> > > It is (I believe) also a problem with swapping to ram.  Two
> > > copies of the same page are kept in memory in different places,
> > > right?  Fixed vs variable size is irrelevant I think.  Or am
> > > I misunderstanding something about swap-to-ram?
> > 
> > I may be misrembering how /dev/ram0 works, or simply assuming that
> > if you want to use it for swap (interesting for testing, but probably
> > not for general use), then you make sure to allocate each page of it
> > in advance.
> > 
> > The pages of /dev/ram0 don't get freed, or not before it's closed
> > (swapoff'ed) anyway.  Yes, swapcache would be duplicating data from
> > other memory into /dev/ram0 memory; but that /dev/ram0 memory has
> > been set aside for this purpose, and removing from swapcache won't
> > free any more memory.
> > 
> > > 
> > > > Similarly, this is not a problem with swapping to SSD.  There might
> > > > or might not be other reasons for adjusting the vm_swap_full() logic
> > > > for SSD or generally, but those have nothing to do with this issue.
> > > 
> > > I think it is at least highly related.  The key is

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Seth,

On Wed, Mar 27, 2013 at 12:19:11PM -0500, Seth Jennings wrote:
> On 03/26/2013 09:22 PM, Minchan Kim wrote:
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
> > 
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
> > 
> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> 
> Great idea!

Thanks!

> 
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
> 
> The comment for SWP_SOLIDSTATE is that "blkdev seeks are cheap". Just
> because seeks are cheap doesn't mean the read itself is also cheap.

The "read" isn't not concern but "write".

> For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of
> them can be pretty slow.

Yeb.

> 
> > So let's add Ccing Shaohua and Hugh.
> > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> > or something for z* family.
> 
> Afaict, setting SWP_SOLIDSTATE depends on characteristics of the
> underlying block device (i.e. blk_queue_nonrot()).  zram is a block
> device but zcache and zswap are not.
> 
> Any idea by what criteria SWP_INMEMORY would be set?

Just in-memory swap, zram, zswap and zcache at the moment. :)

> 
> Also, frontswap backends (zcache and zswap) are a caching layer on top
> of the real swap device, which might actually be rotating media.  So
> you have the issue of to different characteristics, in-memory caching
> on top of rotation media, present in a single swap device.

Please read my patch completely. I already pointed out the problem and
Hugh and Dan are suggesting ideas.

Thanks!

> 
> Thanks,
> Seth
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote:
> On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:hu...@google.com]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > > 
> > > On Wed, 27 Mar 2013, Minchan Kim wrote:
> > > 
> > > > Swap subsystem does lazy swap slot free with expecting the page
> > > > would be swapped out again so we can't avoid unnecessary write.
> > >  so we can avoid unnecessary write.
> > > >
> > > > But the problem in in-memory swap is that it consumes memory space
> > > > until vm_swap_full(ie, used half of all of swap device) condition
> > > > meet. It could be bad if we use multiple swap device, small in-memory 
> > > > swap
> > > > and big storage swap or in-memory swap alone.
> > > 
> > > That is a very good realization: it's surprising that none of us
> > > thought of it before - no disrespect to you, well done, thank you.
> > 
> > Yes, my compliments also Minchan.  This problem has been thought of before
> > but this patch is the first to identify a possible solution.
> >  
> > > And I guess swap readahead is utterly unhelpful in this case too.
> > 
> > Yes... as is any "swap writeahead".  Excuse my ignorance, but I
> > think this is not done in the swap subsystem but instead the kernel
> > assumes write-coalescing will be done in the block I/O subsystem,
> > which means swap writeahead would affect zram but not zcache/zswap
> > (since frontswap subverts the block I/O subsystem).
> 
> I don't know what swap writeahead is; but write coalescing, yes.
> I don't see any problem with it in this context.
> 
> > 
> > However I think a swap-readahead solution would be helpful to
> > zram as well as zcache/zswap.
> 
> Whereas swap readahead on zmem is uncompressing zmem to pagecache
> which may never be needed, and may take a circuit of the inactive
> LRU before it gets reclaimed (if it turns out not to be needed,
> at least it will remain clean and be easily reclaimed).

But it could evict more important pages before reaching out the tail.
That's thing we really want to avoid if possible.

> 
> > 
> > > > This patch changes vm_swap_full logic slightly so it could free
> > > > swap slot early if the backed device is really fast.
> > > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > > 
> > > But I strongly disagree with almost everything in your patch :)
> > > I disagree with addressing it in vm_swap_full(), I disagree that
> > > it can be addressed by device, I disagree that it has anything to
> > > do with SWP_SOLIDSTATE.
> > > 
> > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > > is it?  In those cases, a fixed amount of memory has been set aside
> > > for swap, and it works out just like with disk block devices.  The
> > > memory set aside may be wasted, but that is accepted upfront.
> > 
> > It is (I believe) also a problem with swapping to ram.  Two
> > copies of the same page are kept in memory in different places,
> > right?  Fixed vs variable size is irrelevant I think.  Or am
> > I misunderstanding something about swap-to-ram?
> 
> I may be misrembering how /dev/ram0 works, or simply assuming that
> if you want to use it for swap (interesting for testing, but probably
> not for general use), then you make sure to allocate each page of it
> in advance.
> 
> The pages of /dev/ram0 don't get freed, or not before it's closed
> (swapoff'ed) anyway.  Yes, swapcache would be duplicating data from
> other memory into /dev/ram0 memory; but that /dev/ram0 memory has
> been set aside for this purpose, and removing from swapcache won't
> free any more memory.
> 
> > 
> > > Similarly, this is not a problem with swapping to SSD.  There might
> > > or might not be other reasons for adjusting the vm_swap_full() logic
> > > for SSD or generally, but those have nothing to do with this issue.
> > 
> > I think it is at least highly related.  The key issue is the
> > tradeoff of the likelihood that the page will soon be read/written
> > again while it is in swap cache vs the time/resource-usage necessary
> > to "reconstitute" the page into swap cache.  Reconstituting from disk
> > requires a LOT of elapsed time.  Reconstituting from
> > an SSD likely takes much less time.  Reconstituting from
> > zcache/zram takes thousands of CPU cycles.
> 
> I acknowledge my complete ignorance of

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Dan,

On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > From: Hugh Dickins [mailto:hu...@google.com]
> > Subject: Re: [RFC] mm: remove swapcache page early
> > 
> > On Wed, 27 Mar 2013, Minchan Kim wrote:
> > 
> > > Swap subsystem does lazy swap slot free with expecting the page
> > > would be swapped out again so we can't avoid unnecessary write.
> >  so we can avoid unnecessary write.
> > >
> > > But the problem in in-memory swap is that it consumes memory space
> > > until vm_swap_full(ie, used half of all of swap device) condition
> > > meet. It could be bad if we use multiple swap device, small in-memory swap
> > > and big storage swap or in-memory swap alone.
> > 
> > That is a very good realization: it's surprising that none of us
> > thought of it before - no disrespect to you, well done, thank you.
> 
> Yes, my compliments also Minchan.  This problem has been thought of before
> but this patch is the first to identify a possible solution.

Thanks!

>  
> > And I guess swap readahead is utterly unhelpful in this case too.
> 
> Yes... as is any "swap writeahead".  Excuse my ignorance, but I
> think this is not done in the swap subsystem but instead the kernel
> assumes write-coalescing will be done in the block I/O subsystem,
> which means swap writeahead would affect zram but not zcache/zswap
> (since frontswap subverts the block I/O subsystem).

Frankly speaking, I don't know why you mentioned "swap writeahead"
in this point. Anyway, I dobut how it effect zram, too. A gain I can
have a mind is compress ratio would be high thorough multiple page
compression all at once.

> 
> However I think a swap-readahead solution would be helpful to
> zram as well as zcache/zswap.

Hmm, why? swap-readahead is just hint to reduce big stall time to
reduce on big seek overhead storage. But in-memory swap is no cost
for seeking. So unnecessary swap-readahead can make memory pressure
high and it could cause another page swap out so it could be swap-thrashing.
And for good swap-readahead hit ratio, swap device shouldn't be fragmented.
But as you know, there are many factor to prevent it in the kernel now
and Shaohua is tackling on it.

> 
> > > This patch changes vm_swap_full logic slightly so it could free
> > > swap slot early if the backed device is really fast.
> > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > 
> > But I strongly disagree with almost everything in your patch :)
> > I disagree with addressing it in vm_swap_full(), I disagree that
> > it can be addressed by device, I disagree that it has anything to
> > do with SWP_SOLIDSTATE.
> > 
> > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > is it?  In those cases, a fixed amount of memory has been set aside
> > for swap, and it works out just like with disk block devices.  The
> > memory set aside may be wasted, but that is accepted upfront.
> 
> It is (I believe) also a problem with swapping to ram.  Two
> copies of the same page are kept in memory in different places,
> right?  Fixed vs variable size is irrelevant I think.  Or am
> I misunderstanding something about swap-to-ram?
> 
> > Similarly, this is not a problem with swapping to SSD.  There might
> > or might not be other reasons for adjusting the vm_swap_full() logic
> > for SSD or generally, but those have nothing to do with this issue.
> 
> I think it is at least highly related.  The key issue is the
> tradeoff of the likelihood that the page will soon be read/written
> again while it is in swap cache vs the time/resource-usage necessary
> to "reconstitute" the page into swap cache.  Reconstituting from disk
> requires a LOT of elapsed time.  Reconstituting from
> an SSD likely takes much less time.  Reconstituting from
> zcache/zram takes thousands of CPU cycles.

Yeb. That's why I wanted to use SWP_SOLIDSTATE.

> 
> > The problem here is peculiar to frontswap, and the variably sized
> > memory behind it, isn't it?  We are accustomed to using swap to free
> > up memory by transferring its data to some other, cheaper but slower
> > resource.
> 
> Frontswap does make the problem more complex because some pages
> are in "fairly fast" storage (zcache, needs decompression) and
> some are on the actual (usually) rotating media.  Fortunately,
> differentiating between these two cases is just a table lookup
> (see frontswap_test).

Yeb, I thouht it could be a last resort because I'd like to avoid
lookup every swapin if possible.

> 
> > But in the case of frontswap and zmem (I'll say that to avoid thinkin

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Hugh,

On Wed, Mar 27, 2013 at 02:41:07PM -0700, Hugh Dickins wrote:
> On Wed, 27 Mar 2013, Minchan Kim wrote:
> 
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
>  so we can avoid unnecessary write.
> > 
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
> 
> That is a very good realization: it's surprising that none of us
> thought of it before - no disrespect to you, well done, thank you.
> 
> And I guess swap readahead is utterly unhelpful in this case too.
> 
> > 
> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
> 
> But I strongly disagree with almost everything in your patch :)
> I disagree with addressing it in vm_swap_full(), I disagree that
> it can be addressed by device, I disagree that it has anything to
> do with SWP_SOLIDSTATE.
> 
> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> is it?  In those cases, a fixed amount of memory has been set aside
> for swap, and it works out just like with disk block devices.  The

Brd is okay but it seems you are miunderstanding zram.
The zram doesn't reserve any memory and allocate dynamic memory when
swap out happens so it can make duplicate space in pusdo block device
and memory.

> memory set aside may be wasted, but that is accepted upfront.
> 
> Similarly, this is not a problem with swapping to SSD.  There might
> or might not be other reasons for adjusting the vm_swap_full() logic
> for SSD or generally, but those have nothing to do with this issue.

Yes.

> 
> The problem here is peculiar to frontswap, and the variably sized
> memory behind it, isn't it?  We are accustomed to using swap to free

Zram, too.

> up memory by transferring its data to some other, cheaper but slower
> resource.
> 
> But in the case of frontswap and zmem (I'll say that to avoid thinking

Frankly speaking, I couldn't understand what you means, frontswap and zmem.
The frontswap is just layer for hook the swap subsystem.
Real instance of frontswap is zcache and zswap at the moment.
I will understand them as zcache and zswap. Okay?

> through which backends are actually involved), it is not a cheaper and
> slower resource, but the very same memory we are trying to save: swap
> is stolen from the memory under reclaim, so any duplication becomes
> counter-productive (if we ignore cpu compression/decompression costs:
> I have no idea how fair it is to do so, but anyone who chooses zmem
> is prepared to pay some cpu price for that).

Agree.

> 
> And because it's a frontswap thing, we cannot decide this by device:
> frontswap may or may not stand in front of each device.  There is no
> problem with swapcache duplicated on disk (until that area approaches
> being full or fragmented), but at the higher level we cannot see what
> is in zmem and what is on disk: we only want to free up the zmem dup.

That's what I really have a concern and why I begged idea.

> 
> I believe the answer is for frontswap/zmem to invalidate the frontswap
> copy of the page (to free up the compressed memory when possible) and
> SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> (setting page dirty so nothing will later go to read it from the
> unfreed location on backing swap disk, which was never written).

You mean that zcache and zswap have to do garbage collection by some
policy? It could be but how about zram? It's just pseudo block device
and he don't have any knowledge on top of it. It could be swap or normal
block device. I mean zram has no information of swap to handle it.

> 
> We cannot rely on freeing the swap itself, because in general there
> may be multiple references to the swap, and we only satisfy the one
> which has faulted.  It may or may not be a good idea to use rmap to
> locate the other places to insert pte in place of swap entry, to
> resolve them all at once; but we have chosen not to do so in the
> past, and there's no need for that, if the zmem gets invalidated
> and the swapcache page set dirty.

Yes it could be better but as I mentioned above, it couldn't handle
zram case. If there is a solution for zram, I will be happy. :)

And another point, fronstwap is already percolated into swap subsystem
very tightly. So I doubt adding one another hook is a really problem.

Thanks for great comment, Hugh!

> 
> Hugh
> 
> > So let's add Ccing Shaohua and Hugh.
> > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> > or something for z* family.
> > 
> > Other problem is zram is block device so that it can set SWP_INMEMORY
> > or SWP_SOLIDSTATE easily(ie,

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Hugh Dickins

On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > From: Hugh Dickins [mailto:hu...@google.com]
> > Subject: Re: [RFC] mm: remove swapcache page early
> > 
> > On Wed, 27 Mar 2013, Minchan Kim wrote:
> > 
> > > Swap subsystem does lazy swap slot free with expecting the page
> > > would be swapped out again so we can't avoid unnecessary write.
> >  so we can avoid unnecessary write.
> > >
> > > But the problem in in-memory swap is that it consumes memory space
> > > until vm_swap_full(ie, used half of all of swap device) condition
> > > meet. It could be bad if we use multiple swap device, small in-memory swap
> > > and big storage swap or in-memory swap alone.
> > 
> > That is a very good realization: it's surprising that none of us
> > thought of it before - no disrespect to you, well done, thank you.
> 
> Yes, my compliments also Minchan.  This problem has been thought of before
> but this patch is the first to identify a possible solution.
>  
> > And I guess swap readahead is utterly unhelpful in this case too.
> 
> Yes... as is any "swap writeahead".  Excuse my ignorance, but I
> think this is not done in the swap subsystem but instead the kernel
> assumes write-coalescing will be done in the block I/O subsystem,
> which means swap writeahead would affect zram but not zcache/zswap
> (since frontswap subverts the block I/O subsystem).

I don't know what swap writeahead is; but write coalescing, yes.
I don't see any problem with it in this context.

> 
> However I think a swap-readahead solution would be helpful to
> zram as well as zcache/zswap.

Whereas swap readahead on zmem is uncompressing zmem to pagecache
which may never be needed, and may take a circuit of the inactive
LRU before it gets reclaimed (if it turns out not to be needed,
at least it will remain clean and be easily reclaimed).

> 
> > > This patch changes vm_swap_full logic slightly so it could free
> > > swap slot early if the backed device is really fast.
> > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > 
> > But I strongly disagree with almost everything in your patch :)
> > I disagree with addressing it in vm_swap_full(), I disagree that
> > it can be addressed by device, I disagree that it has anything to
> > do with SWP_SOLIDSTATE.
> > 
> > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > is it?  In those cases, a fixed amount of memory has been set aside
> > for swap, and it works out just like with disk block devices.  The
> > memory set aside may be wasted, but that is accepted upfront.
> 
> It is (I believe) also a problem with swapping to ram.  Two
> copies of the same page are kept in memory in different places,
> right?  Fixed vs variable size is irrelevant I think.  Or am
> I misunderstanding something about swap-to-ram?

I may be misrembering how /dev/ram0 works, or simply assuming that
if you want to use it for swap (interesting for testing, but probably
not for general use), then you make sure to allocate each page of it
in advance.

The pages of /dev/ram0 don't get freed, or not before it's closed
(swapoff'ed) anyway.  Yes, swapcache would be duplicating data from
other memory into /dev/ram0 memory; but that /dev/ram0 memory has
been set aside for this purpose, and removing from swapcache won't
free any more memory.

> 
> > Similarly, this is not a problem with swapping to SSD.  There might
> > or might not be other reasons for adjusting the vm_swap_full() logic
> > for SSD or generally, but those have nothing to do with this issue.
> 
> I think it is at least highly related.  The key issue is the
> tradeoff of the likelihood that the page will soon be read/written
> again while it is in swap cache vs the time/resource-usage necessary
> to "reconstitute" the page into swap cache.  Reconstituting from disk
> requires a LOT of elapsed time.  Reconstituting from
> an SSD likely takes much less time.  Reconstituting from
> zcache/zram takes thousands of CPU cycles.

I acknowledge my complete ignorance of how to judge the tradeoff
between memory usage and cpu usage, but I think Minchan's main
concern was with the memory usage.  Neither hard disk nor SSD
is occupying memory.

> 
> > The problem here is peculiar to frontswap, and the variably sized
> > memory behind it, isn't it?  We are accustomed to using swap to free
> > up memory by transferring its data to some other, cheaper but slower
> > resource.
> 
> Frontswap does make the problem more complex because some pages
> are in "fairly fast" storage (zcache, needs decompression) and
> some are on the actual (usually) rotating med

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

> From: Hugh Dickins [mailto:hu...@google.com]
> Subject: Re: [RFC] mm: remove swapcache page early
> 
> On Wed, 27 Mar 2013, Minchan Kim wrote:
> 
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
>  so we can avoid unnecessary write.
> >
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
> 
> That is a very good realization: it's surprising that none of us
> thought of it before - no disrespect to you, well done, thank you.

Yes, my compliments also Minchan.  This problem has been thought of before
but this patch is the first to identify a possible solution.

> And I guess swap readahead is utterly unhelpful in this case too.

Yes... as is any "swap writeahead".  Excuse my ignorance, but I
think this is not done in the swap subsystem but instead the kernel
assumes write-coalescing will be done in the block I/O subsystem,
which means swap writeahead would affect zram but not zcache/zswap
(since frontswap subverts the block I/O subsystem).

However I think a swap-readahead solution would be helpful to
zram as well as zcache/zswap.

> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
> 
> But I strongly disagree with almost everything in your patch :)
> I disagree with addressing it in vm_swap_full(), I disagree that
> it can be addressed by device, I disagree that it has anything to
> do with SWP_SOLIDSTATE.
> 
> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> is it?  In those cases, a fixed amount of memory has been set aside
> for swap, and it works out just like with disk block devices.  The
> memory set aside may be wasted, but that is accepted upfront.

It is (I believe) also a problem with swapping to ram.  Two
copies of the same page are kept in memory in different places,
right?  Fixed vs variable size is irrelevant I think.  Or am
I misunderstanding something about swap-to-ram?

> Similarly, this is not a problem with swapping to SSD.  There might
> or might not be other reasons for adjusting the vm_swap_full() logic
> for SSD or generally, but those have nothing to do with this issue.

I think it is at least highly related.  The key issue is the
tradeoff of the likelihood that the page will soon be read/written
again while it is in swap cache vs the time/resource-usage necessary
to "reconstitute" the page into swap cache.  Reconstituting from disk
requires a LOT of elapsed time.  Reconstituting from
an SSD likely takes much less time.  Reconstituting from
zcache/zram takes thousands of CPU cycles.

> The problem here is peculiar to frontswap, and the variably sized
> memory behind it, isn't it?  We are accustomed to using swap to free
> up memory by transferring its data to some other, cheaper but slower
> resource.

Frontswap does make the problem more complex because some pages
are in "fairly fast" storage (zcache, needs decompression) and
some are on the actual (usually) rotating media.  Fortunately,
differentiating between these two cases is just a table lookup
(see frontswap_test).

> But in the case of frontswap and zmem (I'll say that to avoid thinking
> through which backends are actually involved), it is not a cheaper and
> slower resource, but the very same memory we are trying to save: swap
> is stolen from the memory under reclaim, so any duplication becomes
> counter-productive (if we ignore cpu compression/decompression costs:
> I have no idea how fair it is to do so, but anyone who chooses zmem
> is prepared to pay some cpu price for that).

Exactly.  There is some "robbing of Peter to pay Paul" and
other complex resource tradeoffs.  Presumably, though, it is
not "the very same memory we are trying to save" but a
fraction of it, saving the same page of data more efficiently
in memory, using less than a page, at some CPU cost.

> And because it's a frontswap thing, we cannot decide this by device:
> frontswap may or may not stand in front of each device.  There is no
> problem with swapcache duplicated on disk (until that area approaches
> being full or fragmented), but at the higher level we cannot see what
> is in zmem and what is on disk: we only want to free up the zmem dup.

I *think* frontswap_test(page) resolves this problem, as long as
we have a specific page available to use as a parameter.

> I believe the answer is for frontswap/zmem to invalidate the frontswa

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Hugh Dickins

On Wed, 27 Mar 2013, Minchan Kim wrote:

> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
 so we can avoid unnecessary write.
> 
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

And I guess swap readahead is utterly unhelpful in this case too.

> 
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it?  In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices.  The
memory set aside may be wasted, but that is accepted upfront.

Similarly, this is not a problem with swapping to SSD.  There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

The problem here is peculiar to frontswap, and the variably sized
memory behind it, isn't it?  We are accustomed to using swap to free
up memory by transferring its data to some other, cheaper but slower
resource.

But in the case of frontswap and zmem (I'll say that to avoid thinking
through which backends are actually involved), it is not a cheaper and
slower resource, but the very same memory we are trying to save: swap
is stolen from the memory under reclaim, so any duplication becomes
counter-productive (if we ignore cpu compression/decompression costs:
I have no idea how fair it is to do so, but anyone who chooses zmem
is prepared to pay some cpu price for that).

And because it's a frontswap thing, we cannot decide this by device:
frontswap may or may not stand in front of each device.  There is no
problem with swapcache duplicated on disk (until that area approaches
being full or fragmented), but at the higher level we cannot see what
is in zmem and what is on disk: we only want to free up the zmem dup.

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

We cannot rely on freeing the swap itself, because in general there
may be multiple references to the swap, and we only satisfy the one
which has faulted.  It may or may not be a good idea to use rmap to
locate the other places to insert pte in place of swap entry, to
resolve them all at once; but we have chosen not to do so in the
past, and there's no need for that, if the zmem gets invalidated
and the swapcache page set dirty.

Hugh

> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
> 
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
> 
> Any idea?
> 
> Other optimize point is we remove it unconditionally when we
> found it's exclusive when swap in happen.
> It could help frontswap family, too.
> What do you think about it?
> 
> Cc: Hugh Dickins 
> Cc: Dan Magenheimer 
> Cc: Seth Jennings 
> Cc: Nitin Gupta 
> Cc: Konrad Rzeszutek Wilk 
> Cc: Shaohua Li 
> Signed-off-by: Minchan Kim 
> ---
>  include/linux/swap.h | 11 ---
>  mm/memory.c  |  3 ++-
>  mm/swapfile.c| 11 +++
>  mm/vmscan.c  |  2 +-
>  4 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2818a12..1f4df66 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
>  extern atomic_long_t nr_swap_pages;
>  extern long total_swap_pages;
>  
> -/* Swap 50% full? Release swapcache more aggressively.. */
> -static inline bool vm_swap_full(void)
> +/*
> + * Swap 50% full or fast backed device?
> + * Release swapcache more aggressively.
> + */
> +static inline bool vm_swap_full(struct swap_info_struct *si)
>  {
> + if (si->flags & SWP_SOLIDSTATE)
> + return true;
>   return atomic_long_read(_swap_pages) * 2 < total_swap_pages;
>  }

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

> From: Minchan Kim [mailto:minc...@kernel.org]
> Subject: [RFC] mm: remove swapcache page early
> 
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
> 
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
> 
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.
> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
> 
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
> 
> Any idea?
> 
> Other optimize point is we remove it unconditionally when we
> found it's exclusive when swap in happen.
> It could help frontswap family, too.

By passing a struct page * to vm_swap_full() you can then call
frontswap_test()... if it returns true, then vm_swap_full
can return true.  Note that this precisely checks whether
the page is in zcache/zswap or not, so Seth's concern that
some pages may be in-memory and some may be in rotating
storage is no longer an issue.

> What do you think about it?

By removing the page from swapcache, you are now increasing
the risk that pages will "thrash" between uncompressed state
(in swapcache) and compressed state (in z*).  I think this is
a better tradeoff though than keeping a copy of both the
compressed page AND the uncompressed page in memory.

You should probably rename vm_swap_full() because you are
now overloading it with other meanings.  Maybe
vm_swap_reclaimable()?

Do you have any measurements?  I think you are correct
that it may help a LOT.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Seth Jennings

On 03/26/2013 09:22 PM, Minchan Kim wrote:
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
> 
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
> 
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.

Great idea!

> For it, I used SWP_SOLIDSTATE but It might be controversial.

The comment for SWP_SOLIDSTATE is that "blkdev seeks are cheap". Just
because seeks are cheap doesn't mean the read itself is also cheap.
For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of
them can be pretty slow.

> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.

Afaict, setting SWP_SOLIDSTATE depends on characteristics of the
underlying block device (i.e. blk_queue_nonrot()).  zram is a block
device but zcache and zswap are not.

Any idea by what criteria SWP_INMEMORY would be set?

Also, frontswap backends (zcache and zswap) are a caching layer on top
of the real swap device, which might actually be rotating media.  So
you have the issue of to different characteristics, in-memory caching
on top of rotation media, present in a single swap device.

Thanks,
Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Kame,

On Wed, Mar 27, 2013 at 02:15:41PM +0900, Kamezawa Hiroyuki wrote:
> (2013/03/27 11:22), Minchan Kim wrote:
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
> > 
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
> > 
> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > So let's add Ccing Shaohua and Hugh.
> > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> > or something for z* family.
> > 
> > Other problem is zram is block device so that it can set SWP_INMEMORY
> > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> > I have no idea to use it for frontswap.
> > 
> > Any idea?
> > 
> Another thinkingin what case, in what system configuration, 
> vm_swap_full() should return false and delay swp_entry freeing ?

It's a really good question I had have in mind from long time ago.
If I catch your point properly, your question is "Couldn't we remove
vm_swap_full logic?"

If so, the answer is "I have no idea and would like to ask it
to Hugh".

Academically, it does make sense swap-out page is unlikely to be
working set so it could be swap out again and I believe it was
merged since we had the workload could be enhanced by the logic
at that time.

And I think it's not easy to prove it's useless thesedays because
I couldn't have all recent workloads over the world so I'd like to
avoid such adventure. :)

Thanks.

> 
> Thanks,
> -Kame
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Kame,

On Wed, Mar 27, 2013 at 02:15:41PM +0900, Kamezawa Hiroyuki wrote:
 (2013/03/27 11:22), Minchan Kim wrote:
  Swap subsystem does lazy swap slot free with expecting the page
  would be swapped out again so we can't avoid unnecessary write.
  
  But the problem in in-memory swap is that it consumes memory space
  until vm_swap_full(ie, used half of all of swap device) condition
  meet. It could be bad if we use multiple swap device, small in-memory swap
  and big storage swap or in-memory swap alone.
  
  This patch changes vm_swap_full logic slightly so it could free
  swap slot early if the backed device is really fast.
  For it, I used SWP_SOLIDSTATE but It might be controversial.
  So let's add Ccing Shaohua and Hugh.
  If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
  or something for z* family.
  
  Other problem is zram is block device so that it can set SWP_INMEMORY
  or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
  I have no idea to use it for frontswap.
  
  Any idea?
  
 Another thinkingin what case, in what system configuration, 
 vm_swap_full() should return false and delay swp_entry freeing ?

It's a really good question I had have in mind from long time ago.
If I catch your point properly, your question is Couldn't we remove
vm_swap_full logic?

If so, the answer is I have no idea and would like to ask it
to Hugh.

Academically, it does make sense swap-out page is unlikely to be
working set so it could be swap out again and I believe it was
merged since we had the workload could be enhanced by the logic
at that time.

And I think it's not easy to prove it's useless thesedays because
I couldn't have all recent workloads over the world so I'd like to
avoid such adventure. :)

Thanks.

 
 Thanks,
 -Kame
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Seth Jennings

On 03/26/2013 09:22 PM, Minchan Kim wrote:
 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.
 
 But the problem in in-memory swap is that it consumes memory space
 until vm_swap_full(ie, used half of all of swap device) condition
 meet. It could be bad if we use multiple swap device, small in-memory swap
 and big storage swap or in-memory swap alone.
 
 This patch changes vm_swap_full logic slightly so it could free
 swap slot early if the backed device is really fast.

Great idea!

 For it, I used SWP_SOLIDSTATE but It might be controversial.

The comment for SWP_SOLIDSTATE is that blkdev seeks are cheap. Just
because seeks are cheap doesn't mean the read itself is also cheap.
For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of
them can be pretty slow.

 So let's add Ccing Shaohua and Hugh.
 If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
 or something for z* family.

Afaict, setting SWP_SOLIDSTATE depends on characteristics of the
underlying block device (i.e. blk_queue_nonrot()).  zram is a block
device but zcache and zswap are not.

Any idea by what criteria SWP_INMEMORY would be set?

Also, frontswap backends (zcache and zswap) are a caching layer on top
of the real swap device, which might actually be rotating media.  So
you have the issue of to different characteristics, in-memory caching
on top of rotation media, present in a single swap device.

Thanks,
Seth

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

 From: Minchan Kim [mailto:minc...@kernel.org]
 Subject: [RFC] mm: remove swapcache page early

 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.

 But the problem in in-memory swap is that it consumes memory space
 until vm_swap_full(ie, used half of all of swap device) condition
 meet. It could be bad if we use multiple swap device, small in-memory swap
 and big storage swap or in-memory swap alone.

 This patch changes vm_swap_full logic slightly so it could free
 swap slot early if the backed device is really fast.
 For it, I used SWP_SOLIDSTATE but It might be controversial.
 So let's add Ccing Shaohua and Hugh.
 If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
 or something for z* family.

 Other problem is zram is block device so that it can set SWP_INMEMORY
 or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
 I have no idea to use it for frontswap.

 Any idea?

 Other optimize point is we remove it unconditionally when we
 found it's exclusive when swap in happen.
 It could help frontswap family, too.

By passing a struct page * to vm_swap_full() you can then call
frontswap_test()... if it returns true, then vm_swap_full
can return true.  Note that this precisely checks whether
the page is in zcache/zswap or not, so Seth's concern that
some pages may be in-memory and some may be in rotating
storage is no longer an issue.

 What do you think about it?

By removing the page from swapcache, you are now increasing
the risk that pages will thrash between uncompressed state
(in swapcache) and compressed state (in z*).  I think this is
a better tradeoff though than keeping a copy of both the
compressed page AND the uncompressed page in memory.

You should probably rename vm_swap_full() because you are
now overloading it with other meanings.  Maybe
vm_swap_reclaimable()?

Do you have any measurements?  I think you are correct
that it may help a LOT.

Thanks,
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Hugh Dickins

On Wed, 27 Mar 2013, Minchan Kim wrote:

 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.
 so we can avoid unnecessary write.
 
 But the problem in in-memory swap is that it consumes memory space
 until vm_swap_full(ie, used half of all of swap device) condition
 meet. It could be bad if we use multiple swap device, small in-memory swap
 and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

And I guess swap readahead is utterly unhelpful in this case too.

 
 This patch changes vm_swap_full logic slightly so it could free
 swap slot early if the backed device is really fast.
 For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it?  In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices.  The
memory set aside may be wasted, but that is accepted upfront.

Similarly, this is not a problem with swapping to SSD.  There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

The problem here is peculiar to frontswap, and the variably sized
memory behind it, isn't it?  We are accustomed to using swap to free
up memory by transferring its data to some other, cheaper but slower
resource.

But in the case of frontswap and zmem (I'll say that to avoid thinking
through which backends are actually involved), it is not a cheaper and
slower resource, but the very same memory we are trying to save: swap
is stolen from the memory under reclaim, so any duplication becomes
counter-productive (if we ignore cpu compression/decompression costs:
I have no idea how fair it is to do so, but anyone who chooses zmem
is prepared to pay some cpu price for that).

And because it's a frontswap thing, we cannot decide this by device:
frontswap may or may not stand in front of each device.  There is no
problem with swapcache duplicated on disk (until that area approaches
being full or fragmented), but at the higher level we cannot see what
is in zmem and what is on disk: we only want to free up the zmem dup.

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

We cannot rely on freeing the swap itself, because in general there
may be multiple references to the swap, and we only satisfy the one
which has faulted.  It may or may not be a good idea to use rmap to
locate the other places to insert pte in place of swap entry, to
resolve them all at once; but we have chosen not to do so in the
past, and there's no need for that, if the zmem gets invalidated
and the swapcache page set dirty.

Hugh

 So let's add Ccing Shaohua and Hugh.
 If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
 or something for z* family.
 
 Other problem is zram is block device so that it can set SWP_INMEMORY
 or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
 I have no idea to use it for frontswap.
 
 Any idea?
 
 Other optimize point is we remove it unconditionally when we
 found it's exclusive when swap in happen.
 It could help frontswap family, too.
 What do you think about it?
 
 Cc: Hugh Dickins hu...@google.com
 Cc: Dan Magenheimer dan.magenhei...@oracle.com
 Cc: Seth Jennings sjenn...@linux.vnet.ibm.com
 Cc: Nitin Gupta ngu...@vflare.org
 Cc: Konrad Rzeszutek Wilk kon...@darnok.org
 Cc: Shaohua Li s...@kernel.org
 Signed-off-by: Minchan Kim minc...@kernel.org
 ---
  include/linux/swap.h | 11 ---
  mm/memory.c  |  3 ++-
  mm/swapfile.c| 11 +++
  mm/vmscan.c  |  2 +-
  4 files changed, 18 insertions(+), 9 deletions(-)
 
 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index 2818a12..1f4df66 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
  extern atomic_long_t nr_swap_pages;
  extern long total_swap_pages;
  
 -/* Swap 50% full? Release swapcache more aggressively.. */
 -static inline bool vm_swap_full(void)
 +/*
 + * Swap 50% full or fast backed device?
 + * Release swapcache more aggressively.
 + */
 +static inline bool vm_swap_full(struct swap_info_struct *si)
  {
 + if (si-flags  SWP_SOLIDSTATE)
 + return

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Dan Magenheimer

 From: Hugh Dickins [mailto:hu...@google.com]
 Subject: Re: [RFC] mm: remove swapcache page early

 On Wed, 27 Mar 2013, Minchan Kim wrote:

  Swap subsystem does lazy swap slot free with expecting the page
  would be swapped out again so we can't avoid unnecessary write.
  so we can avoid unnecessary write.

  But the problem in in-memory swap is that it consumes memory space
  until vm_swap_full(ie, used half of all of swap device) condition
  meet. It could be bad if we use multiple swap device, small in-memory swap
  and big storage swap or in-memory swap alone.

 That is a very good realization: it's surprising that none of us
 thought of it before - no disrespect to you, well done, thank you.

Yes, my compliments also Minchan.  This problem has been thought of before
but this patch is the first to identify a possible solution.

 And I guess swap readahead is utterly unhelpful in this case too.

Yes... as is any swap writeahead.  Excuse my ignorance, but I
think this is not done in the swap subsystem but instead the kernel
assumes write-coalescing will be done in the block I/O subsystem,
which means swap writeahead would affect zram but not zcache/zswap
(since frontswap subverts the block I/O subsystem).

However I think a swap-readahead solution would be helpful to
zram as well as zcache/zswap.

  This patch changes vm_swap_full logic slightly so it could free
  swap slot early if the backed device is really fast.
  For it, I used SWP_SOLIDSTATE but It might be controversial.

 But I strongly disagree with almost everything in your patch :)
 I disagree with addressing it in vm_swap_full(), I disagree that
 it can be addressed by device, I disagree that it has anything to
 do with SWP_SOLIDSTATE.

 This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
 is it?  In those cases, a fixed amount of memory has been set aside
 for swap, and it works out just like with disk block devices.  The
 memory set aside may be wasted, but that is accepted upfront.

It is (I believe) also a problem with swapping to ram.  Two
copies of the same page are kept in memory in different places,
right?  Fixed vs variable size is irrelevant I think.  Or am
I misunderstanding something about swap-to-ram?

 Similarly, this is not a problem with swapping to SSD.  There might
 or might not be other reasons for adjusting the vm_swap_full() logic
 for SSD or generally, but those have nothing to do with this issue.

I think it is at least highly related.  The key issue is the
tradeoff of the likelihood that the page will soon be read/written
again while it is in swap cache vs the time/resource-usage necessary
to reconstitute the page into swap cache.  Reconstituting from disk
requires a LOT of elapsed time.  Reconstituting from
an SSD likely takes much less time.  Reconstituting from
zcache/zram takes thousands of CPU cycles.

 The problem here is peculiar to frontswap, and the variably sized
 memory behind it, isn't it?  We are accustomed to using swap to free
 up memory by transferring its data to some other, cheaper but slower
 resource.

Frontswap does make the problem more complex because some pages
are in fairly fast storage (zcache, needs decompression) and
some are on the actual (usually) rotating media.  Fortunately,
differentiating between these two cases is just a table lookup
(see frontswap_test).

 But in the case of frontswap and zmem (I'll say that to avoid thinking
 through which backends are actually involved), it is not a cheaper and
 slower resource, but the very same memory we are trying to save: swap
 is stolen from the memory under reclaim, so any duplication becomes
 counter-productive (if we ignore cpu compression/decompression costs:
 I have no idea how fair it is to do so, but anyone who chooses zmem
 is prepared to pay some cpu price for that).

Exactly.  There is some robbing of Peter to pay Paul and
other complex resource tradeoffs.  Presumably, though, it is
not the very same memory we are trying to save but a
fraction of it, saving the same page of data more efficiently
in memory, using less than a page, at some CPU cost.

 And because it's a frontswap thing, we cannot decide this by device:
 frontswap may or may not stand in front of each device.  There is no
 problem with swapcache duplicated on disk (until that area approaches
 being full or fragmented), but at the higher level we cannot see what
 is in zmem and what is on disk: we only want to free up the zmem dup.

I *think* frontswap_test(page) resolves this problem, as long as
we have a specific page available to use as a parameter.

 I believe the answer is for frontswap/zmem to invalidate the frontswap
 copy of the page (to free up the compressed memory when possible) and
 SetPageDirty on the PageUptodate PageSwapCache page when swapping in
 (setting page dirty so nothing will later go to read it from the
 unfreed location on backing swap disk, which was never written).

There are two duplication

RE: [RFC] mm: remove swapcache page early

2013-03-27 Thread Hugh Dickins

On Wed, 27 Mar 2013, Dan Magenheimer wrote:
  From: Hugh Dickins [mailto:hu...@google.com]
  Subject: Re: [RFC] mm: remove swapcache page early

  On Wed, 27 Mar 2013, Minchan Kim wrote:

   Swap subsystem does lazy swap slot free with expecting the page
   would be swapped out again so we can't avoid unnecessary write.
   so we can avoid unnecessary write.

   But the problem in in-memory swap is that it consumes memory space
   until vm_swap_full(ie, used half of all of swap device) condition
   meet. It could be bad if we use multiple swap device, small in-memory swap
   and big storage swap or in-memory swap alone.

  That is a very good realization: it's surprising that none of us
  thought of it before - no disrespect to you, well done, thank you.

 Yes, my compliments also Minchan.  This problem has been thought of before
 but this patch is the first to identify a possible solution.

  And I guess swap readahead is utterly unhelpful in this case too.

 Yes... as is any swap writeahead.  Excuse my ignorance, but I
 think this is not done in the swap subsystem but instead the kernel
 assumes write-coalescing will be done in the block I/O subsystem,
 which means swap writeahead would affect zram but not zcache/zswap
 (since frontswap subverts the block I/O subsystem).

I don't know what swap writeahead is; but write coalescing, yes.
I don't see any problem with it in this context.

 However I think a swap-readahead solution would be helpful to
 zram as well as zcache/zswap.

Whereas swap readahead on zmem is uncompressing zmem to pagecache
which may never be needed, and may take a circuit of the inactive
LRU before it gets reclaimed (if it turns out not to be needed,
at least it will remain clean and be easily reclaimed).

   This patch changes vm_swap_full logic slightly so it could free
   swap slot early if the backed device is really fast.
   For it, I used SWP_SOLIDSTATE but It might be controversial.

  But I strongly disagree with almost everything in your patch :)
  I disagree with addressing it in vm_swap_full(), I disagree that
  it can be addressed by device, I disagree that it has anything to
  do with SWP_SOLIDSTATE.

  This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
  is it?  In those cases, a fixed amount of memory has been set aside
  for swap, and it works out just like with disk block devices.  The
  memory set aside may be wasted, but that is accepted upfront.

 It is (I believe) also a problem with swapping to ram.  Two
 copies of the same page are kept in memory in different places,
 right?  Fixed vs variable size is irrelevant I think.  Or am
 I misunderstanding something about swap-to-ram?

I may be misrembering how /dev/ram0 works, or simply assuming that
if you want to use it for swap (interesting for testing, but probably
not for general use), then you make sure to allocate each page of it
in advance.

The pages of /dev/ram0 don't get freed, or not before it's closed
(swapoff'ed) anyway.  Yes, swapcache would be duplicating data from
other memory into /dev/ram0 memory; but that /dev/ram0 memory has
been set aside for this purpose, and removing from swapcache won't
free any more memory.

  Similarly, this is not a problem with swapping to SSD.  There might
  or might not be other reasons for adjusting the vm_swap_full() logic
  for SSD or generally, but those have nothing to do with this issue.

 I think it is at least highly related.  The key issue is the
 tradeoff of the likelihood that the page will soon be read/written
 again while it is in swap cache vs the time/resource-usage necessary
 to reconstitute the page into swap cache.  Reconstituting from disk
 requires a LOT of elapsed time.  Reconstituting from
 an SSD likely takes much less time.  Reconstituting from
 zcache/zram takes thousands of CPU cycles.

I acknowledge my complete ignorance of how to judge the tradeoff
between memory usage and cpu usage, but I think Minchan's main
concern was with the memory usage.  Neither hard disk nor SSD
is occupying memory.

  The problem here is peculiar to frontswap, and the variably sized
  memory behind it, isn't it?  We are accustomed to using swap to free
  up memory by transferring its data to some other, cheaper but slower
  resource.

 Frontswap does make the problem more complex because some pages
 are in fairly fast storage (zcache, needs decompression) and
 some are on the actual (usually) rotating media.  Fortunately,
 differentiating between these two cases is just a table lookup
 (see frontswap_test).

  But in the case of frontswap and zmem (I'll say that to avoid thinking
  through which backends are actually involved), it is not a cheaper and
  slower resource, but the very same memory we are trying to save: swap
  is stolen from the memory under reclaim, so any duplication becomes
  counter-productive (if we ignore cpu compression/decompression costs:
  I have no idea how fair it is to do

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Hugh,

On Wed, Mar 27, 2013 at 02:41:07PM -0700, Hugh Dickins wrote:
 On Wed, 27 Mar 2013, Minchan Kim wrote:
 
  Swap subsystem does lazy swap slot free with expecting the page
  would be swapped out again so we can't avoid unnecessary write.
  so we can avoid unnecessary write.
  
  But the problem in in-memory swap is that it consumes memory space
  until vm_swap_full(ie, used half of all of swap device) condition
  meet. It could be bad if we use multiple swap device, small in-memory swap
  and big storage swap or in-memory swap alone.
 
 That is a very good realization: it's surprising that none of us
 thought of it before - no disrespect to you, well done, thank you.
 
 And I guess swap readahead is utterly unhelpful in this case too.
 
  
  This patch changes vm_swap_full logic slightly so it could free
  swap slot early if the backed device is really fast.
  For it, I used SWP_SOLIDSTATE but It might be controversial.
 
 But I strongly disagree with almost everything in your patch :)
 I disagree with addressing it in vm_swap_full(), I disagree that
 it can be addressed by device, I disagree that it has anything to
 do with SWP_SOLIDSTATE.
 
 This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
 is it?  In those cases, a fixed amount of memory has been set aside
 for swap, and it works out just like with disk block devices.  The

Brd is okay but it seems you are miunderstanding zram.
The zram doesn't reserve any memory and allocate dynamic memory when
swap out happens so it can make duplicate space in pusdo block device
and memory.

 memory set aside may be wasted, but that is accepted upfront.
 
 Similarly, this is not a problem with swapping to SSD.  There might
 or might not be other reasons for adjusting the vm_swap_full() logic
 for SSD or generally, but those have nothing to do with this issue.

Yes.

 
 The problem here is peculiar to frontswap, and the variably sized
 memory behind it, isn't it?  We are accustomed to using swap to free

Zram, too.

 up memory by transferring its data to some other, cheaper but slower
 resource.
 
 But in the case of frontswap and zmem (I'll say that to avoid thinking

Frankly speaking, I couldn't understand what you means, frontswap and zmem.
The frontswap is just layer for hook the swap subsystem.
Real instance of frontswap is zcache and zswap at the moment.
I will understand them as zcache and zswap. Okay?

 through which backends are actually involved), it is not a cheaper and
 slower resource, but the very same memory we are trying to save: swap
 is stolen from the memory under reclaim, so any duplication becomes
 counter-productive (if we ignore cpu compression/decompression costs:
 I have no idea how fair it is to do so, but anyone who chooses zmem
 is prepared to pay some cpu price for that).

Agree.

 
 And because it's a frontswap thing, we cannot decide this by device:
 frontswap may or may not stand in front of each device.  There is no
 problem with swapcache duplicated on disk (until that area approaches
 being full or fragmented), but at the higher level we cannot see what
 is in zmem and what is on disk: we only want to free up the zmem dup.

That's what I really have a concern and why I begged idea.

 
 I believe the answer is for frontswap/zmem to invalidate the frontswap
 copy of the page (to free up the compressed memory when possible) and
 SetPageDirty on the PageUptodate PageSwapCache page when swapping in
 (setting page dirty so nothing will later go to read it from the
 unfreed location on backing swap disk, which was never written).

You mean that zcache and zswap have to do garbage collection by some
policy? It could be but how about zram? It's just pseudo block device
and he don't have any knowledge on top of it. It could be swap or normal
block device. I mean zram has no information of swap to handle it.

 
 We cannot rely on freeing the swap itself, because in general there
 may be multiple references to the swap, and we only satisfy the one
 which has faulted.  It may or may not be a good idea to use rmap to
 locate the other places to insert pte in place of swap entry, to
 resolve them all at once; but we have chosen not to do so in the
 past, and there's no need for that, if the zmem gets invalidated
 and the swapcache page set dirty.

Yes it could be better but as I mentioned above, it couldn't handle
zram case. If there is a solution for zram, I will be happy. :)

And another point, fronstwap is already percolated into swap subsystem
very tightly. So I doubt adding one another hook is a really problem.

Thanks for great comment, Hugh!

 
 Hugh
 
  So let's add Ccing Shaohua and Hugh.
  If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
  or something for z* family.
  
  Other problem is zram is block device so that it can set SWP_INMEMORY
  or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
  I have no idea to use it for frontswap.
  
  Any idea?

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Dan,

On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
  From: Hugh Dickins [mailto:hu...@google.com]
  Subject: Re: [RFC] mm: remove swapcache page early

  On Wed, 27 Mar 2013, Minchan Kim wrote:

   Swap subsystem does lazy swap slot free with expecting the page
   would be swapped out again so we can't avoid unnecessary write.
   so we can avoid unnecessary write.

   But the problem in in-memory swap is that it consumes memory space
   until vm_swap_full(ie, used half of all of swap device) condition
   meet. It could be bad if we use multiple swap device, small in-memory swap
   and big storage swap or in-memory swap alone.

  That is a very good realization: it's surprising that none of us
  thought of it before - no disrespect to you, well done, thank you.

 Yes, my compliments also Minchan.  This problem has been thought of before
 but this patch is the first to identify a possible solution.

Thanks!

  And I guess swap readahead is utterly unhelpful in this case too.

 Yes... as is any swap writeahead.  Excuse my ignorance, but I
 think this is not done in the swap subsystem but instead the kernel
 assumes write-coalescing will be done in the block I/O subsystem,
 which means swap writeahead would affect zram but not zcache/zswap
 (since frontswap subverts the block I/O subsystem).

Frankly speaking, I don't know why you mentioned swap writeahead
in this point. Anyway, I dobut how it effect zram, too. A gain I can
have a mind is compress ratio would be high thorough multiple page
compression all at once.

 However I think a swap-readahead solution would be helpful to
 zram as well as zcache/zswap.

Hmm, why? swap-readahead is just hint to reduce big stall time to
reduce on big seek overhead storage. But in-memory swap is no cost
for seeking. So unnecessary swap-readahead can make memory pressure
high and it could cause another page swap out so it could be swap-thrashing.
And for good swap-readahead hit ratio, swap device shouldn't be fragmented.
But as you know, there are many factor to prevent it in the kernel now
and Shaohua is tackling on it.

   This patch changes vm_swap_full logic slightly so it could free
   swap slot early if the backed device is really fast.
   For it, I used SWP_SOLIDSTATE but It might be controversial.

  But I strongly disagree with almost everything in your patch :)
  I disagree with addressing it in vm_swap_full(), I disagree that
  it can be addressed by device, I disagree that it has anything to
  do with SWP_SOLIDSTATE.

  This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
  is it?  In those cases, a fixed amount of memory has been set aside
  for swap, and it works out just like with disk block devices.  The
  memory set aside may be wasted, but that is accepted upfront.

 It is (I believe) also a problem with swapping to ram.  Two
 copies of the same page are kept in memory in different places,
 right?  Fixed vs variable size is irrelevant I think.  Or am
 I misunderstanding something about swap-to-ram?

  Similarly, this is not a problem with swapping to SSD.  There might
  or might not be other reasons for adjusting the vm_swap_full() logic
  for SSD or generally, but those have nothing to do with this issue.

 I think it is at least highly related.  The key issue is the
 tradeoff of the likelihood that the page will soon be read/written
 again while it is in swap cache vs the time/resource-usage necessary
 to reconstitute the page into swap cache.  Reconstituting from disk
 requires a LOT of elapsed time.  Reconstituting from
 an SSD likely takes much less time.  Reconstituting from
 zcache/zram takes thousands of CPU cycles.

Yeb. That's why I wanted to use SWP_SOLIDSTATE.

  The problem here is peculiar to frontswap, and the variably sized
  memory behind it, isn't it?  We are accustomed to using swap to free
  up memory by transferring its data to some other, cheaper but slower
  resource.

 Frontswap does make the problem more complex because some pages
 are in fairly fast storage (zcache, needs decompression) and
 some are on the actual (usually) rotating media.  Fortunately,
 differentiating between these two cases is just a table lookup
 (see frontswap_test).

Yeb, I thouht it could be a last resort because I'd like to avoid
lookup every swapin if possible.

  But in the case of frontswap and zmem (I'll say that to avoid thinking
  through which backends are actually involved), it is not a cheaper and
  slower resource, but the very same memory we are trying to save: swap
  is stolen from the memory under reclaim, so any duplication becomes
  counter-productive (if we ignore cpu compression/decompression costs:
  I have no idea how fair it is to do so, but anyone who chooses zmem
  is prepared to pay some cpu price for that).

 Exactly.  There is some robbing of Peter to pay Paul and
 other complex resource tradeoffs.  Presumably, though, it is
 not the very

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote:
 On Wed, 27 Mar 2013, Dan Magenheimer wrote:
   From: Hugh Dickins [mailto:hu...@google.com]
   Subject: Re: [RFC] mm: remove swapcache page early

   On Wed, 27 Mar 2013, Minchan Kim wrote:

Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.
so we can avoid unnecessary write.

But the problem in in-memory swap is that it consumes memory space
until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small in-memory 
swap
and big storage swap or in-memory swap alone.

   That is a very good realization: it's surprising that none of us
   thought of it before - no disrespect to you, well done, thank you.

  Yes, my compliments also Minchan.  This problem has been thought of before
  but this patch is the first to identify a possible solution.

   And I guess swap readahead is utterly unhelpful in this case too.

  Yes... as is any swap writeahead.  Excuse my ignorance, but I
  think this is not done in the swap subsystem but instead the kernel
  assumes write-coalescing will be done in the block I/O subsystem,
  which means swap writeahead would affect zram but not zcache/zswap
  (since frontswap subverts the block I/O subsystem).

 I don't know what swap writeahead is; but write coalescing, yes.
 I don't see any problem with it in this context.

  However I think a swap-readahead solution would be helpful to
  zram as well as zcache/zswap.

 Whereas swap readahead on zmem is uncompressing zmem to pagecache
 which may never be needed, and may take a circuit of the inactive
 LRU before it gets reclaimed (if it turns out not to be needed,
 at least it will remain clean and be easily reclaimed).

But it could evict more important pages before reaching out the tail.
That's thing we really want to avoid if possible.

This patch changes vm_swap_full logic slightly so it could free
swap slot early if the backed device is really fast.
For it, I used SWP_SOLIDSTATE but It might be controversial.

   But I strongly disagree with almost everything in your patch :)
   I disagree with addressing it in vm_swap_full(), I disagree that
   it can be addressed by device, I disagree that it has anything to
   do with SWP_SOLIDSTATE.

   This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
   is it?  In those cases, a fixed amount of memory has been set aside
   for swap, and it works out just like with disk block devices.  The
   memory set aside may be wasted, but that is accepted upfront.

  It is (I believe) also a problem with swapping to ram.  Two
  copies of the same page are kept in memory in different places,
  right?  Fixed vs variable size is irrelevant I think.  Or am
  I misunderstanding something about swap-to-ram?

 I may be misrembering how /dev/ram0 works, or simply assuming that
 if you want to use it for swap (interesting for testing, but probably
 not for general use), then you make sure to allocate each page of it
 in advance.

 The pages of /dev/ram0 don't get freed, or not before it's closed
 (swapoff'ed) anyway.  Yes, swapcache would be duplicating data from
 other memory into /dev/ram0 memory; but that /dev/ram0 memory has
 been set aside for this purpose, and removing from swapcache won't
 free any more memory.

   Similarly, this is not a problem with swapping to SSD.  There might
   or might not be other reasons for adjusting the vm_swap_full() logic
   for SSD or generally, but those have nothing to do with this issue.

  I think it is at least highly related.  The key issue is the
  tradeoff of the likelihood that the page will soon be read/written
  again while it is in swap cache vs the time/resource-usage necessary
  to reconstitute the page into swap cache.  Reconstituting from disk
  requires a LOT of elapsed time.  Reconstituting from
  an SSD likely takes much less time.  Reconstituting from
  zcache/zram takes thousands of CPU cycles.

 I acknowledge my complete ignorance of how to judge the tradeoff
 between memory usage and cpu usage, but I think Minchan's main
 concern was with the memory usage.  Neither hard disk nor SSD
 is occupying memory.

Hmm, It seems I misunderstood Dan's opinion in previous thread.
You're right, Hugh. My main concern is memory usage but the rationale
I used SWP_SOLIDSTATE is writing on SSD could be cheap rather than 
storage. Yeb, it depends on SSD's internal's FTL algorith and fragment
ratio due to wear-leveling. That's why I said It might be controversial.

   The problem here is peculiar to frontswap, and the variably sized
   memory behind it, isn't it?  We are accustomed to using swap to free
   up memory by transferring its data to some other, cheaper but slower
   resource.

  Frontswap does make the problem more

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Minchan Kim

Hi Seth,

On Wed, Mar 27, 2013 at 12:19:11PM -0500, Seth Jennings wrote:
 On 03/26/2013 09:22 PM, Minchan Kim wrote:
  Swap subsystem does lazy swap slot free with expecting the page
  would be swapped out again so we can't avoid unnecessary write.
  
  But the problem in in-memory swap is that it consumes memory space
  until vm_swap_full(ie, used half of all of swap device) condition
  meet. It could be bad if we use multiple swap device, small in-memory swap
  and big storage swap or in-memory swap alone.
  
  This patch changes vm_swap_full logic slightly so it could free
  swap slot early if the backed device is really fast.
 
 Great idea!

Thanks!

 
  For it, I used SWP_SOLIDSTATE but It might be controversial.
 
 The comment for SWP_SOLIDSTATE is that blkdev seeks are cheap. Just
 because seeks are cheap doesn't mean the read itself is also cheap.

The read isn't not concern but write.

 For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of
 them can be pretty slow.

Yeb.

 
  So let's add Ccing Shaohua and Hugh.
  If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
  or something for z* family.
 
 Afaict, setting SWP_SOLIDSTATE depends on characteristics of the
 underlying block device (i.e. blk_queue_nonrot()).  zram is a block
 device but zcache and zswap are not.
 
 Any idea by what criteria SWP_INMEMORY would be set?

Just in-memory swap, zram, zswap and zcache at the moment. :)

 
 Also, frontswap backends (zcache and zswap) are a caching layer on top
 of the real swap device, which might actually be rotating media.  So
 you have the issue of to different characteristics, in-memory caching
 on top of rotation media, present in a single swap device.

Please read my patch completely. I already pointed out the problem and
Hugh and Dan are suggesting ideas.

Thanks!

 
 Thanks,
 Seth
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-27 Thread Shaohua Li

On Thu, Mar 28, 2013 at 10:18:24AM +0900, Minchan Kim wrote:
 On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote:
  On Wed, 27 Mar 2013, Dan Magenheimer wrote:
From: Hugh Dickins [mailto:hu...@google.com]
Subject: Re: [RFC] mm: remove swapcache page early

On Wed, 27 Mar 2013, Minchan Kim wrote:

 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.
 so we can avoid unnecessary write.

 But the problem in in-memory swap is that it consumes memory space
 until vm_swap_full(ie, used half of all of swap device) condition
 meet. It could be bad if we use multiple swap device, small in-memory 
 swap
 and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

   Yes, my compliments also Minchan.  This problem has been thought of before
   but this patch is the first to identify a possible solution.

And I guess swap readahead is utterly unhelpful in this case too.

   Yes... as is any swap writeahead.  Excuse my ignorance, but I
   think this is not done in the swap subsystem but instead the kernel
   assumes write-coalescing will be done in the block I/O subsystem,
   which means swap writeahead would affect zram but not zcache/zswap
   (since frontswap subverts the block I/O subsystem).

  I don't know what swap writeahead is; but write coalescing, yes.
  I don't see any problem with it in this context.

   However I think a swap-readahead solution would be helpful to
   zram as well as zcache/zswap.

  Whereas swap readahead on zmem is uncompressing zmem to pagecache
  which may never be needed, and may take a circuit of the inactive
  LRU before it gets reclaimed (if it turns out not to be needed,
  at least it will remain clean and be easily reclaimed).

 But it could evict more important pages before reaching out the tail.
 That's thing we really want to avoid if possible.

 This patch changes vm_swap_full logic slightly so it could free
 swap slot early if the backed device is really fast.
 For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it?  In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices.  The
memory set aside may be wasted, but that is accepted upfront.

   It is (I believe) also a problem with swapping to ram.  Two
   copies of the same page are kept in memory in different places,
   right?  Fixed vs variable size is irrelevant I think.  Or am
   I misunderstanding something about swap-to-ram?

  I may be misrembering how /dev/ram0 works, or simply assuming that
  if you want to use it for swap (interesting for testing, but probably
  not for general use), then you make sure to allocate each page of it
  in advance.

  The pages of /dev/ram0 don't get freed, or not before it's closed
  (swapoff'ed) anyway.  Yes, swapcache would be duplicating data from
  other memory into /dev/ram0 memory; but that /dev/ram0 memory has
  been set aside for this purpose, and removing from swapcache won't
  free any more memory.

Similarly, this is not a problem with swapping to SSD.  There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

   I think it is at least highly related.  The key issue is the
   tradeoff of the likelihood that the page will soon be read/written
   again while it is in swap cache vs the time/resource-usage necessary
   to reconstitute the page into swap cache.  Reconstituting from disk
   requires a LOT of elapsed time.  Reconstituting from
   an SSD likely takes much less time.  Reconstituting from
   zcache/zram takes thousands of CPU cycles.

  I acknowledge my complete ignorance of how to judge the tradeoff
  between memory usage and cpu usage, but I think Minchan's main
  concern was with the memory usage.  Neither hard disk nor SSD
  is occupying memory.

 Hmm, It seems I misunderstood Dan's opinion in previous thread.
 You're right, Hugh. My main concern is memory usage but the rationale
 I used SWP_SOLIDSTATE is writing on SSD could be cheap rather than 
 storage. Yeb, it depends on SSD's internal's FTL algorith and fragment
 ratio due to wear-leveling. That's why I said It might be controversial.

Even SSD is fast, there is tradeoff. And unncessary write to SSD should be
avoided if possible, because write makes

Re: [RFC] mm: remove swapcache page early

2013-03-26 Thread Kamezawa Hiroyuki

(2013/03/27 11:22), Minchan Kim wrote:
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
> 
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
> 
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.
> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
> 
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
> 
> Any idea?
> 
Another thinkingin what case, in what system configuration, 
vm_swap_full() should return false and delay swp_entry freeing ?

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm: remove swapcache page early

2013-03-26 Thread Kyungmin Park

Hi,

On Wed, Mar 27, 2013 at 11:22 AM, Minchan Kim  wrote:
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
>
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
>
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.
> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
I perfer to add new SWP_INMEMORY for z* family. as you know SSD and
memory is different characteristics.
and if new type is added, it doesn't need to modify lots of codes.

Do you have any data for it? do you get meaningful performance gain or
efficiency of z* family? If yes, please share it.

Thank you,
Kyungmin Park

>
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
>
> Any idea?
>
> Other optimize point is we remove it unconditionally when we
> found it's exclusive when swap in happen.
> It could help frontswap family, too.
> What do you think about it?
>
> Cc: Hugh Dickins 
> Cc: Dan Magenheimer 
> Cc: Seth Jennings 
> Cc: Nitin Gupta 
> Cc: Konrad Rzeszutek Wilk 
> Cc: Shaohua Li 
> Signed-off-by: Minchan Kim 
> ---
>  include/linux/swap.h | 11 ---
>  mm/memory.c  |  3 ++-
>  mm/swapfile.c| 11 +++
>  mm/vmscan.c  |  2 +-
>  4 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2818a12..1f4df66 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
>  extern atomic_long_t nr_swap_pages;
>  extern long total_swap_pages;
>
> -/* Swap 50% full? Release swapcache more aggressively.. */
> -static inline bool vm_swap_full(void)
> +/*
> + * Swap 50% full or fast backed device?
> + * Release swapcache more aggressively.
> + */
> +static inline bool vm_swap_full(struct swap_info_struct *si)
>  {
> +   if (si->flags & SWP_SOLIDSTATE)
> +   return true;
> return atomic_long_read(_swap_pages) * 2 < total_swap_pages;
>  }
>
> @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, 
> swp_entry_t ent, bool swapout)
>  #define get_nr_swap_pages()0L
>  #define total_swap_pages   0L
>  #define total_swapcache_pages()0UL
> -#define vm_swap_full() 0
> +#define vm_swap_full(si)   0
>
>  #define si_swapinfo(val) \
> do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> diff --git a/mm/memory.c b/mm/memory.c
> index 705473a..1ca21a9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> swap_free(entry);
> -   if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || 
> PageMlocked(page))
> +   if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page))
> +   || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
> try_to_free_swap(page);
> unlock_page(page);
> if (page != swapcache) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1bee6fa..f9cc701 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -293,7 +293,7 @@ checks:
> scan_base = offset = si->lowest_bit;
>
> /* reuse swap entry of cache-only swap if not busy. */
> -   if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +   if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
> int swap_was_freed;
> spin_unlock(>lock);
> swap_was_freed = __try_to_reclaim_swap(si, offset);
> @@ -382,7 +382,8 @@ scan:
> spin_lock(>lock);
> goto checks;
> }
> -   if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) 
> {
> +   if (vm_swap_full(si) &&
> +   si->swap_map[offset] == SWAP_HAS_CACHE) {
> spin_lock(>lock);
> goto checks;
> }
> @@ -397,7 +398,8 @@ scan:
> spin_lock(>lock);
> goto checks;
> }
> -   if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) 
> {
> +   if (vm_swap_full(si) &&
> +

Re: [RFC] mm: remove swapcache page early

2013-03-26 Thread Kyungmin Park

Hi,

On Wed, Mar 27, 2013 at 11:22 AM, Minchan Kim minc...@kernel.org wrote:
 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.

 But the problem in in-memory swap is that it consumes memory space
 until vm_swap_full(ie, used half of all of swap device) condition
 meet. It could be bad if we use multiple swap device, small in-memory swap
 and big storage swap or in-memory swap alone.

 This patch changes vm_swap_full logic slightly so it could free
 swap slot early if the backed device is really fast.
 For it, I used SWP_SOLIDSTATE but It might be controversial.
 So let's add Ccing Shaohua and Hugh.
 If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
 or something for z* family.
I perfer to add new SWP_INMEMORY for z* family. as you know SSD and
memory is different characteristics.
and if new type is added, it doesn't need to modify lots of codes.

Do you have any data for it? do you get meaningful performance gain or
efficiency of z* family? If yes, please share it.

Thank you,
Kyungmin Park


 Other problem is zram is block device so that it can set SWP_INMEMORY
 or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
 I have no idea to use it for frontswap.

 Any idea?

 Other optimize point is we remove it unconditionally when we
 found it's exclusive when swap in happen.
 It could help frontswap family, too.
 What do you think about it?

 Cc: Hugh Dickins hu...@google.com
 Cc: Dan Magenheimer dan.magenhei...@oracle.com
 Cc: Seth Jennings sjenn...@linux.vnet.ibm.com
 Cc: Nitin Gupta ngu...@vflare.org
 Cc: Konrad Rzeszutek Wilk kon...@darnok.org
 Cc: Shaohua Li s...@kernel.org
 Signed-off-by: Minchan Kim minc...@kernel.org
 ---
  include/linux/swap.h | 11 ---
  mm/memory.c  |  3 ++-
  mm/swapfile.c| 11 +++
  mm/vmscan.c  |  2 +-
  4 files changed, 18 insertions(+), 9 deletions(-)

 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index 2818a12..1f4df66 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
  extern atomic_long_t nr_swap_pages;
  extern long total_swap_pages;

 -/* Swap 50% full? Release swapcache more aggressively.. */
 -static inline bool vm_swap_full(void)
 +/*
 + * Swap 50% full or fast backed device?
 + * Release swapcache more aggressively.
 + */
 +static inline bool vm_swap_full(struct swap_info_struct *si)
  {
 +   if (si-flags  SWP_SOLIDSTATE)
 +   return true;
 return atomic_long_read(nr_swap_pages) * 2  total_swap_pages;
  }

 @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, 
 swp_entry_t ent, bool swapout)
  #define get_nr_swap_pages()0L
  #define total_swap_pages   0L
  #define total_swapcache_pages()0UL
 -#define vm_swap_full() 0
 +#define vm_swap_full(si)   0

  #define si_swapinfo(val) \
 do { (val)-freeswap = (val)-totalswap = 0; } while (0)
 diff --git a/mm/memory.c b/mm/memory.c
 index 705473a..1ca21a9 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct 
 vm_area_struct *vma,
 mem_cgroup_commit_charge_swapin(page, ptr);

 swap_free(entry);
 -   if (vm_swap_full() || (vma-vm_flags  VM_LOCKED) || 
 PageMlocked(page))
 +   if (likely(PageSwapCache(page))  (vm_swap_full(page_swap_info(page))
 +   || (vma-vm_flags  VM_LOCKED) || PageMlocked(page)))
 try_to_free_swap(page);
 unlock_page(page);
 if (page != swapcache) {
 diff --git a/mm/swapfile.c b/mm/swapfile.c
 index 1bee6fa..f9cc701 100644
 --- a/mm/swapfile.c
 +++ b/mm/swapfile.c
 @@ -293,7 +293,7 @@ checks:
 scan_base = offset = si-lowest_bit;

 /* reuse swap entry of cache-only swap if not busy. */
 -   if (vm_swap_full()  si-swap_map[offset] == SWAP_HAS_CACHE) {
 +   if (vm_swap_full(si)  si-swap_map[offset] == SWAP_HAS_CACHE) {
 int swap_was_freed;
 spin_unlock(si-lock);
 swap_was_freed = __try_to_reclaim_swap(si, offset);
 @@ -382,7 +382,8 @@ scan:
 spin_lock(si-lock);
 goto checks;
 }
 -   if (vm_swap_full()  si-swap_map[offset] == SWAP_HAS_CACHE) 
 {
 +   if (vm_swap_full(si) 
 +   si-swap_map[offset] == SWAP_HAS_CACHE) {
 spin_lock(si-lock);
 goto checks;
 }
 @@ -397,7 +398,8 @@ scan:
 spin_lock(si-lock);
 goto checks;
 }
 -   if (vm_swap_full()  si-swap_map[offset] == SWAP_HAS_CACHE) 
 {
 +   if (vm_swap_full(si) 
 +

Re: [RFC] mm: remove swapcache page early

2013-03-26 Thread Kamezawa Hiroyuki

(2013/03/27 11:22), Minchan Kim wrote:
 Swap subsystem does lazy swap slot free with expecting the page
 would be swapped out again so we can't avoid unnecessary write.
 
 But the problem in in-memory swap is that it consumes memory space
 until vm_swap_full(ie, used half of all of swap device) condition
 meet. It could be bad if we use multiple swap device, small in-memory swap
 and big storage swap or in-memory swap alone.
 
 This patch changes vm_swap_full logic slightly so it could free
 swap slot early if the backed device is really fast.
 For it, I used SWP_SOLIDSTATE but It might be controversial.
 So let's add Ccing Shaohua and Hugh.
 If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
 or something for z* family.
 
 Other problem is zram is block device so that it can set SWP_INMEMORY
 or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
 I have no idea to use it for frontswap.
 
 Any idea?
 
Another thinkingin what case, in what system configuration, 
vm_swap_full() should return false and delay swp_entry freeing ?

Thanks,
-Kame


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

48 matches

Mail list logo