Re: [RFC] mm: remove swapcache page early
On 04/08/2013 09:48 AM, Minchan Kim wrote: Hello Simon, On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote: Ping Minchan. On 04/02/2013 09:40 PM, Simon Jeons wrote: Hi Hugh, On 03/28/2013 05:41 AM, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. If page can be swap out again, which codes can avoid unnecessary write? Could you point out to me? Thanks in advance. ;-) Look at shrink_page_list. 1) PageAnon(page) && !PageSwapCache() 2) add_to_swap's SetPageDirty 3) __remove_mapping P.S) It seems you are misunderstanding. Here isn't proper place to ask a question for your understanding the code. As I know, there are some project(ex, kernelnewbies) and books for study and sharing the knowledge linux kernel. I recommend Mel's "Understand the Linux Virtual Memory Manager". It's rather outdated but will be very helpful to understand VM of linux kernel. You can get it freely but I hope you pay for. So if author become a billionaire by selecting best book in Amazon, he might print out second edition which covers all of new VM features and may solve all of you curiosity. It would be a another method to contribute open source project. :) I believe you talented developers can catch it up with reading the code enoughly and find more bonus knowledge. I think it's why our senior developers yell out RTFM and I follow them. What's the meaning of RTFM? Cheers! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Hello Simon, On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote: > Ping Minchan. > On 04/02/2013 09:40 PM, Simon Jeons wrote: > >Hi Hugh, > >On 03/28/2013 05:41 AM, Hugh Dickins wrote: > >>On Wed, 27 Mar 2013, Minchan Kim wrote: > >> > >>>Swap subsystem does lazy swap slot free with expecting the page > >>>would be swapped out again so we can't avoid unnecessary write. > >> so we can avoid unnecessary write. > > > >If page can be swap out again, which codes can avoid unnecessary > >write? Could you point out to me? Thanks in advance. ;-) Look at shrink_page_list. 1) PageAnon(page) && !PageSwapCache() 2) add_to_swap's SetPageDirty 3) __remove_mapping P.S) It seems you are misunderstanding. Here isn't proper place to ask a question for your understanding the code. As I know, there are some project(ex, kernelnewbies) and books for study and sharing the knowledge linux kernel. I recommend Mel's "Understand the Linux Virtual Memory Manager". It's rather outdated but will be very helpful to understand VM of linux kernel. You can get it freely but I hope you pay for. So if author become a billionaire by selecting best book in Amazon, he might print out second edition which covers all of new VM features and may solve all of you curiosity. It would be a another method to contribute open source project. :) I believe you talented developers can catch it up with reading the code enoughly and find more bonus knowledge. I think it's why our senior developers yell out RTFM and I follow them. Cheers! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Ping Minchan. On 04/02/2013 09:40 PM, Simon Jeons wrote: Hi Hugh, On 03/28/2013 05:41 AM, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. If page can be swap out again, which codes can avoid unnecessary write? Could you point out to me? Thanks in advance. ;-) But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Hugh So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. What do you think about it? Cc: Hugh Dickins Cc: Dan Magenheimer Cc: Seth Jennings Cc: Nitin Gupta Cc: Konrad Rzeszutek Wilk Cc: Shaohua Li Signed-off-by: Minchan Kim --- include/linux/swap.h | 11 --- mm/memory.c | 3 ++- mm/swapfile.c| 11 +++ mm/vmscan.c | 2 +- 4 files changed, 18 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2818a12..1f4df66 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, extern atomic_long_t nr_swap_pages; extern long total_swap_pages; -/* Swap 50% full? Release swapcache more aggressively.. */ -static inline bool vm_swap_full(void) +/* + * Swap 50% full or fast backed device? + * Release swapcache more aggressively. + */ +static inline bool vm_swap_full(struct swap_info_struct *si) {
Re: [RFC] mm: remove swapcache page early
Ping Minchan. On 04/02/2013 09:40 PM, Simon Jeons wrote: Hi Hugh, On 03/28/2013 05:41 AM, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. If page can be swap out again, which codes can avoid unnecessary write? Could you point out to me? Thanks in advance. ;-) But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Hugh So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. What do you think about it? Cc: Hugh Dickins hu...@google.com Cc: Dan Magenheimer dan.magenhei...@oracle.com Cc: Seth Jennings sjenn...@linux.vnet.ibm.com Cc: Nitin Gupta ngu...@vflare.org Cc: Konrad Rzeszutek Wilk kon...@darnok.org Cc: Shaohua Li s...@kernel.org Signed-off-by: Minchan Kim minc...@kernel.org --- include/linux/swap.h | 11 --- mm/memory.c | 3 ++- mm/swapfile.c| 11 +++ mm/vmscan.c | 2 +- 4 files changed, 18 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2818a12..1f4df66 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, extern atomic_long_t nr_swap_pages; extern long total_swap_pages; -/* Swap 50% full? Release swapcache more aggressively.. */ -static inline bool vm_swap_full(void) +/* + * Swap 50%
Re: [RFC] mm: remove swapcache page early
Hello Simon, On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote: Ping Minchan. On 04/02/2013 09:40 PM, Simon Jeons wrote: Hi Hugh, On 03/28/2013 05:41 AM, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. If page can be swap out again, which codes can avoid unnecessary write? Could you point out to me? Thanks in advance. ;-) Look at shrink_page_list. 1) PageAnon(page) !PageSwapCache() 2) add_to_swap's SetPageDirty 3) __remove_mapping P.S) It seems you are misunderstanding. Here isn't proper place to ask a question for your understanding the code. As I know, there are some project(ex, kernelnewbies) and books for study and sharing the knowledge linux kernel. I recommend Mel's Understand the Linux Virtual Memory Manager. It's rather outdated but will be very helpful to understand VM of linux kernel. You can get it freely but I hope you pay for. So if author become a billionaire by selecting best book in Amazon, he might print out second edition which covers all of new VM features and may solve all of you curiosity. It would be a another method to contribute open source project. :) I believe you talented developers can catch it up with reading the code enoughly and find more bonus knowledge. I think it's why our senior developers yell out RTFM and I follow them. Cheers! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On 04/08/2013 09:48 AM, Minchan Kim wrote: Hello Simon, On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote: Ping Minchan. On 04/02/2013 09:40 PM, Simon Jeons wrote: Hi Hugh, On 03/28/2013 05:41 AM, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. If page can be swap out again, which codes can avoid unnecessary write? Could you point out to me? Thanks in advance. ;-) Look at shrink_page_list. 1) PageAnon(page) !PageSwapCache() 2) add_to_swap's SetPageDirty 3) __remove_mapping P.S) It seems you are misunderstanding. Here isn't proper place to ask a question for your understanding the code. As I know, there are some project(ex, kernelnewbies) and books for study and sharing the knowledge linux kernel. I recommend Mel's Understand the Linux Virtual Memory Manager. It's rather outdated but will be very helpful to understand VM of linux kernel. You can get it freely but I hope you pay for. So if author become a billionaire by selecting best book in Amazon, he might print out second edition which covers all of new VM features and may solve all of you curiosity. It would be a another method to contribute open source project. :) I believe you talented developers can catch it up with reading the code enoughly and find more bonus knowledge. I think it's why our senior developers yell out RTFM and I follow them. What's the meaning of RTFM? Cheers! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Hi Hugh, On 03/28/2013 05:41 AM, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. If page can be swap out again, which codes can avoid unnecessary write? Could you point out to me? Thanks in advance. ;-) But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Hugh So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. What do you think about it? Cc: Hugh Dickins Cc: Dan Magenheimer Cc: Seth Jennings Cc: Nitin Gupta Cc: Konrad Rzeszutek Wilk Cc: Shaohua Li Signed-off-by: Minchan Kim --- include/linux/swap.h | 11 --- mm/memory.c | 3 ++- mm/swapfile.c| 11 +++ mm/vmscan.c | 2 +- 4 files changed, 18 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2818a12..1f4df66 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, extern atomic_long_t nr_swap_pages; extern long total_swap_pages; -/* Swap 50% full? Release swapcache more aggressively.. */ -static inline bool vm_swap_full(void) +/* + * Swap 50% full or fast backed device? + * Release swapcache more aggressively. + */ +static inline bool vm_swap_full(struct swap_info_struct *si) { + if (si->flags & SWP_SOLIDSTATE) +
Re: [RFC] mm: remove swapcache page early
Hi Hugh, On 03/28/2013 05:41 AM, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. If page can be swap out again, which codes can avoid unnecessary write? Could you point out to me? Thanks in advance. ;-) But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Hugh So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. What do you think about it? Cc: Hugh Dickins hu...@google.com Cc: Dan Magenheimer dan.magenhei...@oracle.com Cc: Seth Jennings sjenn...@linux.vnet.ibm.com Cc: Nitin Gupta ngu...@vflare.org Cc: Konrad Rzeszutek Wilk kon...@darnok.org Cc: Shaohua Li s...@kernel.org Signed-off-by: Minchan Kim minc...@kernel.org --- include/linux/swap.h | 11 --- mm/memory.c | 3 ++- mm/swapfile.c| 11 +++ mm/vmscan.c | 2 +- 4 files changed, 18 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2818a12..1f4df66 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, extern atomic_long_t nr_swap_pages; extern long total_swap_pages; -/* Swap 50% full? Release swapcache more aggressively.. */ -static inline bool vm_swap_full(void) +/* + * Swap 50% full or fast backed device? + * Release swapcache more
Re: [RFC] mm: remove swapcache page early
On Mon, Apr 01, 2013 at 10:13:58PM -0700, Hugh Dickins wrote: > On Tue, 2 Apr 2013, Minchan Kim wrote: > > On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote: > > > On Fri, 29 Mar 2013, Minchan Kim wrote: > > > > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: > > > > > > > > > > I wonder if something like this would have a similar result for zram? > > > > > (Completely untested... snippet stolen from swap_entry_free with > > > > > SetPageDirty added... doesn't compile yet, but should give you the > > > > > idea.) > > > > > > Be careful, although Dan is right that something like this can be > > > done for zram, I believe you will find that it needs a little more: > > > either a separate new entry point (not my preference) or a flags arg > > > (or boolean) added to swap_slot_free_notify. > > > > > > Because this is a different operation: end_swap_bio_read() wants > > > to free up zram's compressed copy of the page, but the swp_entry_t > > > must remain valid until swap_entry_free() can clear up the rest. > > > Precisely how much of the work each should do, you will discover. > > > > First of all, Thanks for noticing it for me! > > > > If I parse your concern correctly, you are concerning about > > different semantic on two functions. > > (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one). > > > > But current implementatoin on zram_slot_free_notify could cover both cases > > properly with luck. > > > > zram_free_page caused by end_swap_bio_read will free compressed copy > > of the page and zram_free_page caused by swap_entry_free later won't find > > right index from zram->table and just return. > > So I think there is no problem. > > > > Remained problem is zram->stats.notify_free, which could be counted > > redundantly but not sure it's valuable to count exactly. > > > > If I miss your point, please pinpoint your concern. :) > > Looking at it again, I do believe you and Dan are perfectly correct, > and I was again the confused one. Though I'd be happier if I could > see just how I was misreading it: makes me wonder if I had a great > insight that I can no longer grasp hold of! I think I was paranoid > about a swp_entry_t getting recycled prematurely: but swap_entry_free > remains in control of that - freeing a swap entry is no part of what > notify_free gets up to. Sorry for wasting your time. Hey, Hugh, Please don't do apology. It gives me a chance to look into that part in detail. It never wasted my time. And your deep insight and kind advise always makes everybody happier. Looking forward to seeing you soon in LSF/MM. Thanks! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Tue, 2 Apr 2013, Minchan Kim wrote: > On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote: > > On Fri, 29 Mar 2013, Minchan Kim wrote: > > > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: > > > > > > > > I wonder if something like this would have a similar result for zram? > > > > (Completely untested... snippet stolen from swap_entry_free with > > > > SetPageDirty added... doesn't compile yet, but should give you the > > > > idea.) > > > > Be careful, although Dan is right that something like this can be > > done for zram, I believe you will find that it needs a little more: > > either a separate new entry point (not my preference) or a flags arg > > (or boolean) added to swap_slot_free_notify. > > > > Because this is a different operation: end_swap_bio_read() wants > > to free up zram's compressed copy of the page, but the swp_entry_t > > must remain valid until swap_entry_free() can clear up the rest. > > Precisely how much of the work each should do, you will discover. > > First of all, Thanks for noticing it for me! > > If I parse your concern correctly, you are concerning about > different semantic on two functions. > (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one). > > But current implementatoin on zram_slot_free_notify could cover both cases > properly with luck. > > zram_free_page caused by end_swap_bio_read will free compressed copy > of the page and zram_free_page caused by swap_entry_free later won't find > right index from zram->table and just return. > So I think there is no problem. > > Remained problem is zram->stats.notify_free, which could be counted > redundantly but not sure it's valuable to count exactly. > > If I miss your point, please pinpoint your concern. :) Looking at it again, I do believe you and Dan are perfectly correct, and I was again the confused one. Though I'd be happier if I could see just how I was misreading it: makes me wonder if I had a great insight that I can no longer grasp hold of! I think I was paranoid about a swp_entry_t getting recycled prematurely: but swap_entry_free remains in control of that - freeing a swap entry is no part of what notify_free gets up to. Sorry for wasting your time. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Hi Hugh, On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote: > On Fri, 29 Mar 2013, Minchan Kim wrote: > > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: > > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: > > > > > > From: Hugh Dickins [mailto:hu...@google.com] > > > > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > > > > > > > > I believe the answer is for frontswap/zmem to invalidate the > > > > > > frontswap > > > > > > copy of the page (to free up the compressed memory when possible) > > > > > > and > > > > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in > > > > > > (setting page dirty so nothing will later go to read it from the > > > > > > unfreed location on backing swap disk, which was never written). > > > > > > > > > > There are two duplication issues: (1) When can the page be removed > > > > > from the swap cache after a call to frontswap_store; and (2) When > > > > > can the page be removed from the frontswap storage after it > > > > > has been brought back into memory via frontswap_load. > > > > > > > > > > This patch from Minchan addresses (1). The issue you are raising > > > > > > > > No. I am addressing (2). > > > > > > > > > here is (2). You may not know that (2) has recently been solved > > > > > in frontswap, at least for zcache. See > > > > > frontswap_exclusive_gets_enabled. > > > > > If this is enabled (and it is for zcache but not yet for zswap), > > > > > what you suggest (SetPageDirty) is what happens. > > > > > > > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it > > > > on zram and zswap. > > > > > > Zswap can enable it trivially by adding a function call in init_zswap. > > > (Note that it is not enabled by default for all frontswap backends > > > because it is another complicated tradeoff of cpu time vs memory space > > > that needs more study on a broad set of workloads.) > > > > > > I wonder if something like this would have a similar result for zram? > > > (Completely untested... snippet stolen from swap_entry_free with > > > SetPageDirty added... doesn't compile yet, but should give you the idea.) > > Thanks for correcting me on zram (in earlier mail of this thread), yes, > I was forgetting about the swap_slot_free_notify entry point which lets > that memory be freed. > > > > > Nice idea! > > > > After I see your patch, I realized it was Hugh's suggestion and > > you implemented it in proper place. > > > > Will resend it after testing. Maybe nextweek. > > Thanks! > > Be careful, although Dan is right that something like this can be > done for zram, I believe you will find that it needs a little more: > either a separate new entry point (not my preference) or a flags arg > (or boolean) added to swap_slot_free_notify. > > Because this is a different operation: end_swap_bio_read() wants > to free up zram's compressed copy of the page, but the swp_entry_t > must remain valid until swap_entry_free() can clear up the rest. > Precisely how much of the work each should do, you will discover. First of all, Thanks for noticing it for me! If I parse your concern correctly, you are concerning about different semantic on two functions. (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one). But current implementatoin on zram_slot_free_notify could cover both cases properly with luck. zram_free_page caused by end_swap_bio_read will free compressed copy of the page and zram_free_page caused by swap_entry_free later won't find right index from zram->table and just return. So I think there is no problem. Remained problem is zram->stats.notify_free, which could be counted redundantly but not sure it's valuable to count exactly. If I miss your point, please pinpoint your concern. :) Thanks! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Hi Hugh, On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote: On Fri, 29 Mar 2013, Minchan Kim wrote: On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). There are two duplication issues: (1) When can the page be removed from the swap cache after a call to frontswap_store; and (2) When can the page be removed from the frontswap storage after it has been brought back into memory via frontswap_load. This patch from Minchan addresses (1). The issue you are raising No. I am addressing (2). here is (2). You may not know that (2) has recently been solved in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. If this is enabled (and it is for zcache but not yet for zswap), what you suggest (SetPageDirty) is what happens. I am blind on zcache so I didn't see it. Anyway, I'd like to address it on zram and zswap. Zswap can enable it trivially by adding a function call in init_zswap. (Note that it is not enabled by default for all frontswap backends because it is another complicated tradeoff of cpu time vs memory space that needs more study on a broad set of workloads.) I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) Thanks for correcting me on zram (in earlier mail of this thread), yes, I was forgetting about the swap_slot_free_notify entry point which lets that memory be freed. Nice idea! After I see your patch, I realized it was Hugh's suggestion and you implemented it in proper place. Will resend it after testing. Maybe nextweek. Thanks! Be careful, although Dan is right that something like this can be done for zram, I believe you will find that it needs a little more: either a separate new entry point (not my preference) or a flags arg (or boolean) added to swap_slot_free_notify. Because this is a different operation: end_swap_bio_read() wants to free up zram's compressed copy of the page, but the swp_entry_t must remain valid until swap_entry_free() can clear up the rest. Precisely how much of the work each should do, you will discover. First of all, Thanks for noticing it for me! If I parse your concern correctly, you are concerning about different semantic on two functions. (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one). But current implementatoin on zram_slot_free_notify could cover both cases properly with luck. zram_free_page caused by end_swap_bio_read will free compressed copy of the page and zram_free_page caused by swap_entry_free later won't find right index from zram-table and just return. So I think there is no problem. Remained problem is zram-stats.notify_free, which could be counted redundantly but not sure it's valuable to count exactly. If I miss your point, please pinpoint your concern. :) Thanks! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Tue, 2 Apr 2013, Minchan Kim wrote: On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote: On Fri, 29 Mar 2013, Minchan Kim wrote: On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) Be careful, although Dan is right that something like this can be done for zram, I believe you will find that it needs a little more: either a separate new entry point (not my preference) or a flags arg (or boolean) added to swap_slot_free_notify. Because this is a different operation: end_swap_bio_read() wants to free up zram's compressed copy of the page, but the swp_entry_t must remain valid until swap_entry_free() can clear up the rest. Precisely how much of the work each should do, you will discover. First of all, Thanks for noticing it for me! If I parse your concern correctly, you are concerning about different semantic on two functions. (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one). But current implementatoin on zram_slot_free_notify could cover both cases properly with luck. zram_free_page caused by end_swap_bio_read will free compressed copy of the page and zram_free_page caused by swap_entry_free later won't find right index from zram-table and just return. So I think there is no problem. Remained problem is zram-stats.notify_free, which could be counted redundantly but not sure it's valuable to count exactly. If I miss your point, please pinpoint your concern. :) Looking at it again, I do believe you and Dan are perfectly correct, and I was again the confused one. Though I'd be happier if I could see just how I was misreading it: makes me wonder if I had a great insight that I can no longer grasp hold of! I think I was paranoid about a swp_entry_t getting recycled prematurely: but swap_entry_free remains in control of that - freeing a swap entry is no part of what notify_free gets up to. Sorry for wasting your time. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Mon, Apr 01, 2013 at 10:13:58PM -0700, Hugh Dickins wrote: On Tue, 2 Apr 2013, Minchan Kim wrote: On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote: On Fri, 29 Mar 2013, Minchan Kim wrote: On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) Be careful, although Dan is right that something like this can be done for zram, I believe you will find that it needs a little more: either a separate new entry point (not my preference) or a flags arg (or boolean) added to swap_slot_free_notify. Because this is a different operation: end_swap_bio_read() wants to free up zram's compressed copy of the page, but the swp_entry_t must remain valid until swap_entry_free() can clear up the rest. Precisely how much of the work each should do, you will discover. First of all, Thanks for noticing it for me! If I parse your concern correctly, you are concerning about different semantic on two functions. (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one). But current implementatoin on zram_slot_free_notify could cover both cases properly with luck. zram_free_page caused by end_swap_bio_read will free compressed copy of the page and zram_free_page caused by swap_entry_free later won't find right index from zram-table and just return. So I think there is no problem. Remained problem is zram-stats.notify_free, which could be counted redundantly but not sure it's valuable to count exactly. If I miss your point, please pinpoint your concern. :) Looking at it again, I do believe you and Dan are perfectly correct, and I was again the confused one. Though I'd be happier if I could see just how I was misreading it: makes me wonder if I had a great insight that I can no longer grasp hold of! I think I was paranoid about a swp_entry_t getting recycled prematurely: but swap_entry_free remains in control of that - freeing a swap entry is no part of what notify_free gets up to. Sorry for wasting your time. Hey, Hugh, Please don't do apology. It gives me a chance to look into that part in detail. It never wasted my time. And your deep insight and kind advise always makes everybody happier. Looking forward to seeing you soon in LSF/MM. Thanks! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Fri, 29 Mar 2013, Minchan Kim wrote: > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: > > > From: Minchan Kim [mailto:minc...@kernel.org] > > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: > > > > > From: Hugh Dickins [mailto:hu...@google.com] > > > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > > > > > > I believe the answer is for frontswap/zmem to invalidate the frontswap > > > > > copy of the page (to free up the compressed memory when possible) and > > > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in > > > > > (setting page dirty so nothing will later go to read it from the > > > > > unfreed location on backing swap disk, which was never written). > > > > > > > > There are two duplication issues: (1) When can the page be removed > > > > from the swap cache after a call to frontswap_store; and (2) When > > > > can the page be removed from the frontswap storage after it > > > > has been brought back into memory via frontswap_load. > > > > > > > > This patch from Minchan addresses (1). The issue you are raising > > > > > > No. I am addressing (2). > > > > > > > here is (2). You may not know that (2) has recently been solved > > > > in frontswap, at least for zcache. See > > > > frontswap_exclusive_gets_enabled. > > > > If this is enabled (and it is for zcache but not yet for zswap), > > > > what you suggest (SetPageDirty) is what happens. > > > > > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it > > > on zram and zswap. > > > > Zswap can enable it trivially by adding a function call in init_zswap. > > (Note that it is not enabled by default for all frontswap backends > > because it is another complicated tradeoff of cpu time vs memory space > > that needs more study on a broad set of workloads.) > > > > I wonder if something like this would have a similar result for zram? > > (Completely untested... snippet stolen from swap_entry_free with > > SetPageDirty added... doesn't compile yet, but should give you the idea.) Thanks for correcting me on zram (in earlier mail of this thread), yes, I was forgetting about the swap_slot_free_notify entry point which lets that memory be freed. > > Nice idea! > > After I see your patch, I realized it was Hugh's suggestion and > you implemented it in proper place. > > Will resend it after testing. Maybe nextweek. > Thanks! Be careful, although Dan is right that something like this can be done for zram, I believe you will find that it needs a little more: either a separate new entry point (not my preference) or a flags arg (or boolean) added to swap_slot_free_notify. Because this is a different operation: end_swap_bio_read() wants to free up zram's compressed copy of the page, but the swp_entry_t must remain valid until swap_entry_free() can clear up the rest. Precisely how much of the work each should do, you will discover. Hugh > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > index 56276fe..2d10988 100644 > > --- a/mm/page_io.c > > +++ b/mm/page_io.c > > @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) > > iminor(bio->bi_bdev->bd_inode), > > (unsigned long long)bio->bi_sector); > > } else { > > + struct swap_info_struct *sis; > > + > > SetPageUptodate(page); > > + sis = page_swap_info(page); > > + if (sis->flags & SWP_BLKDEV) { > > + struct gendisk *disk = sis->bdev->bd_disk; > > + if (disk->fops->swap_slot_free_notify) { > > + SetPageDirty(page); > > + disk->fops->swap_slot_free_notify(sis->bdev, > > + offset); > > + } > > + } > > } > > unlock_page(page); > > bio_put(bio); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Fri, 29 Mar 2013, Minchan Kim wrote: On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). There are two duplication issues: (1) When can the page be removed from the swap cache after a call to frontswap_store; and (2) When can the page be removed from the frontswap storage after it has been brought back into memory via frontswap_load. This patch from Minchan addresses (1). The issue you are raising No. I am addressing (2). here is (2). You may not know that (2) has recently been solved in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. If this is enabled (and it is for zcache but not yet for zswap), what you suggest (SetPageDirty) is what happens. I am blind on zcache so I didn't see it. Anyway, I'd like to address it on zram and zswap. Zswap can enable it trivially by adding a function call in init_zswap. (Note that it is not enabled by default for all frontswap backends because it is another complicated tradeoff of cpu time vs memory space that needs more study on a broad set of workloads.) I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) Thanks for correcting me on zram (in earlier mail of this thread), yes, I was forgetting about the swap_slot_free_notify entry point which lets that memory be freed. Nice idea! After I see your patch, I realized it was Hugh's suggestion and you implemented it in proper place. Will resend it after testing. Maybe nextweek. Thanks! Be careful, although Dan is right that something like this can be done for zram, I believe you will find that it needs a little more: either a separate new entry point (not my preference) or a flags arg (or boolean) added to swap_slot_free_notify. Because this is a different operation: end_swap_bio_read() wants to free up zram's compressed copy of the page, but the swp_entry_t must remain valid until swap_entry_free() can clear up the rest. Precisely how much of the work each should do, you will discover. Hugh diff --git a/mm/page_io.c b/mm/page_io.c index 56276fe..2d10988 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio-bi_bdev-bd_inode), (unsigned long long)bio-bi_sector); } else { + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis-flags SWP_BLKDEV) { + struct gendisk *disk = sis-bdev-bd_disk; + if (disk-fops-swap_slot_free_notify) { + SetPageDirty(page); + disk-fops-swap_slot_free_notify(sis-bdev, + offset); + } + } } unlock_page(page); bio_put(bio); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: > > From: Minchan Kim [mailto:minc...@kernel.org] > > Subject: Re: [RFC] mm: remove swapcache page early > > > > Hi Dan, > > > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: > > > > From: Hugh Dickins [mailto:hu...@google.com] > > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > > > > I believe the answer is for frontswap/zmem to invalidate the frontswap > > > > copy of the page (to free up the compressed memory when possible) and > > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in > > > > (setting page dirty so nothing will later go to read it from the > > > > unfreed location on backing swap disk, which was never written). > > > > > > There are two duplication issues: (1) When can the page be removed > > > from the swap cache after a call to frontswap_store; and (2) When > > > can the page be removed from the frontswap storage after it > > > has been brought back into memory via frontswap_load. > > > > > > This patch from Minchan addresses (1). The issue you are raising > > > > No. I am addressing (2). > > > > > here is (2). You may not know that (2) has recently been solved > > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. > > > If this is enabled (and it is for zcache but not yet for zswap), > > > what you suggest (SetPageDirty) is what happens. > > > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it > > on zram and zswap. > > Zswap can enable it trivially by adding a function call in init_zswap. > (Note that it is not enabled by default for all frontswap backends > because it is another complicated tradeoff of cpu time vs memory space > that needs more study on a broad set of workloads.) > > I wonder if something like this would have a similar result for zram? > (Completely untested... snippet stolen from swap_entry_free with > SetPageDirty added... doesn't compile yet, but should give you the idea.) Nice idea! After I see your patch, I realized it was Hugh's suggestion and you implemented it in proper place. Will resend it after testing. Maybe nextweek. Thanks! > > diff --git a/mm/page_io.c b/mm/page_io.c > index 56276fe..2d10988 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) > iminor(bio->bi_bdev->bd_inode), > (unsigned long long)bio->bi_sector); > } else { > + struct swap_info_struct *sis; > + > SetPageUptodate(page); > + sis = page_swap_info(page); > + if (sis->flags & SWP_BLKDEV) { > + struct gendisk *disk = sis->bdev->bd_disk; > + if (disk->fops->swap_slot_free_notify) { > + SetPageDirty(page); > + disk->fops->swap_slot_free_notify(sis->bdev, > + offset); > + } > + } > } > unlock_page(page); > bio_put(bio); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: Re: [RFC] mm: remove swapcache page early > > Hi Dan, > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: > > > From: Hugh Dickins [mailto:hu...@google.com] > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > > I believe the answer is for frontswap/zmem to invalidate the frontswap > > > copy of the page (to free up the compressed memory when possible) and > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in > > > (setting page dirty so nothing will later go to read it from the > > > unfreed location on backing swap disk, which was never written). > > > > There are two duplication issues: (1) When can the page be removed > > from the swap cache after a call to frontswap_store; and (2) When > > can the page be removed from the frontswap storage after it > > has been brought back into memory via frontswap_load. > > > > This patch from Minchan addresses (1). The issue you are raising > > No. I am addressing (2). > > > here is (2). You may not know that (2) has recently been solved > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. > > If this is enabled (and it is for zcache but not yet for zswap), > > what you suggest (SetPageDirty) is what happens. > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it > on zram and zswap. Zswap can enable it trivially by adding a function call in init_zswap. (Note that it is not enabled by default for all frontswap backends because it is another complicated tradeoff of cpu time vs memory space that needs more study on a broad set of workloads.) I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) diff --git a/mm/page_io.c b/mm/page_io.c index 56276fe..2d10988 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio->bi_bdev->bd_inode), (unsigned long long)bio->bi_sector); } else { + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis->flags & SWP_BLKDEV) { + struct gendisk *disk = sis->bdev->bd_disk; + if (disk->fops->swap_slot_free_notify) { + SetPageDirty(page); + disk->fops->swap_slot_free_notify(sis->bdev, + offset); + } + } } unlock_page(page); bio_put(bio); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
> From: Hugh Dickins [mailto:hu...@google.com] > Subject: RE: [RFC] mm: remove swapcache page early > > On Wed, 27 Mar 2013, Dan Magenheimer wrote: > > > From: Hugh Dickins [mailto:hu...@google.com] > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > The issue you are raising > > here is (2). You may not know that (2) has recently been solved > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. > > If this is enabled (and it is for zcache but not yet for zswap), > > what you suggest (SetPageDirty) is what happens. > > Ah, and I have a dim, perhaps mistaken, memory that I gave you > input on that before, suggesting the SetPageDirty. Good, sounds > like the solution is already in place, if not actually activated. > > Thanks, must dash, > Hugh Hi Hugh -- Credit where it is due... Yes, I do recall now that the idea was originally yours. It went on a to-do list where I eventually tried it and it worked... I'm sorry I had forgotten and neglected to give you credit! (BTW, it is activated for zcache in 3.9.) Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
From: Hugh Dickins [mailto:hu...@google.com] Subject: RE: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early The issue you are raising here is (2). You may not know that (2) has recently been solved in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. If this is enabled (and it is for zcache but not yet for zswap), what you suggest (SetPageDirty) is what happens. Ah, and I have a dim, perhaps mistaken, memory that I gave you input on that before, suggesting the SetPageDirty. Good, sounds like the solution is already in place, if not actually activated. Thanks, must dash, Hugh Hi Hugh -- Credit where it is due... Yes, I do recall now that the idea was originally yours. It went on a to-do list where I eventually tried it and it worked... I'm sorry I had forgotten and neglected to give you credit! (BTW, it is activated for zcache in 3.9.) Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: [RFC] mm: remove swapcache page early Hi Dan, On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). There are two duplication issues: (1) When can the page be removed from the swap cache after a call to frontswap_store; and (2) When can the page be removed from the frontswap storage after it has been brought back into memory via frontswap_load. This patch from Minchan addresses (1). The issue you are raising No. I am addressing (2). here is (2). You may not know that (2) has recently been solved in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. If this is enabled (and it is for zcache but not yet for zswap), what you suggest (SetPageDirty) is what happens. I am blind on zcache so I didn't see it. Anyway, I'd like to address it on zram and zswap. Zswap can enable it trivially by adding a function call in init_zswap. (Note that it is not enabled by default for all frontswap backends because it is another complicated tradeoff of cpu time vs memory space that needs more study on a broad set of workloads.) I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) diff --git a/mm/page_io.c b/mm/page_io.c index 56276fe..2d10988 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio-bi_bdev-bd_inode), (unsigned long long)bio-bi_sector); } else { + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis-flags SWP_BLKDEV) { + struct gendisk *disk = sis-bdev-bd_disk; + if (disk-fops-swap_slot_free_notify) { + SetPageDirty(page); + disk-fops-swap_slot_free_notify(sis-bdev, + offset); + } + } } unlock_page(page); bio_put(bio); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote: From: Minchan Kim [mailto:minc...@kernel.org] Subject: Re: [RFC] mm: remove swapcache page early Hi Dan, On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). There are two duplication issues: (1) When can the page be removed from the swap cache after a call to frontswap_store; and (2) When can the page be removed from the frontswap storage after it has been brought back into memory via frontswap_load. This patch from Minchan addresses (1). The issue you are raising No. I am addressing (2). here is (2). You may not know that (2) has recently been solved in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled. If this is enabled (and it is for zcache but not yet for zswap), what you suggest (SetPageDirty) is what happens. I am blind on zcache so I didn't see it. Anyway, I'd like to address it on zram and zswap. Zswap can enable it trivially by adding a function call in init_zswap. (Note that it is not enabled by default for all frontswap backends because it is another complicated tradeoff of cpu time vs memory space that needs more study on a broad set of workloads.) I wonder if something like this would have a similar result for zram? (Completely untested... snippet stolen from swap_entry_free with SetPageDirty added... doesn't compile yet, but should give you the idea.) Nice idea! After I see your patch, I realized it was Hugh's suggestion and you implemented it in proper place. Will resend it after testing. Maybe nextweek. Thanks! diff --git a/mm/page_io.c b/mm/page_io.c index 56276fe..2d10988 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err) iminor(bio-bi_bdev-bd_inode), (unsigned long long)bio-bi_sector); } else { + struct swap_info_struct *sis; + SetPageUptodate(page); + sis = page_swap_info(page); + if (sis-flags SWP_BLKDEV) { + struct gendisk *disk = sis-bdev-bd_disk; + if (disk-fops-swap_slot_free_notify) { + SetPageDirty(page); + disk-fops-swap_slot_free_notify(sis-bdev, + offset); + } + } } unlock_page(page); bio_put(bio); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Thu, Mar 28, 2013 at 10:18:24AM +0900, Minchan Kim wrote: > On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote: > > On Wed, 27 Mar 2013, Dan Magenheimer wrote: > > > > From: Hugh Dickins [mailto:hu...@google.com] > > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > > > > On Wed, 27 Mar 2013, Minchan Kim wrote: > > > > > > > > > Swap subsystem does lazy swap slot free with expecting the page > > > > > would be swapped out again so we can't avoid unnecessary write. > > > > so we can avoid unnecessary write. > > > > > > > > > > But the problem in in-memory swap is that it consumes memory space > > > > > until vm_swap_full(ie, used half of all of swap device) condition > > > > > meet. It could be bad if we use multiple swap device, small in-memory > > > > > swap > > > > > and big storage swap or in-memory swap alone. > > > > > > > > That is a very good realization: it's surprising that none of us > > > > thought of it before - no disrespect to you, well done, thank you. > > > > > > Yes, my compliments also Minchan. This problem has been thought of before > > > but this patch is the first to identify a possible solution. > > > > > > > And I guess swap readahead is utterly unhelpful in this case too. > > > > > > Yes... as is any "swap writeahead". Excuse my ignorance, but I > > > think this is not done in the swap subsystem but instead the kernel > > > assumes write-coalescing will be done in the block I/O subsystem, > > > which means swap writeahead would affect zram but not zcache/zswap > > > (since frontswap subverts the block I/O subsystem). > > > > I don't know what swap writeahead is; but write coalescing, yes. > > I don't see any problem with it in this context. > > > > > > > > However I think a swap-readahead solution would be helpful to > > > zram as well as zcache/zswap. > > > > Whereas swap readahead on zmem is uncompressing zmem to pagecache > > which may never be needed, and may take a circuit of the inactive > > LRU before it gets reclaimed (if it turns out not to be needed, > > at least it will remain clean and be easily reclaimed). > > But it could evict more important pages before reaching out the tail. > That's thing we really want to avoid if possible. > > > > > > > > > > > This patch changes vm_swap_full logic slightly so it could free > > > > > swap slot early if the backed device is really fast. > > > > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > > > > > > > But I strongly disagree with almost everything in your patch :) > > > > I disagree with addressing it in vm_swap_full(), I disagree that > > > > it can be addressed by device, I disagree that it has anything to > > > > do with SWP_SOLIDSTATE. > > > > > > > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0, > > > > is it? In those cases, a fixed amount of memory has been set aside > > > > for swap, and it works out just like with disk block devices. The > > > > memory set aside may be wasted, but that is accepted upfront. > > > > > > It is (I believe) also a problem with swapping to ram. Two > > > copies of the same page are kept in memory in different places, > > > right? Fixed vs variable size is irrelevant I think. Or am > > > I misunderstanding something about swap-to-ram? > > > > I may be misrembering how /dev/ram0 works, or simply assuming that > > if you want to use it for swap (interesting for testing, but probably > > not for general use), then you make sure to allocate each page of it > > in advance. > > > > The pages of /dev/ram0 don't get freed, or not before it's closed > > (swapoff'ed) anyway. Yes, swapcache would be duplicating data from > > other memory into /dev/ram0 memory; but that /dev/ram0 memory has > > been set aside for this purpose, and removing from swapcache won't > > free any more memory. > > > > > > > > > Similarly, this is not a problem with swapping to SSD. There might > > > > or might not be other reasons for adjusting the vm_swap_full() logic > > > > for SSD or generally, but those have nothing to do with this issue. > > > > > > I think it is at least highly related. The key is
Re: [RFC] mm: remove swapcache page early
Hi Seth, On Wed, Mar 27, 2013 at 12:19:11PM -0500, Seth Jennings wrote: > On 03/26/2013 09:22 PM, Minchan Kim wrote: > > Swap subsystem does lazy swap slot free with expecting the page > > would be swapped out again so we can't avoid unnecessary write. > > > > But the problem in in-memory swap is that it consumes memory space > > until vm_swap_full(ie, used half of all of swap device) condition > > meet. It could be bad if we use multiple swap device, small in-memory swap > > and big storage swap or in-memory swap alone. > > > > This patch changes vm_swap_full logic slightly so it could free > > swap slot early if the backed device is really fast. > > Great idea! Thanks! > > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > The comment for SWP_SOLIDSTATE is that "blkdev seeks are cheap". Just > because seeks are cheap doesn't mean the read itself is also cheap. The "read" isn't not concern but "write". > For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of > them can be pretty slow. Yeb. > > > So let's add Ccing Shaohua and Hugh. > > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > > or something for z* family. > > Afaict, setting SWP_SOLIDSTATE depends on characteristics of the > underlying block device (i.e. blk_queue_nonrot()). zram is a block > device but zcache and zswap are not. > > Any idea by what criteria SWP_INMEMORY would be set? Just in-memory swap, zram, zswap and zcache at the moment. :) > > Also, frontswap backends (zcache and zswap) are a caching layer on top > of the real swap device, which might actually be rotating media. So > you have the issue of to different characteristics, in-memory caching > on top of rotation media, present in a single swap device. Please read my patch completely. I already pointed out the problem and Hugh and Dan are suggesting ideas. Thanks! > > Thanks, > Seth > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote: > On Wed, 27 Mar 2013, Dan Magenheimer wrote: > > > From: Hugh Dickins [mailto:hu...@google.com] > > > Subject: Re: [RFC] mm: remove swapcache page early > > > > > > On Wed, 27 Mar 2013, Minchan Kim wrote: > > > > > > > Swap subsystem does lazy swap slot free with expecting the page > > > > would be swapped out again so we can't avoid unnecessary write. > > > so we can avoid unnecessary write. > > > > > > > > But the problem in in-memory swap is that it consumes memory space > > > > until vm_swap_full(ie, used half of all of swap device) condition > > > > meet. It could be bad if we use multiple swap device, small in-memory > > > > swap > > > > and big storage swap or in-memory swap alone. > > > > > > That is a very good realization: it's surprising that none of us > > > thought of it before - no disrespect to you, well done, thank you. > > > > Yes, my compliments also Minchan. This problem has been thought of before > > but this patch is the first to identify a possible solution. > > > > > And I guess swap readahead is utterly unhelpful in this case too. > > > > Yes... as is any "swap writeahead". Excuse my ignorance, but I > > think this is not done in the swap subsystem but instead the kernel > > assumes write-coalescing will be done in the block I/O subsystem, > > which means swap writeahead would affect zram but not zcache/zswap > > (since frontswap subverts the block I/O subsystem). > > I don't know what swap writeahead is; but write coalescing, yes. > I don't see any problem with it in this context. > > > > > However I think a swap-readahead solution would be helpful to > > zram as well as zcache/zswap. > > Whereas swap readahead on zmem is uncompressing zmem to pagecache > which may never be needed, and may take a circuit of the inactive > LRU before it gets reclaimed (if it turns out not to be needed, > at least it will remain clean and be easily reclaimed). But it could evict more important pages before reaching out the tail. That's thing we really want to avoid if possible. > > > > > > > This patch changes vm_swap_full logic slightly so it could free > > > > swap slot early if the backed device is really fast. > > > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > > > > > But I strongly disagree with almost everything in your patch :) > > > I disagree with addressing it in vm_swap_full(), I disagree that > > > it can be addressed by device, I disagree that it has anything to > > > do with SWP_SOLIDSTATE. > > > > > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0, > > > is it? In those cases, a fixed amount of memory has been set aside > > > for swap, and it works out just like with disk block devices. The > > > memory set aside may be wasted, but that is accepted upfront. > > > > It is (I believe) also a problem with swapping to ram. Two > > copies of the same page are kept in memory in different places, > > right? Fixed vs variable size is irrelevant I think. Or am > > I misunderstanding something about swap-to-ram? > > I may be misrembering how /dev/ram0 works, or simply assuming that > if you want to use it for swap (interesting for testing, but probably > not for general use), then you make sure to allocate each page of it > in advance. > > The pages of /dev/ram0 don't get freed, or not before it's closed > (swapoff'ed) anyway. Yes, swapcache would be duplicating data from > other memory into /dev/ram0 memory; but that /dev/ram0 memory has > been set aside for this purpose, and removing from swapcache won't > free any more memory. > > > > > > Similarly, this is not a problem with swapping to SSD. There might > > > or might not be other reasons for adjusting the vm_swap_full() logic > > > for SSD or generally, but those have nothing to do with this issue. > > > > I think it is at least highly related. The key issue is the > > tradeoff of the likelihood that the page will soon be read/written > > again while it is in swap cache vs the time/resource-usage necessary > > to "reconstitute" the page into swap cache. Reconstituting from disk > > requires a LOT of elapsed time. Reconstituting from > > an SSD likely takes much less time. Reconstituting from > > zcache/zram takes thousands of CPU cycles. > > I acknowledge my complete ignorance of
Re: [RFC] mm: remove swapcache page early
Hi Dan, On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: > > From: Hugh Dickins [mailto:hu...@google.com] > > Subject: Re: [RFC] mm: remove swapcache page early > > > > On Wed, 27 Mar 2013, Minchan Kim wrote: > > > > > Swap subsystem does lazy swap slot free with expecting the page > > > would be swapped out again so we can't avoid unnecessary write. > > so we can avoid unnecessary write. > > > > > > But the problem in in-memory swap is that it consumes memory space > > > until vm_swap_full(ie, used half of all of swap device) condition > > > meet. It could be bad if we use multiple swap device, small in-memory swap > > > and big storage swap or in-memory swap alone. > > > > That is a very good realization: it's surprising that none of us > > thought of it before - no disrespect to you, well done, thank you. > > Yes, my compliments also Minchan. This problem has been thought of before > but this patch is the first to identify a possible solution. Thanks! > > > And I guess swap readahead is utterly unhelpful in this case too. > > Yes... as is any "swap writeahead". Excuse my ignorance, but I > think this is not done in the swap subsystem but instead the kernel > assumes write-coalescing will be done in the block I/O subsystem, > which means swap writeahead would affect zram but not zcache/zswap > (since frontswap subverts the block I/O subsystem). Frankly speaking, I don't know why you mentioned "swap writeahead" in this point. Anyway, I dobut how it effect zram, too. A gain I can have a mind is compress ratio would be high thorough multiple page compression all at once. > > However I think a swap-readahead solution would be helpful to > zram as well as zcache/zswap. Hmm, why? swap-readahead is just hint to reduce big stall time to reduce on big seek overhead storage. But in-memory swap is no cost for seeking. So unnecessary swap-readahead can make memory pressure high and it could cause another page swap out so it could be swap-thrashing. And for good swap-readahead hit ratio, swap device shouldn't be fragmented. But as you know, there are many factor to prevent it in the kernel now and Shaohua is tackling on it. > > > > This patch changes vm_swap_full logic slightly so it could free > > > swap slot early if the backed device is really fast. > > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > > > But I strongly disagree with almost everything in your patch :) > > I disagree with addressing it in vm_swap_full(), I disagree that > > it can be addressed by device, I disagree that it has anything to > > do with SWP_SOLIDSTATE. > > > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0, > > is it? In those cases, a fixed amount of memory has been set aside > > for swap, and it works out just like with disk block devices. The > > memory set aside may be wasted, but that is accepted upfront. > > It is (I believe) also a problem with swapping to ram. Two > copies of the same page are kept in memory in different places, > right? Fixed vs variable size is irrelevant I think. Or am > I misunderstanding something about swap-to-ram? > > > Similarly, this is not a problem with swapping to SSD. There might > > or might not be other reasons for adjusting the vm_swap_full() logic > > for SSD or generally, but those have nothing to do with this issue. > > I think it is at least highly related. The key issue is the > tradeoff of the likelihood that the page will soon be read/written > again while it is in swap cache vs the time/resource-usage necessary > to "reconstitute" the page into swap cache. Reconstituting from disk > requires a LOT of elapsed time. Reconstituting from > an SSD likely takes much less time. Reconstituting from > zcache/zram takes thousands of CPU cycles. Yeb. That's why I wanted to use SWP_SOLIDSTATE. > > > The problem here is peculiar to frontswap, and the variably sized > > memory behind it, isn't it? We are accustomed to using swap to free > > up memory by transferring its data to some other, cheaper but slower > > resource. > > Frontswap does make the problem more complex because some pages > are in "fairly fast" storage (zcache, needs decompression) and > some are on the actual (usually) rotating media. Fortunately, > differentiating between these two cases is just a table lookup > (see frontswap_test). Yeb, I thouht it could be a last resort because I'd like to avoid lookup every swapin if possible. > > > But in the case of frontswap and zmem (I'll say that to avoid thinkin
Re: [RFC] mm: remove swapcache page early
Hi Hugh, On Wed, Mar 27, 2013 at 02:41:07PM -0700, Hugh Dickins wrote: > On Wed, 27 Mar 2013, Minchan Kim wrote: > > > Swap subsystem does lazy swap slot free with expecting the page > > would be swapped out again so we can't avoid unnecessary write. > so we can avoid unnecessary write. > > > > But the problem in in-memory swap is that it consumes memory space > > until vm_swap_full(ie, used half of all of swap device) condition > > meet. It could be bad if we use multiple swap device, small in-memory swap > > and big storage swap or in-memory swap alone. > > That is a very good realization: it's surprising that none of us > thought of it before - no disrespect to you, well done, thank you. > > And I guess swap readahead is utterly unhelpful in this case too. > > > > > This patch changes vm_swap_full logic slightly so it could free > > swap slot early if the backed device is really fast. > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > But I strongly disagree with almost everything in your patch :) > I disagree with addressing it in vm_swap_full(), I disagree that > it can be addressed by device, I disagree that it has anything to > do with SWP_SOLIDSTATE. > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0, > is it? In those cases, a fixed amount of memory has been set aside > for swap, and it works out just like with disk block devices. The Brd is okay but it seems you are miunderstanding zram. The zram doesn't reserve any memory and allocate dynamic memory when swap out happens so it can make duplicate space in pusdo block device and memory. > memory set aside may be wasted, but that is accepted upfront. > > Similarly, this is not a problem with swapping to SSD. There might > or might not be other reasons for adjusting the vm_swap_full() logic > for SSD or generally, but those have nothing to do with this issue. Yes. > > The problem here is peculiar to frontswap, and the variably sized > memory behind it, isn't it? We are accustomed to using swap to free Zram, too. > up memory by transferring its data to some other, cheaper but slower > resource. > > But in the case of frontswap and zmem (I'll say that to avoid thinking Frankly speaking, I couldn't understand what you means, frontswap and zmem. The frontswap is just layer for hook the swap subsystem. Real instance of frontswap is zcache and zswap at the moment. I will understand them as zcache and zswap. Okay? > through which backends are actually involved), it is not a cheaper and > slower resource, but the very same memory we are trying to save: swap > is stolen from the memory under reclaim, so any duplication becomes > counter-productive (if we ignore cpu compression/decompression costs: > I have no idea how fair it is to do so, but anyone who chooses zmem > is prepared to pay some cpu price for that). Agree. > > And because it's a frontswap thing, we cannot decide this by device: > frontswap may or may not stand in front of each device. There is no > problem with swapcache duplicated on disk (until that area approaches > being full or fragmented), but at the higher level we cannot see what > is in zmem and what is on disk: we only want to free up the zmem dup. That's what I really have a concern and why I begged idea. > > I believe the answer is for frontswap/zmem to invalidate the frontswap > copy of the page (to free up the compressed memory when possible) and > SetPageDirty on the PageUptodate PageSwapCache page when swapping in > (setting page dirty so nothing will later go to read it from the > unfreed location on backing swap disk, which was never written). You mean that zcache and zswap have to do garbage collection by some policy? It could be but how about zram? It's just pseudo block device and he don't have any knowledge on top of it. It could be swap or normal block device. I mean zram has no information of swap to handle it. > > We cannot rely on freeing the swap itself, because in general there > may be multiple references to the swap, and we only satisfy the one > which has faulted. It may or may not be a good idea to use rmap to > locate the other places to insert pte in place of swap entry, to > resolve them all at once; but we have chosen not to do so in the > past, and there's no need for that, if the zmem gets invalidated > and the swapcache page set dirty. Yes it could be better but as I mentioned above, it couldn't handle zram case. If there is a solution for zram, I will be happy. :) And another point, fronstwap is already percolated into swap subsystem very tightly. So I doubt adding one another hook is a really problem. Thanks for great comment, Hugh! > > Hugh > > > So let's add Ccing Shaohua and Hugh. > > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > > or something for z* family. > > > > Other problem is zram is block device so that it can set SWP_INMEMORY > > or SWP_SOLIDSTATE easily(ie,
RE: [RFC] mm: remove swapcache page early
On Wed, 27 Mar 2013, Dan Magenheimer wrote: > > From: Hugh Dickins [mailto:hu...@google.com] > > Subject: Re: [RFC] mm: remove swapcache page early > > > > On Wed, 27 Mar 2013, Minchan Kim wrote: > > > > > Swap subsystem does lazy swap slot free with expecting the page > > > would be swapped out again so we can't avoid unnecessary write. > > so we can avoid unnecessary write. > > > > > > But the problem in in-memory swap is that it consumes memory space > > > until vm_swap_full(ie, used half of all of swap device) condition > > > meet. It could be bad if we use multiple swap device, small in-memory swap > > > and big storage swap or in-memory swap alone. > > > > That is a very good realization: it's surprising that none of us > > thought of it before - no disrespect to you, well done, thank you. > > Yes, my compliments also Minchan. This problem has been thought of before > but this patch is the first to identify a possible solution. > > > And I guess swap readahead is utterly unhelpful in this case too. > > Yes... as is any "swap writeahead". Excuse my ignorance, but I > think this is not done in the swap subsystem but instead the kernel > assumes write-coalescing will be done in the block I/O subsystem, > which means swap writeahead would affect zram but not zcache/zswap > (since frontswap subverts the block I/O subsystem). I don't know what swap writeahead is; but write coalescing, yes. I don't see any problem with it in this context. > > However I think a swap-readahead solution would be helpful to > zram as well as zcache/zswap. Whereas swap readahead on zmem is uncompressing zmem to pagecache which may never be needed, and may take a circuit of the inactive LRU before it gets reclaimed (if it turns out not to be needed, at least it will remain clean and be easily reclaimed). > > > > This patch changes vm_swap_full logic slightly so it could free > > > swap slot early if the backed device is really fast. > > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > > > But I strongly disagree with almost everything in your patch :) > > I disagree with addressing it in vm_swap_full(), I disagree that > > it can be addressed by device, I disagree that it has anything to > > do with SWP_SOLIDSTATE. > > > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0, > > is it? In those cases, a fixed amount of memory has been set aside > > for swap, and it works out just like with disk block devices. The > > memory set aside may be wasted, but that is accepted upfront. > > It is (I believe) also a problem with swapping to ram. Two > copies of the same page are kept in memory in different places, > right? Fixed vs variable size is irrelevant I think. Or am > I misunderstanding something about swap-to-ram? I may be misrembering how /dev/ram0 works, or simply assuming that if you want to use it for swap (interesting for testing, but probably not for general use), then you make sure to allocate each page of it in advance. The pages of /dev/ram0 don't get freed, or not before it's closed (swapoff'ed) anyway. Yes, swapcache would be duplicating data from other memory into /dev/ram0 memory; but that /dev/ram0 memory has been set aside for this purpose, and removing from swapcache won't free any more memory. > > > Similarly, this is not a problem with swapping to SSD. There might > > or might not be other reasons for adjusting the vm_swap_full() logic > > for SSD or generally, but those have nothing to do with this issue. > > I think it is at least highly related. The key issue is the > tradeoff of the likelihood that the page will soon be read/written > again while it is in swap cache vs the time/resource-usage necessary > to "reconstitute" the page into swap cache. Reconstituting from disk > requires a LOT of elapsed time. Reconstituting from > an SSD likely takes much less time. Reconstituting from > zcache/zram takes thousands of CPU cycles. I acknowledge my complete ignorance of how to judge the tradeoff between memory usage and cpu usage, but I think Minchan's main concern was with the memory usage. Neither hard disk nor SSD is occupying memory. > > > The problem here is peculiar to frontswap, and the variably sized > > memory behind it, isn't it? We are accustomed to using swap to free > > up memory by transferring its data to some other, cheaper but slower > > resource. > > Frontswap does make the problem more complex because some pages > are in "fairly fast" storage (zcache, needs decompression) and > some are on the actual (usually) rotating med
RE: [RFC] mm: remove swapcache page early
> From: Hugh Dickins [mailto:hu...@google.com] > Subject: Re: [RFC] mm: remove swapcache page early > > On Wed, 27 Mar 2013, Minchan Kim wrote: > > > Swap subsystem does lazy swap slot free with expecting the page > > would be swapped out again so we can't avoid unnecessary write. > so we can avoid unnecessary write. > > > > But the problem in in-memory swap is that it consumes memory space > > until vm_swap_full(ie, used half of all of swap device) condition > > meet. It could be bad if we use multiple swap device, small in-memory swap > > and big storage swap or in-memory swap alone. > > That is a very good realization: it's surprising that none of us > thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. > And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any "swap writeahead". Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. > > This patch changes vm_swap_full logic slightly so it could free > > swap slot early if the backed device is really fast. > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > But I strongly disagree with almost everything in your patch :) > I disagree with addressing it in vm_swap_full(), I disagree that > it can be addressed by device, I disagree that it has anything to > do with SWP_SOLIDSTATE. > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0, > is it? In those cases, a fixed amount of memory has been set aside > for swap, and it works out just like with disk block devices. The > memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? > Similarly, this is not a problem with swapping to SSD. There might > or might not be other reasons for adjusting the vm_swap_full() logic > for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to "reconstitute" the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. > The problem here is peculiar to frontswap, and the variably sized > memory behind it, isn't it? We are accustomed to using swap to free > up memory by transferring its data to some other, cheaper but slower > resource. Frontswap does make the problem more complex because some pages are in "fairly fast" storage (zcache, needs decompression) and some are on the actual (usually) rotating media. Fortunately, differentiating between these two cases is just a table lookup (see frontswap_test). > But in the case of frontswap and zmem (I'll say that to avoid thinking > through which backends are actually involved), it is not a cheaper and > slower resource, but the very same memory we are trying to save: swap > is stolen from the memory under reclaim, so any duplication becomes > counter-productive (if we ignore cpu compression/decompression costs: > I have no idea how fair it is to do so, but anyone who chooses zmem > is prepared to pay some cpu price for that). Exactly. There is some "robbing of Peter to pay Paul" and other complex resource tradeoffs. Presumably, though, it is not "the very same memory we are trying to save" but a fraction of it, saving the same page of data more efficiently in memory, using less than a page, at some CPU cost. > And because it's a frontswap thing, we cannot decide this by device: > frontswap may or may not stand in front of each device. There is no > problem with swapcache duplicated on disk (until that area approaches > being full or fragmented), but at the higher level we cannot see what > is in zmem and what is on disk: we only want to free up the zmem dup. I *think* frontswap_test(page) resolves this problem, as long as we have a specific page available to use as a parameter. > I believe the answer is for frontswap/zmem to invalidate the frontswa
Re: [RFC] mm: remove swapcache page early
On Wed, 27 Mar 2013, Minchan Kim wrote: > Swap subsystem does lazy swap slot free with expecting the page > would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. > > But the problem in in-memory swap is that it consumes memory space > until vm_swap_full(ie, used half of all of swap device) condition > meet. It could be bad if we use multiple swap device, small in-memory swap > and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. > > This patch changes vm_swap_full logic slightly so it could free > swap slot early if the backed device is really fast. > For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Hugh > So let's add Ccing Shaohua and Hugh. > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > or something for z* family. > > Other problem is zram is block device so that it can set SWP_INMEMORY > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but > I have no idea to use it for frontswap. > > Any idea? > > Other optimize point is we remove it unconditionally when we > found it's exclusive when swap in happen. > It could help frontswap family, too. > What do you think about it? > > Cc: Hugh Dickins > Cc: Dan Magenheimer > Cc: Seth Jennings > Cc: Nitin Gupta > Cc: Konrad Rzeszutek Wilk > Cc: Shaohua Li > Signed-off-by: Minchan Kim > --- > include/linux/swap.h | 11 --- > mm/memory.c | 3 ++- > mm/swapfile.c| 11 +++ > mm/vmscan.c | 2 +- > 4 files changed, 18 insertions(+), 9 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2818a12..1f4df66 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, > extern atomic_long_t nr_swap_pages; > extern long total_swap_pages; > > -/* Swap 50% full? Release swapcache more aggressively.. */ > -static inline bool vm_swap_full(void) > +/* > + * Swap 50% full or fast backed device? > + * Release swapcache more aggressively. > + */ > +static inline bool vm_swap_full(struct swap_info_struct *si) > { > + if (si->flags & SWP_SOLIDSTATE) > + return true; > return atomic_long_read(_swap_pages) * 2 < total_swap_pages; > }
RE: [RFC] mm: remove swapcache page early
> From: Minchan Kim [mailto:minc...@kernel.org] > Subject: [RFC] mm: remove swapcache page early > > Swap subsystem does lazy swap slot free with expecting the page > would be swapped out again so we can't avoid unnecessary write. > > But the problem in in-memory swap is that it consumes memory space > until vm_swap_full(ie, used half of all of swap device) condition > meet. It could be bad if we use multiple swap device, small in-memory swap > and big storage swap or in-memory swap alone. > > This patch changes vm_swap_full logic slightly so it could free > swap slot early if the backed device is really fast. > For it, I used SWP_SOLIDSTATE but It might be controversial. > So let's add Ccing Shaohua and Hugh. > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > or something for z* family. > > Other problem is zram is block device so that it can set SWP_INMEMORY > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but > I have no idea to use it for frontswap. > > Any idea? > > Other optimize point is we remove it unconditionally when we > found it's exclusive when swap in happen. > It could help frontswap family, too. By passing a struct page * to vm_swap_full() you can then call frontswap_test()... if it returns true, then vm_swap_full can return true. Note that this precisely checks whether the page is in zcache/zswap or not, so Seth's concern that some pages may be in-memory and some may be in rotating storage is no longer an issue. > What do you think about it? By removing the page from swapcache, you are now increasing the risk that pages will "thrash" between uncompressed state (in swapcache) and compressed state (in z*). I think this is a better tradeoff though than keeping a copy of both the compressed page AND the uncompressed page in memory. You should probably rename vm_swap_full() because you are now overloading it with other meanings. Maybe vm_swap_reclaimable()? Do you have any measurements? I think you are correct that it may help a LOT. Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On 03/26/2013 09:22 PM, Minchan Kim wrote: > Swap subsystem does lazy swap slot free with expecting the page > would be swapped out again so we can't avoid unnecessary write. > > But the problem in in-memory swap is that it consumes memory space > until vm_swap_full(ie, used half of all of swap device) condition > meet. It could be bad if we use multiple swap device, small in-memory swap > and big storage swap or in-memory swap alone. > > This patch changes vm_swap_full logic slightly so it could free > swap slot early if the backed device is really fast. Great idea! > For it, I used SWP_SOLIDSTATE but It might be controversial. The comment for SWP_SOLIDSTATE is that "blkdev seeks are cheap". Just because seeks are cheap doesn't mean the read itself is also cheap. For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of them can be pretty slow. > So let's add Ccing Shaohua and Hugh. > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > or something for z* family. Afaict, setting SWP_SOLIDSTATE depends on characteristics of the underlying block device (i.e. blk_queue_nonrot()). zram is a block device but zcache and zswap are not. Any idea by what criteria SWP_INMEMORY would be set? Also, frontswap backends (zcache and zswap) are a caching layer on top of the real swap device, which might actually be rotating media. So you have the issue of to different characteristics, in-memory caching on top of rotation media, present in a single swap device. Thanks, Seth -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Hi Kame, On Wed, Mar 27, 2013 at 02:15:41PM +0900, Kamezawa Hiroyuki wrote: > (2013/03/27 11:22), Minchan Kim wrote: > > Swap subsystem does lazy swap slot free with expecting the page > > would be swapped out again so we can't avoid unnecessary write. > > > > But the problem in in-memory swap is that it consumes memory space > > until vm_swap_full(ie, used half of all of swap device) condition > > meet. It could be bad if we use multiple swap device, small in-memory swap > > and big storage swap or in-memory swap alone. > > > > This patch changes vm_swap_full logic slightly so it could free > > swap slot early if the backed device is really fast. > > For it, I used SWP_SOLIDSTATE but It might be controversial. > > So let's add Ccing Shaohua and Hugh. > > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > > or something for z* family. > > > > Other problem is zram is block device so that it can set SWP_INMEMORY > > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but > > I have no idea to use it for frontswap. > > > > Any idea? > > > Another thinkingin what case, in what system configuration, > vm_swap_full() should return false and delay swp_entry freeing ? It's a really good question I had have in mind from long time ago. If I catch your point properly, your question is "Couldn't we remove vm_swap_full logic?" If so, the answer is "I have no idea and would like to ask it to Hugh". Academically, it does make sense swap-out page is unlikely to be working set so it could be swap out again and I believe it was merged since we had the workload could be enhanced by the logic at that time. And I think it's not easy to prove it's useless thesedays because I couldn't have all recent workloads over the world so I'd like to avoid such adventure. :) Thanks. > > Thanks, > -Kame > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Hi Kame, On Wed, Mar 27, 2013 at 02:15:41PM +0900, Kamezawa Hiroyuki wrote: (2013/03/27 11:22), Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Another thinkingin what case, in what system configuration, vm_swap_full() should return false and delay swp_entry freeing ? It's a really good question I had have in mind from long time ago. If I catch your point properly, your question is Couldn't we remove vm_swap_full logic? If so, the answer is I have no idea and would like to ask it to Hugh. Academically, it does make sense swap-out page is unlikely to be working set so it could be swap out again and I believe it was merged since we had the workload could be enhanced by the logic at that time. And I think it's not easy to prove it's useless thesedays because I couldn't have all recent workloads over the world so I'd like to avoid such adventure. :) Thanks. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On 03/26/2013 09:22 PM, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. Great idea! For it, I used SWP_SOLIDSTATE but It might be controversial. The comment for SWP_SOLIDSTATE is that blkdev seeks are cheap. Just because seeks are cheap doesn't mean the read itself is also cheap. For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of them can be pretty slow. So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Afaict, setting SWP_SOLIDSTATE depends on characteristics of the underlying block device (i.e. blk_queue_nonrot()). zram is a block device but zcache and zswap are not. Any idea by what criteria SWP_INMEMORY would be set? Also, frontswap backends (zcache and zswap) are a caching layer on top of the real swap device, which might actually be rotating media. So you have the issue of to different characteristics, in-memory caching on top of rotation media, present in a single swap device. Thanks, Seth -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] mm: remove swapcache page early
From: Minchan Kim [mailto:minc...@kernel.org] Subject: [RFC] mm: remove swapcache page early Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. By passing a struct page * to vm_swap_full() you can then call frontswap_test()... if it returns true, then vm_swap_full can return true. Note that this precisely checks whether the page is in zcache/zswap or not, so Seth's concern that some pages may be in-memory and some may be in rotating storage is no longer an issue. What do you think about it? By removing the page from swapcache, you are now increasing the risk that pages will thrash between uncompressed state (in swapcache) and compressed state (in z*). I think this is a better tradeoff though than keeping a copy of both the compressed page AND the uncompressed page in memory. You should probably rename vm_swap_full() because you are now overloading it with other meanings. Maybe vm_swap_reclaimable()? Do you have any measurements? I think you are correct that it may help a LOT. Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Hugh So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. What do you think about it? Cc: Hugh Dickins hu...@google.com Cc: Dan Magenheimer dan.magenhei...@oracle.com Cc: Seth Jennings sjenn...@linux.vnet.ibm.com Cc: Nitin Gupta ngu...@vflare.org Cc: Konrad Rzeszutek Wilk kon...@darnok.org Cc: Shaohua Li s...@kernel.org Signed-off-by: Minchan Kim minc...@kernel.org --- include/linux/swap.h | 11 --- mm/memory.c | 3 ++- mm/swapfile.c| 11 +++ mm/vmscan.c | 2 +- 4 files changed, 18 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2818a12..1f4df66 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, extern atomic_long_t nr_swap_pages; extern long total_swap_pages; -/* Swap 50% full? Release swapcache more aggressively.. */ -static inline bool vm_swap_full(void) +/* + * Swap 50% full or fast backed device? + * Release swapcache more aggressively. + */ +static inline bool vm_swap_full(struct swap_info_struct *si) { + if (si-flags SWP_SOLIDSTATE) + return
RE: [RFC] mm: remove swapcache page early
From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any swap writeahead. Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to reconstitute the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. Frontswap does make the problem more complex because some pages are in fairly fast storage (zcache, needs decompression) and some are on the actual (usually) rotating media. Fortunately, differentiating between these two cases is just a table lookup (see frontswap_test). But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). Exactly. There is some robbing of Peter to pay Paul and other complex resource tradeoffs. Presumably, though, it is not the very same memory we are trying to save but a fraction of it, saving the same page of data more efficiently in memory, using less than a page, at some CPU cost. And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I *think* frontswap_test(page) resolves this problem, as long as we have a specific page available to use as a parameter. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). There are two duplication
RE: [RFC] mm: remove swapcache page early
On Wed, 27 Mar 2013, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any swap writeahead. Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). I don't know what swap writeahead is; but write coalescing, yes. I don't see any problem with it in this context. However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. Whereas swap readahead on zmem is uncompressing zmem to pagecache which may never be needed, and may take a circuit of the inactive LRU before it gets reclaimed (if it turns out not to be needed, at least it will remain clean and be easily reclaimed). This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? I may be misrembering how /dev/ram0 works, or simply assuming that if you want to use it for swap (interesting for testing, but probably not for general use), then you make sure to allocate each page of it in advance. The pages of /dev/ram0 don't get freed, or not before it's closed (swapoff'ed) anyway. Yes, swapcache would be duplicating data from other memory into /dev/ram0 memory; but that /dev/ram0 memory has been set aside for this purpose, and removing from swapcache won't free any more memory. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to reconstitute the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. I acknowledge my complete ignorance of how to judge the tradeoff between memory usage and cpu usage, but I think Minchan's main concern was with the memory usage. Neither hard disk nor SSD is occupying memory. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. Frontswap does make the problem more complex because some pages are in fairly fast storage (zcache, needs decompression) and some are on the actual (usually) rotating media. Fortunately, differentiating between these two cases is just a table lookup (see frontswap_test). But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do
Re: [RFC] mm: remove swapcache page early
Hi Hugh, On Wed, Mar 27, 2013 at 02:41:07PM -0700, Hugh Dickins wrote: On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The Brd is okay but it seems you are miunderstanding zram. The zram doesn't reserve any memory and allocate dynamic memory when swap out happens so it can make duplicate space in pusdo block device and memory. memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. Yes. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free Zram, too. up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking Frankly speaking, I couldn't understand what you means, frontswap and zmem. The frontswap is just layer for hook the swap subsystem. Real instance of frontswap is zcache and zswap at the moment. I will understand them as zcache and zswap. Okay? through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). Agree. And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. That's what I really have a concern and why I begged idea. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). You mean that zcache and zswap have to do garbage collection by some policy? It could be but how about zram? It's just pseudo block device and he don't have any knowledge on top of it. It could be swap or normal block device. I mean zram has no information of swap to handle it. We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Yes it could be better but as I mentioned above, it couldn't handle zram case. If there is a solution for zram, I will be happy. :) And another point, fronstwap is already percolated into swap subsystem very tightly. So I doubt adding one another hook is a really problem. Thanks for great comment, Hugh! Hugh So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea?
Re: [RFC] mm: remove swapcache page early
Hi Dan, On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. Thanks! And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any swap writeahead. Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). Frankly speaking, I don't know why you mentioned swap writeahead in this point. Anyway, I dobut how it effect zram, too. A gain I can have a mind is compress ratio would be high thorough multiple page compression all at once. However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. Hmm, why? swap-readahead is just hint to reduce big stall time to reduce on big seek overhead storage. But in-memory swap is no cost for seeking. So unnecessary swap-readahead can make memory pressure high and it could cause another page swap out so it could be swap-thrashing. And for good swap-readahead hit ratio, swap device shouldn't be fragmented. But as you know, there are many factor to prevent it in the kernel now and Shaohua is tackling on it. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to reconstitute the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. Yeb. That's why I wanted to use SWP_SOLIDSTATE. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. Frontswap does make the problem more complex because some pages are in fairly fast storage (zcache, needs decompression) and some are on the actual (usually) rotating media. Fortunately, differentiating between these two cases is just a table lookup (see frontswap_test). Yeb, I thouht it could be a last resort because I'd like to avoid lookup every swapin if possible. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). Exactly. There is some robbing of Peter to pay Paul and other complex resource tradeoffs. Presumably, though, it is not the very
Re: [RFC] mm: remove swapcache page early
On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote: On Wed, 27 Mar 2013, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any swap writeahead. Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). I don't know what swap writeahead is; but write coalescing, yes. I don't see any problem with it in this context. However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. Whereas swap readahead on zmem is uncompressing zmem to pagecache which may never be needed, and may take a circuit of the inactive LRU before it gets reclaimed (if it turns out not to be needed, at least it will remain clean and be easily reclaimed). But it could evict more important pages before reaching out the tail. That's thing we really want to avoid if possible. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? I may be misrembering how /dev/ram0 works, or simply assuming that if you want to use it for swap (interesting for testing, but probably not for general use), then you make sure to allocate each page of it in advance. The pages of /dev/ram0 don't get freed, or not before it's closed (swapoff'ed) anyway. Yes, swapcache would be duplicating data from other memory into /dev/ram0 memory; but that /dev/ram0 memory has been set aside for this purpose, and removing from swapcache won't free any more memory. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to reconstitute the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. I acknowledge my complete ignorance of how to judge the tradeoff between memory usage and cpu usage, but I think Minchan's main concern was with the memory usage. Neither hard disk nor SSD is occupying memory. Hmm, It seems I misunderstood Dan's opinion in previous thread. You're right, Hugh. My main concern is memory usage but the rationale I used SWP_SOLIDSTATE is writing on SSD could be cheap rather than storage. Yeb, it depends on SSD's internal's FTL algorith and fragment ratio due to wear-leveling. That's why I said It might be controversial. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. Frontswap does make the problem more
Re: [RFC] mm: remove swapcache page early
Hi Seth, On Wed, Mar 27, 2013 at 12:19:11PM -0500, Seth Jennings wrote: On 03/26/2013 09:22 PM, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. Great idea! Thanks! For it, I used SWP_SOLIDSTATE but It might be controversial. The comment for SWP_SOLIDSTATE is that blkdev seeks are cheap. Just because seeks are cheap doesn't mean the read itself is also cheap. The read isn't not concern but write. For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of them can be pretty slow. Yeb. So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Afaict, setting SWP_SOLIDSTATE depends on characteristics of the underlying block device (i.e. blk_queue_nonrot()). zram is a block device but zcache and zswap are not. Any idea by what criteria SWP_INMEMORY would be set? Just in-memory swap, zram, zswap and zcache at the moment. :) Also, frontswap backends (zcache and zswap) are a caching layer on top of the real swap device, which might actually be rotating media. So you have the issue of to different characteristics, in-memory caching on top of rotation media, present in a single swap device. Please read my patch completely. I already pointed out the problem and Hugh and Dan are suggesting ideas. Thanks! Thanks, Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
On Thu, Mar 28, 2013 at 10:18:24AM +0900, Minchan Kim wrote: On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote: On Wed, 27 Mar 2013, Dan Magenheimer wrote: From: Hugh Dickins [mailto:hu...@google.com] Subject: Re: [RFC] mm: remove swapcache page early On Wed, 27 Mar 2013, Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. Yes, my compliments also Minchan. This problem has been thought of before but this patch is the first to identify a possible solution. And I guess swap readahead is utterly unhelpful in this case too. Yes... as is any swap writeahead. Excuse my ignorance, but I think this is not done in the swap subsystem but instead the kernel assumes write-coalescing will be done in the block I/O subsystem, which means swap writeahead would affect zram but not zcache/zswap (since frontswap subverts the block I/O subsystem). I don't know what swap writeahead is; but write coalescing, yes. I don't see any problem with it in this context. However I think a swap-readahead solution would be helpful to zram as well as zcache/zswap. Whereas swap readahead on zmem is uncompressing zmem to pagecache which may never be needed, and may take a circuit of the inactive LRU before it gets reclaimed (if it turns out not to be needed, at least it will remain clean and be easily reclaimed). But it could evict more important pages before reaching out the tail. That's thing we really want to avoid if possible. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. It is (I believe) also a problem with swapping to ram. Two copies of the same page are kept in memory in different places, right? Fixed vs variable size is irrelevant I think. Or am I misunderstanding something about swap-to-ram? I may be misrembering how /dev/ram0 works, or simply assuming that if you want to use it for swap (interesting for testing, but probably not for general use), then you make sure to allocate each page of it in advance. The pages of /dev/ram0 don't get freed, or not before it's closed (swapoff'ed) anyway. Yes, swapcache would be duplicating data from other memory into /dev/ram0 memory; but that /dev/ram0 memory has been set aside for this purpose, and removing from swapcache won't free any more memory. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. I think it is at least highly related. The key issue is the tradeoff of the likelihood that the page will soon be read/written again while it is in swap cache vs the time/resource-usage necessary to reconstitute the page into swap cache. Reconstituting from disk requires a LOT of elapsed time. Reconstituting from an SSD likely takes much less time. Reconstituting from zcache/zram takes thousands of CPU cycles. I acknowledge my complete ignorance of how to judge the tradeoff between memory usage and cpu usage, but I think Minchan's main concern was with the memory usage. Neither hard disk nor SSD is occupying memory. Hmm, It seems I misunderstood Dan's opinion in previous thread. You're right, Hugh. My main concern is memory usage but the rationale I used SWP_SOLIDSTATE is writing on SSD could be cheap rather than storage. Yeb, it depends on SSD's internal's FTL algorith and fragment ratio due to wear-leveling. That's why I said It might be controversial. Even SSD is fast, there is tradeoff. And unncessary write to SSD should be avoided if possible, because write makes
Re: [RFC] mm: remove swapcache page early
(2013/03/27 11:22), Minchan Kim wrote: > Swap subsystem does lazy swap slot free with expecting the page > would be swapped out again so we can't avoid unnecessary write. > > But the problem in in-memory swap is that it consumes memory space > until vm_swap_full(ie, used half of all of swap device) condition > meet. It could be bad if we use multiple swap device, small in-memory swap > and big storage swap or in-memory swap alone. > > This patch changes vm_swap_full logic slightly so it could free > swap slot early if the backed device is really fast. > For it, I used SWP_SOLIDSTATE but It might be controversial. > So let's add Ccing Shaohua and Hugh. > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > or something for z* family. > > Other problem is zram is block device so that it can set SWP_INMEMORY > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but > I have no idea to use it for frontswap. > > Any idea? > Another thinkingin what case, in what system configuration, vm_swap_full() should return false and delay swp_entry freeing ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm: remove swapcache page early
Hi, On Wed, Mar 27, 2013 at 11:22 AM, Minchan Kim wrote: > Swap subsystem does lazy swap slot free with expecting the page > would be swapped out again so we can't avoid unnecessary write. > > But the problem in in-memory swap is that it consumes memory space > until vm_swap_full(ie, used half of all of swap device) condition > meet. It could be bad if we use multiple swap device, small in-memory swap > and big storage swap or in-memory swap alone. > > This patch changes vm_swap_full logic slightly so it could free > swap slot early if the backed device is really fast. > For it, I used SWP_SOLIDSTATE but It might be controversial. > So let's add Ccing Shaohua and Hugh. > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > or something for z* family. I perfer to add new SWP_INMEMORY for z* family. as you know SSD and memory is different characteristics. and if new type is added, it doesn't need to modify lots of codes. Do you have any data for it? do you get meaningful performance gain or efficiency of z* family? If yes, please share it. Thank you, Kyungmin Park > > Other problem is zram is block device so that it can set SWP_INMEMORY > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but > I have no idea to use it for frontswap. > > Any idea? > > Other optimize point is we remove it unconditionally when we > found it's exclusive when swap in happen. > It could help frontswap family, too. > What do you think about it? > > Cc: Hugh Dickins > Cc: Dan Magenheimer > Cc: Seth Jennings > Cc: Nitin Gupta > Cc: Konrad Rzeszutek Wilk > Cc: Shaohua Li > Signed-off-by: Minchan Kim > --- > include/linux/swap.h | 11 --- > mm/memory.c | 3 ++- > mm/swapfile.c| 11 +++ > mm/vmscan.c | 2 +- > 4 files changed, 18 insertions(+), 9 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2818a12..1f4df66 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, > extern atomic_long_t nr_swap_pages; > extern long total_swap_pages; > > -/* Swap 50% full? Release swapcache more aggressively.. */ > -static inline bool vm_swap_full(void) > +/* > + * Swap 50% full or fast backed device? > + * Release swapcache more aggressively. > + */ > +static inline bool vm_swap_full(struct swap_info_struct *si) > { > + if (si->flags & SWP_SOLIDSTATE) > + return true; > return atomic_long_read(_swap_pages) * 2 < total_swap_pages; > } > > @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, > swp_entry_t ent, bool swapout) > #define get_nr_swap_pages()0L > #define total_swap_pages 0L > #define total_swapcache_pages()0UL > -#define vm_swap_full() 0 > +#define vm_swap_full(si) 0 > > #define si_swapinfo(val) \ > do { (val)->freeswap = (val)->totalswap = 0; } while (0) > diff --git a/mm/memory.c b/mm/memory.c > index 705473a..1ca21a9 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct > vm_area_struct *vma, > mem_cgroup_commit_charge_swapin(page, ptr); > > swap_free(entry); > - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || > PageMlocked(page)) > + if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page)) > + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))) > try_to_free_swap(page); > unlock_page(page); > if (page != swapcache) { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 1bee6fa..f9cc701 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -293,7 +293,7 @@ checks: > scan_base = offset = si->lowest_bit; > > /* reuse swap entry of cache-only swap if not busy. */ > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { > + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) { > int swap_was_freed; > spin_unlock(>lock); > swap_was_freed = __try_to_reclaim_swap(si, offset); > @@ -382,7 +382,8 @@ scan: > spin_lock(>lock); > goto checks; > } > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) > { > + if (vm_swap_full(si) && > + si->swap_map[offset] == SWAP_HAS_CACHE) { > spin_lock(>lock); > goto checks; > } > @@ -397,7 +398,8 @@ scan: > spin_lock(>lock); > goto checks; > } > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) > { > + if (vm_swap_full(si) && > +
Re: [RFC] mm: remove swapcache page early
Hi, On Wed, Mar 27, 2013 at 11:22 AM, Minchan Kim minc...@kernel.org wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. I perfer to add new SWP_INMEMORY for z* family. as you know SSD and memory is different characteristics. and if new type is added, it doesn't need to modify lots of codes. Do you have any data for it? do you get meaningful performance gain or efficiency of z* family? If yes, please share it. Thank you, Kyungmin Park Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Other optimize point is we remove it unconditionally when we found it's exclusive when swap in happen. It could help frontswap family, too. What do you think about it? Cc: Hugh Dickins hu...@google.com Cc: Dan Magenheimer dan.magenhei...@oracle.com Cc: Seth Jennings sjenn...@linux.vnet.ibm.com Cc: Nitin Gupta ngu...@vflare.org Cc: Konrad Rzeszutek Wilk kon...@darnok.org Cc: Shaohua Li s...@kernel.org Signed-off-by: Minchan Kim minc...@kernel.org --- include/linux/swap.h | 11 --- mm/memory.c | 3 ++- mm/swapfile.c| 11 +++ mm/vmscan.c | 2 +- 4 files changed, 18 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2818a12..1f4df66 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, extern atomic_long_t nr_swap_pages; extern long total_swap_pages; -/* Swap 50% full? Release swapcache more aggressively.. */ -static inline bool vm_swap_full(void) +/* + * Swap 50% full or fast backed device? + * Release swapcache more aggressively. + */ +static inline bool vm_swap_full(struct swap_info_struct *si) { + if (si-flags SWP_SOLIDSTATE) + return true; return atomic_long_read(nr_swap_pages) * 2 total_swap_pages; } @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) #define get_nr_swap_pages()0L #define total_swap_pages 0L #define total_swapcache_pages()0UL -#define vm_swap_full() 0 +#define vm_swap_full(si) 0 #define si_swapinfo(val) \ do { (val)-freeswap = (val)-totalswap = 0; } while (0) diff --git a/mm/memory.c b/mm/memory.c index 705473a..1ca21a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, mem_cgroup_commit_charge_swapin(page, ptr); swap_free(entry); - if (vm_swap_full() || (vma-vm_flags VM_LOCKED) || PageMlocked(page)) + if (likely(PageSwapCache(page)) (vm_swap_full(page_swap_info(page)) + || (vma-vm_flags VM_LOCKED) || PageMlocked(page))) try_to_free_swap(page); unlock_page(page); if (page != swapcache) { diff --git a/mm/swapfile.c b/mm/swapfile.c index 1bee6fa..f9cc701 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -293,7 +293,7 @@ checks: scan_base = offset = si-lowest_bit; /* reuse swap entry of cache-only swap if not busy. */ - if (vm_swap_full() si-swap_map[offset] == SWAP_HAS_CACHE) { + if (vm_swap_full(si) si-swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; spin_unlock(si-lock); swap_was_freed = __try_to_reclaim_swap(si, offset); @@ -382,7 +382,8 @@ scan: spin_lock(si-lock); goto checks; } - if (vm_swap_full() si-swap_map[offset] == SWAP_HAS_CACHE) { + if (vm_swap_full(si) + si-swap_map[offset] == SWAP_HAS_CACHE) { spin_lock(si-lock); goto checks; } @@ -397,7 +398,8 @@ scan: spin_lock(si-lock); goto checks; } - if (vm_swap_full() si-swap_map[offset] == SWAP_HAS_CACHE) { + if (vm_swap_full(si) +
Re: [RFC] mm: remove swapcache page early
(2013/03/27 11:22), Minchan Kim wrote: Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can't avoid unnecessary write. But the problem in in-memory swap is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch changes vm_swap_full logic slightly so it could free swap slot early if the backed device is really fast. For it, I used SWP_SOLIDSTATE but It might be controversial. So let's add Ccing Shaohua and Hugh. If it's a problem for SSD, I'd like to create new type SWP_INMEMORY or something for z* family. Other problem is zram is block device so that it can set SWP_INMEMORY or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but I have no idea to use it for frontswap. Any idea? Another thinkingin what case, in what system configuration, vm_swap_full() should return false and delay swp_entry freeing ? Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/