Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-02 Thread Johannes Weiner
On Thu, Feb 02, 2017 at 02:14:10PM +0900, Minchan Kim wrote:
> Hi Johannes,
> 
> On Tue, Jan 31, 2017 at 04:38:10PM -0500, Johannes Weiner wrote:
> > On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > > Hi Shaohua,
> > > > 
> > > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > > Without
> > > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > > - Doesn't support system without swap enabled. Because if swap is 
> > > > > off, we can't
> > > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > > are mixed
> > > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > > current
> > > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without 
> > > > > swap enabled.
> > > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > > will prevent
> > > > >   our setup using MADV_FREE.
> > > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > > against
> > > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, 
> > > > > because those
> > > > >   pages could be freed easily and refilled with very slight penality. 
> > > > > Even page
> > > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > > MADV_FREE
> > > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > > MADV_FREE
> > > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > > and nothing
> > > > >   without it.
> > > > 
> > > > Fully agreed, the anon LRU is a bad place for these pages.
> > > > 
> > > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > > pages could
> > > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > > writting them
> > > > > out to swap, so the first issue could be fixed. If only MADV_FREE 
> > > > > pages are in
> > > > > the new list, page reclaim can easily reclaim such pages without 
> > > > > interference
> > > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > > 
> > > > Do we actually need a new page flag and a special LRU for them? These
> > > > pages are basically like clean cache pages at that point. What do you
> > > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > > them to the inactive file list? The way isolate+putback works should
> > > > not even need much modification, something like clear_page_mlock().
> > > > 
> > > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > > again set PG_swapbacked and goto keep_locked to move the page back
> > > > into the anon LRU to get reclaimed according to swapping rules.
> > > 
> > > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > > anonymous
> > > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > > pages
> > > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > > file
> > > pages first before reclaim the MADV_FREE pages. This doesn't look good. 
> > > The
> > > point of a separate LRU is to avoid scan other anon/file pages.
> > 
> > The LRU code and the rest of VM already use independent page type
> > distinctions. That's because shmem pages are !PageAnon - they have a
> > page->mapping that points to a real address space, not an anon_vma -
> > but they are swapbacked and thus go through the anon LRU. This would
> > just do the reverse: put PageAnon pages on the file LRU when they
> > don't contain valid data and are thus not swapbacked.
> > 
> > As far as mixing with inactive file pages goes, it'd be possible to
> > link the MADV_FREE pages to the tail of the inactive list, rather than
> > the head. That said, I'm not sure reclaiming use-once filesystem cache
> > before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> > the sole purpose of reusing them in the (near) future. That is
> > actually a stronger reuse signal than we have for use-once file pages.
> > If somebody does continuous writes to a logfile or a one-off search
> > through one or more files, we should actually reclaim that cache
> > before we go after MADV_FREE pages that are temporarily invalidated.
> 
> Yes, we should be careful on this issue. It was main arguable point.
> How about moving them to head of inactive file, not tail if we want to
> go with inactive file LRU?
> 
> With that, VM try to reclaim file pages first from the tail of list
> and if pages reclaimed were workingset, it could be activated by
> workingset_refault. Otherwise, we can discard use-once pages without
> puring *madv_free* pages so I think it's good compromise.
> 
> What do you think?

That's what I tried to 

Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-02 Thread Johannes Weiner
On Thu, Feb 02, 2017 at 02:14:10PM +0900, Minchan Kim wrote:
> Hi Johannes,
> 
> On Tue, Jan 31, 2017 at 04:38:10PM -0500, Johannes Weiner wrote:
> > On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > > Hi Shaohua,
> > > > 
> > > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > > Without
> > > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > > - Doesn't support system without swap enabled. Because if swap is 
> > > > > off, we can't
> > > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > > are mixed
> > > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > > current
> > > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without 
> > > > > swap enabled.
> > > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > > will prevent
> > > > >   our setup using MADV_FREE.
> > > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > > against
> > > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, 
> > > > > because those
> > > > >   pages could be freed easily and refilled with very slight penality. 
> > > > > Even page
> > > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > > MADV_FREE
> > > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > > MADV_FREE
> > > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > > and nothing
> > > > >   without it.
> > > > 
> > > > Fully agreed, the anon LRU is a bad place for these pages.
> > > > 
> > > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > > pages could
> > > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > > writting them
> > > > > out to swap, so the first issue could be fixed. If only MADV_FREE 
> > > > > pages are in
> > > > > the new list, page reclaim can easily reclaim such pages without 
> > > > > interference
> > > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > > 
> > > > Do we actually need a new page flag and a special LRU for them? These
> > > > pages are basically like clean cache pages at that point. What do you
> > > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > > them to the inactive file list? The way isolate+putback works should
> > > > not even need much modification, something like clear_page_mlock().
> > > > 
> > > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > > again set PG_swapbacked and goto keep_locked to move the page back
> > > > into the anon LRU to get reclaimed according to swapping rules.
> > > 
> > > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > > anonymous
> > > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > > pages
> > > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > > file
> > > pages first before reclaim the MADV_FREE pages. This doesn't look good. 
> > > The
> > > point of a separate LRU is to avoid scan other anon/file pages.
> > 
> > The LRU code and the rest of VM already use independent page type
> > distinctions. That's because shmem pages are !PageAnon - they have a
> > page->mapping that points to a real address space, not an anon_vma -
> > but they are swapbacked and thus go through the anon LRU. This would
> > just do the reverse: put PageAnon pages on the file LRU when they
> > don't contain valid data and are thus not swapbacked.
> > 
> > As far as mixing with inactive file pages goes, it'd be possible to
> > link the MADV_FREE pages to the tail of the inactive list, rather than
> > the head. That said, I'm not sure reclaiming use-once filesystem cache
> > before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> > the sole purpose of reusing them in the (near) future. That is
> > actually a stronger reuse signal than we have for use-once file pages.
> > If somebody does continuous writes to a logfile or a one-off search
> > through one or more files, we should actually reclaim that cache
> > before we go after MADV_FREE pages that are temporarily invalidated.
> 
> Yes, we should be careful on this issue. It was main arguable point.
> How about moving them to head of inactive file, not tail if we want to
> go with inactive file LRU?
> 
> With that, VM try to reclaim file pages first from the tail of list
> and if pages reclaimed were workingset, it could be activated by
> workingset_refault. Otherwise, we can discard use-once pages without
> puring *madv_free* pages so I think it's good compromise.
> 
> What do you think?

That's what I tried to 

Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-01 Thread Minchan Kim
Hi Johannes,

On Tue, Jan 31, 2017 at 04:38:10PM -0500, Johannes Weiner wrote:
> On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > Hi Shaohua,
> > > 
> > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > Without
> > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > - Doesn't support system without swap enabled. Because if swap is off, 
> > > > we can't
> > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > are mixed
> > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > current
> > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > > enabled.
> > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > will prevent
> > > >   our setup using MADV_FREE.
> > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > against
> > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > > those
> > > >   pages could be freed easily and refilled with very slight penality. 
> > > > Even page
> > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > MADV_FREE
> > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > MADV_FREE
> > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > and nothing
> > > >   without it.
> > > 
> > > Fully agreed, the anon LRU is a bad place for these pages.
> > > 
> > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > pages could
> > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > writting them
> > > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > > are in
> > > > the new list, page reclaim can easily reclaim such pages without 
> > > > interference
> > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > 
> > > Do we actually need a new page flag and a special LRU for them? These
> > > pages are basically like clean cache pages at that point. What do you
> > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > them to the inactive file list? The way isolate+putback works should
> > > not even need much modification, something like clear_page_mlock().
> > > 
> > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > again set PG_swapbacked and goto keep_locked to move the page back
> > > into the anon LRU to get reclaimed according to swapping rules.
> > 
> > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > anonymous
> > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > pages
> > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > file
> > pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> > point of a separate LRU is to avoid scan other anon/file pages.
> 
> The LRU code and the rest of VM already use independent page type
> distinctions. That's because shmem pages are !PageAnon - they have a
> page->mapping that points to a real address space, not an anon_vma -
> but they are swapbacked and thus go through the anon LRU. This would
> just do the reverse: put PageAnon pages on the file LRU when they
> don't contain valid data and are thus not swapbacked.
> 
> As far as mixing with inactive file pages goes, it'd be possible to
> link the MADV_FREE pages to the tail of the inactive list, rather than
> the head. That said, I'm not sure reclaiming use-once filesystem cache
> before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> the sole purpose of reusing them in the (near) future. That is
> actually a stronger reuse signal than we have for use-once file pages.
> If somebody does continuous writes to a logfile or a one-off search
> through one or more files, we should actually reclaim that cache
> before we go after MADV_FREE pages that are temporarily invalidated.

Yes, we should be careful on this issue. It was main arguable point.
How about moving them to head of inactive file, not tail if we want to
go with inactive file LRU?

With that, VM try to reclaim file pages first from the tail of list
and if pages reclaimed were workingset, it could be activated by
workingset_refault. Otherwise, we can discard use-once pages without
puring *madv_free* pages so I think it's good compromise.

What do you think?


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-01 Thread Minchan Kim
Hi Johannes,

On Tue, Jan 31, 2017 at 04:38:10PM -0500, Johannes Weiner wrote:
> On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > Hi Shaohua,
> > > 
> > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > Without
> > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > - Doesn't support system without swap enabled. Because if swap is off, 
> > > > we can't
> > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > are mixed
> > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > current
> > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > > enabled.
> > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > will prevent
> > > >   our setup using MADV_FREE.
> > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > against
> > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > > those
> > > >   pages could be freed easily and refilled with very slight penality. 
> > > > Even page
> > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > MADV_FREE
> > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > MADV_FREE
> > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > and nothing
> > > >   without it.
> > > 
> > > Fully agreed, the anon LRU is a bad place for these pages.
> > > 
> > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > pages could
> > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > writting them
> > > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > > are in
> > > > the new list, page reclaim can easily reclaim such pages without 
> > > > interference
> > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > 
> > > Do we actually need a new page flag and a special LRU for them? These
> > > pages are basically like clean cache pages at that point. What do you
> > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > them to the inactive file list? The way isolate+putback works should
> > > not even need much modification, something like clear_page_mlock().
> > > 
> > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > again set PG_swapbacked and goto keep_locked to move the page back
> > > into the anon LRU to get reclaimed according to swapping rules.
> > 
> > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > anonymous
> > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > pages
> > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > file
> > pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> > point of a separate LRU is to avoid scan other anon/file pages.
> 
> The LRU code and the rest of VM already use independent page type
> distinctions. That's because shmem pages are !PageAnon - they have a
> page->mapping that points to a real address space, not an anon_vma -
> but they are swapbacked and thus go through the anon LRU. This would
> just do the reverse: put PageAnon pages on the file LRU when they
> don't contain valid data and are thus not swapbacked.
> 
> As far as mixing with inactive file pages goes, it'd be possible to
> link the MADV_FREE pages to the tail of the inactive list, rather than
> the head. That said, I'm not sure reclaiming use-once filesystem cache
> before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> the sole purpose of reusing them in the (near) future. That is
> actually a stronger reuse signal than we have for use-once file pages.
> If somebody does continuous writes to a logfile or a one-off search
> through one or more files, we should actually reclaim that cache
> before we go after MADV_FREE pages that are temporarily invalidated.

Yes, we should be careful on this issue. It was main arguable point.
How about moving them to head of inactive file, not tail if we want to
go with inactive file LRU?

With that, VM try to reclaim file pages first from the tail of list
and if pages reclaimed were workingset, it could be activated by
workingset_refault. Otherwise, we can discard use-once pages without
puring *madv_free* pages so I think it's good compromise.

What do you think?


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-01 Thread Shaohua Li
On Tue, Jan 31, 2017 at 04:38:10PM -0500, Johannes Weiner wrote:
> On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > Hi Shaohua,
> > > 
> > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > Without
> > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > - Doesn't support system without swap enabled. Because if swap is off, 
> > > > we can't
> > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > are mixed
> > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > current
> > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > > enabled.
> > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > will prevent
> > > >   our setup using MADV_FREE.
> > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > against
> > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > > those
> > > >   pages could be freed easily and refilled with very slight penality. 
> > > > Even page
> > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > MADV_FREE
> > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > MADV_FREE
> > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > and nothing
> > > >   without it.
> > > 
> > > Fully agreed, the anon LRU is a bad place for these pages.
> > > 
> > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > pages could
> > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > writting them
> > > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > > are in
> > > > the new list, page reclaim can easily reclaim such pages without 
> > > > interference
> > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > 
> > > Do we actually need a new page flag and a special LRU for them? These
> > > pages are basically like clean cache pages at that point. What do you
> > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > them to the inactive file list? The way isolate+putback works should
> > > not even need much modification, something like clear_page_mlock().
> > > 
> > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > again set PG_swapbacked and goto keep_locked to move the page back
> > > into the anon LRU to get reclaimed according to swapping rules.
> > 
> > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > anonymous
> > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > pages
> > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > file
> > pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> > point of a separate LRU is to avoid scan other anon/file pages.
> 
> The LRU code and the rest of VM already use independent page type
> distinctions. That's because shmem pages are !PageAnon - they have a
> page->mapping that points to a real address space, not an anon_vma -
> but they are swapbacked and thus go through the anon LRU. This would
> just do the reverse: put PageAnon pages on the file LRU when they
> don't contain valid data and are thus not swapbacked.
> 
> As far as mixing with inactive file pages goes, it'd be possible to
> link the MADV_FREE pages to the tail of the inactive list, rather than
> the head. That said, I'm not sure reclaiming use-once filesystem cache
> before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> the sole purpose of reusing them in the (near) future. That is
> actually a stronger reuse signal than we have for use-once file pages.
> If somebody does continuous writes to a logfile or a one-off search
> through one or more files, we should actually reclaim that cache
> before we go after MADV_FREE pages that are temporarily invalidated.

Thanks, I'll try this idea.

Thanks,
Shaohua


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-01 Thread Shaohua Li
On Tue, Jan 31, 2017 at 04:38:10PM -0500, Johannes Weiner wrote:
> On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > Hi Shaohua,
> > > 
> > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > Without
> > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > - Doesn't support system without swap enabled. Because if swap is off, 
> > > > we can't
> > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > are mixed
> > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > current
> > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > > enabled.
> > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > will prevent
> > > >   our setup using MADV_FREE.
> > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > against
> > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > > those
> > > >   pages could be freed easily and refilled with very slight penality. 
> > > > Even page
> > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > MADV_FREE
> > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > MADV_FREE
> > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > and nothing
> > > >   without it.
> > > 
> > > Fully agreed, the anon LRU is a bad place for these pages.
> > > 
> > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > pages could
> > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > writting them
> > > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > > are in
> > > > the new list, page reclaim can easily reclaim such pages without 
> > > > interference
> > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > 
> > > Do we actually need a new page flag and a special LRU for them? These
> > > pages are basically like clean cache pages at that point. What do you
> > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > them to the inactive file list? The way isolate+putback works should
> > > not even need much modification, something like clear_page_mlock().
> > > 
> > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > again set PG_swapbacked and goto keep_locked to move the page back
> > > into the anon LRU to get reclaimed according to swapping rules.
> > 
> > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > anonymous
> > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > pages
> > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > file
> > pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> > point of a separate LRU is to avoid scan other anon/file pages.
> 
> The LRU code and the rest of VM already use independent page type
> distinctions. That's because shmem pages are !PageAnon - they have a
> page->mapping that points to a real address space, not an anon_vma -
> but they are swapbacked and thus go through the anon LRU. This would
> just do the reverse: put PageAnon pages on the file LRU when they
> don't contain valid data and are thus not swapbacked.
> 
> As far as mixing with inactive file pages goes, it'd be possible to
> link the MADV_FREE pages to the tail of the inactive list, rather than
> the head. That said, I'm not sure reclaiming use-once filesystem cache
> before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> the sole purpose of reusing them in the (near) future. That is
> actually a stronger reuse signal than we have for use-once file pages.
> If somebody does continuous writes to a logfile or a one-off search
> through one or more files, we should actually reclaim that cache
> before we go after MADV_FREE pages that are temporarily invalidated.

Thanks, I'll try this idea.

Thanks,
Shaohua


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-01 Thread Michal Hocko
On Tue 31-01-17 16:38:10, Johannes Weiner wrote:
> On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > Hi Shaohua,
> > > 
> > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > Without
> > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > - Doesn't support system without swap enabled. Because if swap is off, 
> > > > we can't
> > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > are mixed
> > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > current
> > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > > enabled.
> > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > will prevent
> > > >   our setup using MADV_FREE.
> > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > against
> > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > > those
> > > >   pages could be freed easily and refilled with very slight penality. 
> > > > Even page
> > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > MADV_FREE
> > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > MADV_FREE
> > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > and nothing
> > > >   without it.
> > > 
> > > Fully agreed, the anon LRU is a bad place for these pages.
> > > 
> > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > pages could
> > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > writting them
> > > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > > are in
> > > > the new list, page reclaim can easily reclaim such pages without 
> > > > interference
> > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > 
> > > Do we actually need a new page flag and a special LRU for them? These
> > > pages are basically like clean cache pages at that point. What do you
> > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > them to the inactive file list? The way isolate+putback works should
> > > not even need much modification, something like clear_page_mlock().
> > > 
> > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > again set PG_swapbacked and goto keep_locked to move the page back
> > > into the anon LRU to get reclaimed according to swapping rules.
> > 
> > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > anonymous
> > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > pages
> > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > file
> > pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> > point of a separate LRU is to avoid scan other anon/file pages.
> 
> The LRU code and the rest of VM already use independent page type
> distinctions. That's because shmem pages are !PageAnon - they have a
> page->mapping that points to a real address space, not an anon_vma -
> but they are swapbacked and thus go through the anon LRU. This would
> just do the reverse: put PageAnon pages on the file LRU when they
> don't contain valid data and are thus not swapbacked.
> 
> As far as mixing with inactive file pages goes, it'd be possible to
> link the MADV_FREE pages to the tail of the inactive list, rather than
> the head. That said, I'm not sure reclaiming use-once filesystem cache
> before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> the sole purpose of reusing them in the (near) future. That is
> actually a stronger reuse signal than we have for use-once file pages.
> If somebody does continuous writes to a logfile or a one-off search
> through one or more files, we should actually reclaim that cache
> before we go after MADV_FREE pages that are temporarily invalidated.

I completely agree here. LRU_*_FILE will be a bit misnomer (LRU_*CACHE
would sound more appropriate). I expect there would be few places which
account based on the LRU list but those shouldn't be that hard to fix.
-- 
Michal Hocko
SUSE Labs


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-02-01 Thread Michal Hocko
On Tue 31-01-17 16:38:10, Johannes Weiner wrote:
> On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> > On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > > Hi Shaohua,
> > > 
> > > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > > Without
> > > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > > - Doesn't support system without swap enabled. Because if swap is off, 
> > > > we can't
> > > >   or can't efficiently age anonymous pages. And since MADV_FREE pages 
> > > > are mixed
> > > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In 
> > > > current
> > > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > > enabled.
> > > >   But in our environment, a lot of machines don't enable swap. This 
> > > > will prevent
> > > >   our setup using MADV_FREE.
> > > > - Increases memory pressure. page reclaim bias file pages reclaim 
> > > > against
> > > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > > those
> > > >   pages could be freed easily and refilled with very slight penality. 
> > > > Even page
> > > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > > MADV_FREE
> > > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > > MADV_FREE
> > > >   page, we probably must scan a lot of other anonymous pages, which is
> > > >   inefficient. In our test, we usually see oom with MADV_FREE enabled 
> > > > and nothing
> > > >   without it.
> > > 
> > > Fully agreed, the anon LRU is a bad place for these pages.
> > > 
> > > > For the first two issues, introducing a new LRU list for MADV_FREE 
> > > > pages could
> > > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > > writting them
> > > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > > are in
> > > > the new list, page reclaim can easily reclaim such pages without 
> > > > interference
> > > > of file or anonymous pages. The memory pressure issue will disappear.
> > > 
> > > Do we actually need a new page flag and a special LRU for them? These
> > > pages are basically like clean cache pages at that point. What do you
> > > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > > them to the inactive file list? The way isolate+putback works should
> > > not even need much modification, something like clear_page_mlock().
> > > 
> > > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > > again set PG_swapbacked and goto keep_locked to move the page back
> > > into the anon LRU to get reclaimed according to swapping rules.
> > 
> > Interesting idea! Not sure though, the MADV_FREE pages are actually 
> > anonymous
> > pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> > pages
> > are mixed with inactive file pages, page reclaim need to reclaim a lot of 
> > file
> > pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> > point of a separate LRU is to avoid scan other anon/file pages.
> 
> The LRU code and the rest of VM already use independent page type
> distinctions. That's because shmem pages are !PageAnon - they have a
> page->mapping that points to a real address space, not an anon_vma -
> but they are swapbacked and thus go through the anon LRU. This would
> just do the reverse: put PageAnon pages on the file LRU when they
> don't contain valid data and are thus not swapbacked.
> 
> As far as mixing with inactive file pages goes, it'd be possible to
> link the MADV_FREE pages to the tail of the inactive list, rather than
> the head. That said, I'm not sure reclaiming use-once filesystem cache
> before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
> the sole purpose of reusing them in the (near) future. That is
> actually a stronger reuse signal than we have for use-once file pages.
> If somebody does continuous writes to a logfile or a one-off search
> through one or more files, we should actually reclaim that cache
> before we go after MADV_FREE pages that are temporarily invalidated.

I completely agree here. LRU_*_FILE will be a bit misnomer (LRU_*CACHE
would sound more appropriate). I expect there would be few places which
account based on the LRU list but those shouldn't be that hard to fix.
-- 
Michal Hocko
SUSE Labs


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Minchan Kim
Hi Shaohua,

On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> Hi,
> 
> We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
> solving the issues, jemalloc can't use the MADV_FREE feature.
> - Doesn't support system without swap enabled. Because if swap is off, we 
> can't
>   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> mixed
>   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
>   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> enabled.
>   But in our environment, a lot of machines don't enable swap. This will 
> prevent
>   our setup using MADV_FREE.
> - Increases memory pressure. page reclaim bias file pages reclaim against
>   anonymous pages. This doesn't make sense for MADV_FREE pages, because those
>   pages could be freed easily and refilled with very slight penality. Even 
> page
>   reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
>   pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
>   page, we probably must scan a lot of other anonymous pages, which is
>   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> nothing
>   without it.
> - RSS accounting. MADV_FREE pages are accounted as normal anon pages and
>   reclaimed lazily, so application's RSS becomes bigger. This confuses our
>   workloads. We have monitoring daemon running and if it finds applications' 
> RSS
>   becomes abnormal, the daemon will kill the applications even kernel can 
> reclaim
>   the memory easily. Currently we don't export separate RSS accounting for
>   MADV_FREE pages. This will prevent our setup using MADV_FREE too.
> 
> For the first two issues, introducing a new LRU list for MADV_FREE pages could
> solve the issues. We can directly reclaim MADV_FREE pages without writting 
> them
> out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
> the new list, page reclaim can easily reclaim such pages without interference
> of file or anonymous pages. The memory pressure issue will disappear.
> 
> Actually Minchan posted patches to add the LRU list before, but he didn't
> pursue. So I picked up them and the patches are based on Minchan's previous
> patches. The main difference between my patches and Minchan previous patches 
> is
> page reclaim policy. Minchan's patches introduces a knob to balance the 
> reclaim
> of MADV_FREE pages and anon/file pages, while the patches always reclaim
> MADV_FREE pages first if there are. I described the reason in patch 5.

First of all, thanks for th effort to support MADV_FREE for swapless system,
Shaohua!

CCing Daniel,

The reason I have postponed is due to controverial part about balancing
used-once vs. madv_freed apges. I thought it doesn't make sense to reclaim
madv_freed pages first even if there are lots of used-once pages.

Recently, Johannes posted patches for balancing file/anon and it was based
on the cost model, IIRC. I wanted to base on it.

The idea is VM reclaims file-based pages and if refault happens, we can measure
refault distance and sizeof(LRU_LAZYFREE list). If refault distance is smaller
than lazyfree LRU list's size, it means the file-backed page have been kept
in memory if we have discarded lazyfree pages so it adds more cost to reclaim
lazyfree LRU list more agressively.

I tested your patch with simple MADV_FREE workload(just alloc and then repeated
touch/madv_free) with background stream-read process. In that case, the
MADV_FREE workload regressed in half without any gain for stream-read process.

I tested hacky code to simulate feedback loop I suggested idea and it restores
the performance regression. I'm not saying below hacky patch should merge in
but I think we should have used-once reclaim feedback logic to prevent
unnecessary purging for madv_freed pages.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 589a165..39d4bba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -703,6 +703,7 @@ typedef struct pglist_data {
/* Per-node vmstats */
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
+   bool lazyfree;
 } pg_data_t;
 
 #define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f809f04..cf54b81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2364,22 +2364,25 @@ static void shrink_node_memcg(struct pglist_data 
*pgdat, struct mem_cgroup *memc
struct blk_plug plug;
bool scan_adjusted;
 
-   /* reclaim all lazyfree pages so don't apply priority  */
-   nr[LRU_LAZYFREE] = lruvec_lru_size(lruvec, LRU_LAZYFREE, 
sc->reclaim_idx);
-   while (nr[LRU_LAZYFREE]) {
-   nr_to_scan = min(nr[LRU_LAZYFREE], SWAP_CLUSTER_MAX);
-   nr[LRU_LAZYFREE] -= nr_to_scan;
-   nr_reclaimed += 

Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Minchan Kim
Hi Shaohua,

On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> Hi,
> 
> We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
> solving the issues, jemalloc can't use the MADV_FREE feature.
> - Doesn't support system without swap enabled. Because if swap is off, we 
> can't
>   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> mixed
>   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
>   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> enabled.
>   But in our environment, a lot of machines don't enable swap. This will 
> prevent
>   our setup using MADV_FREE.
> - Increases memory pressure. page reclaim bias file pages reclaim against
>   anonymous pages. This doesn't make sense for MADV_FREE pages, because those
>   pages could be freed easily and refilled with very slight penality. Even 
> page
>   reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
>   pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
>   page, we probably must scan a lot of other anonymous pages, which is
>   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> nothing
>   without it.
> - RSS accounting. MADV_FREE pages are accounted as normal anon pages and
>   reclaimed lazily, so application's RSS becomes bigger. This confuses our
>   workloads. We have monitoring daemon running and if it finds applications' 
> RSS
>   becomes abnormal, the daemon will kill the applications even kernel can 
> reclaim
>   the memory easily. Currently we don't export separate RSS accounting for
>   MADV_FREE pages. This will prevent our setup using MADV_FREE too.
> 
> For the first two issues, introducing a new LRU list for MADV_FREE pages could
> solve the issues. We can directly reclaim MADV_FREE pages without writting 
> them
> out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
> the new list, page reclaim can easily reclaim such pages without interference
> of file or anonymous pages. The memory pressure issue will disappear.
> 
> Actually Minchan posted patches to add the LRU list before, but he didn't
> pursue. So I picked up them and the patches are based on Minchan's previous
> patches. The main difference between my patches and Minchan previous patches 
> is
> page reclaim policy. Minchan's patches introduces a knob to balance the 
> reclaim
> of MADV_FREE pages and anon/file pages, while the patches always reclaim
> MADV_FREE pages first if there are. I described the reason in patch 5.

First of all, thanks for th effort to support MADV_FREE for swapless system,
Shaohua!

CCing Daniel,

The reason I have postponed is due to controverial part about balancing
used-once vs. madv_freed apges. I thought it doesn't make sense to reclaim
madv_freed pages first even if there are lots of used-once pages.

Recently, Johannes posted patches for balancing file/anon and it was based
on the cost model, IIRC. I wanted to base on it.

The idea is VM reclaims file-based pages and if refault happens, we can measure
refault distance and sizeof(LRU_LAZYFREE list). If refault distance is smaller
than lazyfree LRU list's size, it means the file-backed page have been kept
in memory if we have discarded lazyfree pages so it adds more cost to reclaim
lazyfree LRU list more agressively.

I tested your patch with simple MADV_FREE workload(just alloc and then repeated
touch/madv_free) with background stream-read process. In that case, the
MADV_FREE workload regressed in half without any gain for stream-read process.

I tested hacky code to simulate feedback loop I suggested idea and it restores
the performance regression. I'm not saying below hacky patch should merge in
but I think we should have used-once reclaim feedback logic to prevent
unnecessary purging for madv_freed pages.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 589a165..39d4bba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -703,6 +703,7 @@ typedef struct pglist_data {
/* Per-node vmstats */
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
+   bool lazyfree;
 } pg_data_t;
 
 #define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f809f04..cf54b81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2364,22 +2364,25 @@ static void shrink_node_memcg(struct pglist_data 
*pgdat, struct mem_cgroup *memc
struct blk_plug plug;
bool scan_adjusted;
 
-   /* reclaim all lazyfree pages so don't apply priority  */
-   nr[LRU_LAZYFREE] = lruvec_lru_size(lruvec, LRU_LAZYFREE, 
sc->reclaim_idx);
-   while (nr[LRU_LAZYFREE]) {
-   nr_to_scan = min(nr[LRU_LAZYFREE], SWAP_CLUSTER_MAX);
-   nr[LRU_LAZYFREE] -= nr_to_scan;
-   nr_reclaimed += 

Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Johannes Weiner
On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > Hi Shaohua,
> > 
> > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > Without
> > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > - Doesn't support system without swap enabled. Because if swap is off, we 
> > > can't
> > >   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> > > mixed
> > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
> > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > enabled.
> > >   But in our environment, a lot of machines don't enable swap. This will 
> > > prevent
> > >   our setup using MADV_FREE.
> > > - Increases memory pressure. page reclaim bias file pages reclaim against
> > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > those
> > >   pages could be freed easily and refilled with very slight penality. 
> > > Even page
> > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > MADV_FREE
> > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > MADV_FREE
> > >   page, we probably must scan a lot of other anonymous pages, which is
> > >   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> > > nothing
> > >   without it.
> > 
> > Fully agreed, the anon LRU is a bad place for these pages.
> > 
> > > For the first two issues, introducing a new LRU list for MADV_FREE pages 
> > > could
> > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > writting them
> > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > are in
> > > the new list, page reclaim can easily reclaim such pages without 
> > > interference
> > > of file or anonymous pages. The memory pressure issue will disappear.
> > 
> > Do we actually need a new page flag and a special LRU for them? These
> > pages are basically like clean cache pages at that point. What do you
> > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > them to the inactive file list? The way isolate+putback works should
> > not even need much modification, something like clear_page_mlock().
> > 
> > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > again set PG_swapbacked and goto keep_locked to move the page back
> > into the anon LRU to get reclaimed according to swapping rules.
> 
> Interesting idea! Not sure though, the MADV_FREE pages are actually anonymous
> pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> pages
> are mixed with inactive file pages, page reclaim need to reclaim a lot of file
> pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> point of a separate LRU is to avoid scan other anon/file pages.

The LRU code and the rest of VM already use independent page type
distinctions. That's because shmem pages are !PageAnon - they have a
page->mapping that points to a real address space, not an anon_vma -
but they are swapbacked and thus go through the anon LRU. This would
just do the reverse: put PageAnon pages on the file LRU when they
don't contain valid data and are thus not swapbacked.

As far as mixing with inactive file pages goes, it'd be possible to
link the MADV_FREE pages to the tail of the inactive list, rather than
the head. That said, I'm not sure reclaiming use-once filesystem cache
before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
the sole purpose of reusing them in the (near) future. That is
actually a stronger reuse signal than we have for use-once file pages.
If somebody does continuous writes to a logfile or a one-off search
through one or more files, we should actually reclaim that cache
before we go after MADV_FREE pages that are temporarily invalidated.

> > > For the third issue, we can add a separate RSS count for MADV_FREE pages. 
> > > The
> > > count will be increased in madvise syscall and decreased in page reclaim 
> > > (eg,
> > > unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> > > active page there. But there isn't mm_struct context at that place. 
> > > Iterating
> > > vma there sounds too silly. The patchset don't fix this issue yet. 
> > > Hopefully
> > > somebody can share a hint how to fix this issue.
> > 
> > This problem also goes away if we use the file LRUs.
> 
> Can you elaborate this please? Maybe you mean charge them to MM_FILEPAGES? But
> that doesn't solve the problem. 'statm' proc file will still report a big RSS.

Sorry, I was just referring to the activate_page(). If we use the file
LRUs, then page activation has a clear target. And we wouldn't have to
adjust any RSS counters when a lazyfreed page is activated.

If we have MM context everywhere else, can we add 

Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Johannes Weiner
On Tue, Jan 31, 2017 at 11:45:47AM -0800, Shaohua Li wrote:
> On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> > Hi Shaohua,
> > 
> > On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > > Without
> > > solving the issues, jemalloc can't use the MADV_FREE feature.
> > > - Doesn't support system without swap enabled. Because if swap is off, we 
> > > can't
> > >   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> > > mixed
> > >   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
> > >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > > enabled.
> > >   But in our environment, a lot of machines don't enable swap. This will 
> > > prevent
> > >   our setup using MADV_FREE.
> > > - Increases memory pressure. page reclaim bias file pages reclaim against
> > >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > > those
> > >   pages could be freed easily and refilled with very slight penality. 
> > > Even page
> > >   reclaim doesn't bias file pages, there is still an issue, because 
> > > MADV_FREE
> > >   pages and other anonymous pages are mixed together. To reclaim a 
> > > MADV_FREE
> > >   page, we probably must scan a lot of other anonymous pages, which is
> > >   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> > > nothing
> > >   without it.
> > 
> > Fully agreed, the anon LRU is a bad place for these pages.
> > 
> > > For the first two issues, introducing a new LRU list for MADV_FREE pages 
> > > could
> > > solve the issues. We can directly reclaim MADV_FREE pages without 
> > > writting them
> > > out to swap, so the first issue could be fixed. If only MADV_FREE pages 
> > > are in
> > > the new list, page reclaim can easily reclaim such pages without 
> > > interference
> > > of file or anonymous pages. The memory pressure issue will disappear.
> > 
> > Do we actually need a new page flag and a special LRU for them? These
> > pages are basically like clean cache pages at that point. What do you
> > think about clearing their PG_swapbacked flag on MADV_FREE and moving
> > them to the inactive file list? The way isolate+putback works should
> > not even need much modification, something like clear_page_mlock().
> > 
> > When the reclaim scanner finds anon && dirty && !swapbacked, it can
> > again set PG_swapbacked and goto keep_locked to move the page back
> > into the anon LRU to get reclaimed according to swapping rules.
> 
> Interesting idea! Not sure though, the MADV_FREE pages are actually anonymous
> pages, this will introduce confusion. On the other hand, if the MADV_FREE 
> pages
> are mixed with inactive file pages, page reclaim need to reclaim a lot of file
> pages first before reclaim the MADV_FREE pages. This doesn't look good. The
> point of a separate LRU is to avoid scan other anon/file pages.

The LRU code and the rest of VM already use independent page type
distinctions. That's because shmem pages are !PageAnon - they have a
page->mapping that points to a real address space, not an anon_vma -
but they are swapbacked and thus go through the anon LRU. This would
just do the reverse: put PageAnon pages on the file LRU when they
don't contain valid data and are thus not swapbacked.

As far as mixing with inactive file pages goes, it'd be possible to
link the MADV_FREE pages to the tail of the inactive list, rather than
the head. That said, I'm not sure reclaiming use-once filesystem cache
before MADV_FREE is such a bad policy. MADV_FREE retains the vmas for
the sole purpose of reusing them in the (near) future. That is
actually a stronger reuse signal than we have for use-once file pages.
If somebody does continuous writes to a logfile or a one-off search
through one or more files, we should actually reclaim that cache
before we go after MADV_FREE pages that are temporarily invalidated.

> > > For the third issue, we can add a separate RSS count for MADV_FREE pages. 
> > > The
> > > count will be increased in madvise syscall and decreased in page reclaim 
> > > (eg,
> > > unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> > > active page there. But there isn't mm_struct context at that place. 
> > > Iterating
> > > vma there sounds too silly. The patchset don't fix this issue yet. 
> > > Hopefully
> > > somebody can share a hint how to fix this issue.
> > 
> > This problem also goes away if we use the file LRUs.
> 
> Can you elaborate this please? Maybe you mean charge them to MM_FILEPAGES? But
> that doesn't solve the problem. 'statm' proc file will still report a big RSS.

Sorry, I was just referring to the activate_page(). If we use the file
LRUs, then page activation has a clear target. And we wouldn't have to
adjust any RSS counters when a lazyfreed page is activated.

If we have MM context everywhere else, can we add 

Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Shaohua Li
On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> Hi Shaohua,
> 
> On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > Without
> > solving the issues, jemalloc can't use the MADV_FREE feature.
> > - Doesn't support system without swap enabled. Because if swap is off, we 
> > can't
> >   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> > mixed
> >   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
> >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > enabled.
> >   But in our environment, a lot of machines don't enable swap. This will 
> > prevent
> >   our setup using MADV_FREE.
> > - Increases memory pressure. page reclaim bias file pages reclaim against
> >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > those
> >   pages could be freed easily and refilled with very slight penality. Even 
> > page
> >   reclaim doesn't bias file pages, there is still an issue, because 
> > MADV_FREE
> >   pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
> >   page, we probably must scan a lot of other anonymous pages, which is
> >   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> > nothing
> >   without it.
> 
> Fully agreed, the anon LRU is a bad place for these pages.
> 
> > For the first two issues, introducing a new LRU list for MADV_FREE pages 
> > could
> > solve the issues. We can directly reclaim MADV_FREE pages without writting 
> > them
> > out to swap, so the first issue could be fixed. If only MADV_FREE pages are 
> > in
> > the new list, page reclaim can easily reclaim such pages without 
> > interference
> > of file or anonymous pages. The memory pressure issue will disappear.
> 
> Do we actually need a new page flag and a special LRU for them? These
> pages are basically like clean cache pages at that point. What do you
> think about clearing their PG_swapbacked flag on MADV_FREE and moving
> them to the inactive file list? The way isolate+putback works should
> not even need much modification, something like clear_page_mlock().
> 
> When the reclaim scanner finds anon && dirty && !swapbacked, it can
> again set PG_swapbacked and goto keep_locked to move the page back
> into the anon LRU to get reclaimed according to swapping rules.

Interesting idea! Not sure though, the MADV_FREE pages are actually anonymous
pages, this will introduce confusion. On the other hand, if the MADV_FREE pages
are mixed with inactive file pages, page reclaim need to reclaim a lot of file
pages first before reclaim the MADV_FREE pages. This doesn't look good. The
point of a separate LRU is to avoid scan other anon/file pages.
 
> > For the third issue, we can add a separate RSS count for MADV_FREE pages. 
> > The
> > count will be increased in madvise syscall and decreased in page reclaim 
> > (eg,
> > unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> > active page there. But there isn't mm_struct context at that place. 
> > Iterating
> > vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
> > somebody can share a hint how to fix this issue.
> 
> This problem also goes away if we use the file LRUs.

Can you elaborate this please? Maybe you mean charge them to MM_FILEPAGES? But
that doesn't solve the problem. 'statm' proc file will still report a big RSS.

Thanks,
Shaohua


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Shaohua Li
On Tue, Jan 31, 2017 at 01:59:49PM -0500, Johannes Weiner wrote:
> Hi Shaohua,
> 
> On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> > We are trying to use MADV_FREE in jemalloc. Several issues are found. 
> > Without
> > solving the issues, jemalloc can't use the MADV_FREE feature.
> > - Doesn't support system without swap enabled. Because if swap is off, we 
> > can't
> >   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> > mixed
> >   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
> >   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> > enabled.
> >   But in our environment, a lot of machines don't enable swap. This will 
> > prevent
> >   our setup using MADV_FREE.
> > - Increases memory pressure. page reclaim bias file pages reclaim against
> >   anonymous pages. This doesn't make sense for MADV_FREE pages, because 
> > those
> >   pages could be freed easily and refilled with very slight penality. Even 
> > page
> >   reclaim doesn't bias file pages, there is still an issue, because 
> > MADV_FREE
> >   pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
> >   page, we probably must scan a lot of other anonymous pages, which is
> >   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> > nothing
> >   without it.
> 
> Fully agreed, the anon LRU is a bad place for these pages.
> 
> > For the first two issues, introducing a new LRU list for MADV_FREE pages 
> > could
> > solve the issues. We can directly reclaim MADV_FREE pages without writting 
> > them
> > out to swap, so the first issue could be fixed. If only MADV_FREE pages are 
> > in
> > the new list, page reclaim can easily reclaim such pages without 
> > interference
> > of file or anonymous pages. The memory pressure issue will disappear.
> 
> Do we actually need a new page flag and a special LRU for them? These
> pages are basically like clean cache pages at that point. What do you
> think about clearing their PG_swapbacked flag on MADV_FREE and moving
> them to the inactive file list? The way isolate+putback works should
> not even need much modification, something like clear_page_mlock().
> 
> When the reclaim scanner finds anon && dirty && !swapbacked, it can
> again set PG_swapbacked and goto keep_locked to move the page back
> into the anon LRU to get reclaimed according to swapping rules.

Interesting idea! Not sure though, the MADV_FREE pages are actually anonymous
pages, this will introduce confusion. On the other hand, if the MADV_FREE pages
are mixed with inactive file pages, page reclaim need to reclaim a lot of file
pages first before reclaim the MADV_FREE pages. This doesn't look good. The
point of a separate LRU is to avoid scan other anon/file pages.
 
> > For the third issue, we can add a separate RSS count for MADV_FREE pages. 
> > The
> > count will be increased in madvise syscall and decreased in page reclaim 
> > (eg,
> > unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> > active page there. But there isn't mm_struct context at that place. 
> > Iterating
> > vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
> > somebody can share a hint how to fix this issue.
> 
> This problem also goes away if we use the file LRUs.

Can you elaborate this please? Maybe you mean charge them to MM_FILEPAGES? But
that doesn't solve the problem. 'statm' proc file will still report a big RSS.

Thanks,
Shaohua


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Johannes Weiner
Hi Shaohua,

On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
> solving the issues, jemalloc can't use the MADV_FREE feature.
> - Doesn't support system without swap enabled. Because if swap is off, we 
> can't
>   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> mixed
>   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
>   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> enabled.
>   But in our environment, a lot of machines don't enable swap. This will 
> prevent
>   our setup using MADV_FREE.
> - Increases memory pressure. page reclaim bias file pages reclaim against
>   anonymous pages. This doesn't make sense for MADV_FREE pages, because those
>   pages could be freed easily and refilled with very slight penality. Even 
> page
>   reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
>   pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
>   page, we probably must scan a lot of other anonymous pages, which is
>   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> nothing
>   without it.

Fully agreed, the anon LRU is a bad place for these pages.

> For the first two issues, introducing a new LRU list for MADV_FREE pages could
> solve the issues. We can directly reclaim MADV_FREE pages without writting 
> them
> out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
> the new list, page reclaim can easily reclaim such pages without interference
> of file or anonymous pages. The memory pressure issue will disappear.

Do we actually need a new page flag and a special LRU for them? These
pages are basically like clean cache pages at that point. What do you
think about clearing their PG_swapbacked flag on MADV_FREE and moving
them to the inactive file list? The way isolate+putback works should
not even need much modification, something like clear_page_mlock().

When the reclaim scanner finds anon && dirty && !swapbacked, it can
again set PG_swapbacked and goto keep_locked to move the page back
into the anon LRU to get reclaimed according to swapping rules.

> For the third issue, we can add a separate RSS count for MADV_FREE pages. The
> count will be increased in madvise syscall and decreased in page reclaim (eg,
> unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> active page there. But there isn't mm_struct context at that place. Iterating
> vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
> somebody can share a hint how to fix this issue.

This problem also goes away if we use the file LRUs.


Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-31 Thread Johannes Weiner
Hi Shaohua,

On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
> solving the issues, jemalloc can't use the MADV_FREE feature.
> - Doesn't support system without swap enabled. Because if swap is off, we 
> can't
>   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> mixed
>   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
>   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> enabled.
>   But in our environment, a lot of machines don't enable swap. This will 
> prevent
>   our setup using MADV_FREE.
> - Increases memory pressure. page reclaim bias file pages reclaim against
>   anonymous pages. This doesn't make sense for MADV_FREE pages, because those
>   pages could be freed easily and refilled with very slight penality. Even 
> page
>   reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
>   pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
>   page, we probably must scan a lot of other anonymous pages, which is
>   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> nothing
>   without it.

Fully agreed, the anon LRU is a bad place for these pages.

> For the first two issues, introducing a new LRU list for MADV_FREE pages could
> solve the issues. We can directly reclaim MADV_FREE pages without writting 
> them
> out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
> the new list, page reclaim can easily reclaim such pages without interference
> of file or anonymous pages. The memory pressure issue will disappear.

Do we actually need a new page flag and a special LRU for them? These
pages are basically like clean cache pages at that point. What do you
think about clearing their PG_swapbacked flag on MADV_FREE and moving
them to the inactive file list? The way isolate+putback works should
not even need much modification, something like clear_page_mlock().

When the reclaim scanner finds anon && dirty && !swapbacked, it can
again set PG_swapbacked and goto keep_locked to move the page back
into the anon LRU to get reclaimed according to swapping rules.

> For the third issue, we can add a separate RSS count for MADV_FREE pages. The
> count will be increased in madvise syscall and decreased in page reclaim (eg,
> unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> active page there. But there isn't mm_struct context at that place. Iterating
> vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
> somebody can share a hint how to fix this issue.

This problem also goes away if we use the file LRUs.


[RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-29 Thread Shaohua Li
Hi,

We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
solving the issues, jemalloc can't use the MADV_FREE feature.
- Doesn't support system without swap enabled. Because if swap is off, we can't
  or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed
  with other anonymous pages, we can't reclaim MADV_FREE pages. In current
  implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled.
  But in our environment, a lot of machines don't enable swap. This will prevent
  our setup using MADV_FREE.
- Increases memory pressure. page reclaim bias file pages reclaim against
  anonymous pages. This doesn't make sense for MADV_FREE pages, because those
  pages could be freed easily and refilled with very slight penality. Even page
  reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
  pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
  page, we probably must scan a lot of other anonymous pages, which is
  inefficient. In our test, we usually see oom with MADV_FREE enabled and 
nothing
  without it.
- RSS accounting. MADV_FREE pages are accounted as normal anon pages and
  reclaimed lazily, so application's RSS becomes bigger. This confuses our
  workloads. We have monitoring daemon running and if it finds applications' RSS
  becomes abnormal, the daemon will kill the applications even kernel can 
reclaim
  the memory easily. Currently we don't export separate RSS accounting for
  MADV_FREE pages. This will prevent our setup using MADV_FREE too.

For the first two issues, introducing a new LRU list for MADV_FREE pages could
solve the issues. We can directly reclaim MADV_FREE pages without writting them
out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
the new list, page reclaim can easily reclaim such pages without interference
of file or anonymous pages. The memory pressure issue will disappear.

Actually Minchan posted patches to add the LRU list before, but he didn't
pursue. So I picked up them and the patches are based on Minchan's previous
patches. The main difference between my patches and Minchan previous patches is
page reclaim policy. Minchan's patches introduces a knob to balance the reclaim
of MADV_FREE pages and anon/file pages, while the patches always reclaim
MADV_FREE pages first if there are. I described the reason in patch 5.

For the third issue, we can add a separate RSS count for MADV_FREE pages. The
count will be increased in madvise syscall and decreased in page reclaim (eg,
unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
active page there. But there isn't mm_struct context at that place. Iterating
vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
somebody can share a hint how to fix this issue.

Thanks,
Shaohua

Minchan previous patches:
http://marc.info/?l=linux-mm=144800657002763=2

Shaohua Li (6):
  mm: add wrap for page accouting index
  mm: add lazyfree page flag
  mm: add LRU_LAZYFREE lru list
  mm: move MADV_FREE pages into LRU_LAZYFREE list
  mm: reclaim lazyfree pages
  mm: enable MADV_FREE for swapless system

 drivers/base/node.c   |  2 +
 drivers/staging/android/lowmemorykiller.c |  3 +-
 fs/proc/meminfo.c |  1 +
 fs/proc/task_mmu.c|  8 ++-
 include/linux/mm_inline.h | 41 +
 include/linux/mmzone.h|  9 +++
 include/linux/page-flags.h|  6 ++
 include/linux/swap.h  |  2 +-
 include/linux/vm_event_item.h |  2 +-
 include/trace/events/mmflags.h|  1 +
 include/trace/events/vmscan.h | 31 +-
 kernel/power/snapshot.c   |  1 +
 mm/compaction.c   | 11 ++--
 mm/huge_memory.c  |  6 +-
 mm/khugepaged.c   |  6 +-
 mm/madvise.c  | 11 +---
 mm/memcontrol.c   |  4 ++
 mm/memory-failure.c   |  3 +-
 mm/memory_hotplug.c   |  3 +-
 mm/mempolicy.c|  3 +-
 mm/migrate.c  | 29 --
 mm/page_alloc.c   | 10 
 mm/rmap.c |  7 ++-
 mm/swap.c | 51 +---
 mm/vmscan.c   | 96 +++
 mm/vmstat.c   |  4 ++
 26 files changed, 242 insertions(+), 109 deletions(-)

-- 
2.9.3



[RFC 0/6]mm: add new LRU list for MADV_FREE pages

2017-01-29 Thread Shaohua Li
Hi,

We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
solving the issues, jemalloc can't use the MADV_FREE feature.
- Doesn't support system without swap enabled. Because if swap is off, we can't
  or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed
  with other anonymous pages, we can't reclaim MADV_FREE pages. In current
  implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled.
  But in our environment, a lot of machines don't enable swap. This will prevent
  our setup using MADV_FREE.
- Increases memory pressure. page reclaim bias file pages reclaim against
  anonymous pages. This doesn't make sense for MADV_FREE pages, because those
  pages could be freed easily and refilled with very slight penality. Even page
  reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
  pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
  page, we probably must scan a lot of other anonymous pages, which is
  inefficient. In our test, we usually see oom with MADV_FREE enabled and 
nothing
  without it.
- RSS accounting. MADV_FREE pages are accounted as normal anon pages and
  reclaimed lazily, so application's RSS becomes bigger. This confuses our
  workloads. We have monitoring daemon running and if it finds applications' RSS
  becomes abnormal, the daemon will kill the applications even kernel can 
reclaim
  the memory easily. Currently we don't export separate RSS accounting for
  MADV_FREE pages. This will prevent our setup using MADV_FREE too.

For the first two issues, introducing a new LRU list for MADV_FREE pages could
solve the issues. We can directly reclaim MADV_FREE pages without writting them
out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
the new list, page reclaim can easily reclaim such pages without interference
of file or anonymous pages. The memory pressure issue will disappear.

Actually Minchan posted patches to add the LRU list before, but he didn't
pursue. So I picked up them and the patches are based on Minchan's previous
patches. The main difference between my patches and Minchan previous patches is
page reclaim policy. Minchan's patches introduces a knob to balance the reclaim
of MADV_FREE pages and anon/file pages, while the patches always reclaim
MADV_FREE pages first if there are. I described the reason in patch 5.

For the third issue, we can add a separate RSS count for MADV_FREE pages. The
count will be increased in madvise syscall and decreased in page reclaim (eg,
unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
active page there. But there isn't mm_struct context at that place. Iterating
vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
somebody can share a hint how to fix this issue.

Thanks,
Shaohua

Minchan previous patches:
http://marc.info/?l=linux-mm=144800657002763=2

Shaohua Li (6):
  mm: add wrap for page accouting index
  mm: add lazyfree page flag
  mm: add LRU_LAZYFREE lru list
  mm: move MADV_FREE pages into LRU_LAZYFREE list
  mm: reclaim lazyfree pages
  mm: enable MADV_FREE for swapless system

 drivers/base/node.c   |  2 +
 drivers/staging/android/lowmemorykiller.c |  3 +-
 fs/proc/meminfo.c |  1 +
 fs/proc/task_mmu.c|  8 ++-
 include/linux/mm_inline.h | 41 +
 include/linux/mmzone.h|  9 +++
 include/linux/page-flags.h|  6 ++
 include/linux/swap.h  |  2 +-
 include/linux/vm_event_item.h |  2 +-
 include/trace/events/mmflags.h|  1 +
 include/trace/events/vmscan.h | 31 +-
 kernel/power/snapshot.c   |  1 +
 mm/compaction.c   | 11 ++--
 mm/huge_memory.c  |  6 +-
 mm/khugepaged.c   |  6 +-
 mm/madvise.c  | 11 +---
 mm/memcontrol.c   |  4 ++
 mm/memory-failure.c   |  3 +-
 mm/memory_hotplug.c   |  3 +-
 mm/mempolicy.c|  3 +-
 mm/migrate.c  | 29 --
 mm/page_alloc.c   | 10 
 mm/rmap.c |  7 ++-
 mm/swap.c | 51 +---
 mm/vmscan.c   | 96 +++
 mm/vmstat.c   |  4 ++
 26 files changed, 242 insertions(+), 109 deletions(-)

-- 
2.9.3