Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Minchan Kim
On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > > Hi,
> > > could you try to run with the following patch on top of the previous
> > > one? I do not think it will make a large change in your workload but
> > > I think we need something like that so some testing under which is known
> > > to make a high lowmem pressure would be really appreciated. If you have
> > > more time to play with it then running with and without the patch with
> > > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > > whether it make any difference at all.
> > > 
> > > I would also appreciate if Mel and Johannes had a look at it. I am not
> > > yet sure whether we need the same thing for anon/file balancing in
> > > get_scan_count. I suspect we need but need to think more about that.
> > > 
> > > Thanks a lot again!
> > > ---
> > > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mho...@suse.com>
> > > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > > 
> > > get_scan_count considers the whole node LRU size when
> > > - doing SCAN_FILE due to many page cache inactive pages
> > > - calculating the number of pages to scan
> > > 
> > > in both cases this might lead to unexpected behavior especially on 32b
> > > systems where we can expect lowmem memory pressure very often.
> > > 
> > > A large highmem zone can easily distort SCAN_FILE heuristic because
> > > there might be only few file pages from the eligible zones on the node
> > > lru and we would still enforce file lru scanning which can lead to
> > > trashing while we could still scan anonymous pages.
> > 
> > Nit:
> > It doesn't make thrashing because isolate_lru_pages filter out them
> > but I agree it makes pointless CPU burning to find eligible pages.
> 
> This is not about isolate_lru_pages. The trashing could happen if we had
> lowmem pagecache user which would constantly reclaim recently faulted
> in pages while there is anonymous memory in the lowmem which could be
> reclaimed instead.
>  
> [...]
> > >  /*
> > > + * Return the number of pages on the given lru which are eligibne for the
> > eligible
> 
> fixed
> 
> > > + * given zone_idx
> > > + */
> > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > + enum lru_list lru, int zone_idx)
> > 
> > Nit:
> > 
> > Although there is a comment, function name is rather confusing when I 
> > compared
> > it with lruvec_zone_lru_size.
> 
> I am all for a better name.
> 
> > lruvec_eligible_zones_lru_size is better?
> 
> this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> What about lruvec_lru_size_eligible_zones?

Don't mind.

>  
> > Nit:
> > 
> > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx 
> > rather than
> > own custom calculation to filter out non-eligible pages. 
> 
> Yes, that would be possible and I was considering that. But then I found
> useful to see total and reduced numbers in the tracepoint
> http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
> and didn't want to call lruvec_lru_size 2 times. But if you insist then
> I can just do that.

I don't mind either but I think we need to describe the reason if you want to
go with your open-coded version. Otherwise, someone will try to fix it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.
> 
> I would also appreciate if Mel and Johannes had a look at it. I am not
> yet sure whether we need the same thing for anon/file balancing in
> get_scan_count. I suspect we need but need to think more about that.
> 
> Thanks a lot again!
> ---
> From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mho...@suse.com>
> Date: Tue, 27 Dec 2016 16:28:44 +0100
> Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> 
> get_scan_count considers the whole node LRU size when
> - doing SCAN_FILE due to many page cache inactive pages
> - calculating the number of pages to scan
> 
> in both cases this might lead to unexpected behavior especially on 32b
> systems where we can expect lowmem memory pressure very often.
> 
> A large highmem zone can easily distort SCAN_FILE heuristic because
> there might be only few file pages from the eligible zones on the node
> lru and we would still enforce file lru scanning which can lead to
> trashing while we could still scan anonymous pages.

Nit:
It doesn't make thrashing because isolate_lru_pages filter out them
but I agree it makes pointless CPU burning to find eligible pages.

> 
> The later use of lruvec_lru_size can be problematic as well. Especially
> when there are not many pages from the eligible zones. We would have to
> skip over many pages to find anything to reclaim but shrink_node_memcg
> would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
> at maximum. Therefore we can end up going over a large LRU many times
> without actually having chance to reclaim much if anything at all. The
> closer we are out of memory on lowmem zone the worse the problem will
> be.
> 
> Signed-off-by: Michal Hocko <mho...@suse.com>
> ---
>  mm/vmscan.c | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c98b1a585992..785b4d7fb8a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec 
> *lruvec, enum lru_list lru, int
>  }
>  
>  /*
> + * Return the number of pages on the given lru which are eligibne for the
eligible
> + * given zone_idx
> + */
> +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> + enum lru_list lru, int zone_idx)

Nit:

Although there is a comment, function name is rather confusing when I compared
it with lruvec_zone_lru_size.

lruvec_eligible_zones_lru_size is better?


> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + unsigned long lru_size;
> + int zid;
> +
> + lru_size = lruvec_lru_size(lruvec, lru);
> + for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = >node_zones[zid];
> + unsigned long size;
> +
> + if (!managed_zone(zone))
> + continue;
> +
> + size = lruvec_zone_lru_size(lruvec, lru, zid);
> + lru_size -= min(size, lru_size);
> + }
> +
> + return lru_size;
> +}
> +
> +/*
>   * Add a shrinker callback to be called from the vm.
>   */
>  int register_shrinker(struct shrinker *shrinker)
> @@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>* system is under heavy pressure.
>*/
>   if (!inactive_list_is_low(lruvec, true, sc) &&
> - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
> + lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
> sc->reclaim_idx) >> sc->priority) {
>   scan_balance = SCAN_FILE;
>   goto out;
>   }
> @@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>   unsigned long size;
>   unsigned long scan;
>  
> - size = lruvec_lru_size(lruvec, lru);
> + size = lruvec_lru_size_zone_idx(lruvec, lru, 
> sc->reclaim_idx);
>  

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mho...@suse.com>
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> > kworker/u4:5 invoked oom-killer: 
> > gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> > oom_score_adj=0
> > kworker/u4:5 cpuset=/ mems_allowed=0
> > CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> > [...]
> > Mem-Info:
> > active_anon:58685 inactive_anon:90 isolated_anon:0
> >  active_file:274324 inactive_file:281962 isolated_file:0
> >  unevictable:0 dirty:649 writeback:0 unstable:0
> >  slab_reclaimable:40662 slab_unreclaimable:17754
> >  mapped:7382 shmem:202 pagetables:351 bounce:0
> >  free:206736 free_pcp:332 free_cma:0
> > Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> > inactive_file:1127848kB unevictable:0kB isolated(anon):0kB 
> > isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB 
> > shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB 
> > unstable:0kB pages_scanned:0 all_unreclaimable? no
> > DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> > inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> > writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> > slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> > pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 813 3474 3474
> > Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> > active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> > unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> > mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> > kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> > local_pcp:340kB free_cma:0kB
> > lowmem_reserve[]: 0 0 21292 21292
> > HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> > active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> > inactive_file:1127804kB unevictable:0kB writepending:2592kB 
> > present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB 
> > slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB 
> > free_pcp:800kB local_pcp:608kB free_cma:0kB
> > 
> > the oom killer is clearly pre-mature because there there is still a
> > lot of page cache in the zone Normal which should satisfy this lowmem
> > request. Further debugging has shown that the reclaim cannot make any
> > forward progress because the page cache is hidden in the active list
> > which doesn't get rotated because inactive_list_is_low is not memcg
> > aware.
> > It simply subtracts per-zone highmem counters from the respective

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
g per zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled. Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
> 
> We are loosing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
> of the inherent zone vs. node discrepancy.
> 
> Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible 
> zones inactive ratio")
> Cc: stable # 4.8+
> Reported-by: Nils Holland <nholl...@tisys.org>
> Signed-off-by: Michal Hocko <mho...@suse.com>
Acked-by: Minchan Kim <minc...@kernel.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

2011-10-01 Thread Minchan Kim
On Fri, Sep 30, 2011 at 09:17:20AM +0200, Johannes Weiner wrote:
 The amount of dirtyable pages should not include the full number of
 free pages: there is a number of reserved pages that the page
 allocator and kswapd always try to keep free.
 
 The closer (reclaimable pages - dirty pages) is to the number of
 reserved pages, the more likely it becomes for reclaim to run into
 dirty pages:
 
+--+ ---
|   anon   |  |
+--+  |
|  |  |
|  |  -- dirty limit new-- flusher new
|   file   |  | |
|  |  | |
|  |  -- dirty limit old-- flusher old
|  ||
+--+   --- reclaim
| reserved |
+--+
|  kernel  |
+--+
 
 This patch introduces a per-zone dirty reserve that takes both the
 lowmem reserve as well as the high watermark of the zone into account,
 and a global sum of those per-zone values that is subtracted from the
 global amount of dirtyable pages.  The lowmem reserve is unavailable
 to page cache allocations and kswapd tries to keep the high watermark
 free.  We don't want to end up in a situation where reclaim has to
 clean pages in order to balance zones.
 
 Not treating reserved pages as dirtyable on a global level is only a
 conceptual fix.  In reality, dirty pages are not distributed equally
 across zones and reclaim runs into dirty pages on a regular basis.
 
 But it is important to get this right before tackling the problem on a
 per-zone level, where the distance between reclaim and the dirty pages
 is mostly much smaller in absolute numbers.
 
 Signed-off-by: Johannes Weiner jwei...@redhat.com
 Reviewed-by: Rik van Riel r...@redhat.com
Reviewed-by: Minchan Kim minchan@gmail.com

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

2011-09-28 Thread Minchan Kim
On Tue, Sep 20, 2011 at 03:45:14PM +0200, Johannes Weiner wrote:
 Tell the page allocator that pages allocated through
 grab_cache_page_write_begin() are expected to become dirty soon.
 
 Signed-off-by: Johannes Weiner jwei...@redhat.com
Reviewed-by: Minchan Kim minchan@gmail.com

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Minchan Kim
On Wed, Sep 28, 2011 at 09:11:54AM +0200, Johannes Weiner wrote:
 On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
  On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
   The maximum number of dirty pages that exist in the system at any time
   is determined by a number of pages considered dirtyable and a
   user-configured percentage of those, or an absolute number in bytes.
  
  It's explanation of old approach.
 
 What do you mean?  This does not change with this patch.  We still
 have a number of dirtyable pages and a limit that is applied
 relatively to this number.
 
   This number of dirtyable pages is the sum of memory provided by all
   the zones in the system minus their lowmem reserves and high
   watermarks, so that the system can retain a healthy number of free
   pages without having to reclaim dirty pages.
  
  It's a explanation of new approach.
 
 Same here, this aspect is also not changed with this patch!
 
   But there is a flaw in that we have a zoned page allocator which does
   not care about the global state but rather the state of individual
   memory zones.  And right now there is nothing that prevents one zone
   from filling up with dirty pages while other zones are spared, which
   frequently leads to situations where kswapd, in order to restore the
   watermark of free pages, does indeed have to write pages from that
   zone's LRU list.  This can interfere so badly with IO from the flusher
   threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
   requests from reclaim already, taking away the VM's only possibility
   to keep such a zone balanced, aside from hoping the flushers will soon
   clean pages from that zone.
  
  It's a explanation of old approach, again!
  Shoudn't we move above phrase of new approach into below?
 
 Everything above describes the current behaviour (at the point of this
 patch, so respecting lowmem_reserve e.g. is part of the current
 behaviour by now) and its problems.  And below follows a description
 of how the patch tries to fix it.

It seems that it's not a good choice to use old and new terms.
Hannes, please ignore, it's not a biggie.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

2011-09-28 Thread Minchan Kim
On Wed, Sep 28, 2011 at 09:50:54AM +0200, Johannes Weiner wrote:
 On Wed, Sep 28, 2011 at 01:55:51PM +0900, Minchan Kim wrote:
  Hi Hannes,
  
  On Fri, Sep 23, 2011 at 04:38:17PM +0200, Johannes Weiner wrote:
   The amount of dirtyable pages should not include the full number of
   free pages: there is a number of reserved pages that the page
   allocator and kswapd always try to keep free.
   
   The closer (reclaimable pages - dirty pages) is to the number of
   reserved pages, the more likely it becomes for reclaim to run into
   dirty pages:
   
  +--+ ---
  |   anon   |  |
  +--+  |
  |  |  |
  |  |  -- dirty limit new-- flusher new
  |   file   |  | |
  |  |  | |
  |  |  -- dirty limit old-- flusher old
  |  ||
  +--+   --- reclaim
  | reserved |
  +--+
  |  kernel  |
  +--+
   
   This patch introduces a per-zone dirty reserve that takes both the
   lowmem reserve as well as the high watermark of the zone into account,
   and a global sum of those per-zone values that is subtracted from the
   global amount of dirtyable pages.  The lowmem reserve is unavailable
   to page cache allocations and kswapd tries to keep the high watermark
   free.  We don't want to end up in a situation where reclaim has to
   clean pages in order to balance zones.
   
   Not treating reserved pages as dirtyable on a global level is only a
   conceptual fix.  In reality, dirty pages are not distributed equally
   across zones and reclaim runs into dirty pages on a regular basis.
   
   But it is important to get this right before tackling the problem on a
   per-zone level, where the distance between reclaim and the dirty pages
   is mostly much smaller in absolute numbers.
   
   Signed-off-by: Johannes Weiner jwei...@redhat.com
   ---
include/linux/mmzone.h |6 ++
include/linux/swap.h   |1 +
mm/page-writeback.c|6 --
mm/page_alloc.c|   19 +++
4 files changed, 30 insertions(+), 2 deletions(-)
   
   diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
   index 1ed4116..37a61e7 100644
   --- a/include/linux/mmzone.h
   +++ b/include/linux/mmzone.h
   @@ -317,6 +317,12 @@ struct zone {
  */
 unsigned long   lowmem_reserve[MAX_NR_ZONES];

   + /*
   +  * This is a per-zone reserve of pages that should not be
   +  * considered dirtyable memory.
   +  */
   + unsigned long   dirty_balance_reserve;
   +
#ifdef CONFIG_NUMA
 int node;
 /*
   diff --git a/include/linux/swap.h b/include/linux/swap.h
   index b156e80..9021453 100644
   --- a/include/linux/swap.h
   +++ b/include/linux/swap.h
   @@ -209,6 +209,7 @@ struct swap_list_t {
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalreserve_pages;
   +extern unsigned long dirty_balance_reserve;
extern unsigned int nr_free_buffer_pages(void);
extern unsigned int nr_free_pagecache_pages(void);

   diff --git a/mm/page-writeback.c b/mm/page-writeback.c
   index da6d263..c8acf8a 100644
   --- a/mm/page-writeback.c
   +++ b/mm/page-writeback.c
   @@ -170,7 +170,8 @@ static unsigned long 
   highmem_dirtyable_memory(unsigned long total)
 NODE_DATA(node)-node_zones[ZONE_HIGHMEM];

 x += zone_page_state(z, NR_FREE_PAGES) +
   -  zone_reclaimable_pages(z);
   +  zone_reclaimable_pages(z) -
   +  zone-dirty_balance_reserve;
 }
 /*
  * Make sure that the number of highmem pages is never larger
   @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
{
 unsigned long x;

   - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
   + x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
   + dirty_balance_reserve;

 if (!vm_highmem_is_dirtyable)
 x -= highmem_dirtyable_memory(x);
   diff --git a/mm/page_alloc.c b/mm/page_alloc.c
   index 1dba05e..f8cba89 100644
   --- a/mm/page_alloc.c
   +++ b/mm/page_alloc.c
   @@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);

unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
   +/*
   + * When calculating the number of globally allowed dirty pages, there
   + * is a certain number of per-zone reserves that should not be
   + * considered dirtyable memory.  This is the sum of those reserves
   + * over all existing zones that contribute dirtyable memory.
   + */
   +unsigned long dirty_balance_reserve __read_mostly;
   +
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

   @@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages

Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

2011-09-27 Thread Minchan Kim
 to
 +  * GFP_HIGHUSER page cache allocations and
 +  * kswapd tries to balance zones to their high
 +  * watermark.  As a result, neither should be
 +  * regarded as dirtyable memory, to prevent a
 +  * situation where reclaim has to clean pages
 +  * in order to balance the zones.
 +  */

Could you put Mel's description instead of it if you don't mind?
If I didn't see Mel's thing, maybe I wouldn't suggest but it looks
more easier to understand.

 + zone-dirty_balance_reserve = max;
   }
   }
 + dirty_balance_reserve = reserve_pages;
   totalreserve_pages = reserve_pages;
  }
  
 -- 
 1.7.6.2
 

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-27 Thread Minchan Kim
On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
 The maximum number of dirty pages that exist in the system at any time
 is determined by a number of pages considered dirtyable and a
 user-configured percentage of those, or an absolute number in bytes.

It's explanation of old approach.

 
 This number of dirtyable pages is the sum of memory provided by all
 the zones in the system minus their lowmem reserves and high
 watermarks, so that the system can retain a healthy number of free
 pages without having to reclaim dirty pages.

It's a explanation of new approach.

 
 But there is a flaw in that we have a zoned page allocator which does
 not care about the global state but rather the state of individual
 memory zones.  And right now there is nothing that prevents one zone
 from filling up with dirty pages while other zones are spared, which
 frequently leads to situations where kswapd, in order to restore the
 watermark of free pages, does indeed have to write pages from that
 zone's LRU list.  This can interfere so badly with IO from the flusher
 threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
 requests from reclaim already, taking away the VM's only possibility
 to keep such a zone balanced, aside from hoping the flushers will soon
 clean pages from that zone.

It's a explanation of old approach, again!
Shoudn't we move above phrase of new approach into below?

 
 Enter per-zone dirty limits.  They are to a zone's dirtyable memory
 what the global limit is to the global amount of dirtyable memory, and
 try to make sure that no single zone receives more than its fair share
 of the globally allowed dirty pages in the first place.  As the number
 of pages considered dirtyable exclude the zones' lowmem reserves and
 high watermarks, the maximum number of dirty pages in a zone is such
 that the zone can always be balanced without requiring page cleaning.
 
 As this is a placement decision in the page allocator and pages are
 dirtied only after the allocation, this patch allows allocators to
 pass __GFP_WRITE when they know in advance that the page will be
 written to and become dirty soon.  The page allocator will then
 attempt to allocate from the first zone of the zonelist - which on
 NUMA is determined by the task's NUMA memory policy - that has not
 exceeded its dirty limit.
 
 At first glance, it would appear that the diversion to lower zones can
 increase pressure on them, but this is not the case.  With a full high
 zone, allocations will be diverted to lower zones eventually, so it is
 more of a shift in timing of the lower zone allocations.  Workloads
 that previously could fit their dirty pages completely in the higher
 zone may be forced to allocate from lower zones, but the amount of
 pages that 'spill over' are limited themselves by the lower zones'
 dirty constraints, and thus unlikely to become a problem.

That's a good justification.

 
 For now, the problem of unfair dirty page distribution remains for
 NUMA configurations where the zones allowed for allocation are in sum
 not big enough to trigger the global dirty limits, wake up the flusher
 threads and remedy the situation.  Because of this, an allocation that
 could not succeed on any of the considered zones is allowed to ignore
 the dirty limits before going into direct reclaim or even failing the
 allocation, until a future patch changes the global dirty throttling
 and flusher thread activation so that they take individual zone states
 into account.
 
 Signed-off-by: Johannes Weiner jwei...@redhat.com

Otherwise, looks good to me.
Reviewed-by: Minchan Kim minchan@gmail.com

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2/4] mm: writeback: cleanups in preparation for per-zone dirty limits

2011-09-27 Thread Minchan Kim
On Fri, Sep 23, 2011 at 04:41:07PM +0200, Johannes Weiner wrote:
 On Thu, Sep 22, 2011 at 10:52:42AM +0200, Johannes Weiner wrote:
  On Wed, Sep 21, 2011 at 04:02:26PM -0700, Andrew Morton wrote:
   Should we rename determine_dirtyable_memory() to
   global_dirtyable_memory(), to get some sense of its relationship with
   zone_dirtyable_memory()?
  
  Sounds good.
 
 ---
 
 The next patch will introduce per-zone dirty limiting functions in
 addition to the traditional global dirty limiting.
 
 Rename determine_dirtyable_memory() to global_dirtyable_memory()
 before adding the zone-specific version, and fix up its documentation.
 
 Also, move the functions to determine the dirtyable memory and the
 function to calculate the dirty limit based on that together so that
 their relationship is more apparent and that they can be commented on
 as a group.
 
 Signed-off-by: Johannes Weiner jwei...@redhat.com
Reviewed-by: Minchan Kim minchan@gmail.com
-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V8 4/8] mm/fs: add hooks to support cleancache

2011-04-17 Thread Minchan Kim
 that filekey, delete in cleancache any record of
 any matching handle and any data associated with those handles;
 any space used for the data can also be freed.

 The cleancache operation currently known as init fs has
 the following semantics: Create a unique poolid to refer
 to this filesystem and save it in the superblock's
 cleancache_poolid field.

 The cleancache operation currently known as flush fs has
 the following semantics: Get the cleancache_poolid field
 from this superblock.  If cleancache contains ANY handles
 associated with that poolid, delete in cleancache any
 record of any matching handles and any data associated with
 those handles; any space used for the data can also be freed.
 Also, set the superblock's cleancache_poolid to be invalid
 and, in cleancache, recycle the poolid so a subsequent init_fs
 operation can reuse it.

 That's all!

 Thanks,
 Dan


At least, I didn't confused your semantics except just flush. That's
why I suggested only flush but after seeing your explaining, there is
another thing I want to change. The get/put is common semantic of
reference counting in kernel but in POV your semantics, it makes sense
to me but get has a exclusive semantic so I want to represent it with
API name. Maybe cleancache_get_page_exclusive.

The summary is that I don't want to change all API name. Just two thing.
(I am not sure you and others agree on me. It's just suggestion).

1. cleancache_flush_page - cleancache_[invalidate|remove]_page
2. cleancache_get_page - cleancache_get_page_exclusive

BTW, Nice description.
Please include it in documentation if we can't reach the conclusion.
It will help others to understand semantic of cleancache.

Thanks, Dan.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V8 4/8] mm/fs: add hooks to support cleancache

2011-04-14 Thread Minchan Kim
, otherwise
 +        * invalidate any existing cleancache entries.  We can't leave
 +        * stale data around in the cleancache once our page is gone
 +        */
 +       if (PageUptodate(page)  PageMappedToDisk(page))
 +               cleancache_put_page(page);
 +       else
 +               cleancache_flush_page(mapping, page);
 +

First of all, thanks for resolving conflict with my patch.

Before I suggested a thing about cleancache_flush_page, cleancache_flush_inode.

what's the meaning of flush's semantic?
I thought it means invalidation.
AFAIC, how about change flush with invalidate?


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/3] drivers/staging: zcache: dynamic page cache/swap compression

2011-02-15 Thread Minchan Kim
On Wed, Feb 16, 2011 at 10:27 AM, Dan Magenheimer
dan.magenhei...@oracle.com wrote:
 -Original Message-
 From: Matt [mailto:jackdac...@gmail.com]
 Sent: Tuesday, February 15, 2011 5:12 PM
 To: Minchan Kim
 Cc: Dan Magenheimer; gre...@suse.de; Chris Mason; linux-
 ker...@vger.kernel.org; linux...@kvack.org; ngu...@vflare.org; linux-
 bt...@vger.kernel.org; Josef Bacik; Dan Rosenberg; Yan Zheng;
 mi...@cn.fujitsu.com; Li Zefan
 Subject: Re: [PATCH V2 0/3] drivers/staging: zcache: dynamic page
 cache/swap compression

 On Mon, Feb 14, 2011 at 4:35 AM, Minchan Kim minchan@gmail.com
  Just my guessing. I might be wrong.
 
  __cleancache_flush_inode calls cleancache_get_key with
 cleancache_filekey.
  cleancache_file_key's size is just 6 * u32.
  cleancache_get_key calls btrfs_encode_fh with the key.
  but btrfs_encode_fh does typecasting the key to btrfs_fid which is
  bigger size than cleancache_filekey's one so it should not access
  fields beyond cleancache_get_key.
 
  I think some file systems use extend fid so in there, this problem
 can
  happen. I don't know why we can't find it earlier. Maybe Dan and
  others test it for a long time.
 
  Am I missing something?
 
 
 
  --
  Kind regards,
  Minchan Kim
 

 reposting Minchan's message for reference to the btrfs mailing list
 while also adding

 Li Zefan, Miao Xie, Yan Zheng, Dan Rosenberg and Josef Bacik to CC

 Regards

 Matt

 Hi Matt and Minchan --

 (BTRFS EXPERTS SEE *** BELOW)

 I definitely see a bug in cleancache_get_key in the monolithic
 zcache+cleancache+frontswap patch I posted on oss.oracle.com
 that is corrected in linux-next but I don't see how it could
 get provoked by btrfs.

 The bug is that, in cleancache_get_key, the return value of fhfn should
 be checked against 255.  If the return value is 255, cleancache_get_key
 should return -1.  This should disable cleancache for any filesystem
 where KEY_MAX is too large.

 But cleancache_get_key always calls fhfn with connectable == 0 and
 CLEANCACHE_KEY_MAX==6 should be greater than BTRFS_FID_SIZE_CONNECTABLE
 (which I think should be 5?).  And the elements written into the
 typecast btrfs_fid should be only writing the first 5 32-bit words.

BTRFS_FID_SIZE_NON_CONNECTALBE is 5,  not BTRFS_FID_SIZE_CONNECTABLE.
Anyway, you passed connectable with 0 so it should be only writing the
first 5 32-bit words as you said.
That's one I missed. ;-)

Thanks.
-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] Cleancache: overview

2010-07-23 Thread Minchan Kim
On Fri, Jul 23, 2010 at 4:36 PM, Nitin Gupta ngu...@vflare.org wrote:

 2. I think change in btrfs can be avoided by moving cleancache_get_page()
 from do_mpage_reapage() to filemap_fault() and this should work for all
 filesystems. See:

 handle_pte_fault() - do_(non)linear_fault() - __do_fault()
                                                - vma-vm_ops-fault()

 which is defined as filemap_fault() for all filesystems. If some future
 filesystem uses its own custom function (why?) then it will have to arrange 
 for
 call to cleancache_get_page(), if it wants this feature.


filemap fault works only in case of file-backed page which is mapped
but don't work not-mapped cache page.  So we could miss cache page by
read system call if we move it into filemap_fault.


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] Cleancache: overview

2010-07-23 Thread Minchan Kim
On Fri, Jul 23, 2010 at 4:36 PM, Nitin Gupta ngu...@vflare.org wrote:

 2. I think change in btrfs can be avoided by moving cleancache_get_page()
 from do_mpage_reapage() to filemap_fault() and this should work for all
 filesystems. See:

 handle_pte_fault() - do_(non)linear_fault() - __do_fault()
                                                - vma-vm_ops-fault()

 which is defined as filemap_fault() for all filesystems. If some future
 filesystem uses its own custom function (why?) then it will have to arrange 
 for
 call to cleancache_get_page(), if it wants this feature.


filemap fault works only in case of file-backed page which is mapped
but don't work not-mapped cache page.  So we could miss cache page by
read system call if we move it into filemap_fault.


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-04 Thread Minchan Kim
Hi, Nitin. 

I am happy to hear you started this work. 

On Fri, Jun 04, 2010 at 03:06:49PM +0530, Nitin Gupta wrote:
 On 06/03/2010 09:13 PM, Dan Magenheimer wrote:
  On 06/03/2010 10:23 AM, Andreas Dilger wrote:
  On 2010-06-02, at 20:46, Nitin Gupta wrote:
 
  I was thinking it would be quite clever to do compression in, say,
  64kB or 128kB chunks in a mapping (to get decent compression) and
  then write these compressed chunks directly from the page cache
  to disk in btrfs and/or a revived compressed ext4.
 
  Batching of pages to get good compression ratio seems doable.
  
  Is there evidence that batching a set of random individual 4K
  pages will have a significantly better compression ratio than
  compressing the pages separately?  I certainly understand that
  if the pages are from the same file, compression is likely to
  be better, but pages evicted from the page cache (which is
  the source for all cleancache_puts) are likely to be quite a
  bit more random than that, aren't they?
  
 
 
 Batching of pages from random files may not be so effective but
 it would be interesting to collect some data for this. Still,
 per-inode batching of pages seems doable and this should help
 us get over this problem.

1)
Please, consider system memory pressure case. 
In such case, we have to release compressed cache pages. 
Or it would be better to discard not-good-compression pages 
when you compress it. 

2)
This work is related to page reclaiming.
Page reclaiming is to make free memory. 
But this work might free memory little than old. 
I admit your concept is good in terms of I/O cost. 
But we might discard more clean pages than old if you want to 
do batching of pages for good compression.

3)
testcase. 

As I mentioned, it could be good in terms of I/O cost. 
But it could change system's behavior due to page consumption of backend. 
so many page scanning/reclaiming could be happen. 
It means hot pages can be discarded with this patch.
But it's a just guessing. 
So we need number with testcase we can measure I/O and system 
responsivness. 

 
 Thanks,
 Nitin

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/7] Cleancache (was Transcendent Memory): VFS hooks

2010-06-04 Thread Minchan Kim
On Fri, Jun 04, 2010 at 08:13:14AM -0700, Dan Magenheimer wrote:
  1)
  You mentiond PFRA in you description and I understood cleancache has
  a cold clean page which is evicted by reclaimer.
  But __remove_from_page_cache can be called by other call sites.
  
  For example, shmem_write page calls it for moving the page from page
  cache
  to swap cache. Although there isn't the page in page cache, it is in
  swap cache.
  So next read/write of shmem until swapout happens can be read/write in
  swap cache.
  
  I didn't looked into whole of callsites. But please review again them.
 
 I think the if (PageUptodate(page)) eliminates all the cases
 where bad things can happen.

I missed it. my fisrt concern has gone. :)

 
 Note that there may be cases where some unnecessary puts/flushes
 occur.  The focus of the patch is on correctness first; it may
 be possible to increase performance (marginally) in the future by
 reducing unnecessary cases.

I think it wouldn't be marginally. It depends on implementation
of backend. 
I think frontend would be better to notify to backend in 
only exact place. As your descrption, we can call it in shrink_page_list
with some check or change __remove_mapping which adding a argument to tell
this is calling of reclaim path. 

 
  3) Please consider system memory pressure.
  And I hope Nitin consider this, too.
 
 This is definitely very important but remember that cleancache
 provides a great deal of flexibility:  Any page in cleancache
 can be thrown away at any time as every page is clean!  It
 can even accept a page and throw it away immediately.  Clearly
 the backend needs to do this intelligently so this will
 take some policy work.

I admit design goal of cleancache is to give a greate deal of flexibility. 
But I think system memory pressure(ie, direct reclaim and even OOM) is 
exceptional. Whenever we implement various backend, every backend(non-virtual
environemnt)have to implement policy which deal with system memory 
pressure emergency to prevent system hang, I think. 

And backend might need some hack to know the situation. It's horrible.
So I hope frontend gives little information to backend, at least. 

If some backend don't need it, it can just ignore. 
But if some backend need it, it can be a big deal. :)

 
 Since I saw you sent a separate response to Nitin, I'll
 let him answer for his in-kernel page cache compression
 work.  The solution to the similar problem for Xen is
 described in the tmem internals document that I think
 I pointed to earlier here:
 http://oss.oracle.com/projects/tmem/documentation/internals/ 

I will read it when I have a time. 
Thanks for quick reply but I can't. 
It's time to sleep and weekend. 
See you soon and have a nice weekend. 

 
 Thanks,
 Dan
 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-02 Thread Minchan Kim
. by tools that control the pseudo-RAM).  Or a
 pseudo-RAM implementation can simply disable shared_init by always
 returning a negative value.

 If a get_page is successful on a non-shared pool, the page is flushed (thus
 making cleancache an exclusive cache).  On a shared pool, the page

Do you have any reason about force exclusive on a non-shared pool?
To free memory on pesudo-RAM?
I want to make it inclusive by some reason but unfortunately I can't
say why I want it now.

While you mentioned it's exclusive, cleancache_get_page doesn't
flush the page at below code.
Is it a role of user who implement cleancache_ops-get_page?

+int __cleancache_get_page(struct page *page)
+{
+   int ret = 0;
+   int pool_id = page-mapping-host-i_sb-cleancache_poolid;
+
+   if (pool_id = 0) {
+   ret = (*cleancache_ops-get_page)(pool_id,
+ page-mapping-host-i_ino,
+ page-index,
+ page);
+   if (ret == CLEANCACHE_GET_PAGE_SUCCESS)
+   succ_gets++;
+   else
+   failed_gets++;
+   }
+   return ret;
+}
+EXPORT_SYMBOL(__cleancache_get_page);

If backed device is ram(ie), Could we _move_ the pages from page cache
to cleancache?
I mean I don't want to copy page when get/put operation. we can just
move page in case of backed device ram. Is it possible?

You send the patches which is core of cleancache but I don't see any use case.
Could you send use case patches with this series?
It could help understand cleancache's benefit.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-02 Thread Minchan Kim
Hi, Dan. 

On Wed, Jun 02, 2010 at 08:27:48AM -0700, Dan Magenheimer wrote:
 Hi Minchan --
 
  I think cleancache approach is cool. :)
  I have some suggestions and questions.
 
 Thanks for your interest!
 
   If a get_page is successful on a non-shared pool, the page is flushed
  (thus
   making cleancache an exclusive cache).  On a shared pool, the page
  
  Do you have any reason about force exclusive on a non-shared pool?
  To free memory on pesudo-RAM?
  I want to make it inclusive by some reason but unfortunately I can't
  say why I want it now.
 
 The main reason is to free up memory in pseudo-RAM and to
 avoid unnecessary cleancache_flush calls.  If you want
 inclusive, the page can be put immediately following
 the get.  If put-after-get for inclusive becomes common,
 the interface could easily be extended to add a get_no_flush
 call.

Sounds good to me. 

  
  While you mentioned it's exclusive, cleancache_get_page doesn't
  flush the page at below code.
  Is it a role of user who implement cleancache_ops-get_page?
 
 Yes, the flush is done by the cleancache implementation.
 
  If backed device is ram(ie), Could we _move_ the pages from page cache
  to cleancache?
  I mean I don't want to copy page when get/put operation. we can just
  move page in case of backed device ram. Is it possible?
 
 By move, do you mean changing the virtual mappings?  Yes,
 this could be done as long as the source and destination are
 both directly addressable (that is, true physical RAM), but
 requires TLB manipulation and has some complicated corner
 cases.  The copy semantics simplifies the implementation on
 both the frontend and the backend and also allows the
 backend to do fancy things on-the-fly like page compression
 and page deduplication.

Agree. But I don't mean it. 
If I use brd as backend, i want to do it follwing as. 

put_page :

remove_from_page_cache(page);
brd_insert_page(page);

get_page :

brd_lookup_page(page);
add_to_page_cache(page);

Of course, I know it's impossible without new metadata and modification of 
page cache handling and it makes front and backend's good layered design. 

What I want is to remove copy overhead when backend is ram and it's also
part of main memory(ie, we have page descriptor). 

Do you have an idea?

 
  You send the patches which is core of cleancache but I don't see any
  use case.
  Could you send use case patches with this series?
  It could help understand cleancache's benefit.
 
 Do you mean the Xen Transcendent Memory (tmem) implementation?
 If so, this is four files in the Xen source tree (common/tmem.c,
 common/tmem_xen.c, include/xen/tmem.h, include/xen/tmem_xen.h).
 There is also an html document in the Xen source tree, which can
 be viewed here:
 http://oss.oracle.com/projects/tmem/dist/documentation/internals/xen4-internals-v01.html
  
 
 Or did you mean a cleancache_ops backend?  For tmem, there
 is one file linux/drivers/xen/tmem.c and it interfaces between
 the cleancache_ops calls and Xen hypercalls.  It should be in
 a Xenlinux pv_ops tree soon, or I can email it sooner.

I mean backend. :) 

 
 I am also eagerly awaiting Nitin Gupta's cleancache backend
 and implementation to do in-kernel page cache compression.

Do Nitin say he will make backend of cleancache for
page cache compression? 

It would be good feature. 
I have a interest, too. :)

Thanks, Dan. 

 
 Thanks,
 Dan

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html