Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Mel Gorman
On Fri, Dec 30, 2016 at 12:05:45PM +0100, Michal Hocko wrote:
> On Fri 30-12-16 10:19:26, Mel Gorman wrote:
> > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > 
> > > > > Nils, even though this is still highly experimental, could you give 
> > > > > it a
> > > > > try please?
> > > > 
> > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > revert the latest version of the debugging patch (the one in
> > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > is that I get this during boot:
> > > > 
> > > > [1.568174] [ cut here ]
> > > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > > mem_cgroup_update_lru_size+0x118/0x130
> > > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 
> > > > but not empty
> > > 
> > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > my patch (I double account) and b) the detection for the empty list
> > > cannot work after my change because per node zone will not match per
> > > zone statistics. The updated patch is below. So I hope my brain already
> > > works after it's been mostly off last few days...
> > > ---
> > > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > > when
> > >  memcg is enabled
> > > 
> > > Nils Holland has reported unexpected OOM killer invocations with 32b
> > > kernel starting with 4.8 kernels
> > > 
> > 
> > I think it's unfortunate that per-zone stats are reintroduced to the
> > memcg structure.
> 
> the original patch I had didn't add per zone stats but rather did a
> nr_highmem counter to mem_cgroup_per_node (inside ifdeff CONFIG_HIGMEM).
> This would help for this particular case but it wouldn't work for other
> lowmem requests (e.g. GFP_DMA32) and with the kmem accounting this might
> be a problem in future.

That did occur to me.

> So I've decided to go with a more generic
> approach which requires per-zone tracking. I cannot say I would be
> overly happy about this at all.
> 
> > I can't help but think that it would have also worked
> > to always rotate a small number of pages if !inactive_list_is_low and
> > reclaiming for memcg even if it distorted page aging.
> 
> I am not really sure how that would work. Do you mean something like the
> following?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fa30010a5277..563ada3c02ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2044,6 +2044,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
> bool file,
>   inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
>   active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>  
> + if (!mem_cgroup_disabled())
> + goto out;
> +
>   /*
>* For zone-constrained allocations, it is necessary to check if
>* deactivations are required for lowmem to be reclaimed. This
> @@ -2063,6 +2066,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
> bool file,
>   active -= min(active, active_zone);
>   }
>  
> +out:
>   gb = (inactive + active) >> (30 - PAGE_SHIFT);
>   if (gb)
>   inactive_ratio = int_sqrt(10 * gb);
> 
> The problem I see with such an approach is that chances are that this
> would reintroduce what f8d1a31163fc ("mm: consider whether to decivate
> based on eligible zones inactive ratio") tried to fix. But maybe I have
> missed your point.
> 

No, you didn't miss the point. It was something like that I had in mind
but as I thought about it, I could see some cases where it might not work
and still cause a premature OOM. The per-zone accounting is unfortunate
but it's robust hence the Ack.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Mel Gorman
On Fri, Dec 30, 2016 at 12:05:45PM +0100, Michal Hocko wrote:
> On Fri 30-12-16 10:19:26, Mel Gorman wrote:
> > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > 
> > > > > Nils, even though this is still highly experimental, could you give 
> > > > > it a
> > > > > try please?
> > > > 
> > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > revert the latest version of the debugging patch (the one in
> > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > is that I get this during boot:
> > > > 
> > > > [1.568174] [ cut here ]
> > > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > > mem_cgroup_update_lru_size+0x118/0x130
> > > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 
> > > > but not empty
> > > 
> > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > my patch (I double account) and b) the detection for the empty list
> > > cannot work after my change because per node zone will not match per
> > > zone statistics. The updated patch is below. So I hope my brain already
> > > works after it's been mostly off last few days...
> > > ---
> > > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > > when
> > >  memcg is enabled
> > > 
> > > Nils Holland has reported unexpected OOM killer invocations with 32b
> > > kernel starting with 4.8 kernels
> > > 
> > 
> > I think it's unfortunate that per-zone stats are reintroduced to the
> > memcg structure.
> 
> the original patch I had didn't add per zone stats but rather did a
> nr_highmem counter to mem_cgroup_per_node (inside ifdeff CONFIG_HIGMEM).
> This would help for this particular case but it wouldn't work for other
> lowmem requests (e.g. GFP_DMA32) and with the kmem accounting this might
> be a problem in future.

That did occur to me.

> So I've decided to go with a more generic
> approach which requires per-zone tracking. I cannot say I would be
> overly happy about this at all.
> 
> > I can't help but think that it would have also worked
> > to always rotate a small number of pages if !inactive_list_is_low and
> > reclaiming for memcg even if it distorted page aging.
> 
> I am not really sure how that would work. Do you mean something like the
> following?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fa30010a5277..563ada3c02ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2044,6 +2044,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
> bool file,
>   inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
>   active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>  
> + if (!mem_cgroup_disabled())
> + goto out;
> +
>   /*
>* For zone-constrained allocations, it is necessary to check if
>* deactivations are required for lowmem to be reclaimed. This
> @@ -2063,6 +2066,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
> bool file,
>   active -= min(active, active_zone);
>   }
>  
> +out:
>   gb = (inactive + active) >> (30 - PAGE_SHIFT);
>   if (gb)
>   inactive_ratio = int_sqrt(10 * gb);
> 
> The problem I see with such an approach is that chances are that this
> would reintroduce what f8d1a31163fc ("mm: consider whether to decivate
> based on eligible zones inactive ratio") tried to fix. But maybe I have
> missed your point.
> 

No, you didn't miss the point. It was something like that I had in mind
but as I thought about it, I could see some cases where it might not work
and still cause a premature OOM. The per-zone accounting is unfortunate
but it's robust hence the Ack.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Michal Hocko
On Fri 30-12-16 10:19:26, Mel Gorman wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> 
> I think it's unfortunate that per-zone stats are reintroduced to the
> memcg structure.

the original patch I had didn't add per zone stats but rather did a
nr_highmem counter to mem_cgroup_per_node (inside ifdeff CONFIG_HIGMEM).
This would help for this particular case but it wouldn't work for other
lowmem requests (e.g. GFP_DMA32) and with the kmem accounting this might
be a problem in future. So I've decided to go with a more generic
approach which requires per-zone tracking. I cannot say I would be
overly happy about this at all.

> I can't help but think that it would have also worked
> to always rotate a small number of pages if !inactive_list_is_low and
> reclaiming for memcg even if it distorted page aging.

I am not really sure how that would work. Do you mean something like the
following?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa30010a5277..563ada3c02ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2044,6 +2044,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
 
+   if (!mem_cgroup_disabled())
+   goto out;
+
/*
 * For zone-constrained allocations, it is necessary to check if
 * deactivations are required for lowmem to be reclaimed. This
@@ -2063,6 +2066,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
active -= min(active, active_zone);
}
 
+out:
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
inactive_ratio = int_sqrt(10 * gb);

The problem I see with such an approach is that chances are that this
would reintroduce what f8d1a31163fc ("mm: consider whether to decivate
based on eligible zones inactive ratio") tried to fix. But maybe I have
missed your point.

> However, given that such an approach would be less robust and this has
> been heavily tested;
> 
> Acked-by: Mel Gorman 

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Michal Hocko
On Fri 30-12-16 10:19:26, Mel Gorman wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> 
> I think it's unfortunate that per-zone stats are reintroduced to the
> memcg structure.

the original patch I had didn't add per zone stats but rather did a
nr_highmem counter to mem_cgroup_per_node (inside ifdeff CONFIG_HIGMEM).
This would help for this particular case but it wouldn't work for other
lowmem requests (e.g. GFP_DMA32) and with the kmem accounting this might
be a problem in future. So I've decided to go with a more generic
approach which requires per-zone tracking. I cannot say I would be
overly happy about this at all.

> I can't help but think that it would have also worked
> to always rotate a small number of pages if !inactive_list_is_low and
> reclaiming for memcg even if it distorted page aging.

I am not really sure how that would work. Do you mean something like the
following?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa30010a5277..563ada3c02ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2044,6 +2044,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
 
+   if (!mem_cgroup_disabled())
+   goto out;
+
/*
 * For zone-constrained allocations, it is necessary to check if
 * deactivations are required for lowmem to be reclaimed. This
@@ -2063,6 +2066,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
active -= min(active, active_zone);
}
 
+out:
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
inactive_ratio = int_sqrt(10 * gb);

The problem I see with such an approach is that chances are that this
would reintroduce what f8d1a31163fc ("mm: consider whether to decivate
based on eligible zones inactive ratio") tried to fix. But maybe I have
missed your point.

> However, given that such an approach would be less robust and this has
> been heavily tested;
> 
> Acked-by: Mel Gorman 

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Michal Hocko
On Fri 30-12-16 11:05:22, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> > On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
[...]
> > > > + * given zone_idx
> > > > + */
> > > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > > +   enum lru_list lru, int zone_idx)
> > > 
> > > Nit:
> > > 
> > > Although there is a comment, function name is rather confusing when I 
> > > compared
> > > it with lruvec_zone_lru_size.
> > 
> > I am all for a better name.
> > 
> > > lruvec_eligible_zones_lru_size is better?
> > 
> > this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> > What about lruvec_lru_size_eligible_zones?
> 
> Don't mind.

I will go with lruvec_lru_size_eligible_zones then.

> > > Nit:
> > > 
> > > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx 
> > > rather than
> > > own custom calculation to filter out non-eligible pages. 
> > 
> > Yes, that would be possible and I was considering that. But then I found
> > useful to see total and reduced numbers in the tracepoint
> > http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
> > and didn't want to call lruvec_lru_size 2 times. But if you insist then
> > I can just do that.
> 
> I don't mind either but I think we need to describe the reason if you want to
> go with your open-coded version. Otherwise, someone will try to fix it.

OK, I will go with the follow up patch on top of the tracepoints series.
I was hoping that the way how tracing is full of macros would allow us
to evaluate arguments only when the tracepoint is enabled but this
doesn't seem to be the case. Let's CC Steven. Would it be possible to
define a tracepoint in such a way that all given arguments are evaluated
only when the tracepoint is enabled?
---
>From 9a561d652f91f3557db22161600f10ca2462c74f Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 30 Dec 2016 11:28:20 +0100
Subject: [PATCH] mm, vmscan: cleanup up inactive_list_is_low

inactive_list_is_low is effectively duplicating logic implemented by
lruvec_lru_size_eligibe_zones. Let's use the dedicated function to
get the number of eligible pages on the lru list and ask use
lruvec_lru_size to get the total LRU lize only when the tracing is
really requested. We are still iterating over all LRUs two times in that
case but a) inactive_list_is_low is not a hot path and b) this can be
addressed at the tracing layer and only evaluate arguments only when the
tracing is enabled in future if that ever matters.

Signed-off-by: Michal Hocko 
---
 mm/vmscan.c | 38 ++
 1 file changed, 10 insertions(+), 28 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 137bc85067d3..a9c881f06c0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2054,11 +2054,10 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
struct scan_control *sc, bool 
trace)
 {
unsigned long inactive_ratio;
-   unsigned long total_inactive, inactive;
-   unsigned long total_active, active;
+   unsigned long inactive, active;
+   enum lru_list inactive_lru = file * LRU_FILE;
+   enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
unsigned long gb;
-   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-   int zid;
 
/*
 * If we don't have swap space, anonymous page deactivation
@@ -2067,27 +2066,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
if (!file && !total_swap_pages)
return false;
 
-   total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
-   total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + 
LRU_ACTIVE);
-
-   /*
-* For zone-constrained allocations, it is necessary to check if
-* deactivations are required for lowmem to be reclaimed. This
-* calculates the inactive/active pages available in eligible zones.
-*/
-   for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {
-   struct zone *zone = >node_zones[zid];
-   unsigned long inactive_zone, active_zone;
-
-   if (!managed_zone(zone))
-   continue;
-
-   inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, 
zid);
-   active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + 
LRU_ACTIVE, zid);
-
-   inactive -= min(inactive, inactive_zone);
-   active -= min(active, active_zone);
-   }
+   inactive = lruvec_lru_size_eligibe_zones(lruvec, inactive_lru, 
sc->reclaim_idx);
+   active = lruvec_lru_size_eligibe_zones(lruvec, active_lru, 
sc->reclaim_idx);
 
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
@@ -2096,10 +2076,12 @@ static bool 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Michal Hocko
On Fri 30-12-16 11:05:22, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> > On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
[...]
> > > > + * given zone_idx
> > > > + */
> > > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > > +   enum lru_list lru, int zone_idx)
> > > 
> > > Nit:
> > > 
> > > Although there is a comment, function name is rather confusing when I 
> > > compared
> > > it with lruvec_zone_lru_size.
> > 
> > I am all for a better name.
> > 
> > > lruvec_eligible_zones_lru_size is better?
> > 
> > this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> > What about lruvec_lru_size_eligible_zones?
> 
> Don't mind.

I will go with lruvec_lru_size_eligible_zones then.

> > > Nit:
> > > 
> > > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx 
> > > rather than
> > > own custom calculation to filter out non-eligible pages. 
> > 
> > Yes, that would be possible and I was considering that. But then I found
> > useful to see total and reduced numbers in the tracepoint
> > http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
> > and didn't want to call lruvec_lru_size 2 times. But if you insist then
> > I can just do that.
> 
> I don't mind either but I think we need to describe the reason if you want to
> go with your open-coded version. Otherwise, someone will try to fix it.

OK, I will go with the follow up patch on top of the tracepoints series.
I was hoping that the way how tracing is full of macros would allow us
to evaluate arguments only when the tracepoint is enabled but this
doesn't seem to be the case. Let's CC Steven. Would it be possible to
define a tracepoint in such a way that all given arguments are evaluated
only when the tracepoint is enabled?
---
>From 9a561d652f91f3557db22161600f10ca2462c74f Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 30 Dec 2016 11:28:20 +0100
Subject: [PATCH] mm, vmscan: cleanup up inactive_list_is_low

inactive_list_is_low is effectively duplicating logic implemented by
lruvec_lru_size_eligibe_zones. Let's use the dedicated function to
get the number of eligible pages on the lru list and ask use
lruvec_lru_size to get the total LRU lize only when the tracing is
really requested. We are still iterating over all LRUs two times in that
case but a) inactive_list_is_low is not a hot path and b) this can be
addressed at the tracing layer and only evaluate arguments only when the
tracing is enabled in future if that ever matters.

Signed-off-by: Michal Hocko 
---
 mm/vmscan.c | 38 ++
 1 file changed, 10 insertions(+), 28 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 137bc85067d3..a9c881f06c0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2054,11 +2054,10 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
struct scan_control *sc, bool 
trace)
 {
unsigned long inactive_ratio;
-   unsigned long total_inactive, inactive;
-   unsigned long total_active, active;
+   unsigned long inactive, active;
+   enum lru_list inactive_lru = file * LRU_FILE;
+   enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
unsigned long gb;
-   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-   int zid;
 
/*
 * If we don't have swap space, anonymous page deactivation
@@ -2067,27 +2066,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
if (!file && !total_swap_pages)
return false;
 
-   total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
-   total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + 
LRU_ACTIVE);
-
-   /*
-* For zone-constrained allocations, it is necessary to check if
-* deactivations are required for lowmem to be reclaimed. This
-* calculates the inactive/active pages available in eligible zones.
-*/
-   for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {
-   struct zone *zone = >node_zones[zid];
-   unsigned long inactive_zone, active_zone;
-
-   if (!managed_zone(zone))
-   continue;
-
-   inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, 
zid);
-   active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + 
LRU_ACTIVE, zid);
-
-   inactive -= min(inactive, inactive_zone);
-   active -= min(active, active_zone);
-   }
+   inactive = lruvec_lru_size_eligibe_zones(lruvec, inactive_lru, 
sc->reclaim_idx);
+   active = lruvec_lru_size_eligibe_zones(lruvec, active_lru, 
sc->reclaim_idx);
 
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
@@ -2096,10 +2076,12 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Mel Gorman
On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > 
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> > 
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> > 
> > [1.568174] [ cut here ]
> > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0x118/0x130
> > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > not empty
> 
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...
> ---
> From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 23 Dec 2016 15:11:54 +0100
> Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
>  memcg is enabled
> 
> Nils Holland has reported unexpected OOM killer invocations with 32b
> kernel starting with 4.8 kernels
> 

I think it's unfortunate that per-zone stats are reintroduced to the
memcg structure. I can't help but think that it would have also worked
to always rotate a small number of pages if !inactive_list_is_low and
reclaiming for memcg even if it distorted page aging. However, given
that such an approach would be less robust and this has been heavily
tested;

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Mel Gorman
On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > 
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> > 
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> > 
> > [1.568174] [ cut here ]
> > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0x118/0x130
> > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > not empty
> 
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...
> ---
> From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 23 Dec 2016 15:11:54 +0100
> Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
>  memcg is enabled
> 
> Nils Holland has reported unexpected OOM killer invocations with 32b
> kernel starting with 4.8 kernels
> 

I think it's unfortunate that per-zone stats are reintroduced to the
memcg structure. I can't help but think that it would have also worked
to always rotate a small number of pages if !inactive_list_is_low and
reclaiming for memcg even if it distorted page aging. However, given
that such an approach would be less robust and this has been heavily
tested;

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Minchan Kim
On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > > Hi,
> > > could you try to run with the following patch on top of the previous
> > > one? I do not think it will make a large change in your workload but
> > > I think we need something like that so some testing under which is known
> > > to make a high lowmem pressure would be really appreciated. If you have
> > > more time to play with it then running with and without the patch with
> > > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > > whether it make any difference at all.
> > > 
> > > I would also appreciate if Mel and Johannes had a look at it. I am not
> > > yet sure whether we need the same thing for anon/file balancing in
> > > get_scan_count. I suspect we need but need to think more about that.
> > > 
> > > Thanks a lot again!
> > > ---
> > > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > > 
> > > get_scan_count considers the whole node LRU size when
> > > - doing SCAN_FILE due to many page cache inactive pages
> > > - calculating the number of pages to scan
> > > 
> > > in both cases this might lead to unexpected behavior especially on 32b
> > > systems where we can expect lowmem memory pressure very often.
> > > 
> > > A large highmem zone can easily distort SCAN_FILE heuristic because
> > > there might be only few file pages from the eligible zones on the node
> > > lru and we would still enforce file lru scanning which can lead to
> > > trashing while we could still scan anonymous pages.
> > 
> > Nit:
> > It doesn't make thrashing because isolate_lru_pages filter out them
> > but I agree it makes pointless CPU burning to find eligible pages.
> 
> This is not about isolate_lru_pages. The trashing could happen if we had
> lowmem pagecache user which would constantly reclaim recently faulted
> in pages while there is anonymous memory in the lowmem which could be
> reclaimed instead.
>  
> [...]
> > >  /*
> > > + * Return the number of pages on the given lru which are eligibne for the
> > eligible
> 
> fixed
> 
> > > + * given zone_idx
> > > + */
> > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > + enum lru_list lru, int zone_idx)
> > 
> > Nit:
> > 
> > Although there is a comment, function name is rather confusing when I 
> > compared
> > it with lruvec_zone_lru_size.
> 
> I am all for a better name.
> 
> > lruvec_eligible_zones_lru_size is better?
> 
> this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> What about lruvec_lru_size_eligible_zones?

Don't mind.

>  
> > Nit:
> > 
> > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx 
> > rather than
> > own custom calculation to filter out non-eligible pages. 
> 
> Yes, that would be possible and I was considering that. But then I found
> useful to see total and reduced numbers in the tracepoint
> http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
> and didn't want to call lruvec_lru_size 2 times. But if you insist then
> I can just do that.

I don't mind either but I think we need to describe the reason if you want to
go with your open-coded version. Otherwise, someone will try to fix it.


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Minchan Kim
On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > > Hi,
> > > could you try to run with the following patch on top of the previous
> > > one? I do not think it will make a large change in your workload but
> > > I think we need something like that so some testing under which is known
> > > to make a high lowmem pressure would be really appreciated. If you have
> > > more time to play with it then running with and without the patch with
> > > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > > whether it make any difference at all.
> > > 
> > > I would also appreciate if Mel and Johannes had a look at it. I am not
> > > yet sure whether we need the same thing for anon/file balancing in
> > > get_scan_count. I suspect we need but need to think more about that.
> > > 
> > > Thanks a lot again!
> > > ---
> > > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > > 
> > > get_scan_count considers the whole node LRU size when
> > > - doing SCAN_FILE due to many page cache inactive pages
> > > - calculating the number of pages to scan
> > > 
> > > in both cases this might lead to unexpected behavior especially on 32b
> > > systems where we can expect lowmem memory pressure very often.
> > > 
> > > A large highmem zone can easily distort SCAN_FILE heuristic because
> > > there might be only few file pages from the eligible zones on the node
> > > lru and we would still enforce file lru scanning which can lead to
> > > trashing while we could still scan anonymous pages.
> > 
> > Nit:
> > It doesn't make thrashing because isolate_lru_pages filter out them
> > but I agree it makes pointless CPU burning to find eligible pages.
> 
> This is not about isolate_lru_pages. The trashing could happen if we had
> lowmem pagecache user which would constantly reclaim recently faulted
> in pages while there is anonymous memory in the lowmem which could be
> reclaimed instead.
>  
> [...]
> > >  /*
> > > + * Return the number of pages on the given lru which are eligibne for the
> > eligible
> 
> fixed
> 
> > > + * given zone_idx
> > > + */
> > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > + enum lru_list lru, int zone_idx)
> > 
> > Nit:
> > 
> > Although there is a comment, function name is rather confusing when I 
> > compared
> > it with lruvec_zone_lru_size.
> 
> I am all for a better name.
> 
> > lruvec_eligible_zones_lru_size is better?
> 
> this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> What about lruvec_lru_size_eligible_zones?

Don't mind.

>  
> > Nit:
> > 
> > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx 
> > rather than
> > own custom calculation to filter out non-eligible pages. 
> 
> Yes, that would be possible and I was considering that. But then I found
> useful to see total and reduced numbers in the tracepoint
> http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
> and didn't want to call lruvec_lru_size 2 times. But if you insist then
> I can just do that.

I don't mind either but I think we need to describe the reason if you want to
go with your open-coded version. Otherwise, someone will try to fix it.


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Michal Hocko
On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> > 
> > I would also appreciate if Mel and Johannes had a look at it. I am not
> > yet sure whether we need the same thing for anon/file balancing in
> > get_scan_count. I suspect we need but need to think more about that.
> > 
> > Thanks a lot again!
> > ---
> > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > 
> > get_scan_count considers the whole node LRU size when
> > - doing SCAN_FILE due to many page cache inactive pages
> > - calculating the number of pages to scan
> > 
> > in both cases this might lead to unexpected behavior especially on 32b
> > systems where we can expect lowmem memory pressure very often.
> > 
> > A large highmem zone can easily distort SCAN_FILE heuristic because
> > there might be only few file pages from the eligible zones on the node
> > lru and we would still enforce file lru scanning which can lead to
> > trashing while we could still scan anonymous pages.
> 
> Nit:
> It doesn't make thrashing because isolate_lru_pages filter out them
> but I agree it makes pointless CPU burning to find eligible pages.

This is not about isolate_lru_pages. The trashing could happen if we had
lowmem pagecache user which would constantly reclaim recently faulted
in pages while there is anonymous memory in the lowmem which could be
reclaimed instead.
 
[...]
> >  /*
> > + * Return the number of pages on the given lru which are eligibne for the
> eligible

fixed

> > + * given zone_idx
> > + */
> > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > +   enum lru_list lru, int zone_idx)
> 
> Nit:
> 
> Although there is a comment, function name is rather confusing when I compared
> it with lruvec_zone_lru_size.

I am all for a better name.

> lruvec_eligible_zones_lru_size is better?

this would be too easy to confuse with lruvec_eligible_zone_lru_size.
What about lruvec_lru_size_eligible_zones?
 
> Nit:
> 
> With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather 
> than
> own custom calculation to filter out non-eligible pages. 

Yes, that would be possible and I was considering that. But then I found
useful to see total and reduced numbers in the tracepoint
http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
and didn't want to call lruvec_lru_size 2 times. But if you insist then
I can just do that.

> Anyway, I think this patch does right things so I suppose this.
> 
> Acked-by: Minchan Kim 

Thanks for the review!

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Michal Hocko
On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> > 
> > I would also appreciate if Mel and Johannes had a look at it. I am not
> > yet sure whether we need the same thing for anon/file balancing in
> > get_scan_count. I suspect we need but need to think more about that.
> > 
> > Thanks a lot again!
> > ---
> > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > 
> > get_scan_count considers the whole node LRU size when
> > - doing SCAN_FILE due to many page cache inactive pages
> > - calculating the number of pages to scan
> > 
> > in both cases this might lead to unexpected behavior especially on 32b
> > systems where we can expect lowmem memory pressure very often.
> > 
> > A large highmem zone can easily distort SCAN_FILE heuristic because
> > there might be only few file pages from the eligible zones on the node
> > lru and we would still enforce file lru scanning which can lead to
> > trashing while we could still scan anonymous pages.
> 
> Nit:
> It doesn't make thrashing because isolate_lru_pages filter out them
> but I agree it makes pointless CPU burning to find eligible pages.

This is not about isolate_lru_pages. The trashing could happen if we had
lowmem pagecache user which would constantly reclaim recently faulted
in pages while there is anonymous memory in the lowmem which could be
reclaimed instead.
 
[...]
> >  /*
> > + * Return the number of pages on the given lru which are eligibne for the
> eligible

fixed

> > + * given zone_idx
> > + */
> > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > +   enum lru_list lru, int zone_idx)
> 
> Nit:
> 
> Although there is a comment, function name is rather confusing when I compared
> it with lruvec_zone_lru_size.

I am all for a better name.

> lruvec_eligible_zones_lru_size is better?

this would be too easy to confuse with lruvec_eligible_zone_lru_size.
What about lruvec_lru_size_eligible_zones?
 
> Nit:
> 
> With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather 
> than
> own custom calculation to filter out non-eligible pages. 

Yes, that would be possible and I was considering that. But then I found
useful to see total and reduced numbers in the tracepoint
http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
and didn't want to call lruvec_lru_size 2 times. But if you insist then
I can just do that.

> Anyway, I think this patch does right things so I suppose this.
> 
> Acked-by: Minchan Kim 

Thanks for the review!

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Michal Hocko
On Thu 29-12-16 09:48:24, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
[...]
> > Acked-by: Minchan Kim 

Thanks!
 
> Nit:
> 
> WARNING: line over 80 characters
> #53: FILE: include/linux/memcontrol.h:689:
> +unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum 
> lru_list lru,
> 
> WARNING: line over 80 characters
> #147: FILE: mm/vmscan.c:248:
> +unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, 
> int zone_idx)
> 
> WARNING: line over 80 characters
> #177: FILE: mm/vmscan.c:1446:
> +   mem_cgroup_update_lru_size(lruvec, lru, zid, 
> -nr_zone_taken[zid]);

fixed

> WARNING: line over 80 characters
> #201: FILE: mm/vmscan.c:2099:
> +   inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, 
> zid);
> 
> WARNING: line over 80 characters
> #202: FILE: mm/vmscan.c:2100:
> +   active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) 
> + LRU_ACTIVE, zid);

I would prefer to have those on the same line though. It will make them
easier to follow.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Michal Hocko
On Thu 29-12-16 09:48:24, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
[...]
> > Acked-by: Minchan Kim 

Thanks!
 
> Nit:
> 
> WARNING: line over 80 characters
> #53: FILE: include/linux/memcontrol.h:689:
> +unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum 
> lru_list lru,
> 
> WARNING: line over 80 characters
> #147: FILE: mm/vmscan.c:248:
> +unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, 
> int zone_idx)
> 
> WARNING: line over 80 characters
> #177: FILE: mm/vmscan.c:1446:
> +   mem_cgroup_update_lru_size(lruvec, lru, zid, 
> -nr_zone_taken[zid]);

fixed

> WARNING: line over 80 characters
> #201: FILE: mm/vmscan.c:2099:
> +   inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, 
> zid);
> 
> WARNING: line over 80 characters
> #202: FILE: mm/vmscan.c:2100:
> +   active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) 
> + LRU_ACTIVE, zid);

I would prefer to have those on the same line though. It will make them
easier to follow.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.
> 
> I would also appreciate if Mel and Johannes had a look at it. I am not
> yet sure whether we need the same thing for anon/file balancing in
> get_scan_count. I suspect we need but need to think more about that.
> 
> Thanks a lot again!
> ---
> From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 27 Dec 2016 16:28:44 +0100
> Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> 
> get_scan_count considers the whole node LRU size when
> - doing SCAN_FILE due to many page cache inactive pages
> - calculating the number of pages to scan
> 
> in both cases this might lead to unexpected behavior especially on 32b
> systems where we can expect lowmem memory pressure very often.
> 
> A large highmem zone can easily distort SCAN_FILE heuristic because
> there might be only few file pages from the eligible zones on the node
> lru and we would still enforce file lru scanning which can lead to
> trashing while we could still scan anonymous pages.

Nit:
It doesn't make thrashing because isolate_lru_pages filter out them
but I agree it makes pointless CPU burning to find eligible pages.

> 
> The later use of lruvec_lru_size can be problematic as well. Especially
> when there are not many pages from the eligible zones. We would have to
> skip over many pages to find anything to reclaim but shrink_node_memcg
> would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
> at maximum. Therefore we can end up going over a large LRU many times
> without actually having chance to reclaim much if anything at all. The
> closer we are out of memory on lowmem zone the worse the problem will
> be.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/vmscan.c | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c98b1a585992..785b4d7fb8a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec 
> *lruvec, enum lru_list lru, int
>  }
>  
>  /*
> + * Return the number of pages on the given lru which are eligibne for the
eligible
> + * given zone_idx
> + */
> +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> + enum lru_list lru, int zone_idx)

Nit:

Although there is a comment, function name is rather confusing when I compared
it with lruvec_zone_lru_size.

lruvec_eligible_zones_lru_size is better?


> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + unsigned long lru_size;
> + int zid;
> +
> + lru_size = lruvec_lru_size(lruvec, lru);
> + for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = >node_zones[zid];
> + unsigned long size;
> +
> + if (!managed_zone(zone))
> + continue;
> +
> + size = lruvec_zone_lru_size(lruvec, lru, zid);
> + lru_size -= min(size, lru_size);
> + }
> +
> + return lru_size;
> +}
> +
> +/*
>   * Add a shrinker callback to be called from the vm.
>   */
>  int register_shrinker(struct shrinker *shrinker)
> @@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>* system is under heavy pressure.
>*/
>   if (!inactive_list_is_low(lruvec, true, sc) &&
> - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
> + lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
> sc->reclaim_idx) >> sc->priority) {
>   scan_balance = SCAN_FILE;
>   goto out;
>   }
> @@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>   unsigned long size;
>   unsigned long scan;
>  
> - size = lruvec_lru_size(lruvec, lru);
> + size = lruvec_lru_size_zone_idx(lruvec, lru, 
> sc->reclaim_idx);
>   scan = size >> sc->priority;
>  
>   if (!scan && pass && force_scan)
> -- 
> 2.10.2

Nit:

With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather 
than
own custom calculation to filter out non-eligible pages. 

Anyway, I think this patch does right things so I suppose this.

Acked-by: Minchan Kim 



Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.
> 
> I would also appreciate if Mel and Johannes had a look at it. I am not
> yet sure whether we need the same thing for anon/file balancing in
> get_scan_count. I suspect we need but need to think more about that.
> 
> Thanks a lot again!
> ---
> From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 27 Dec 2016 16:28:44 +0100
> Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> 
> get_scan_count considers the whole node LRU size when
> - doing SCAN_FILE due to many page cache inactive pages
> - calculating the number of pages to scan
> 
> in both cases this might lead to unexpected behavior especially on 32b
> systems where we can expect lowmem memory pressure very often.
> 
> A large highmem zone can easily distort SCAN_FILE heuristic because
> there might be only few file pages from the eligible zones on the node
> lru and we would still enforce file lru scanning which can lead to
> trashing while we could still scan anonymous pages.

Nit:
It doesn't make thrashing because isolate_lru_pages filter out them
but I agree it makes pointless CPU burning to find eligible pages.

> 
> The later use of lruvec_lru_size can be problematic as well. Especially
> when there are not many pages from the eligible zones. We would have to
> skip over many pages to find anything to reclaim but shrink_node_memcg
> would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
> at maximum. Therefore we can end up going over a large LRU many times
> without actually having chance to reclaim much if anything at all. The
> closer we are out of memory on lowmem zone the worse the problem will
> be.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/vmscan.c | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c98b1a585992..785b4d7fb8a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec 
> *lruvec, enum lru_list lru, int
>  }
>  
>  /*
> + * Return the number of pages on the given lru which are eligibne for the
eligible
> + * given zone_idx
> + */
> +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> + enum lru_list lru, int zone_idx)

Nit:

Although there is a comment, function name is rather confusing when I compared
it with lruvec_zone_lru_size.

lruvec_eligible_zones_lru_size is better?


> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + unsigned long lru_size;
> + int zid;
> +
> + lru_size = lruvec_lru_size(lruvec, lru);
> + for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = >node_zones[zid];
> + unsigned long size;
> +
> + if (!managed_zone(zone))
> + continue;
> +
> + size = lruvec_zone_lru_size(lruvec, lru, zid);
> + lru_size -= min(size, lru_size);
> + }
> +
> + return lru_size;
> +}
> +
> +/*
>   * Add a shrinker callback to be called from the vm.
>   */
>  int register_shrinker(struct shrinker *shrinker)
> @@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>* system is under heavy pressure.
>*/
>   if (!inactive_list_is_low(lruvec, true, sc) &&
> - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
> + lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
> sc->reclaim_idx) >> sc->priority) {
>   scan_balance = SCAN_FILE;
>   goto out;
>   }
> @@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>   unsigned long size;
>   unsigned long scan;
>  
> - size = lruvec_lru_size(lruvec, lru);
> + size = lruvec_lru_size_zone_idx(lruvec, lru, 
> sc->reclaim_idx);
>   scan = size >> sc->priority;
>  
>   if (!scan && pass && force_scan)
> -- 
> 2.10.2

Nit:

With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather 
than
own custom calculation to filter out non-eligible pages. 

Anyway, I think this patch does right things so I suppose this.

Acked-by: Minchan Kim 



Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> > kworker/u4:5 invoked oom-killer: 
> > gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> > oom_score_adj=0
> > kworker/u4:5 cpuset=/ mems_allowed=0
> > CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> > [...]
> > Mem-Info:
> > active_anon:58685 inactive_anon:90 isolated_anon:0
> >  active_file:274324 inactive_file:281962 isolated_file:0
> >  unevictable:0 dirty:649 writeback:0 unstable:0
> >  slab_reclaimable:40662 slab_unreclaimable:17754
> >  mapped:7382 shmem:202 pagetables:351 bounce:0
> >  free:206736 free_pcp:332 free_cma:0
> > Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> > inactive_file:1127848kB unevictable:0kB isolated(anon):0kB 
> > isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB 
> > shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB 
> > unstable:0kB pages_scanned:0 all_unreclaimable? no
> > DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> > inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> > writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> > slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> > pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 813 3474 3474
> > Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> > active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> > unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> > mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> > kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> > local_pcp:340kB free_cma:0kB
> > lowmem_reserve[]: 0 0 21292 21292
> > HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> > active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> > inactive_file:1127804kB unevictable:0kB writepending:2592kB 
> > present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB 
> > slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB 
> > free_pcp:800kB local_pcp:608kB free_cma:0kB
> > 
> > the oom killer is clearly pre-mature because there there is still a
> > lot of page cache in the zone Normal which should satisfy this lowmem
> > request. Further debugging has shown that the reclaim cannot make any
> > forward progress because the page cache is hidden in the active list
> > which doesn't get rotated because inactive_list_is_low is not memcg
> > aware.
> > It simply subtracts per-zone highmem counters from the respective
> > memcg's lru sizes which doesn't make any sense. We can simply end up
> > always seeing the resulting active and inactive counts 0 and return
> > false. This issue is not limited to 32b kernels but in practice the
> > effect on systems without CONFIG_HIGHMEM would be much harder to notice
> > because we do not invoke the OOM killer for allocations requests
> > targeting < ZONE_NORMAL.
> > 
> > Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> > and subtract per-memcg highmem counts when memcg is enabled. Introduce
> > helper 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> > kworker/u4:5 invoked oom-killer: 
> > gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> > oom_score_adj=0
> > kworker/u4:5 cpuset=/ mems_allowed=0
> > CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> > [...]
> > Mem-Info:
> > active_anon:58685 inactive_anon:90 isolated_anon:0
> >  active_file:274324 inactive_file:281962 isolated_file:0
> >  unevictable:0 dirty:649 writeback:0 unstable:0
> >  slab_reclaimable:40662 slab_unreclaimable:17754
> >  mapped:7382 shmem:202 pagetables:351 bounce:0
> >  free:206736 free_pcp:332 free_cma:0
> > Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> > inactive_file:1127848kB unevictable:0kB isolated(anon):0kB 
> > isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB 
> > shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB 
> > unstable:0kB pages_scanned:0 all_unreclaimable? no
> > DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> > inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> > writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> > slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> > pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 813 3474 3474
> > Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> > active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> > unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> > mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> > kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> > local_pcp:340kB free_cma:0kB
> > lowmem_reserve[]: 0 0 21292 21292
> > HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> > active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> > inactive_file:1127804kB unevictable:0kB writepending:2592kB 
> > present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB 
> > slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB 
> > free_pcp:800kB local_pcp:608kB free_cma:0kB
> > 
> > the oom killer is clearly pre-mature because there there is still a
> > lot of page cache in the zone Normal which should satisfy this lowmem
> > request. Further debugging has shown that the reclaim cannot make any
> > forward progress because the page cache is hidden in the active list
> > which doesn't get rotated because inactive_list_is_low is not memcg
> > aware.
> > It simply subtracts per-zone highmem counters from the respective
> > memcg's lru sizes which doesn't make any sense. We can simply end up
> > always seeing the resulting active and inactive counts 0 and return
> > false. This issue is not limited to 32b kernels but in practice the
> > effect on systems without CONFIG_HIGHMEM would be much harder to notice
> > because we do not invoke the OOM killer for allocations requests
> > targeting < ZONE_NORMAL.
> > 
> > Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> > and subtract per-memcg highmem counts when memcg is enabled. Introduce
> > helper 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > 
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> > 
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> > 
> > [1.568174] [ cut here ]
> > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0x118/0x130
> > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > not empty
> 
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...
> ---
> From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 23 Dec 2016 15:11:54 +0100
> Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
>  memcg is enabled
> 
> Nils Holland has reported unexpected OOM killer invocations with 32b
> kernel starting with 4.8 kernels
> 
>   kworker/u4:5 invoked oom-killer: 
> gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> oom_score_adj=0
>   kworker/u4:5 cpuset=/ mems_allowed=0
>   CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
>   [...]
>   Mem-Info:
>   active_anon:58685 inactive_anon:90 isolated_anon:0
>active_file:274324 inactive_file:281962 isolated_file:0
>unevictable:0 dirty:649 writeback:0 unstable:0
>slab_reclaimable:40662 slab_unreclaimable:17754
>mapped:7382 shmem:202 pagetables:351 bounce:0
>free:206736 free_pcp:332 free_cma:0
>   Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
> shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
> pages_scanned:0 all_unreclaimable? no
>   DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>   lowmem_reserve[]: 0 813 3474 3474
>   Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> local_pcp:340kB free_cma:0kB
>   lowmem_reserve[]: 0 0 21292 21292
>   HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
> managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
> free_cma:0kB
> 
> the oom killer is clearly pre-mature because there there is still a
> lot of page cache in the zone Normal which should satisfy this lowmem
> request. Further debugging has shown that the reclaim cannot make any
> forward progress because the page cache is hidden in the active list
> which doesn't get rotated because inactive_list_is_low is not memcg
> aware.
> It simply subtracts per-zone highmem counters from the respective
> memcg's lru sizes which doesn't make any sense. We can simply end up
> always seeing the resulting active and inactive counts 0 and return
> false. This issue is not limited to 32b kernels but in practice the
> effect on systems without CONFIG_HIGHMEM would be much harder to notice
> because we do not invoke the OOM killer for allocations requests
> targeting < ZONE_NORMAL.
> 
> Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled. Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
> 
> We are loosing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > 
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> > 
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> > 
> > [1.568174] [ cut here ]
> > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0x118/0x130
> > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > not empty
> 
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...
> ---
> From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 23 Dec 2016 15:11:54 +0100
> Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
>  memcg is enabled
> 
> Nils Holland has reported unexpected OOM killer invocations with 32b
> kernel starting with 4.8 kernels
> 
>   kworker/u4:5 invoked oom-killer: 
> gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> oom_score_adj=0
>   kworker/u4:5 cpuset=/ mems_allowed=0
>   CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
>   [...]
>   Mem-Info:
>   active_anon:58685 inactive_anon:90 isolated_anon:0
>active_file:274324 inactive_file:281962 isolated_file:0
>unevictable:0 dirty:649 writeback:0 unstable:0
>slab_reclaimable:40662 slab_unreclaimable:17754
>mapped:7382 shmem:202 pagetables:351 bounce:0
>free:206736 free_pcp:332 free_cma:0
>   Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
> shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
> pages_scanned:0 all_unreclaimable? no
>   DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>   lowmem_reserve[]: 0 813 3474 3474
>   Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> local_pcp:340kB free_cma:0kB
>   lowmem_reserve[]: 0 0 21292 21292
>   HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
> managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
> free_cma:0kB
> 
> the oom killer is clearly pre-mature because there there is still a
> lot of page cache in the zone Normal which should satisfy this lowmem
> request. Further debugging has shown that the reclaim cannot make any
> forward progress because the page cache is hidden in the active list
> which doesn't get rotated because inactive_list_is_low is not memcg
> aware.
> It simply subtracts per-zone highmem counters from the respective
> memcg's lru sizes which doesn't make any sense. We can simply end up
> always seeing the resulting active and inactive counts 0 and return
> false. This issue is not limited to 32b kernels but in practice the
> effect on systems without CONFIG_HIGHMEM would be much harder to notice
> because we do not invoke the OOM killer for allocations requests
> targeting < ZONE_NORMAL.
> 
> Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled. Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
> 
> We are loosing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: update_lru_size warn 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Michal Hocko
On Tue 27-12-16 20:33:09, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> 
> Of course, no problem!
> 
> First, about the events to trace: mm_vmscan_direct_reclaim_start
> doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
> sure that's what you meant and so I took that one instead.

yes, sorry about the confusion

> Then I have to admit in both cases (once without the latest patch,
> once with) very little trace data was actually produced. In the case
> without the patch, the reclaim was started more often and reclaimed a
> smaller number of pages each time, in the case with the patch it was
> invoked less often, and with the last time it was invoked it reclaimed
> a rather big number of pages. I have no clue, however, if that
> happened "by chance" or if it was actually causes by the patch and
> thus an expected change.

yes that seems to be a variation of the workload I would say because if
anything the patch should reduce the number of scanned pages.

> In both cases, my test case was: Reboot, setup logging, do "emerge
> firefox" (which unpacks and builds the firefox sources), then, when
> the emerge had come so far that the unpacking was done and the
> building had started, switch to another console and untar the latest
> kernel, libreoffice and (once more) firefox sources there. After that
> had completed, I aborted the emerge build process and stopped tracing.
> 
> Here's the trace data captured without the latest patch applied:
> 
> khugepaged-22[000]    566.123383: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000] .N..   566.165520: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    587.515424: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    587.596035: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1029
> khugepaged-22[001]    599.879536: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    601.000812: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    601.228137: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    601.309952: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1081
> khugepaged-22[001]    694.935267: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001] .N..   695.081943: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1071
> khugepaged-22[001]    701.370707: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    701.372798: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1089
> khugepaged-22[001]    764.752036: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    771.047905: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1039
> khugepaged-22[000]    781.760515: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    781.826543: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.595575: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    782.638591: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.930455: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    782.993608: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    783.330378: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    783.369653: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> 
> And this is the same with the patch applied:
> 
> khugepaged-22[001]    523.57: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    523.683110: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1092
> khugepaged-22[001]    535.345477: 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Michal Hocko
On Tue 27-12-16 20:33:09, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> 
> Of course, no problem!
> 
> First, about the events to trace: mm_vmscan_direct_reclaim_start
> doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
> sure that's what you meant and so I took that one instead.

yes, sorry about the confusion

> Then I have to admit in both cases (once without the latest patch,
> once with) very little trace data was actually produced. In the case
> without the patch, the reclaim was started more often and reclaimed a
> smaller number of pages each time, in the case with the patch it was
> invoked less often, and with the last time it was invoked it reclaimed
> a rather big number of pages. I have no clue, however, if that
> happened "by chance" or if it was actually causes by the patch and
> thus an expected change.

yes that seems to be a variation of the workload I would say because if
anything the patch should reduce the number of scanned pages.

> In both cases, my test case was: Reboot, setup logging, do "emerge
> firefox" (which unpacks and builds the firefox sources), then, when
> the emerge had come so far that the unpacking was done and the
> building had started, switch to another console and untar the latest
> kernel, libreoffice and (once more) firefox sources there. After that
> had completed, I aborted the emerge build process and stopped tracing.
> 
> Here's the trace data captured without the latest patch applied:
> 
> khugepaged-22[000]    566.123383: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000] .N..   566.165520: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    587.515424: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    587.596035: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1029
> khugepaged-22[001]    599.879536: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    601.000812: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    601.228137: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    601.309952: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1081
> khugepaged-22[001]    694.935267: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001] .N..   695.081943: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1071
> khugepaged-22[001]    701.370707: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    701.372798: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1089
> khugepaged-22[001]    764.752036: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    771.047905: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1039
> khugepaged-22[000]    781.760515: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    781.826543: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.595575: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    782.638591: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.930455: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    782.993608: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    783.330378: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    783.369653: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> 
> And this is the same with the patch applied:
> 
> khugepaged-22[001]    523.57: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    523.683110: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1092
> khugepaged-22[001]    535.345477: 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Nils Holland
On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.

Of course, no problem!

First, about the events to trace: mm_vmscan_direct_reclaim_start
doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
sure that's what you meant and so I took that one instead.

Then I have to admit in both cases (once without the latest patch,
once with) very little trace data was actually produced. In the case
without the patch, the reclaim was started more often and reclaimed a
smaller number of pages each time, in the case with the patch it was
invoked less often, and with the last time it was invoked it reclaimed
a rather big number of pages. I have no clue, however, if that
happened "by chance" or if it was actually causes by the patch and
thus an expected change.

In both cases, my test case was: Reboot, setup logging, do "emerge
firefox" (which unpacks and builds the firefox sources), then, when
the emerge had come so far that the unpacking was done and the
building had started, switch to another console and untar the latest
kernel, libreoffice and (once more) firefox sources there. After that
had completed, I aborted the emerge build process and stopped tracing.

Here's the trace data captured without the latest patch applied:

khugepaged-22[000]    566.123383: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000] .N..   566.165520: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1100
khugepaged-22[001]    587.515424: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    587.596035: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1029
khugepaged-22[001]    599.879536: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    601.000812: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1100
khugepaged-22[001]    601.228137: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    601.309952: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1081
khugepaged-22[001]    694.935267: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001] .N..   695.081943: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1071
khugepaged-22[001]    701.370707: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    701.372798: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1089
khugepaged-22[001]    764.752036: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    771.047905: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1039
khugepaged-22[000]    781.760515: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    781.826543: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040
khugepaged-22[001]    782.595575: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    782.638591: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040
khugepaged-22[001]    782.930455: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    782.993608: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040
khugepaged-22[001]    783.330378: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    783.369653: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040

And this is the same with the patch applied:

khugepaged-22[001]    523.57: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    523.683110: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1092
khugepaged-22[001]    535.345477: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    535.401189: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1078
khugepaged-22[000]    692.876716: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    703.312399: mm_vmscan_direct_reclaim_end: 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Nils Holland
On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.

Of course, no problem!

First, about the events to trace: mm_vmscan_direct_reclaim_start
doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
sure that's what you meant and so I took that one instead.

Then I have to admit in both cases (once without the latest patch,
once with) very little trace data was actually produced. In the case
without the patch, the reclaim was started more often and reclaimed a
smaller number of pages each time, in the case with the patch it was
invoked less often, and with the last time it was invoked it reclaimed
a rather big number of pages. I have no clue, however, if that
happened "by chance" or if it was actually causes by the patch and
thus an expected change.

In both cases, my test case was: Reboot, setup logging, do "emerge
firefox" (which unpacks and builds the firefox sources), then, when
the emerge had come so far that the unpacking was done and the
building had started, switch to another console and untar the latest
kernel, libreoffice and (once more) firefox sources there. After that
had completed, I aborted the emerge build process and stopped tracing.

Here's the trace data captured without the latest patch applied:

khugepaged-22[000]    566.123383: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000] .N..   566.165520: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1100
khugepaged-22[001]    587.515424: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    587.596035: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1029
khugepaged-22[001]    599.879536: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    601.000812: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1100
khugepaged-22[001]    601.228137: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    601.309952: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1081
khugepaged-22[001]    694.935267: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001] .N..   695.081943: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1071
khugepaged-22[001]    701.370707: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    701.372798: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1089
khugepaged-22[001]    764.752036: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    771.047905: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1039
khugepaged-22[000]    781.760515: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    781.826543: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040
khugepaged-22[001]    782.595575: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[000]    782.638591: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040
khugepaged-22[001]    782.930455: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    782.993608: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040
khugepaged-22[001]    783.330378: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    783.369653: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1040

And this is the same with the patch applied:

khugepaged-22[001]    523.57: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    523.683110: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1092
khugepaged-22[001]    535.345477: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    535.401189: mm_vmscan_direct_reclaim_end: 
nr_reclaimed=1078
khugepaged-22[000]    692.876716: mm_vmscan_direct_reclaim_begin: 
order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22[001]    703.312399: mm_vmscan_direct_reclaim_end: 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
Hi,
could you try to run with the following patch on top of the previous
one? I do not think it will make a large change in your workload but
I think we need something like that so some testing under which is known
to make a high lowmem pressure would be really appreciated. If you have
more time to play with it then running with and without the patch with
mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
whether it make any difference at all.

I would also appreciate if Mel and Johannes had a look at it. I am not
yet sure whether we need the same thing for anon/file balancing in
get_scan_count. I suspect we need but need to think more about that.

Thanks a lot again!
---
>From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 27 Dec 2016 16:28:44 +0100
Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

get_scan_count considers the whole node LRU size when
- doing SCAN_FILE due to many page cache inactive pages
- calculating the number of pages to scan

in both cases this might lead to unexpected behavior especially on 32b
systems where we can expect lowmem memory pressure very often.

A large highmem zone can easily distort SCAN_FILE heuristic because
there might be only few file pages from the eligible zones on the node
lru and we would still enforce file lru scanning which can lead to
trashing while we could still scan anonymous pages.

The later use of lruvec_lru_size can be problematic as well. Especially
when there are not many pages from the eligible zones. We would have to
skip over many pages to find anything to reclaim but shrink_node_memcg
would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
at maximum. Therefore we can end up going over a large LRU many times
without actually having chance to reclaim much if anything at all. The
closer we are out of memory on lowmem zone the worse the problem will
be.

Signed-off-by: Michal Hocko 
---
 mm/vmscan.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c98b1a585992..785b4d7fb8a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, 
enum lru_list lru, int
 }
 
 /*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+   enum lru_list lru, int zone_idx)
+{
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+   unsigned long lru_size;
+   int zid;
+
+   lru_size = lruvec_lru_size(lruvec, lru);
+   for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+   unsigned long size;
+
+   if (!managed_zone(zone))
+   continue;
+
+   size = lruvec_zone_lru_size(lruvec, lru, zid);
+   lru_size -= min(size, lru_size);
+   }
+
+   return lru_size;
+}
+
+/*
  * Add a shrinker callback to be called from the vm.
  */
 int register_shrinker(struct shrinker *shrinker)
@@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true, sc) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
unsigned long size;
unsigned long scan;
 
-   size = lruvec_lru_size(lruvec, lru);
+   size = lruvec_lru_size_zone_idx(lruvec, lru, 
sc->reclaim_idx);
scan = size >> sc->priority;
 
if (!scan && pass && force_scan)
-- 
2.10.2

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
Hi,
could you try to run with the following patch on top of the previous
one? I do not think it will make a large change in your workload but
I think we need something like that so some testing under which is known
to make a high lowmem pressure would be really appreciated. If you have
more time to play with it then running with and without the patch with
mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
whether it make any difference at all.

I would also appreciate if Mel and Johannes had a look at it. I am not
yet sure whether we need the same thing for anon/file balancing in
get_scan_count. I suspect we need but need to think more about that.

Thanks a lot again!
---
>From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 27 Dec 2016 16:28:44 +0100
Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

get_scan_count considers the whole node LRU size when
- doing SCAN_FILE due to many page cache inactive pages
- calculating the number of pages to scan

in both cases this might lead to unexpected behavior especially on 32b
systems where we can expect lowmem memory pressure very often.

A large highmem zone can easily distort SCAN_FILE heuristic because
there might be only few file pages from the eligible zones on the node
lru and we would still enforce file lru scanning which can lead to
trashing while we could still scan anonymous pages.

The later use of lruvec_lru_size can be problematic as well. Especially
when there are not many pages from the eligible zones. We would have to
skip over many pages to find anything to reclaim but shrink_node_memcg
would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
at maximum. Therefore we can end up going over a large LRU many times
without actually having chance to reclaim much if anything at all. The
closer we are out of memory on lowmem zone the worse the problem will
be.

Signed-off-by: Michal Hocko 
---
 mm/vmscan.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c98b1a585992..785b4d7fb8a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, 
enum lru_list lru, int
 }
 
 /*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+   enum lru_list lru, int zone_idx)
+{
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+   unsigned long lru_size;
+   int zid;
+
+   lru_size = lruvec_lru_size(lruvec, lru);
+   for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+   unsigned long size;
+
+   if (!managed_zone(zone))
+   continue;
+
+   size = lruvec_zone_lru_size(lruvec, lru, zid);
+   lru_size -= min(size, lru_size);
+   }
+
+   return lru_size;
+}
+
+/*
  * Add a shrinker callback to be called from the vm.
  */
 int register_shrinker(struct shrinker *shrinker)
@@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true, sc) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
unsigned long size;
unsigned long scan;
 
-   size = lruvec_lru_size(lruvec, lru);
+   size = lruvec_lru_size_zone_idx(lruvec, lru, 
sc->reclaim_idx);
scan = size >> sc->priority;
 
if (!scan && pass && force_scan)
-- 
2.10.2

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
On Tue 27-12-16 12:23:13, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 09:08:38AM +0100, Michal Hocko wrote:
> > On Mon 26-12-16 19:57:03, Nils Holland wrote:
> > > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > > 
> > > > > > Nils, even though this is still highly experimental, could you give 
> > > > > > it a
> > > > > > try please?
> > > > > 
> > > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > > revert the latest version of the debugging patch (the one in
> > > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > > is that I get this during boot:
> > > > > 
> > > > > [1.568174] [ cut here ]
> > > > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > > > mem_cgroup_update_lru_size+0x118/0x130
> > > > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 
> > > > > but not empty
> > > > 
> > > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > > my patch (I double account) and b) the detection for the empty list
> > > > cannot work after my change because per node zone will not match per
> > > > zone statistics. The updated patch is below. So I hope my brain already
> > > > works after it's been mostly off last few days...
> > > 
> > > I tried the updated patch, and I can confirm that the warning during
> > > boot is gone. Also, I've tried my ordinary procedure to reproduce my
> > > testcase, and I can say that a kernel with this new patch also works
> > > fine and doesn't produce OOMs or similar issues.
> > > 
> > > I had the previous version of the patch in use on a machine non-stop
> > > for the last few days during normal day-to-day workloads and didn't
> > > notice any issues. Now I'll keep a machine running during the next few
> > > days with this patch, and in case I notice something that doesn't look
> > > normal, I'll of course report back!
> > 
> > Thanks for your testing! Can I add your
> > Tested-by: Nils Holland 
> 
> Yes, I think so! The patch has now been running for 16 hours on my two
> machines, and that's an uptime that was hard to achieve since 4.8 for
> me. ;-) So my tests clearly suggest that the patch is good! :-)

OK, thanks a lot for your testing! I will wait few more days before I
send it to Andrew.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
On Tue 27-12-16 12:23:13, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 09:08:38AM +0100, Michal Hocko wrote:
> > On Mon 26-12-16 19:57:03, Nils Holland wrote:
> > > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > > 
> > > > > > Nils, even though this is still highly experimental, could you give 
> > > > > > it a
> > > > > > try please?
> > > > > 
> > > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > > revert the latest version of the debugging patch (the one in
> > > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > > is that I get this during boot:
> > > > > 
> > > > > [1.568174] [ cut here ]
> > > > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > > > mem_cgroup_update_lru_size+0x118/0x130
> > > > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 
> > > > > but not empty
> > > > 
> > > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > > my patch (I double account) and b) the detection for the empty list
> > > > cannot work after my change because per node zone will not match per
> > > > zone statistics. The updated patch is below. So I hope my brain already
> > > > works after it's been mostly off last few days...
> > > 
> > > I tried the updated patch, and I can confirm that the warning during
> > > boot is gone. Also, I've tried my ordinary procedure to reproduce my
> > > testcase, and I can say that a kernel with this new patch also works
> > > fine and doesn't produce OOMs or similar issues.
> > > 
> > > I had the previous version of the patch in use on a machine non-stop
> > > for the last few days during normal day-to-day workloads and didn't
> > > notice any issues. Now I'll keep a machine running during the next few
> > > days with this patch, and in case I notice something that doesn't look
> > > normal, I'll of course report back!
> > 
> > Thanks for your testing! Can I add your
> > Tested-by: Nils Holland 
> 
> Yes, I think so! The patch has now been running for 16 hours on my two
> machines, and that's an uptime that was hard to achieve since 4.8 for
> me. ;-) So my tests clearly suggest that the patch is good! :-)

OK, thanks a lot for your testing! I will wait few more days before I
send it to Andrew.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Nils Holland
On Tue, Dec 27, 2016 at 09:08:38AM +0100, Michal Hocko wrote:
> On Mon 26-12-16 19:57:03, Nils Holland wrote:
> > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > 
> > > > > Nils, even though this is still highly experimental, could you give 
> > > > > it a
> > > > > try please?
> > > > 
> > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > revert the latest version of the debugging patch (the one in
> > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > is that I get this during boot:
> > > > 
> > > > [1.568174] [ cut here ]
> > > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > > mem_cgroup_update_lru_size+0x118/0x130
> > > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 
> > > > but not empty
> > > 
> > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > my patch (I double account) and b) the detection for the empty list
> > > cannot work after my change because per node zone will not match per
> > > zone statistics. The updated patch is below. So I hope my brain already
> > > works after it's been mostly off last few days...
> > 
> > I tried the updated patch, and I can confirm that the warning during
> > boot is gone. Also, I've tried my ordinary procedure to reproduce my
> > testcase, and I can say that a kernel with this new patch also works
> > fine and doesn't produce OOMs or similar issues.
> > 
> > I had the previous version of the patch in use on a machine non-stop
> > for the last few days during normal day-to-day workloads and didn't
> > notice any issues. Now I'll keep a machine running during the next few
> > days with this patch, and in case I notice something that doesn't look
> > normal, I'll of course report back!
> 
> Thanks for your testing! Can I add your
> Tested-by: Nils Holland 

Yes, I think so! The patch has now been running for 16 hours on my two
machines, and that's an uptime that was hard to achieve since 4.8 for
me. ;-) So my tests clearly suggest that the patch is good! :-)

Greetings
Nils


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Nils Holland
On Tue, Dec 27, 2016 at 09:08:38AM +0100, Michal Hocko wrote:
> On Mon 26-12-16 19:57:03, Nils Holland wrote:
> > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > 
> > > > > Nils, even though this is still highly experimental, could you give 
> > > > > it a
> > > > > try please?
> > > > 
> > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > revert the latest version of the debugging patch (the one in
> > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > is that I get this during boot:
> > > > 
> > > > [1.568174] [ cut here ]
> > > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > > mem_cgroup_update_lru_size+0x118/0x130
> > > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 
> > > > but not empty
> > > 
> > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > my patch (I double account) and b) the detection for the empty list
> > > cannot work after my change because per node zone will not match per
> > > zone statistics. The updated patch is below. So I hope my brain already
> > > works after it's been mostly off last few days...
> > 
> > I tried the updated patch, and I can confirm that the warning during
> > boot is gone. Also, I've tried my ordinary procedure to reproduce my
> > testcase, and I can say that a kernel with this new patch also works
> > fine and doesn't produce OOMs or similar issues.
> > 
> > I had the previous version of the patch in use on a machine non-stop
> > for the last few days during normal day-to-day workloads and didn't
> > notice any issues. Now I'll keep a machine running during the next few
> > days with this patch, and in case I notice something that doesn't look
> > normal, I'll of course report back!
> 
> Thanks for your testing! Can I add your
> Tested-by: Nils Holland 

Yes, I think so! The patch has now been running for 16 hours on my two
machines, and that's an uptime that was hard to achieve since 4.8 for
me. ;-) So my tests clearly suggest that the patch is good! :-)

Greetings
Nils


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
On Mon 26-12-16 19:57:03, Nils Holland wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> 
> I tried the updated patch, and I can confirm that the warning during
> boot is gone. Also, I've tried my ordinary procedure to reproduce my
> testcase, and I can say that a kernel with this new patch also works
> fine and doesn't produce OOMs or similar issues.
> 
> I had the previous version of the patch in use on a machine non-stop
> for the last few days during normal day-to-day workloads and didn't
> notice any issues. Now I'll keep a machine running during the next few
> days with this patch, and in case I notice something that doesn't look
> normal, I'll of course report back!

Thanks for your testing! Can I add your
Tested-by: Nils Holland 
?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
On Mon 26-12-16 19:57:03, Nils Holland wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> 
> I tried the updated patch, and I can confirm that the warning during
> boot is gone. Also, I've tried my ordinary procedure to reproduce my
> testcase, and I can say that a kernel with this new patch also works
> fine and doesn't produce OOMs or similar issues.
> 
> I had the previous version of the patch in use on a machine non-stop
> for the last few days during normal day-to-day workloads and didn't
> notice any issues. Now I'll keep a machine running during the next few
> days with this patch, and in case I notice something that doesn't look
> normal, I'll of course report back!

Thanks for your testing! Can I add your
Tested-by: Nils Holland 
?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-26 Thread Nils Holland
On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > 
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> > 
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> > 
> > [1.568174] [ cut here ]
> > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0x118/0x130
> > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > not empty
> 
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...

I tried the updated patch, and I can confirm that the warning during
boot is gone. Also, I've tried my ordinary procedure to reproduce my
testcase, and I can say that a kernel with this new patch also works
fine and doesn't produce OOMs or similar issues.

I had the previous version of the patch in use on a machine non-stop
for the last few days during normal day-to-day workloads and didn't
notice any issues. Now I'll keep a machine running during the next few
days with this patch, and in case I notice something that doesn't look
normal, I'll of course report back!

Greetings
Nils


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-26 Thread Nils Holland
On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > 
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> > 
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> > 
> > [1.568174] [ cut here ]
> > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0x118/0x130
> > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > not empty
> 
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...

I tried the updated patch, and I can confirm that the warning during
boot is gone. Also, I've tried my ordinary procedure to reproduce my
testcase, and I can say that a kernel with this new patch also works
fine and doesn't produce OOMs or similar issues.

I had the previous version of the patch in use on a machine non-stop
for the last few days during normal day-to-day workloads and didn't
notice any issues. Now I'll keep a machine running during the next few
days with this patch, and in case I notice something that doesn't look
normal, I'll of course report back!

Greetings
Nils


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-26 Thread Michal Hocko
On Fri 23-12-16 23:26:00, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > 
> > Nils, even though this is still highly experimental, could you give it a
> > try please?
> 
> Yes, no problem! So I kept the very first patch you sent but had to
> revert the latest version of the debugging patch (the one in
> which you added the "mm_vmscan_inactive_list_is_low" event) because
> otherwise the patch you just sent wouldn't apply. Then I rebooted with
> memory cgroups enabled again, and the first thing that strikes the eye
> is that I get this during boot:
> 
> [1.568174] [ cut here ]
> [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> mem_cgroup_update_lru_size+0x118/0x130
> [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not 
> empty

Ohh, I can see what is wrong! a) there is a bug in the accounting in
my patch (I double account) and b) the detection for the empty list
cannot work after my change because per node zone will not match per
zone statistics. The updated patch is below. So I hope my brain already
works after it's been mostly off last few days...
---
>From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
 memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
It simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are loosing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones 
inactive ratio")
Cc: stable # 4.8+

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-26 Thread Michal Hocko
On Fri 23-12-16 23:26:00, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > 
> > Nils, even though this is still highly experimental, could you give it a
> > try please?
> 
> Yes, no problem! So I kept the very first patch you sent but had to
> revert the latest version of the debugging patch (the one in
> which you added the "mm_vmscan_inactive_list_is_low" event) because
> otherwise the patch you just sent wouldn't apply. Then I rebooted with
> memory cgroups enabled again, and the first thing that strikes the eye
> is that I get this during boot:
> 
> [1.568174] [ cut here ]
> [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> mem_cgroup_update_lru_size+0x118/0x130
> [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not 
> empty

Ohh, I can see what is wrong! a) there is a bug in the accounting in
my patch (I double account) and b) the detection for the empty list
cannot work after my change because per node zone will not match per
zone statistics. The updated patch is below. So I hope my brain already
works after it's been mostly off last few days...
---
>From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
 memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
It simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are loosing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones 
inactive ratio")
Cc: stable # 4.8+
Reported-by: Nils 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-23 Thread Nils Holland
On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> 
> Nils, even though this is still highly experimental, could you give it a
> try please?

Yes, no problem! So I kept the very first patch you sent but had to
revert the latest version of the debugging patch (the one in
which you added the "mm_vmscan_inactive_list_is_low" event) because
otherwise the patch you just sent wouldn't apply. Then I rebooted with
memory cgroups enabled again, and the first thing that strikes the eye
is that I get this during boot:

[1.568174] [ cut here ]
[1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
mem_cgroup_update_lru_size+0x118/0x130
[1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not 
empty
[1.568754] Modules linked in:
[1.568922] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-gentoo #6
[1.569052] Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS 
F.22 08/06/2014
[1.571750]  f44e5b84 c142bdee f44e5bc8 c1b5ade0 f44e5bb4 c103ab1d c1b583e4 
f44e5be4
[1.572262]  0001 c1b5ade0 0408 c11603d8 0408  c1b5af73 
0001
[1.572774]  f44e5bd0 c103ab76 0009  f44e5bc8 c1b583e4 f44e5be4 
f44e5c18
[1.573285] Call Trace:
[1.573419]  [] dump_stack+0x47/0x69
[1.573551]  [] __warn+0xed/0x110
[1.573681]  [] ? mem_cgroup_update_lru_size+0x118/0x130
[1.573812]  [] warn_slowpath_fmt+0x36/0x40
[1.573942]  [] mem_cgroup_update_lru_size+0x118/0x130
[1.574076]  [] __pagevec_lru_add_fn+0xd7/0x1b0
[1.574206]  [] ? perf_trace_mm_lru_insertion+0x150/0x150
[1.574336]  [] pagevec_lru_move_fn+0x4d/0x80
[1.574465]  [] ? perf_trace_mm_lru_insertion+0x150/0x150
[1.574595]  [] __lru_cache_add+0x45/0x60
[1.574724]  [] lru_cache_add+0x8/0x10
[1.574852]  [] add_to_page_cache_lru+0x61/0xc0
[1.574982]  [] pagecache_get_page+0xee/0x270
[1.575111]  [] grab_cache_page_write_begin+0x20/0x40
[1.575243]  [] simple_write_begin+0x25/0xd0
[1.575372]  [] generic_perform_write+0xa8/0x1a0
[1.575503]  [] __generic_file_write_iter+0x197/0x1f0
[1.575634]  [] generic_file_write_iter+0x19f/0x2b0
[1.575766]  [] __vfs_write+0xd1/0x140
[1.575897]  [] vfs_write+0x95/0x1b0
[1.576026]  [] SyS_write+0x3f/0x90
[1.576157]  [] xwrite+0x1c/0x4b
[1.576285]  [] do_copy+0x22/0xac
[1.576413]  [] write_buffer+0x1d/0x2c
[1.576540]  [] flush_buffer+0x1e/0x70
[1.576670]  [] unxz+0x149/0x211
[1.576798]  [] ? unlzo+0x359/0x359
[1.576926]  [] unpack_to_rootfs+0x14f/0x246
[1.577054]  [] ? write_buffer+0x2c/0x2c
[1.577183]  [] ? initrd_load+0x3b/0x3b
[1.577312]  [] ? maybe_link.part.3+0xe3/0xe3
[1.577443]  [] populate_rootfs+0x47/0x8f
[1.577573]  [] do_one_initcall+0x36/0x150
[1.577701]  [] ? repair_env_string+0x12/0x54
[1.577832]  [] ? parse_args+0x25d/0x400
[1.577962]  [] ? kernel_init_freeable+0x101/0x19e
[1.578092]  [] kernel_init_freeable+0x121/0x19e
[1.578222]  [] ? rest_init+0x60/0x60
[1.578350]  [] kernel_init+0xb/0x100
[1.578480]  [] ? schedule_tail+0xc/0x50
[1.578608]  [] ? rest_init+0x60/0x60
[1.578737]  [] ret_from_fork+0x1b/0x28
[1.578871] ---[ end trace cf6f1adac9dfe60e ]---

The machine then continued to boot just normally, however, so I
started my ordinary tests. And in fact, they were working just fine,
i.e. no OOMing anymore, even during heavy tarball unpacking.

Would it make sense to capture more trace data for you at this point?
As I'm on the go, I don't currently have a second machine for
capturing over the network, but since we're not having OOMs or other
issues now, capturing to file should probably work just fine.

I'll keep the patch applied and see if I notice anything else that
doesn't look normal during day to day usage, especially during my
ordinary Gentoo updates, which consist of a lot of fetching /
unpacking / building, and in the recent past had been very problematic
(in fact, that was where the problem first struck me and the "heavy
tarball unpacking" test was then just what I distilled it down to
in order to manually reproduce this with the least time and effort
possible).

Greetings
Nils


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-23 Thread Nils Holland
On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> 
> Nils, even though this is still highly experimental, could you give it a
> try please?

Yes, no problem! So I kept the very first patch you sent but had to
revert the latest version of the debugging patch (the one in
which you added the "mm_vmscan_inactive_list_is_low" event) because
otherwise the patch you just sent wouldn't apply. Then I rebooted with
memory cgroups enabled again, and the first thing that strikes the eye
is that I get this during boot:

[1.568174] [ cut here ]
[1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
mem_cgroup_update_lru_size+0x118/0x130
[1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not 
empty
[1.568754] Modules linked in:
[1.568922] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-gentoo #6
[1.569052] Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS 
F.22 08/06/2014
[1.571750]  f44e5b84 c142bdee f44e5bc8 c1b5ade0 f44e5bb4 c103ab1d c1b583e4 
f44e5be4
[1.572262]  0001 c1b5ade0 0408 c11603d8 0408  c1b5af73 
0001
[1.572774]  f44e5bd0 c103ab76 0009  f44e5bc8 c1b583e4 f44e5be4 
f44e5c18
[1.573285] Call Trace:
[1.573419]  [] dump_stack+0x47/0x69
[1.573551]  [] __warn+0xed/0x110
[1.573681]  [] ? mem_cgroup_update_lru_size+0x118/0x130
[1.573812]  [] warn_slowpath_fmt+0x36/0x40
[1.573942]  [] mem_cgroup_update_lru_size+0x118/0x130
[1.574076]  [] __pagevec_lru_add_fn+0xd7/0x1b0
[1.574206]  [] ? perf_trace_mm_lru_insertion+0x150/0x150
[1.574336]  [] pagevec_lru_move_fn+0x4d/0x80
[1.574465]  [] ? perf_trace_mm_lru_insertion+0x150/0x150
[1.574595]  [] __lru_cache_add+0x45/0x60
[1.574724]  [] lru_cache_add+0x8/0x10
[1.574852]  [] add_to_page_cache_lru+0x61/0xc0
[1.574982]  [] pagecache_get_page+0xee/0x270
[1.575111]  [] grab_cache_page_write_begin+0x20/0x40
[1.575243]  [] simple_write_begin+0x25/0xd0
[1.575372]  [] generic_perform_write+0xa8/0x1a0
[1.575503]  [] __generic_file_write_iter+0x197/0x1f0
[1.575634]  [] generic_file_write_iter+0x19f/0x2b0
[1.575766]  [] __vfs_write+0xd1/0x140
[1.575897]  [] vfs_write+0x95/0x1b0
[1.576026]  [] SyS_write+0x3f/0x90
[1.576157]  [] xwrite+0x1c/0x4b
[1.576285]  [] do_copy+0x22/0xac
[1.576413]  [] write_buffer+0x1d/0x2c
[1.576540]  [] flush_buffer+0x1e/0x70
[1.576670]  [] unxz+0x149/0x211
[1.576798]  [] ? unlzo+0x359/0x359
[1.576926]  [] unpack_to_rootfs+0x14f/0x246
[1.577054]  [] ? write_buffer+0x2c/0x2c
[1.577183]  [] ? initrd_load+0x3b/0x3b
[1.577312]  [] ? maybe_link.part.3+0xe3/0xe3
[1.577443]  [] populate_rootfs+0x47/0x8f
[1.577573]  [] do_one_initcall+0x36/0x150
[1.577701]  [] ? repair_env_string+0x12/0x54
[1.577832]  [] ? parse_args+0x25d/0x400
[1.577962]  [] ? kernel_init_freeable+0x101/0x19e
[1.578092]  [] kernel_init_freeable+0x121/0x19e
[1.578222]  [] ? rest_init+0x60/0x60
[1.578350]  [] kernel_init+0xb/0x100
[1.578480]  [] ? schedule_tail+0xc/0x50
[1.578608]  [] ? rest_init+0x60/0x60
[1.578737]  [] ret_from_fork+0x1b/0x28
[1.578871] ---[ end trace cf6f1adac9dfe60e ]---

The machine then continued to boot just normally, however, so I
started my ordinary tests. And in fact, they were working just fine,
i.e. no OOMing anymore, even during heavy tarball unpacking.

Would it make sense to capture more trace data for you at this point?
As I'm on the go, I don't currently have a second machine for
capturing over the network, but since we're not having OOMs or other
issues now, capturing to file should probably work just fine.

I'll keep the patch applied and see if I notice anything else that
doesn't look normal during day to day usage, especially during my
ordinary Gentoo updates, which consist of a lot of fetching /
unpacking / building, and in the recent past had been very problematic
(in fact, that was where the problem first struck me and the "heavy
tarball unpacking" test was then just what I distilled it down to
in order to manually reproduce this with the least time and effort
possible).

Greetings
Nils


[RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-23 Thread Michal Hocko
[Add Mel, Johannes and Vladimir - the email thread started here
http://lkml.kernel.org/r/20161215225702.ga27...@boerne.fritz.box
The long story short, the zone->node reclaim change has broken active
list aging for lowmem requests when memory cgroups are enabled. More
details below.

On Fri 23-12-16 13:57:28, Michal Hocko wrote:
> On Fri 23-12-16 13:18:51, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > > TL;DR
> > > drop the last patch, check whether memory cgroup is enabled and retest
> > > with cgroup_disable=memory to see whether this is memcg related and if
> > > it is _not_ then try to test with the patch below
> > 
> > Right, it seems we might be looking in the right direction! So I
> > removed the previous patch from my kernel and verified if memory
> > cgroup was enabled, and indeed, it was. So I booted with
> > cgroup_disable=memory and ran my ordinary test again ... and in fact,
> > no ooms!
> 
> OK, thanks for confirmation. I could have figured that earlier. The
> pagecache differences in such a short time should have raised the red
> flag and point towards memcgs...
> 
> [...]
> > > I would appreciate to stick with your setup to not pull new unknows into
> > > the picture.
> > 
> > No problem! It's just likely that I won't be able to test during the
> > following days until Dec 27th, but after that I should be back to
> > normal and thus be able to run further tests in a timely fashion. :-)
> 
> no problem at all. I will try to cook up a patch in the mean time.

So here is my attempt. Only compile tested so be careful, it might eat
your kittens or do more harm. I would appreciate other guys to have a
look to see whether this is sane. There are probably other places which
would need some tweaks. I think that get_scan_count needs some tweaks
as well because we should only consider eligible zones when counting the
number of pages to scan. This would be for a separate patch which I will
send later. I just want to fix this one first.

Nils, even though this is still highly experimental, could you give it a
try please?
---
>From a66fd89d43e9fd8ca9afa7e6c7252ab73d22b686 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
 memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
It simply subtracts per-zone highmem counters from the respective
memcg's lru 

[RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-23 Thread Michal Hocko
[Add Mel, Johannes and Vladimir - the email thread started here
http://lkml.kernel.org/r/20161215225702.ga27...@boerne.fritz.box
The long story short, the zone->node reclaim change has broken active
list aging for lowmem requests when memory cgroups are enabled. More
details below.

On Fri 23-12-16 13:57:28, Michal Hocko wrote:
> On Fri 23-12-16 13:18:51, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > > TL;DR
> > > drop the last patch, check whether memory cgroup is enabled and retest
> > > with cgroup_disable=memory to see whether this is memcg related and if
> > > it is _not_ then try to test with the patch below
> > 
> > Right, it seems we might be looking in the right direction! So I
> > removed the previous patch from my kernel and verified if memory
> > cgroup was enabled, and indeed, it was. So I booted with
> > cgroup_disable=memory and ran my ordinary test again ... and in fact,
> > no ooms!
> 
> OK, thanks for confirmation. I could have figured that earlier. The
> pagecache differences in such a short time should have raised the red
> flag and point towards memcgs...
> 
> [...]
> > > I would appreciate to stick with your setup to not pull new unknows into
> > > the picture.
> > 
> > No problem! It's just likely that I won't be able to test during the
> > following days until Dec 27th, but after that I should be back to
> > normal and thus be able to run further tests in a timely fashion. :-)
> 
> no problem at all. I will try to cook up a patch in the mean time.

So here is my attempt. Only compile tested so be careful, it might eat
your kittens or do more harm. I would appreciate other guys to have a
look to see whether this is sane. There are probably other places which
would need some tweaks. I think that get_scan_count needs some tweaks
as well because we should only consider eligible zones when counting the
number of pages to scan. This would be for a separate patch which I will
send later. I just want to fix this one first.

Nils, even though this is still highly experimental, could you give it a
try please?
---
>From a66fd89d43e9fd8ca9afa7e6c7252ab73d22b686 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
 memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
It simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which 

Re: OOM: Better, but still there on

2016-12-23 Thread Michal Hocko
On Fri 23-12-16 13:18:51, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > TL;DR
> > drop the last patch, check whether memory cgroup is enabled and retest
> > with cgroup_disable=memory to see whether this is memcg related and if
> > it is _not_ then try to test with the patch below
> 
> Right, it seems we might be looking in the right direction! So I
> removed the previous patch from my kernel and verified if memory
> cgroup was enabled, and indeed, it was. So I booted with
> cgroup_disable=memory and ran my ordinary test again ... and in fact,
> no ooms!

OK, thanks for confirmation. I could have figured that earlier. The
pagecache differences in such a short time should have raised the red
flag and point towards memcgs...

[...]
> > I would appreciate to stick with your setup to not pull new unknows into
> > the picture.
> 
> No problem! It's just likely that I won't be able to test during the
> following days until Dec 27th, but after that I should be back to
> normal and thus be able to run further tests in a timely fashion. :-)

no problem at all. I will try to cook up a patch in the mean time.
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-23 Thread Michal Hocko
On Fri 23-12-16 13:18:51, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > TL;DR
> > drop the last patch, check whether memory cgroup is enabled and retest
> > with cgroup_disable=memory to see whether this is memcg related and if
> > it is _not_ then try to test with the patch below
> 
> Right, it seems we might be looking in the right direction! So I
> removed the previous patch from my kernel and verified if memory
> cgroup was enabled, and indeed, it was. So I booted with
> cgroup_disable=memory and ran my ordinary test again ... and in fact,
> no ooms!

OK, thanks for confirmation. I could have figured that earlier. The
pagecache differences in such a short time should have raised the red
flag and point towards memcgs...

[...]
> > I would appreciate to stick with your setup to not pull new unknows into
> > the picture.
> 
> No problem! It's just likely that I won't be able to test during the
> following days until Dec 27th, but after that I should be back to
> normal and thus be able to run further tests in a timely fashion. :-)

no problem at all. I will try to cook up a patch in the mean time.
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-23 Thread Nils Holland
On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> TL;DR
> drop the last patch, check whether memory cgroup is enabled and retest
> with cgroup_disable=memory to see whether this is memcg related and if
> it is _not_ then try to test with the patch below

Right, it seems we might be looking in the right direction! So I
removed the previous patch from my kernel and verified if memory
cgroup was enabled, and indeed, it was. So I booted with
cgroup_disable=memory and ran my ordinary test again ... and in fact,
no ooms! I could have the firefox sources building and unpack half a
dozen big tarballs, which would previously with 99% certainty already
trigger an OOM upon unpacking the first tarball. Also, the system
seemed to run noticably "nicer", in the sense that the other processes
I had running (like htop) would not get delayed / hung. The new patch
you sent has, as per your instructions, NOT been applied.

I've provided a log of this run, it's available at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-23.log.xz

As no OOMs or other bad situations occured, no memory information was
forcibly logged. However, about three times I triggered a memory info
manually via SysReq, because I guess that might be interesting for you
to look at.

I'd like to run the same test on my second machine as well just to
make sure that cgroup_disable=memory has an effect there too. I
should be able to do that later tonight and will report back as soon
as I know more!

> I would appreciate to stick with your setup to not pull new unknows into
> the picture.

No problem! It's just likely that I won't be able to test during the
following days until Dec 27th, but after that I should be back to
normal and thus be able to run further tests in a timely fashion. :-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-23 Thread Nils Holland
On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> TL;DR
> drop the last patch, check whether memory cgroup is enabled and retest
> with cgroup_disable=memory to see whether this is memcg related and if
> it is _not_ then try to test with the patch below

Right, it seems we might be looking in the right direction! So I
removed the previous patch from my kernel and verified if memory
cgroup was enabled, and indeed, it was. So I booted with
cgroup_disable=memory and ran my ordinary test again ... and in fact,
no ooms! I could have the firefox sources building and unpack half a
dozen big tarballs, which would previously with 99% certainty already
trigger an OOM upon unpacking the first tarball. Also, the system
seemed to run noticably "nicer", in the sense that the other processes
I had running (like htop) would not get delayed / hung. The new patch
you sent has, as per your instructions, NOT been applied.

I've provided a log of this run, it's available at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-23.log.xz

As no OOMs or other bad situations occured, no memory information was
forcibly logged. However, about three times I triggered a memory info
manually via SysReq, because I guess that might be interesting for you
to look at.

I'd like to run the same test on my second machine as well just to
make sure that cgroup_disable=memory has an effect there too. I
should be able to do that later tonight and will report back as soon
as I know more!

> I would appreciate to stick with your setup to not pull new unknows into
> the picture.

No problem! It's just likely that I won't be able to test during the
following days until Dec 27th, but after that I should be back to
normal and thus be able to run further tests in a timely fashion. :-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-23 Thread Michal Hocko
TL;DR
drop the last patch, check whether memory cgroup is enabled and retest
with cgroup_disable=memory to see whether this is memcg related and if
it is _not_ then try to test with the patch below

On Thu 22-12-16 22:46:11, Nils Holland wrote:
> On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> > TL;DR I still do not see what is going on here and it still smells like
> > multiple issues. Please apply the patch below on _top_ of what you had.
> 
> I've run the usual procedure again with the new patch on top and the
> log is now up at:
> 
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

OK, so there are still large page cache fluctuations even with the
locking applied:
472.042409 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=450451 inactive=0 total_active=210056 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042442 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042451 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=12 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042484 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11944 inactive=0 total_active=117286 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB

One thing that didn't occure to me previously was that this might be an
effect of the memory cgroups. Do you have memory cgroups enabled? If
yes then reruning with cgroup_disable=memory would be interesting
as well.

Anyway, now I am looking at get_scan_count which determines how many pages
we should scan on each LRU list. The problem I can see there is that
it doesn't reflect eligible zones (or at least it doesn't do that
consistently). So it might happen we simply decide to scan the whole LRU
list (when we get down to prio 0 because we cannot make any progress)
and then _slowly_ scan through it in SWAP_CLUSTER_MAX chunks each
time. This can take a lot of time and who knows what might have happened
if there are many such reclaimers in parallel.

[...]

> This might suggest - although I have to admit, again, that this is
> inconclusive, as I've not used a final 4.9 kernel - that you could
> very easily reproduce the issue yourself by just setting up a 32 bit
> system with a btrfs filesystem and then unpacking a few huge tarballs.
> Of course, I'm more than happy to continue giving any patches sent to
> me a spin, but I thought I'd still mention this in case it makes
> things easier for you. :-)

I would appreciate to stick with your setup to not pull new unknows into
the picture.
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb82913b62bb..533bb591b0be 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -243,6 +243,35 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum 
lru_list lru)
 }
 
 /*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+   enum lru_list lru, int zone_idx)
+{
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+   unsigned long lru_size;
+   int zid;
+
+   if (!mem_cgroup_disabled())
+   return mem_cgroup_get_lru_size(lruvec, lru);
+
+   lru_size = lruvec_lru_size(lruvec, lru);
+   for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+   unsigned long size;
+
+   if (!managed_zone(zone))
+   continue;
+
+   size = zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
+   lru_size -= min(size, lru_size);
+   }
+
+   return lru_size;
+}
+
+/*
  * Add a shrinker callback to be called from the vm.
  */
 int register_shrinker(struct shrinker *shrinker)
@@ -2228,7 +2257,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true, sc) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2295,7 +2324,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
unsigned long size;
unsigned long scan;
 
-   size = lruvec_lru_size(lruvec, lru);
+   size = lruvec_lru_size_zone_idx(lruvec, lru, 
sc->reclaim_idx);
scan = size >> sc->priority;
 
if (!scan && pass && force_scan)
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-23 Thread Michal Hocko
TL;DR
drop the last patch, check whether memory cgroup is enabled and retest
with cgroup_disable=memory to see whether this is memcg related and if
it is _not_ then try to test with the patch below

On Thu 22-12-16 22:46:11, Nils Holland wrote:
> On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> > TL;DR I still do not see what is going on here and it still smells like
> > multiple issues. Please apply the patch below on _top_ of what you had.
> 
> I've run the usual procedure again with the new patch on top and the
> log is now up at:
> 
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

OK, so there are still large page cache fluctuations even with the
locking applied:
472.042409 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=450451 inactive=0 total_active=210056 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042442 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042451 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=12 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042484 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11944 inactive=0 total_active=117286 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB

One thing that didn't occure to me previously was that this might be an
effect of the memory cgroups. Do you have memory cgroups enabled? If
yes then reruning with cgroup_disable=memory would be interesting
as well.

Anyway, now I am looking at get_scan_count which determines how many pages
we should scan on each LRU list. The problem I can see there is that
it doesn't reflect eligible zones (or at least it doesn't do that
consistently). So it might happen we simply decide to scan the whole LRU
list (when we get down to prio 0 because we cannot make any progress)
and then _slowly_ scan through it in SWAP_CLUSTER_MAX chunks each
time. This can take a lot of time and who knows what might have happened
if there are many such reclaimers in parallel.

[...]

> This might suggest - although I have to admit, again, that this is
> inconclusive, as I've not used a final 4.9 kernel - that you could
> very easily reproduce the issue yourself by just setting up a 32 bit
> system with a btrfs filesystem and then unpacking a few huge tarballs.
> Of course, I'm more than happy to continue giving any patches sent to
> me a spin, but I thought I'd still mention this in case it makes
> things easier for you. :-)

I would appreciate to stick with your setup to not pull new unknows into
the picture.
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb82913b62bb..533bb591b0be 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -243,6 +243,35 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum 
lru_list lru)
 }
 
 /*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+   enum lru_list lru, int zone_idx)
+{
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+   unsigned long lru_size;
+   int zid;
+
+   if (!mem_cgroup_disabled())
+   return mem_cgroup_get_lru_size(lruvec, lru);
+
+   lru_size = lruvec_lru_size(lruvec, lru);
+   for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+   unsigned long size;
+
+   if (!managed_zone(zone))
+   continue;
+
+   size = zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
+   lru_size -= min(size, lru_size);
+   }
+
+   return lru_size;
+}
+
+/*
  * Add a shrinker callback to be called from the vm.
  */
 int register_shrinker(struct shrinker *shrinker)
@@ -2228,7 +2257,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true, sc) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2295,7 +2324,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
unsigned long size;
unsigned long scan;
 
-   size = lruvec_lru_size(lruvec, lru);
+   size = lruvec_lru_size_zone_idx(lruvec, lru, 
sc->reclaim_idx);
scan = size >> sc->priority;
 
if (!scan && pass && force_scan)
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> TL;DR I still do not see what is going on here and it still smells like
> multiple issues. Please apply the patch below on _top_ of what you had.

I've run the usual procedure again with the new patch on top and the
log is now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

As a little side note: It is likely, but I cannot completely say for
sure yet, that this issue is rather easy to reproduce. When I had some
time today at work, I set up a fresh Debian Sid installation in a VM
(32 bit PAE kernel, 4 GB RAM, btrfs as root fs). I used some late 4.9rc(8?)
kernel supplied by Debian - they don't seem to have 4.9 final yet and I
didn't come around to build and use a custom 4.9 final kernel, probably
even with your patches. But the 4.9rc kernel there seemed to behave very much
the same as the 4.9 kernel on my real 32 bit machines does: All I had
to do was unpack a few big tarballs - firefox, libreoffice and the
kernel are my favorites - and the machine would start OOMing.

This might suggest - although I have to admit, again, that this is
inconclusive, as I've not used a final 4.9 kernel - that you could
very easily reproduce the issue yourself by just setting up a 32 bit
system with a btrfs filesystem and then unpacking a few huge tarballs.
Of course, I'm more than happy to continue giving any patches sent to
me a spin, but I thought I'd still mention this in case it makes
things easier for you. :-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> TL;DR I still do not see what is going on here and it still smells like
> multiple issues. Please apply the patch below on _top_ of what you had.

I've run the usual procedure again with the new patch on top and the
log is now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

As a little side note: It is likely, but I cannot completely say for
sure yet, that this issue is rather easy to reproduce. When I had some
time today at work, I set up a fresh Debian Sid installation in a VM
(32 bit PAE kernel, 4 GB RAM, btrfs as root fs). I used some late 4.9rc(8?)
kernel supplied by Debian - they don't seem to have 4.9 final yet and I
didn't come around to build and use a custom 4.9 final kernel, probably
even with your patches. But the 4.9rc kernel there seemed to behave very much
the same as the 4.9 kernel on my real 32 bit machines does: All I had
to do was unpack a few big tarballs - firefox, libreoffice and the
kernel are my favorites - and the machine would start OOMing.

This might suggest - although I have to admit, again, that this is
inconclusive, as I've not used a final 4.9 kernel - that you could
very easily reproduce the issue yourself by just setting up a 32 bit
system with a btrfs filesystem and then unpacking a few huge tarballs.
Of course, I'm more than happy to continue giving any patches sent to
me a spin, but I thought I'd still mention this in case it makes
things easier for you. :-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
TL;DR I still do not see what is going on here and it still smells like
multiple issues. Please apply the patch below on _top_ of what you had.

On Thu 22-12-16 11:10:29, Nils Holland wrote:
[...]
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

It took me a while to realize that tracepoint and printk messages are
not sorted by the timestamp. Some massaging has fixed that
$ xzcat boerne_2016-12-22.log.xz | sed -e 's@.*192.168.17.32:6665 
\[[[:space:]]*\([0-9\.]\+\)\] @\1 @' -e 
's@.*192.168.17.32:53062[[:space:]]*\([^[:space:]]\+\)[[:space:]].*[[:space:]]\([0-9\.]\+\):@\2
 \1@' | sort -k1 -n -s

461.757468 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757501 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757504 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11852 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757508 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757535 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757537 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11820 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757543 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757584 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757588 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11788 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
[...]
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 
inactive=0 total_active=120208 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 
inactive=0 total_active=120208 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=89 
inactive=0 total_active=1301 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722385 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722386 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 
inactive=0 total_active=21 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=131 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 
inactive=0 total_active=21 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=131 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722401 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=450730 
inactive=0 total_active=206026 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
484.144971 collect2 invoked oom-killer: 
gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
order=0, oom_score_adj=0
[...]
484.146871 Node 0 active_anon:100688kB inactive_anon:380kB 
active_file:1296560kB inactive_file:1848044kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB mapped:32180kB dirty:20896kB 
writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 40960kB anon_thp: 776kB 
writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
484.147097 DMA free:4004kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:8016kB inactive_file:12kB unevictable:0kB 
writepending:68kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:2652kB slab_unreclaimable:1224kB kernel_stack:8kB 

Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
TL;DR I still do not see what is going on here and it still smells like
multiple issues. Please apply the patch below on _top_ of what you had.

On Thu 22-12-16 11:10:29, Nils Holland wrote:
[...]
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

It took me a while to realize that tracepoint and printk messages are
not sorted by the timestamp. Some massaging has fixed that
$ xzcat boerne_2016-12-22.log.xz | sed -e 's@.*192.168.17.32:6665 
\[[[:space:]]*\([0-9\.]\+\)\] @\1 @' -e 
's@.*192.168.17.32:53062[[:space:]]*\([^[:space:]]\+\)[[:space:]].*[[:space:]]\([0-9\.]\+\):@\2
 \1@' | sort -k1 -n -s

461.757468 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757501 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757504 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11852 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757508 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757535 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757537 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11820 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757543 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757584 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757588 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11788 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
[...]
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 
inactive=0 total_active=120208 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 
inactive=0 total_active=120208 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=89 
inactive=0 total_active=1301 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722385 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722386 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 
inactive=0 total_active=21 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=131 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 
inactive=0 total_active=21 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=131 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722401 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=450730 
inactive=0 total_active=206026 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
484.144971 collect2 invoked oom-killer: 
gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
order=0, oom_score_adj=0
[...]
484.146871 Node 0 active_anon:100688kB inactive_anon:380kB 
active_file:1296560kB inactive_file:1848044kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB mapped:32180kB dirty:20896kB 
writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 40960kB anon_thp: 776kB 
writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
484.147097 DMA free:4004kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:8016kB inactive_file:12kB unevictable:0kB 
writepending:68kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:2652kB slab_unreclaimable:1224kB kernel_stack:8kB 

Re: OOM: Better, but still there on

2016-12-22 Thread Tetsuo Handa
Nils Holland wrote:
> Well, the issue is that I could only do everything via ssh today and
> don't have any physical access to the machines. In fact, both seem to
> have suffered a genuine kernel panic, which is also visible in the
> last few lines of the log I provided today. So, basically, both
> machines are now sitting at my home in panic state and I'll only be
> able to resurrect them wheh I'm physically there again tonight.

# echo 10 > /proc/sys/kernel/panic


Re: OOM: Better, but still there on

2016-12-22 Thread Tetsuo Handa
Nils Holland wrote:
> Well, the issue is that I could only do everything via ssh today and
> don't have any physical access to the machines. In fact, both seem to
> have suffered a genuine kernel panic, which is also visible in the
> last few lines of the log I provided today. So, basically, both
> machines are now sitting at my home in panic state and I'll only be
> able to resurrect them wheh I'm physically there again tonight.

# echo 10 > /proc/sys/kernel/panic


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Thu, Dec 22, 2016 at 11:27:25AM +0100, Michal Hocko wrote:
> On Thu 22-12-16 11:10:29, Nils Holland wrote:
> 
> > However, the log comes from machine #2 again today, as I'm
> > unfortunately forced to try this via VPN from work to home today, so I
> > have exactly one attempt per machine before it goes down and locks up
> > (and I can only restart it later tonight).
> 
> This is really surprising to me. Are you sure that you have sysrq
> configured properly. At least sysrq+b shouldn't depend on any memory
> allocations and should allow you to reboot immediately. A sysrq+m right
> before the reboot might turn out being helpful as well.

Well, the issue is that I could only do everything via ssh today and
don't have any physical access to the machines. In fact, both seem to
have suffered a genuine kernel panic, which is also visible in the
last few lines of the log I provided today. So, basically, both
machines are now sitting at my home in panic state and I'll only be
able to resurrect them wheh I'm physically there again tonight. But
that was expected; I could have waited with the test until I'm at
home, which makes things easier, but I thought the sooner I can
provide a log for you to look at, the better. ;-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Thu, Dec 22, 2016 at 11:27:25AM +0100, Michal Hocko wrote:
> On Thu 22-12-16 11:10:29, Nils Holland wrote:
> 
> > However, the log comes from machine #2 again today, as I'm
> > unfortunately forced to try this via VPN from work to home today, so I
> > have exactly one attempt per machine before it goes down and locks up
> > (and I can only restart it later tonight).
> 
> This is really surprising to me. Are you sure that you have sysrq
> configured properly. At least sysrq+b shouldn't depend on any memory
> allocations and should allow you to reboot immediately. A sysrq+m right
> before the reboot might turn out being helpful as well.

Well, the issue is that I could only do everything via ssh today and
don't have any physical access to the machines. In fact, both seem to
have suffered a genuine kernel panic, which is also visible in the
last few lines of the log I provided today. So, basically, both
machines are now sitting at my home in panic state and I'll only be
able to resurrect them wheh I'm physically there again tonight. But
that was expected; I could have waited with the test until I'm at
home, which makes things easier, but I thought the sooner I can
provide a log for you to look at, the better. ;-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
On Thu 22-12-16 11:10:29, Nils Holland wrote:
> On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> 
> Right, I did just that and can provide a new log. I was also able, in
> this case, to reproduce the OOM issues again and not just the "page
> allocation stalls" that were the only thing visible in the previous
> log.

Thanks a lot for testing! I will have a look later today.

> However, the log comes from machine #2 again today, as I'm
> unfortunately forced to try this via VPN from work to home today, so I
> have exactly one attempt per machine before it goes down and locks up
> (and I can only restart it later tonight).

This is really surprising to me. Are you sure that you have sysrq
configured properly. At least sysrq+b shouldn't depend on any memory
allocations and should allow you to reboot immediately. A sysrq+m right
before the reboot might turn out being helpful as well.
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
On Thu 22-12-16 11:10:29, Nils Holland wrote:
> On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> 
> Right, I did just that and can provide a new log. I was also able, in
> this case, to reproduce the OOM issues again and not just the "page
> allocation stalls" that were the only thing visible in the previous
> log.

Thanks a lot for testing! I will have a look later today.

> However, the log comes from machine #2 again today, as I'm
> unfortunately forced to try this via VPN from work to home today, so I
> have exactly one attempt per machine before it goes down and locks up
> (and I can only restart it later tonight).

This is really surprising to me. Are you sure that you have sysrq
configured properly. At least sysrq+b shouldn't depend on any memory
allocations and should allow you to reboot immediately. A sysrq+m right
before the reboot might turn out being helpful as well.
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.

Right, I did just that and can provide a new log. I was also able, in
this case, to reproduce the OOM issues again and not just the "page
allocation stalls" that were the only thing visible in the previous
log. However, the log comes from machine #2 again today, as I'm
unfortunately forced to try this via VPN from work to home today, so I
have exactly one attempt per machine before it goes down and locks up
(and I can only restart it later tonight). Machine #1 failed to
produce good looking results during its one attempt, but what machine #2
produced seems to be exactly what we've been trying to track down, and so
its log us now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.

Right, I did just that and can provide a new log. I was also able, in
this case, to reproduce the OOM issues again and not just the "page
allocation stalls" that were the only thing visible in the previous
log. However, the log comes from machine #2 again today, as I'm
unfortunately forced to try this via VPN from work to home today, so I
have exactly one attempt per machine before it goes down and locks up
(and I can only restart it later tonight). Machine #1 failed to
produce good looking results during its one attempt, but what machine #2
produced seems to be exactly what we've been trying to track down, and so
its log us now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-21 Thread Chris Mason

On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote:

On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...


It might be interesting to put tracing on releasepage and see if btrfs 
is pinning pages around.  I can't see how 32bit kernels would be 
different, but maybe we're hitting a weird corner.


-chris



Re: OOM: Better, but still there on

2016-12-21 Thread Chris Mason

On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote:

On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...


It might be interesting to put tracing on releasepage and see if btrfs 
is pinning pages around.  I can't see how 32bit kernels would be 
different, but maybe we're hitting a weird corner.


-chris



Re: OOM: Better, but still there on

2016-12-21 Thread Michal Hocko
On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> > 
> > Hopefully the additional data will tell us more.
> > 
> > On Tue 20-12-16 03:08:29, Nils Holland wrote:
[...]
> > > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> > 
> > This is the stall report:
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, 
> > order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 
> > 4.9.0-gentoo #4
> > 
> > pid 1950 is trying to allocate for a _long_ time. Considering that this
> > is the only stall report, this means that reclaim took really long so we
> > didn't get to the page allocator for that long. It sounds really crazy!
> 
> warn_alloc() reports only if !__GFP_NOWARN.

yes and the above allocation clear is !__GFP_NOWARN allocation which is
reported after 611s! If there are no prior/lost warn_alloc() then it
implies we have spent _that_ much time in the reclaim. Considering the
tracing data we cannot really rule that out. All the reclaimers would
fight over the lru_lock and considering we are scanning the whole LRU
this will take some time.

[...]

> By the way, Michal, I'm feeling strange because it seems to me that your
> analysis does not refer to the implications of "x86_32 kernel". Maybe
> you already referred x86_32 by "they are from the highmem zone" though.

yes Highmem as well all those scanning anomalies is the 32b kernel
specific thing. I believe I have already mentioned that the 32b kernel
suffers from some inherent issues but I would like to understand what is
going on here before blaming the 32b.

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...

-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-21 Thread Michal Hocko
On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> > 
> > Hopefully the additional data will tell us more.
> > 
> > On Tue 20-12-16 03:08:29, Nils Holland wrote:
[...]
> > > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> > 
> > This is the stall report:
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, 
> > order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 
> > 4.9.0-gentoo #4
> > 
> > pid 1950 is trying to allocate for a _long_ time. Considering that this
> > is the only stall report, this means that reclaim took really long so we
> > didn't get to the page allocator for that long. It sounds really crazy!
> 
> warn_alloc() reports only if !__GFP_NOWARN.

yes and the above allocation clear is !__GFP_NOWARN allocation which is
reported after 611s! If there are no prior/lost warn_alloc() then it
implies we have spent _that_ much time in the reclaim. Considering the
tracing data we cannot really rule that out. All the reclaimers would
fight over the lru_lock and considering we are scanning the whole LRU
this will take some time.

[...]

> By the way, Michal, I'm feeling strange because it seems to me that your
> analysis does not refer to the implications of "x86_32 kernel". Maybe
> you already referred x86_32 by "they are from the highmem zone" though.

yes Highmem as well all those scanning anomalies is the 32b kernel
specific thing. I believe I have already mentioned that the 32b kernel
suffers from some inherent issues but I would like to understand what is
going on here before blaming the 32b.

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...

-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-21 Thread Tetsuo Handa
Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.
> 
> Hopefully the additional data will tell us more.
> 
> On Tue 20-12-16 03:08:29, Nils Holland wrote:
> > On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
> > 
> > > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > > not know whether we managed to rotate those pages. If they are referenced
> > > quickly enough we might just keep refaulting them... Could you try to 
> > > apply
> > > the followin diff on top what you have currently. It should add some more
> > > tracepoint data which might tell us more. We can reduce the amount of
> > > tracing data by enabling only mm_vmscan_lru_isolate,
> > > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
> > 
> > So, the results are in! I applied your patch and rebuild the kernel,
> > then I rebooted the machine, set up tracing so that only the three
> > events you mentioned were being traced, and captured the output over
> > the network.
> > 
> > Things went a bit different this time: The trace events started to
> > appear after a while and a whole lot of them were generated, but
> > suddenly they stopped. A short while later, we get

"cat /debug/trace/trace_pipe > /dev/udp/$ip/$port" stops reporting if
/bin/cat is disturbed by page fault and/or memory allocation needed for
sending UDP packets. Since netconsole can send UDP packets without involving
memory allocation, printk() is preferable than tracing under OOM.

> 
> It is possible that you are hitting multiple issues so it would be
> great to focus at one at the time. The underlying problem might be
> same/similar in the end but this is hard to tell now. Could you try to
> reproduce and provide data for the OOM killer situation as well?
>  
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, 
> > order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > 
> > along with a backtrace and memory information, and then there was
> > silence.
> 
> > When I walked up to the machine, it had completely died; it
> > wouldn't turn on its screen on key press any more, blindly trying to
> > reboot via SysRequest had no effect, but the caps lock LED also wasn't
> > blinking, like it normally does when a kernel panic occurs. Good
> > question what state it was in. The OOM reaper didn't really seem to
> > kick in and kill processes this time, it seems.
> > 
> > The complete capture is up at:
> > 
> > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> 
> This is the stall report:
> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
> mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 
> 4.9.0-gentoo #4
> 
> pid 1950 is trying to allocate for a _long_ time. Considering that this
> is the only stall report, this means that reclaim took really long so we
> didn't get to the page allocator for that long. It sounds really crazy!

warn_alloc() reports only if !__GFP_NOWARN.

We can report where they were looping using kmallocwd at
http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
(and extend it to call printk() for reporting values using SystemTap which your
trace hooks would report, only during memory allocations are stalling, without
delay caused by page fault and/or memory allocation needed for sending UDP 
packets).

But if trying to reboot via SysRq-b did not work, I think that the system
was in hard lockup state. That would be a different problem.

By the way, Michal, I'm feeling strange because it seems to me that your
analysis does not refer to the implications of "x86_32 kernel". Maybe
you already referred x86_32 by "they are from the highmem zone" though.


Re: OOM: Better, but still there on

2016-12-21 Thread Tetsuo Handa
Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.
> 
> Hopefully the additional data will tell us more.
> 
> On Tue 20-12-16 03:08:29, Nils Holland wrote:
> > On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
> > 
> > > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > > not know whether we managed to rotate those pages. If they are referenced
> > > quickly enough we might just keep refaulting them... Could you try to 
> > > apply
> > > the followin diff on top what you have currently. It should add some more
> > > tracepoint data which might tell us more. We can reduce the amount of
> > > tracing data by enabling only mm_vmscan_lru_isolate,
> > > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
> > 
> > So, the results are in! I applied your patch and rebuild the kernel,
> > then I rebooted the machine, set up tracing so that only the three
> > events you mentioned were being traced, and captured the output over
> > the network.
> > 
> > Things went a bit different this time: The trace events started to
> > appear after a while and a whole lot of them were generated, but
> > suddenly they stopped. A short while later, we get

"cat /debug/trace/trace_pipe > /dev/udp/$ip/$port" stops reporting if
/bin/cat is disturbed by page fault and/or memory allocation needed for
sending UDP packets. Since netconsole can send UDP packets without involving
memory allocation, printk() is preferable than tracing under OOM.

> 
> It is possible that you are hitting multiple issues so it would be
> great to focus at one at the time. The underlying problem might be
> same/similar in the end but this is hard to tell now. Could you try to
> reproduce and provide data for the OOM killer situation as well?
>  
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, 
> > order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > 
> > along with a backtrace and memory information, and then there was
> > silence.
> 
> > When I walked up to the machine, it had completely died; it
> > wouldn't turn on its screen on key press any more, blindly trying to
> > reboot via SysRequest had no effect, but the caps lock LED also wasn't
> > blinking, like it normally does when a kernel panic occurs. Good
> > question what state it was in. The OOM reaper didn't really seem to
> > kick in and kill processes this time, it seems.
> > 
> > The complete capture is up at:
> > 
> > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> 
> This is the stall report:
> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
> mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 
> 4.9.0-gentoo #4
> 
> pid 1950 is trying to allocate for a _long_ time. Considering that this
> is the only stall report, this means that reclaim took really long so we
> didn't get to the page allocator for that long. It sounds really crazy!

warn_alloc() reports only if !__GFP_NOWARN.

We can report where they were looping using kmallocwd at
http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
(and extend it to call printk() for reporting values using SystemTap which your
trace hooks would report, only during memory allocations are stalling, without
delay caused by page fault and/or memory allocation needed for sending UDP 
packets).

But if trying to reboot via SysRq-b did not work, I think that the system
was in hard lockup state. That would be a different problem.

By the way, Michal, I'm feeling strange because it seems to me that your
analysis does not refer to the implications of "x86_32 kernel". Maybe
you already referred x86_32 by "they are from the highmem zone" though.


Re: OOM: Better, but still there on

2016-12-20 Thread Michal Hocko
TL;DR
there is another version of the debugging patch. Just revert the
previous one and apply this one instead. It's still not clear what
is going on but I suspect either some misaccounting or unexpeted
pages on the LRU lists. I have added one more tracepoint, so please
enable also mm_vmscan_inactive_list_is_low.

Hopefully the additional data will tell us more.

On Tue 20-12-16 03:08:29, Nils Holland wrote:
> On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
> 
> > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > not know whether we managed to rotate those pages. If they are referenced
> > quickly enough we might just keep refaulting them... Could you try to apply
> > the followin diff on top what you have currently. It should add some more
> > tracepoint data which might tell us more. We can reduce the amount of
> > tracing data by enabling only mm_vmscan_lru_isolate,
> > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
> 
> So, the results are in! I applied your patch and rebuild the kernel,
> then I rebooted the machine, set up tracing so that only the three
> events you mentioned were being traced, and captured the output over
> the network.
> 
> Things went a bit different this time: The trace events started to
> appear after a while and a whole lot of them were generated, but
> suddenly they stopped. A short while later, we get

It is possible that you are hitting multiple issues so it would be
great to focus at one at the time. The underlying problem might be
same/similar in the end but this is hard to tell now. Could you try to
reproduce and provide data for the OOM killer situation as well?
 
> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
> mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> 
> along with a backtrace and memory information, and then there was
> silence.

> When I walked up to the machine, it had completely died; it
> wouldn't turn on its screen on key press any more, blindly trying to
> reboot via SysRequest had no effect, but the caps lock LED also wasn't
> blinking, like it normally does when a kernel panic occurs. Good
> question what state it was in. The OOM reaper didn't really seem to
> kick in and kill processes this time, it seems.
> 
> The complete capture is up at:
> 
> http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

This is the stall report:
[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
[ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 4.9.0-gentoo 
#4

pid 1950 is trying to allocate for a _long_ time. Considering that this
is the only stall report, this means that reclaim took really long so we
didn't get to the page allocator for that long. It sounds really crazy!

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_shrink_inactive | 
sed 's@.*nr_reclaimed=\([0-9\]*\).*@\1@' | sort | uniq -c
509 0
  1 1
  1 10
  5 11
  1 12
  1 14
  1 16
  2 19
  5 2
  1 22
  2 23
  1 25
  3 28
  2 3
  1 4
  4 5

It barely managed to reclaim something. While it has tried a lot. It
had hard times to actually isolate anything:

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_isolate: | sed 
's@.*nr_taken=@@' | sort | uniq -c
   8284 0 file=1
  8 11 file=1
  4 14 file=1
  1 1 file=1
  7 23 file=1
  1 25 file=1
  9 2 file=1
501 32 file=1
  1 3 file=1
  7 5 file=1
  1 6 file=1

a typical mm_vmscan_lru_isolate looks as follows

btrfs-transacti-1950  [001] d...  1368.508008: mm_vmscan_lru_isolate: 
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=266727 nr_taken=0 
file=1

so the whole inactive lru has been scanned it seems. But we couldn't
isolate a single page. There are two possibilities here. Either we skip
them all because they are from the highmem zone or we fail to
__isolate_lru_page them. Counters will not tell us because nr_scanned
includes skipped pages. I have updated the debugging patch to make this
distinction. I suspect we are skipping all of them...
The later option would be really surprising because the only way to fail
__isolate_lru_page with the 0 isolate_mode is if get_page_unless_zero(page)
fails which would mean we would have pages with 0 reference count on the
LRU list.

The stall message is from a later time so the situation might have
changed but
[ 1661.490170] Node 0   active_anon:139296kBinactive_anon:432kB 
active_file:1088996kB   inactive_file:1114524kB
[ 1661.490745] DMA  active_anon:0kB inactive_anon:0kB   
active_file:9540kB  inactive_file:0kB
[ 1661.491528] Normal   active_anon:0kB inactive_anon:0kB   
active_file:530560kBinactive_file:452kB
[ 1661.513077] HighMem  active_anon:139296kBinactive_anon:432kB 
active_file:548896kBinactive_file:1114068kB

suggests our inactive 

Re: OOM: Better, but still there on

2016-12-20 Thread Michal Hocko
TL;DR
there is another version of the debugging patch. Just revert the
previous one and apply this one instead. It's still not clear what
is going on but I suspect either some misaccounting or unexpeted
pages on the LRU lists. I have added one more tracepoint, so please
enable also mm_vmscan_inactive_list_is_low.

Hopefully the additional data will tell us more.

On Tue 20-12-16 03:08:29, Nils Holland wrote:
> On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
> 
> > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > not know whether we managed to rotate those pages. If they are referenced
> > quickly enough we might just keep refaulting them... Could you try to apply
> > the followin diff on top what you have currently. It should add some more
> > tracepoint data which might tell us more. We can reduce the amount of
> > tracing data by enabling only mm_vmscan_lru_isolate,
> > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
> 
> So, the results are in! I applied your patch and rebuild the kernel,
> then I rebooted the machine, set up tracing so that only the three
> events you mentioned were being traced, and captured the output over
> the network.
> 
> Things went a bit different this time: The trace events started to
> appear after a while and a whole lot of them were generated, but
> suddenly they stopped. A short while later, we get

It is possible that you are hitting multiple issues so it would be
great to focus at one at the time. The underlying problem might be
same/similar in the end but this is hard to tell now. Could you try to
reproduce and provide data for the OOM killer situation as well?
 
> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
> mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> 
> along with a backtrace and memory information, and then there was
> silence.

> When I walked up to the machine, it had completely died; it
> wouldn't turn on its screen on key press any more, blindly trying to
> reboot via SysRequest had no effect, but the caps lock LED also wasn't
> blinking, like it normally does when a kernel panic occurs. Good
> question what state it was in. The OOM reaper didn't really seem to
> kick in and kill processes this time, it seems.
> 
> The complete capture is up at:
> 
> http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

This is the stall report:
[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
[ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 4.9.0-gentoo 
#4

pid 1950 is trying to allocate for a _long_ time. Considering that this
is the only stall report, this means that reclaim took really long so we
didn't get to the page allocator for that long. It sounds really crazy!

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_shrink_inactive | 
sed 's@.*nr_reclaimed=\([0-9\]*\).*@\1@' | sort | uniq -c
509 0
  1 1
  1 10
  5 11
  1 12
  1 14
  1 16
  2 19
  5 2
  1 22
  2 23
  1 25
  3 28
  2 3
  1 4
  4 5

It barely managed to reclaim something. While it has tried a lot. It
had hard times to actually isolate anything:

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_isolate: | sed 
's@.*nr_taken=@@' | sort | uniq -c
   8284 0 file=1
  8 11 file=1
  4 14 file=1
  1 1 file=1
  7 23 file=1
  1 25 file=1
  9 2 file=1
501 32 file=1
  1 3 file=1
  7 5 file=1
  1 6 file=1

a typical mm_vmscan_lru_isolate looks as follows

btrfs-transacti-1950  [001] d...  1368.508008: mm_vmscan_lru_isolate: 
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=266727 nr_taken=0 
file=1

so the whole inactive lru has been scanned it seems. But we couldn't
isolate a single page. There are two possibilities here. Either we skip
them all because they are from the highmem zone or we fail to
__isolate_lru_page them. Counters will not tell us because nr_scanned
includes skipped pages. I have updated the debugging patch to make this
distinction. I suspect we are skipping all of them...
The later option would be really surprising because the only way to fail
__isolate_lru_page with the 0 isolate_mode is if get_page_unless_zero(page)
fails which would mean we would have pages with 0 reference count on the
LRU list.

The stall message is from a later time so the situation might have
changed but
[ 1661.490170] Node 0   active_anon:139296kBinactive_anon:432kB 
active_file:1088996kB   inactive_file:1114524kB
[ 1661.490745] DMA  active_anon:0kB inactive_anon:0kB   
active_file:9540kB  inactive_file:0kB
[ 1661.491528] Normal   active_anon:0kB inactive_anon:0kB   
active_file:530560kBinactive_file:452kB
[ 1661.513077] HighMem  active_anon:139296kBinactive_anon:432kB 
active_file:548896kBinactive_file:1114068kB

suggests our inactive 

Re: OOM: Better, but still there on

2016-12-19 Thread Nils Holland
On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:

> Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> not know whether we managed to rotate those pages. If they are referenced
> quickly enough we might just keep refaulting them... Could you try to apply
> the followin diff on top what you have currently. It should add some more
> tracepoint data which might tell us more. We can reduce the amount of
> tracing data by enabling only mm_vmscan_lru_isolate,
> mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.

So, the results are in! I applied your patch and rebuild the kernel,
then I rebooted the machine, set up tracing so that only the three
events you mentioned were being traced, and captured the output over
the network.

Things went a bit different this time: The trace events started to
appear after a while and a whole lot of them were generated, but
suddenly they stopped. A short while later, we get

[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)

along with a backtrace and memory information, and then there was
silence. When I walked up to the machine, it had completely died; it
wouldn't turn on its screen on key press any more, blindly trying to
reboot via SysRequest had no effect, but the caps lock LED also wasn't
blinking, like it normally does when a kernel panic occurs. Good
question what state it was in. The OOM reaper didn't really seem to
kick in and kill processes this time, it seems.

The complete capture is up at:

http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-19 Thread Nils Holland
On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:

> Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> not know whether we managed to rotate those pages. If they are referenced
> quickly enough we might just keep refaulting them... Could you try to apply
> the followin diff on top what you have currently. It should add some more
> tracepoint data which might tell us more. We can reduce the amount of
> tracing data by enabling only mm_vmscan_lru_isolate,
> mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.

So, the results are in! I applied your patch and rebuild the kernel,
then I rebooted the machine, set up tracing so that only the three
events you mentioned were being traced, and captured the output over
the network.

Things went a bit different this time: The trace events started to
appear after a while and a whole lot of them were generated, but
suddenly they stopped. A short while later, we get

[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)

along with a backtrace and memory information, and then there was
silence. When I walked up to the machine, it had completely died; it
wouldn't turn on its screen on key press any more, blindly trying to
reboot via SysRequest had no effect, but the caps lock LED also wasn't
blinking, like it normally does when a kernel panic occurs. Good
question what state it was in. The OOM reaper didn't really seem to
kick in and kill processes this time, it seems.

The complete capture is up at:

http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-19 Thread Michal Hocko
On Sat 17-12-16 22:06:47, Nils Holland wrote:
[...]
> Unfortunately, the reclaim trace messages stopped a while after the first
> OOM messages show up - most likely my "cat" had been killed at that
> point or became unresponsive. :-/

The later is more probable because I do not see the OOM killer to kill
any cat process and the first bash has been killed 10s after the first
OOM.

2016-12-17 21:36:56 192.168.17.23:6665 [ 1276.828639] Killed process 3894 (xz) 
total-vm:68640kB, anon-rss:65920kB, file-rss:1696kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1277.598271] Killed process 3864 
(sandbox) total-vm:2192kB, anon-rss:128kB, file-rss:1400kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1278.222416] Killed process 3086 
(emerge) total-vm:65064kB, anon-rss:52768kB, file-rss:7216kB, shmem-rss:0kB
2016-12-17 21:36:58 192.168.17.23:6665 [ 1278.846902] Killed process 2705 
(NetworkManager) total-vm:104376kB, anon-rss:4172kB, file-rss:10516kB, 
shmem-rss:0kB
2016-12-17 21:36:59 192.168.17.23:6665 [ 1279.862150] Killed process 2823 
(polkitd) total-vm:65536kB, anon-rss:2192kB, file-rss:8656kB, shmem-rss:0kB
2016-12-17 21:37:00 192.168.17.23:6665 [ 1280.496988] Killed process 3885 
(ebuild.sh) total-vm:10640kB, anon-rss:3340kB, file-rss:2244kB, shmem-rss:0kB
2016-12-17 21:37:04 192.168.17.23:6665 [ 1285.126052] Killed process 2824 
(wpa_supplicant) total-vm:8580kB, anon-rss:540kB, file-rss:5092kB, shmem-rss:0kB
2016-12-17 21:37:05 192.168.17.23:6665 [ 1286.124687] Killed process 2943 
(bash) total-vm:7320kB, anon-rss:368kB, file-rss:3240kB, shmem-rss:0kB
2016-12-17 21:37:07 192.168.17.23:6665 [ 1287.974353] Killed process 2878 
(sshd) total-vm:10524kB, anon-rss:700kB, file-rss:4908kB, shmem-rss:4kB
2016-12-17 21:37:16 192.168.17.23:6665 [ 1296.953350] Killed process 4048 
(ebuild.sh) total-vm:10640kB, anon-rss:3352kB, file-rss:1892kB, shmem-rss:0kB
2016-12-17 21:37:24 192.168.17.23:6665 [ 1304.398944] Killed process 1980 
(systemd-journal) total-vm:24640kB, anon-rss:332kB, file-rss:4608kB, 
shmem-rss:4kB
2016-12-17 21:37:25 192.168.17.23:6665 [ 1305.934472] Killed process 2918 
((sd-pam)) total-vm:9152kB, anon-rss:964kB, file-rss:1536kB, shmem-rss:0kB
2016-12-17 21:37:28 192.168.17.23:6665 [ 1308.878775] Killed process 2888 
(systemd) total-vm:7856kB, anon-rss:528kB, file-rss:4388kB, shmem-rss:0kB
2016-12-17 21:37:34 192.168.17.23:6665 [ 1314.268177] Killed process 2711 
(rsyslogd) total-vm:25200kB, anon-rss:1084kB, file-rss:2908kB, shmem-rss:0kB
2016-12-17 21:37:39 192.168.17.23:6665 [ 1319.634561] Killed process 2704 
(systemd-logind) total-vm:5980kB, anon-rss:340kB, file-rss:3568kB, shmem-rss:0kB
2016-12-17 21:37:43 192.168.17.23:6665 [ 1323.488894] Killed process 3103 
(htop) total-vm:7532kB, anon-rss:1024kB, file-rss:2872kB, shmem-rss:0kB
2016-12-17 21:38:42 192.168.17.23:6665 [ 1379.556282] Killed process 2701 
(systemd-timesyn) total-vm:15480kB, anon-rss:356kB, file-rss:3292kB, 
shmem-rss:0kB
2016-12-17 21:39:05 192.168.17.23:6665 [ 1403.130435] Killed process 3082 
(bash) total-vm:7324kB, anon-rss:380kB, file-rss:3324kB, shmem-rss:0kB
2016-12-17 21:39:17 192.168.17.23:6665 [ 1417.600367] Killed process 3077 
(start_trace) total-vm:6948kB, anon-rss:184kB, file-rss:2524kB, shmem-rss:0kB
2016-12-17 21:39:24 192.168.17.23:6665 [ 1423.955452] Killed process 3073 
(bash) total-vm:7324kB, anon-rss:380kB, file-rss:3284kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1425.338670] Killed process 3099 
(bash) total-vm:7324kB, anon-rss:376kB, file-rss:3176kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800677] Killed process 3070 
(screen) total-vm:7440kB, anon-rss:960kB, file-rss:2360kB, shmem-rss:0kB
 
> In the end, the machine didn't completely panic, but after nothing new
> showed up being logged via the network, I walked up to the
> machine and found it in a state where I couldn't really log in to it
> anymore, but all that worked was, as always, a magic SysRequest reboot.
> 
> The complete log, from machine boot right up to the point where it
> wouldn't really do anything anymore, is up again on my web server (~42
> MB, 928 KB packed):
> 
> http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz

$ xzgrep invoked teela_2016-12-17.log.xz | sed 
's@.*gfp_mask=0x[0-9a-f]*(\(.*\)), .*@\1@' | sort | uniq -c
  2 GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK
  1 GFP_KERNEL|__GFP_NOTRACK
  6 GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
  1 
GFP_KERNEL|__GFP_NOWARN|__GFP_REPEAT|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
  2 GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK
  2 GFP_TEMPORARY
  5 GFP_TEMPORARY|__GFP_NOTRACK
  3 GFP_USER|__GFP_COLD

so all of them are lowmem requests which is in line with your previous
report. This basically means that only zone Normal is usable as I've
already mentioned before. In general lowmem problems are inherent to the
32b kernels but in this case we still have a _lot of_ page cache to
reclaim so we 

Re: OOM: Better, but still there on

2016-12-19 Thread Michal Hocko
On Sat 17-12-16 22:06:47, Nils Holland wrote:
[...]
> Unfortunately, the reclaim trace messages stopped a while after the first
> OOM messages show up - most likely my "cat" had been killed at that
> point or became unresponsive. :-/

The later is more probable because I do not see the OOM killer to kill
any cat process and the first bash has been killed 10s after the first
OOM.

2016-12-17 21:36:56 192.168.17.23:6665 [ 1276.828639] Killed process 3894 (xz) 
total-vm:68640kB, anon-rss:65920kB, file-rss:1696kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1277.598271] Killed process 3864 
(sandbox) total-vm:2192kB, anon-rss:128kB, file-rss:1400kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1278.222416] Killed process 3086 
(emerge) total-vm:65064kB, anon-rss:52768kB, file-rss:7216kB, shmem-rss:0kB
2016-12-17 21:36:58 192.168.17.23:6665 [ 1278.846902] Killed process 2705 
(NetworkManager) total-vm:104376kB, anon-rss:4172kB, file-rss:10516kB, 
shmem-rss:0kB
2016-12-17 21:36:59 192.168.17.23:6665 [ 1279.862150] Killed process 2823 
(polkitd) total-vm:65536kB, anon-rss:2192kB, file-rss:8656kB, shmem-rss:0kB
2016-12-17 21:37:00 192.168.17.23:6665 [ 1280.496988] Killed process 3885 
(ebuild.sh) total-vm:10640kB, anon-rss:3340kB, file-rss:2244kB, shmem-rss:0kB
2016-12-17 21:37:04 192.168.17.23:6665 [ 1285.126052] Killed process 2824 
(wpa_supplicant) total-vm:8580kB, anon-rss:540kB, file-rss:5092kB, shmem-rss:0kB
2016-12-17 21:37:05 192.168.17.23:6665 [ 1286.124687] Killed process 2943 
(bash) total-vm:7320kB, anon-rss:368kB, file-rss:3240kB, shmem-rss:0kB
2016-12-17 21:37:07 192.168.17.23:6665 [ 1287.974353] Killed process 2878 
(sshd) total-vm:10524kB, anon-rss:700kB, file-rss:4908kB, shmem-rss:4kB
2016-12-17 21:37:16 192.168.17.23:6665 [ 1296.953350] Killed process 4048 
(ebuild.sh) total-vm:10640kB, anon-rss:3352kB, file-rss:1892kB, shmem-rss:0kB
2016-12-17 21:37:24 192.168.17.23:6665 [ 1304.398944] Killed process 1980 
(systemd-journal) total-vm:24640kB, anon-rss:332kB, file-rss:4608kB, 
shmem-rss:4kB
2016-12-17 21:37:25 192.168.17.23:6665 [ 1305.934472] Killed process 2918 
((sd-pam)) total-vm:9152kB, anon-rss:964kB, file-rss:1536kB, shmem-rss:0kB
2016-12-17 21:37:28 192.168.17.23:6665 [ 1308.878775] Killed process 2888 
(systemd) total-vm:7856kB, anon-rss:528kB, file-rss:4388kB, shmem-rss:0kB
2016-12-17 21:37:34 192.168.17.23:6665 [ 1314.268177] Killed process 2711 
(rsyslogd) total-vm:25200kB, anon-rss:1084kB, file-rss:2908kB, shmem-rss:0kB
2016-12-17 21:37:39 192.168.17.23:6665 [ 1319.634561] Killed process 2704 
(systemd-logind) total-vm:5980kB, anon-rss:340kB, file-rss:3568kB, shmem-rss:0kB
2016-12-17 21:37:43 192.168.17.23:6665 [ 1323.488894] Killed process 3103 
(htop) total-vm:7532kB, anon-rss:1024kB, file-rss:2872kB, shmem-rss:0kB
2016-12-17 21:38:42 192.168.17.23:6665 [ 1379.556282] Killed process 2701 
(systemd-timesyn) total-vm:15480kB, anon-rss:356kB, file-rss:3292kB, 
shmem-rss:0kB
2016-12-17 21:39:05 192.168.17.23:6665 [ 1403.130435] Killed process 3082 
(bash) total-vm:7324kB, anon-rss:380kB, file-rss:3324kB, shmem-rss:0kB
2016-12-17 21:39:17 192.168.17.23:6665 [ 1417.600367] Killed process 3077 
(start_trace) total-vm:6948kB, anon-rss:184kB, file-rss:2524kB, shmem-rss:0kB
2016-12-17 21:39:24 192.168.17.23:6665 [ 1423.955452] Killed process 3073 
(bash) total-vm:7324kB, anon-rss:380kB, file-rss:3284kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1425.338670] Killed process 3099 
(bash) total-vm:7324kB, anon-rss:376kB, file-rss:3176kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800677] Killed process 3070 
(screen) total-vm:7440kB, anon-rss:960kB, file-rss:2360kB, shmem-rss:0kB
 
> In the end, the machine didn't completely panic, but after nothing new
> showed up being logged via the network, I walked up to the
> machine and found it in a state where I couldn't really log in to it
> anymore, but all that worked was, as always, a magic SysRequest reboot.
> 
> The complete log, from machine boot right up to the point where it
> wouldn't really do anything anymore, is up again on my web server (~42
> MB, 928 KB packed):
> 
> http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz

$ xzgrep invoked teela_2016-12-17.log.xz | sed 
's@.*gfp_mask=0x[0-9a-f]*(\(.*\)), .*@\1@' | sort | uniq -c
  2 GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK
  1 GFP_KERNEL|__GFP_NOTRACK
  6 GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
  1 
GFP_KERNEL|__GFP_NOWARN|__GFP_REPEAT|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
  2 GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK
  2 GFP_TEMPORARY
  5 GFP_TEMPORARY|__GFP_NOTRACK
  3 GFP_USER|__GFP_COLD

so all of them are lowmem requests which is in line with your previous
report. This basically means that only zone Normal is usable as I've
already mentioned before. In general lowmem problems are inherent to the
32b kernels but in this case we still have a _lot of_ page cache to
reclaim so we 

Re: OOM: Better, but still there on

2016-12-17 Thread Tetsuo Handa
Nils Holland wrote:
> On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> > On 2016/12/17 21:59, Nils Holland wrote:
> > > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> > >> mount -t tracefs none /debug/trace
> > >> echo 1 > /debug/trace/events/vmscan/enable
> > >> cat /debug/trace/trace_pipe > trace.log
> > >>
> > >> should help
> > >> [...]
> > >
> > > No problem! I enabled writing the trace data to a file and then tried
> > > to trigger another OOM situation. That worked, this time without a
> > > complete kernel panic, but with only my processes being killed and the
> > > system becoming unresponsive.
> >
> > Under OOM situation, writing to a file on disk unlikely works. Maybe
> > logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> > if your are using bash) works better. (I wish we can do it from kernel
> > so that /bin/cat is not disturbed by delays due to page fault.)
> >
> > If you can configure netconsole for logging OOM killer messages and
> > UDP socket for logging trace_pipe messages, udplogger at
> > https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> > might fit for logging both output with timestamp into a single file.
>
> Actually, I decided to give this a try once more on machine #2, i.e.
> not the one that produced the previous trace, but the other one.
>
> I logged via netconsole as well as 'cat /debug/trace/trace_pipe' via
> the network to another machine running udplogger. After the machine
> had been frehsly booted and I had set up the logging, unpacking of the
> firefox source tarball started. After it had been unpacking for a
> while, the first load of trace messages started to appear. Some time
> later, OOMs started to appear - I've got quite a lot of them in my
> capture file this time.

Thank you for capturing. I think it worked well. Let's wait for Michal.

The first OOM killer invocation was

  2016-12-17 21:36:56 192.168.17.23:6665 [ 1276.828639] Killed process 3894 
(xz) total-vm:68640kB, anon-rss:65920kB, file-rss:1696kB, shmem-rss:0kB

and the last OOM killer invocation was

  2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800677] Killed process 3070 
(screen) total-vm:7440kB, anon-rss:960kB, file-rss:2360kB, shmem-rss:0kB

and trace output was sent until

  2016-12-17 21:37:07 192.168.17.23:48468 kworker/u4:4-3896  [000]   
1287.202958: mm_shrink_slab_start: super_cache_scan+0x0/0x170 f4436ed4: nid: 0 
objects to shrink 86 gfp_flags GFP_NOFS|__GFP_NOFAIL pgs_scanned 32 lru_pgs 
406078 cache items 412 delta 0 total_scan 86

which (I hope) should be sufficient for analysis.

>
> Unfortunately, the reclaim trace messages stopped a while after the first
> OOM messages show up - most likely my "cat" had been killed at that
> point or became unresponsive. :-/
>
> In the end, the machine didn't completely panic, but after nothing new
> showed up being logged via the network, I walked up to the
> machine and found it in a state where I couldn't really log in to it
> anymore, but all that worked was, as always, a magic SysRequest reboot.

There is a known issue (since Linux 2.6.32) that all memory allocation requests
get stuck due to kswapd v.s. shrink_inactive_list() livelock which occurs under
almost OOM situation ( http://lkml.kernel.org/r/20160211225929.GU14668@dastard 
).
If we hit it, even "page allocation stalls for " messages do not show up.

Even if we didn't hit it, although agetty and sshd were still alive

  2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800614] [ 2800] 0  2800 
1152  494   6   30 0 agetty
  2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800618] [ 2802] 0  2802 
1457 1055   6   30 -1000 sshd

memory allocation was delaying too much

  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034624] btrfs-transacti: page 
alloction stalls for 93995ms, order:0, mode:0x2400840(GFP_NOFS|__GFP_NOFAIL)
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034628] CPU: 1 PID: 1949 Comm: 
btrfs-transacti Not tainted 4.9.0-gentoo #3
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034630] Hardware name: 
Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034638]  f162f94c c142bd8e 
0001  f162f970 c110ad7e c1b58833 02400840
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034645]  f162f978 f162f980 
c1b55814 f162f960 0160 f162fa38 c110b78c 02400840
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034652]  c1b55814 00016f2b 
 0040  f21d f21d 0001
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034653] Call Trace:
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034660]  [] 
dump_stack+0x47/0x69
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034666]  [] 
warn_alloc+0xce/0xf0
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034671]  [] 
__alloc_pages_nodemask+0x97c/0xd30
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034678]  

Re: OOM: Better, but still there on

2016-12-17 Thread Tetsuo Handa
Nils Holland wrote:
> On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> > On 2016/12/17 21:59, Nils Holland wrote:
> > > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> > >> mount -t tracefs none /debug/trace
> > >> echo 1 > /debug/trace/events/vmscan/enable
> > >> cat /debug/trace/trace_pipe > trace.log
> > >>
> > >> should help
> > >> [...]
> > >
> > > No problem! I enabled writing the trace data to a file and then tried
> > > to trigger another OOM situation. That worked, this time without a
> > > complete kernel panic, but with only my processes being killed and the
> > > system becoming unresponsive.
> >
> > Under OOM situation, writing to a file on disk unlikely works. Maybe
> > logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> > if your are using bash) works better. (I wish we can do it from kernel
> > so that /bin/cat is not disturbed by delays due to page fault.)
> >
> > If you can configure netconsole for logging OOM killer messages and
> > UDP socket for logging trace_pipe messages, udplogger at
> > https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> > might fit for logging both output with timestamp into a single file.
>
> Actually, I decided to give this a try once more on machine #2, i.e.
> not the one that produced the previous trace, but the other one.
>
> I logged via netconsole as well as 'cat /debug/trace/trace_pipe' via
> the network to another machine running udplogger. After the machine
> had been frehsly booted and I had set up the logging, unpacking of the
> firefox source tarball started. After it had been unpacking for a
> while, the first load of trace messages started to appear. Some time
> later, OOMs started to appear - I've got quite a lot of them in my
> capture file this time.

Thank you for capturing. I think it worked well. Let's wait for Michal.

The first OOM killer invocation was

  2016-12-17 21:36:56 192.168.17.23:6665 [ 1276.828639] Killed process 3894 
(xz) total-vm:68640kB, anon-rss:65920kB, file-rss:1696kB, shmem-rss:0kB

and the last OOM killer invocation was

  2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800677] Killed process 3070 
(screen) total-vm:7440kB, anon-rss:960kB, file-rss:2360kB, shmem-rss:0kB

and trace output was sent until

  2016-12-17 21:37:07 192.168.17.23:48468 kworker/u4:4-3896  [000]   
1287.202958: mm_shrink_slab_start: super_cache_scan+0x0/0x170 f4436ed4: nid: 0 
objects to shrink 86 gfp_flags GFP_NOFS|__GFP_NOFAIL pgs_scanned 32 lru_pgs 
406078 cache items 412 delta 0 total_scan 86

which (I hope) should be sufficient for analysis.

>
> Unfortunately, the reclaim trace messages stopped a while after the first
> OOM messages show up - most likely my "cat" had been killed at that
> point or became unresponsive. :-/
>
> In the end, the machine didn't completely panic, but after nothing new
> showed up being logged via the network, I walked up to the
> machine and found it in a state where I couldn't really log in to it
> anymore, but all that worked was, as always, a magic SysRequest reboot.

There is a known issue (since Linux 2.6.32) that all memory allocation requests
get stuck due to kswapd v.s. shrink_inactive_list() livelock which occurs under
almost OOM situation ( http://lkml.kernel.org/r/20160211225929.GU14668@dastard 
).
If we hit it, even "page allocation stalls for " messages do not show up.

Even if we didn't hit it, although agetty and sshd were still alive

  2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800614] [ 2800] 0  2800 
1152  494   6   30 0 agetty
  2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800618] [ 2802] 0  2802 
1457 1055   6   30 -1000 sshd

memory allocation was delaying too much

  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034624] btrfs-transacti: page 
alloction stalls for 93995ms, order:0, mode:0x2400840(GFP_NOFS|__GFP_NOFAIL)
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034628] CPU: 1 PID: 1949 Comm: 
btrfs-transacti Not tainted 4.9.0-gentoo #3
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034630] Hardware name: 
Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034638]  f162f94c c142bd8e 
0001  f162f970 c110ad7e c1b58833 02400840
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034645]  f162f978 f162f980 
c1b55814 f162f960 0160 f162fa38 c110b78c 02400840
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034652]  c1b55814 00016f2b 
 0040  f21d f21d 0001
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034653] Call Trace:
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034660]  [] 
dump_stack+0x47/0x69
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034666]  [] 
warn_alloc+0xce/0xf0
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034671]  [] 
__alloc_pages_nodemask+0x97c/0xd30
  2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034678]  

Re: OOM: Better, but still there on

2016-12-17 Thread Xin Zhou
Hi,
The system supposes to have special memory reservation for coredump and other 
debug info when encountering panic,
the size seems configurable.
Thanks,
Xin
 
 

Sent: Saturday, December 17, 2016 at 6:44 AM
From: "Tetsuo Handa" <penguin-ker...@i-love.sakura.ne.jp>
To: "Nils Holland" <nholl...@tisys.org>, "Michal Hocko" <mho...@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux...@kvack.org, "Chris Mason" 
<c...@fb.com>, "David Sterba" <dste...@suse.cz>, linux-bt...@vger.kernel.org
Subject: Re: OOM: Better, but still there on
On 2016/12/17 21:59, Nils Holland wrote:
> On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
>> mount -t tracefs none /debug/trace
>> echo 1 > /debug/trace/events/vmscan/enable
>> cat /debug/trace/trace_pipe > trace.log
>>
>> should help
>> [...]
>
> No problem! I enabled writing the trace data to a file and then tried
> to trigger another OOM situation. That worked, this time without a
> complete kernel panic, but with only my processes being killed and the
> system becoming unresponsive. When that happened, I let it run for
> another minute or two so that in case it was still logging something
> to the trace file, it could continue to do so some time longer. Then I
> rebooted with the only thing that still worked, i.e. by means of magic
> SysRequest.

Under OOM situation, writing to a file on disk unlikely works. Maybe
logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
if your are using bash) works better. (I wish we can do it from kernel
so that /bin/cat is not disturbed by delays due to page fault.)

If you can configure netconsole for logging OOM killer messages and
UDP socket for logging trace_pipe messages, udplogger at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
might fit for logging both output with timestamp into a single file.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at 
http://vger.kernel.org/majordomo-info.html[http://vger.kernel.org/majordomo-info.html]


Re: OOM: Better, but still there on

2016-12-17 Thread Xin Zhou
Hi,
The system supposes to have special memory reservation for coredump and other 
debug info when encountering panic,
the size seems configurable.
Thanks,
Xin
 
 

Sent: Saturday, December 17, 2016 at 6:44 AM
From: "Tetsuo Handa" 
To: "Nils Holland" , "Michal Hocko" 
Cc: linux-kernel@vger.kernel.org, linux...@kvack.org, "Chris Mason" 
, "David Sterba" , linux-bt...@vger.kernel.org
Subject: Re: OOM: Better, but still there on
On 2016/12/17 21:59, Nils Holland wrote:
> On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
>> mount -t tracefs none /debug/trace
>> echo 1 > /debug/trace/events/vmscan/enable
>> cat /debug/trace/trace_pipe > trace.log
>>
>> should help
>> [...]
>
> No problem! I enabled writing the trace data to a file and then tried
> to trigger another OOM situation. That worked, this time without a
> complete kernel panic, but with only my processes being killed and the
> system becoming unresponsive. When that happened, I let it run for
> another minute or two so that in case it was still logging something
> to the trace file, it could continue to do so some time longer. Then I
> rebooted with the only thing that still worked, i.e. by means of magic
> SysRequest.

Under OOM situation, writing to a file on disk unlikely works. Maybe
logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
if your are using bash) works better. (I wish we can do it from kernel
so that /bin/cat is not disturbed by delays due to page fault.)

If you can configure netconsole for logging OOM killer messages and
UDP socket for logging trace_pipe messages, udplogger at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
might fit for logging both output with timestamp into a single file.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at 
http://vger.kernel.org/majordomo-info.html[http://vger.kernel.org/majordomo-info.html]


Re: OOM: Better, but still there on

2016-12-17 Thread Nils Holland
On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> On 2016/12/17 21:59, Nils Holland wrote:
> > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> >> mount -t tracefs none /debug/trace
> >> echo 1 > /debug/trace/events/vmscan/enable
> >> cat /debug/trace/trace_pipe > trace.log
> >>
> >> should help
> >> [...]
> > 
> > No problem! I enabled writing the trace data to a file and then tried
> > to trigger another OOM situation. That worked, this time without a
> > complete kernel panic, but with only my processes being killed and the
> > system becoming unresponsive.
> 
> Under OOM situation, writing to a file on disk unlikely works. Maybe
> logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> if your are using bash) works better. (I wish we can do it from kernel
> so that /bin/cat is not disturbed by delays due to page fault.)
> 
> If you can configure netconsole for logging OOM killer messages and
> UDP socket for logging trace_pipe messages, udplogger at
> https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> might fit for logging both output with timestamp into a single file.

Actually, I decided to give this a try once more on machine #2, i.e.
not the one that produced the previous trace, but the other one.

I logged via netconsole as well as 'cat /debug/trace/trace_pipe' via
the network to another machine running udplogger. After the machine
had been frehsly booted and I had set up the logging, unpacking of the
firefox source tarball started. After it had been unpacking for a
while, the first load of trace messages started to appear. Some time
later, OOMs started to appear - I've got quite a lot of them in my
capture file this time.

Unfortunately, the reclaim trace messages stopped a while after the first
OOM messages show up - most likely my "cat" had been killed at that
point or became unresponsive. :-/

In the end, the machine didn't completely panic, but after nothing new
showed up being logged via the network, I walked up to the
machine and found it in a state where I couldn't really log in to it
anymore, but all that worked was, as always, a magic SysRequest reboot.

The complete log, from machine boot right up to the point where it
wouldn't really do anything anymore, is up again on my web server (~42
MB, 928 KB packed):

http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-17 Thread Nils Holland
On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> On 2016/12/17 21:59, Nils Holland wrote:
> > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> >> mount -t tracefs none /debug/trace
> >> echo 1 > /debug/trace/events/vmscan/enable
> >> cat /debug/trace/trace_pipe > trace.log
> >>
> >> should help
> >> [...]
> > 
> > No problem! I enabled writing the trace data to a file and then tried
> > to trigger another OOM situation. That worked, this time without a
> > complete kernel panic, but with only my processes being killed and the
> > system becoming unresponsive.
> 
> Under OOM situation, writing to a file on disk unlikely works. Maybe
> logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> if your are using bash) works better. (I wish we can do it from kernel
> so that /bin/cat is not disturbed by delays due to page fault.)
> 
> If you can configure netconsole for logging OOM killer messages and
> UDP socket for logging trace_pipe messages, udplogger at
> https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> might fit for logging both output with timestamp into a single file.

Actually, I decided to give this a try once more on machine #2, i.e.
not the one that produced the previous trace, but the other one.

I logged via netconsole as well as 'cat /debug/trace/trace_pipe' via
the network to another machine running udplogger. After the machine
had been frehsly booted and I had set up the logging, unpacking of the
firefox source tarball started. After it had been unpacking for a
while, the first load of trace messages started to appear. Some time
later, OOMs started to appear - I've got quite a lot of them in my
capture file this time.

Unfortunately, the reclaim trace messages stopped a while after the first
OOM messages show up - most likely my "cat" had been killed at that
point or became unresponsive. :-/

In the end, the machine didn't completely panic, but after nothing new
showed up being logged via the network, I walked up to the
machine and found it in a state where I couldn't really log in to it
anymore, but all that worked was, as always, a magic SysRequest reboot.

The complete log, from machine boot right up to the point where it
wouldn't really do anything anymore, is up again on my web server (~42
MB, 928 KB packed):

http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-17 Thread Nils Holland
On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> On 2016/12/17 21:59, Nils Holland wrote:
> > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> >> mount -t tracefs none /debug/trace
> >> echo 1 > /debug/trace/events/vmscan/enable
> >> cat /debug/trace/trace_pipe > trace.log
> >>
> >> should help
> >> [...]
> > 
> > No problem! I enabled writing the trace data to a file and then tried
> > to trigger another OOM situation. That worked, this time without a
> > complete kernel panic, but with only my processes being killed and the
> > system becoming unresponsive.
> > [...]
> 
> Under OOM situation, writing to a file on disk unlikely works. Maybe
> logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> if your are using bash) works better. (I wish we can do it from kernel
> so that /bin/cat is not disturbed by delays due to page fault.)
> 
> If you can configure netconsole for logging OOM killer messages and
> UDP socket for logging trace_pipe messages, udplogger at
> https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> might fit for logging both output with timestamp into a single file.

Thanks for the hint, sounds very sane! I'll try to go that route for
the next log / trace I produce. Of course, if Michal says that the
trace file I've already posted, and which has been logged to file, is
useless and would have been better if I had instead logged to a
different machine via the network, I could also repeat the current
experiment and produce a new file at any time. :-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-17 Thread Nils Holland
On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> On 2016/12/17 21:59, Nils Holland wrote:
> > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> >> mount -t tracefs none /debug/trace
> >> echo 1 > /debug/trace/events/vmscan/enable
> >> cat /debug/trace/trace_pipe > trace.log
> >>
> >> should help
> >> [...]
> > 
> > No problem! I enabled writing the trace data to a file and then tried
> > to trigger another OOM situation. That worked, this time without a
> > complete kernel panic, but with only my processes being killed and the
> > system becoming unresponsive.
> > [...]
> 
> Under OOM situation, writing to a file on disk unlikely works. Maybe
> logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> if your are using bash) works better. (I wish we can do it from kernel
> so that /bin/cat is not disturbed by delays due to page fault.)
> 
> If you can configure netconsole for logging OOM killer messages and
> UDP socket for logging trace_pipe messages, udplogger at
> https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> might fit for logging both output with timestamp into a single file.

Thanks for the hint, sounds very sane! I'll try to go that route for
the next log / trace I produce. Of course, if Michal says that the
trace file I've already posted, and which has been logged to file, is
useless and would have been better if I had instead logged to a
different machine via the network, I could also repeat the current
experiment and produce a new file at any time. :-)

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-17 Thread Tetsuo Handa
On 2016/12/17 21:59, Nils Holland wrote:
> On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
>> mount -t tracefs none /debug/trace
>> echo 1 > /debug/trace/events/vmscan/enable
>> cat /debug/trace/trace_pipe > trace.log
>>
>> should help
>> [...]
> 
> No problem! I enabled writing the trace data to a file and then tried
> to trigger another OOM situation. That worked, this time without a
> complete kernel panic, but with only my processes being killed and the
> system becoming unresponsive. When that happened, I let it run for
> another minute or two so that in case it was still logging something
> to the trace file, it could continue to do so some time longer. Then I
> rebooted with the only thing that still worked, i.e. by means of magic
> SysRequest.

Under OOM situation, writing to a file on disk unlikely works. Maybe
logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
if your are using bash) works better. (I wish we can do it from kernel
so that /bin/cat is not disturbed by delays due to page fault.)

If you can configure netconsole for logging OOM killer messages and
UDP socket for logging trace_pipe messages, udplogger at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
might fit for logging both output with timestamp into a single file.



Re: OOM: Better, but still there on

2016-12-17 Thread Tetsuo Handa
On 2016/12/17 21:59, Nils Holland wrote:
> On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
>> mount -t tracefs none /debug/trace
>> echo 1 > /debug/trace/events/vmscan/enable
>> cat /debug/trace/trace_pipe > trace.log
>>
>> should help
>> [...]
> 
> No problem! I enabled writing the trace data to a file and then tried
> to trigger another OOM situation. That worked, this time without a
> complete kernel panic, but with only my processes being killed and the
> system becoming unresponsive. When that happened, I let it run for
> another minute or two so that in case it was still logging something
> to the trace file, it could continue to do so some time longer. Then I
> rebooted with the only thing that still worked, i.e. by means of magic
> SysRequest.

Under OOM situation, writing to a file on disk unlikely works. Maybe
logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
if your are using bash) works better. (I wish we can do it from kernel
so that /bin/cat is not disturbed by delays due to page fault.)

If you can configure netconsole for logging OOM killer messages and
UDP socket for logging trace_pipe messages, udplogger at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
might fit for logging both output with timestamp into a single file.



Re: OOM: Better, but still there on

2016-12-17 Thread Nils Holland
On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> On Fri 16-12-16 19:47:00, Nils Holland wrote:
> > 
> > Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages 
> > freed, 10219 pages still pinned.
> > Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: 
> > gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), 
> > nodemask=0, order=1, oom_score_adj=0
> > Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
> [...]
> > Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB 
> > low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> > active_file:470556kB inactive_file:148kB unevictable:0kB 
> > writepending:1616kB present:897016kB managed:831480kB mlocked:0kB 
> > slab_reclaimable:213172kB slab_unreclaimable:86236kB kernel_stack:1864kB 
> > pagetables:3572kB bounce:0kB free_pcp:532kB local_pcp:456kB free_cma:0kB
> 
> this is a GFP_KERNEL allocation so it cannot use the highmem zone again.
> There is no anonymous memory in this zone but the allocation
> context implies the full reclaim context so the file LRU should be
> reclaimable. For some reason ~470MB of the active file LRU is still
> there. This is quite unexpected. It is harder to tell more without
> further data. It would be great if you could enable reclaim related
> tracepoints:
> 
> mount -t tracefs none /debug/trace
> echo 1 > /debug/trace/events/vmscan/enable
> cat /debug/trace/trace_pipe > trace.log
> 
> should help
> [...]

No problem! I enabled writing the trace data to a file and then tried
to trigger another OOM situation. That worked, this time without a
complete kernel panic, but with only my processes being killed and the
system becoming unresponsive. When that happened, I let it run for
another minute or two so that in case it was still logging something
to the trace file, it could continue to do so some time longer. Then I
rebooted with the only thing that still worked, i.e. by means of magic
SysRequest.

The trace file has actually become rather big (around 21 MB). I didn't
dare to cut anything from it because I didn't want to risk deleting
something that might turn out important. So, due to the size, I'm not
attaching the trace file to this message, but it's up compressed
(about 536 KB) to be grabbed at:

http://ftp.tisys.org/pub/misc/trace.log.xz

For reference, here's the OOM report that goes along with this
incident and the trace file:

Dec 17 13:31:06 boerne.fritz.box kernel: Purging GPU memory, 145 pages freed, 
10287 pages still pinned.
Dec 17 13:31:07 boerne.fritz.box kernel: awesome invoked oom-killer: 
gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
Dec 17 13:31:07 boerne.fritz.box kernel: awesome cpuset=/ mems_allowed=0
Dec 17 13:31:07 boerne.fritz.box kernel: CPU: 1 PID: 5599 Comm: awesome Not 
tainted 4.9.0-gentoo #3
Dec 17 13:31:07 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite 
L500/KSWAA, BIOS V1.80 10/28/2009
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c18
Dec 17 13:31:07 boerne.fritz.box kernel:  c1433406
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37d48
Dec 17 13:31:07 boerne.fritz.box kernel:  c5319280
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c48
Dec 17 13:31:07 boerne.fritz.box kernel:  c1170011
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c9c
Dec 17 13:31:07 boerne.fritz.box kernel:  00200286
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c48
Dec 17 13:31:07 boerne.fritz.box kernel:  c1438fff
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c4c
Dec 17 13:31:07 boerne.fritz.box kernel:  c72479c0
Dec 17 13:31:07 boerne.fritz.box kernel:  c60dd200
Dec 17 13:31:07 boerne.fritz.box kernel:  c5319280
Dec 17 13:31:07 boerne.fritz.box kernel:  c1ad1899
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37d48
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c8c
Dec 17 13:31:07 boerne.fritz.box kernel:  c1114407
Dec 17 13:31:07 boerne.fritz.box kernel:  c10513a5
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c78
Dec 17 13:31:07 boerne.fritz.box kernel:  c11140a1
Dec 17 13:31:07 boerne.fritz.box kernel:  0005
Dec 17 13:31:07 boerne.fritz.box kernel:  
Dec 17 13:31:07 boerne.fritz.box kernel:  
Dec 17 13:31:07 boerne.fritz.box kernel: Call Trace:
Dec 17 13:31:07 boerne.fritz.box kernel:  [] dump_stack+0x47/0x61
Dec 17 13:31:07 boerne.fritz.box kernel:  [] dump_header+0x5f/0x175
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? ___ratelimit+0x7f/0xe0
Dec 17 13:31:07 boerne.fritz.box kernel:  [] 
oom_kill_process+0x207/0x3c0
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? 
has_capability_noaudit+0x15/0x20
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? 
oom_badness.part.13+0xb1/0x120
Dec 17 13:31:07 boerne.fritz.box kernel:  [] out_of_memory+0xd4/0x270
Dec 17 13:31:07 boerne.fritz.box kernel:  [] 
__alloc_pages_nodemask+0xcf5/0xd60
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? 
skb_queue_purge+0x30/0x30
Dec 17 13:31:07 boerne.fritz.box kernel:  [] 

Re: OOM: Better, but still there on

2016-12-17 Thread Nils Holland
On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> On Fri 16-12-16 19:47:00, Nils Holland wrote:
> > 
> > Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages 
> > freed, 10219 pages still pinned.
> > Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: 
> > gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), 
> > nodemask=0, order=1, oom_score_adj=0
> > Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
> [...]
> > Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB 
> > low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> > active_file:470556kB inactive_file:148kB unevictable:0kB 
> > writepending:1616kB present:897016kB managed:831480kB mlocked:0kB 
> > slab_reclaimable:213172kB slab_unreclaimable:86236kB kernel_stack:1864kB 
> > pagetables:3572kB bounce:0kB free_pcp:532kB local_pcp:456kB free_cma:0kB
> 
> this is a GFP_KERNEL allocation so it cannot use the highmem zone again.
> There is no anonymous memory in this zone but the allocation
> context implies the full reclaim context so the file LRU should be
> reclaimable. For some reason ~470MB of the active file LRU is still
> there. This is quite unexpected. It is harder to tell more without
> further data. It would be great if you could enable reclaim related
> tracepoints:
> 
> mount -t tracefs none /debug/trace
> echo 1 > /debug/trace/events/vmscan/enable
> cat /debug/trace/trace_pipe > trace.log
> 
> should help
> [...]

No problem! I enabled writing the trace data to a file and then tried
to trigger another OOM situation. That worked, this time without a
complete kernel panic, but with only my processes being killed and the
system becoming unresponsive. When that happened, I let it run for
another minute or two so that in case it was still logging something
to the trace file, it could continue to do so some time longer. Then I
rebooted with the only thing that still worked, i.e. by means of magic
SysRequest.

The trace file has actually become rather big (around 21 MB). I didn't
dare to cut anything from it because I didn't want to risk deleting
something that might turn out important. So, due to the size, I'm not
attaching the trace file to this message, but it's up compressed
(about 536 KB) to be grabbed at:

http://ftp.tisys.org/pub/misc/trace.log.xz

For reference, here's the OOM report that goes along with this
incident and the trace file:

Dec 17 13:31:06 boerne.fritz.box kernel: Purging GPU memory, 145 pages freed, 
10287 pages still pinned.
Dec 17 13:31:07 boerne.fritz.box kernel: awesome invoked oom-killer: 
gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
Dec 17 13:31:07 boerne.fritz.box kernel: awesome cpuset=/ mems_allowed=0
Dec 17 13:31:07 boerne.fritz.box kernel: CPU: 1 PID: 5599 Comm: awesome Not 
tainted 4.9.0-gentoo #3
Dec 17 13:31:07 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite 
L500/KSWAA, BIOS V1.80 10/28/2009
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c18
Dec 17 13:31:07 boerne.fritz.box kernel:  c1433406
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37d48
Dec 17 13:31:07 boerne.fritz.box kernel:  c5319280
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c48
Dec 17 13:31:07 boerne.fritz.box kernel:  c1170011
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c9c
Dec 17 13:31:07 boerne.fritz.box kernel:  00200286
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c48
Dec 17 13:31:07 boerne.fritz.box kernel:  c1438fff
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c4c
Dec 17 13:31:07 boerne.fritz.box kernel:  c72479c0
Dec 17 13:31:07 boerne.fritz.box kernel:  c60dd200
Dec 17 13:31:07 boerne.fritz.box kernel:  c5319280
Dec 17 13:31:07 boerne.fritz.box kernel:  c1ad1899
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37d48
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c8c
Dec 17 13:31:07 boerne.fritz.box kernel:  c1114407
Dec 17 13:31:07 boerne.fritz.box kernel:  c10513a5
Dec 17 13:31:07 boerne.fritz.box kernel:  c5a37c78
Dec 17 13:31:07 boerne.fritz.box kernel:  c11140a1
Dec 17 13:31:07 boerne.fritz.box kernel:  0005
Dec 17 13:31:07 boerne.fritz.box kernel:  
Dec 17 13:31:07 boerne.fritz.box kernel:  
Dec 17 13:31:07 boerne.fritz.box kernel: Call Trace:
Dec 17 13:31:07 boerne.fritz.box kernel:  [] dump_stack+0x47/0x61
Dec 17 13:31:07 boerne.fritz.box kernel:  [] dump_header+0x5f/0x175
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? ___ratelimit+0x7f/0xe0
Dec 17 13:31:07 boerne.fritz.box kernel:  [] 
oom_kill_process+0x207/0x3c0
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? 
has_capability_noaudit+0x15/0x20
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? 
oom_badness.part.13+0xb1/0x120
Dec 17 13:31:07 boerne.fritz.box kernel:  [] out_of_memory+0xd4/0x270
Dec 17 13:31:07 boerne.fritz.box kernel:  [] 
__alloc_pages_nodemask+0xcf5/0xd60
Dec 17 13:31:07 boerne.fritz.box kernel:  [] ? 
skb_queue_purge+0x30/0x30
Dec 17 13:31:07 boerne.fritz.box kernel:  [] 

Re: OOM: Better, but still there on

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 19:47:00, Nils Holland wrote:
[...]
> Despite the fact that I'm no expert, I can see that there's no more
> GFP_NOFS being logged, which seems to be what the patches tried to
> achieve. What the still present OOMs mean remains up for
> interpretation by the experts, all I can say is that in the (pre-4.8?)
> past, doing all of the things I just did would probably slow down my
> machine quite a bit, but I can't remember to have ever seen it OOM or
> even crash completely.
> 
> Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 
> 10219 pages still pinned.
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: 
> gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
> order=1, oom_score_adj=0
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
[...]
> Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:470556kB inactive_file:148kB unevictable:0kB writepending:1616kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213172kB 
> slab_unreclaimable:86236kB kernel_stack:1864kB pagetables:3572kB bounce:0kB 
> free_pcp:532kB local_pcp:456kB free_cma:0kB

this is a GFP_KERNEL allocation so it cannot use the highmem zone again.
There is no anonymous memory in this zone but the allocation
context implies the full reclaim context so the file LRU should be
reclaimable. For some reason ~470MB of the active file LRU is still
there. This is quite unexpected. It is harder to tell more without
further data. It would be great if you could enable reclaim related
tracepoints:

mount -t tracefs none /debug/trace
echo 1 > /debug/trace/events/vmscan/enable
cat /debug/trace/trace_pipe > trace.log

should help
[...]

> Dec 16 18:56:31 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: 
> gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0

another allocation in a short time. Killing the task has obviously
didn't help because the lowmem memory pressure hasn't been relieved

[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:41028kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:472164kB inactive_file:108kB unevictable:0kB writepending:112kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB 
> slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2564kB bounce:32kB 
> free_pcp:180kB local_pcp:24kB free_cma:0kB

in fact we have even more pages on the file LRUs.

[...]

> Dec 16 18:56:32 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: 
> gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:40988kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:472436kB inactive_file:144kB unevictable:0kB writepending:312kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB 
> slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2464kB bounce:32kB 
> free_pcp:116kB local_pcp:0kB free_cma:0kB

same here. All that suggests that the page cache cannot be reclaimed for
some reason. It is hard to tell why but there is definitely something
bad going on.
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 19:47:00, Nils Holland wrote:
[...]
> Despite the fact that I'm no expert, I can see that there's no more
> GFP_NOFS being logged, which seems to be what the patches tried to
> achieve. What the still present OOMs mean remains up for
> interpretation by the experts, all I can say is that in the (pre-4.8?)
> past, doing all of the things I just did would probably slow down my
> machine quite a bit, but I can't remember to have ever seen it OOM or
> even crash completely.
> 
> Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 
> 10219 pages still pinned.
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: 
> gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
> order=1, oom_score_adj=0
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
[...]
> Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:470556kB inactive_file:148kB unevictable:0kB writepending:1616kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213172kB 
> slab_unreclaimable:86236kB kernel_stack:1864kB pagetables:3572kB bounce:0kB 
> free_pcp:532kB local_pcp:456kB free_cma:0kB

this is a GFP_KERNEL allocation so it cannot use the highmem zone again.
There is no anonymous memory in this zone but the allocation
context implies the full reclaim context so the file LRU should be
reclaimable. For some reason ~470MB of the active file LRU is still
there. This is quite unexpected. It is harder to tell more without
further data. It would be great if you could enable reclaim related
tracepoints:

mount -t tracefs none /debug/trace
echo 1 > /debug/trace/events/vmscan/enable
cat /debug/trace/trace_pipe > trace.log

should help
[...]

> Dec 16 18:56:31 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: 
> gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0

another allocation in a short time. Killing the task has obviously
didn't help because the lowmem memory pressure hasn't been relieved

[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:41028kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:472164kB inactive_file:108kB unevictable:0kB writepending:112kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB 
> slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2564kB bounce:32kB 
> free_pcp:180kB local_pcp:24kB free_cma:0kB

in fact we have even more pages on the file LRUs.

[...]

> Dec 16 18:56:32 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: 
> gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:40988kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:472436kB inactive_file:144kB unevictable:0kB writepending:312kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB 
> slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2464kB bounce:32kB 
> free_pcp:116kB local_pcp:0kB free_cma:0kB

same here. All that suggests that the page cache cannot be reclaimed for
some reason. It is hard to tell why but there is definitely something
bad going on.
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 17:47:25, Chris Mason wrote:
> On 12/16/2016 05:14 PM, Michal Hocko wrote:
> > On Fri 16-12-16 13:15:18, Chris Mason wrote:
> > > On 12/16/2016 02:39 AM, Michal Hocko wrote:
> > [...]
> > > > I believe the right way to go around this is to pursue what I've started
> > > > in [1]. I will try to prepare something for testing today for you. Stay
> > > > tuned. But I would be really happy if somebody from the btrfs camp could
> > > > check the NOFS aspect of this allocation. We have already seen
> > > > allocation stalls from this path quite recently
> > > 
> > > Just double checking, are you asking why we're using GFP_NOFS to avoid 
> > > going
> > > into btrfs from the btrfs writepages call, or are you asking why we aren't
> > > allowing highmem?
> > 
> > I am more interested in the NOFS part. Why cannot this be a full
> > GFP_KERNEL context? What kind of locks we would lock up when recursing
> > to the fs via slab shrinkers?
> > 
> 
> Since this is our writepages call, any jump into direct reclaim would go to
> writepage, which would end up calling the same set of code to read metadata
> blocks, which would do a GFP_KERNEL allocation and end up back in writepage
> again.

But we are not doing pageout on the page cache from the direct reclaim
for a long time. So basically the only way to recurse back to the fs
code is via slab ([di]cache) shrinkers. Are those a problem as well?

-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 17:47:25, Chris Mason wrote:
> On 12/16/2016 05:14 PM, Michal Hocko wrote:
> > On Fri 16-12-16 13:15:18, Chris Mason wrote:
> > > On 12/16/2016 02:39 AM, Michal Hocko wrote:
> > [...]
> > > > I believe the right way to go around this is to pursue what I've started
> > > > in [1]. I will try to prepare something for testing today for you. Stay
> > > > tuned. But I would be really happy if somebody from the btrfs camp could
> > > > check the NOFS aspect of this allocation. We have already seen
> > > > allocation stalls from this path quite recently
> > > 
> > > Just double checking, are you asking why we're using GFP_NOFS to avoid 
> > > going
> > > into btrfs from the btrfs writepages call, or are you asking why we aren't
> > > allowing highmem?
> > 
> > I am more interested in the NOFS part. Why cannot this be a full
> > GFP_KERNEL context? What kind of locks we would lock up when recursing
> > to the fs via slab shrinkers?
> > 
> 
> Since this is our writepages call, any jump into direct reclaim would go to
> writepage, which would end up calling the same set of code to read metadata
> blocks, which would do a GFP_KERNEL allocation and end up back in writepage
> again.

But we are not doing pageout on the page cache from the direct reclaim
for a long time. So basically the only way to recurse back to the fs
code is via slab ([di]cache) shrinkers. Are those a problem as well?

-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 05:14 PM, Michal Hocko wrote:

On Fri 16-12-16 13:15:18, Chris Mason wrote:

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[...]

I believe the right way to go around this is to pursue what I've started
in [1]. I will try to prepare something for testing today for you. Stay
tuned. But I would be really happy if somebody from the btrfs camp could
check the NOFS aspect of this allocation. We have already seen
allocation stalls from this path quite recently


Just double checking, are you asking why we're using GFP_NOFS to avoid going
into btrfs from the btrfs writepages call, or are you asking why we aren't
allowing highmem?


I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?



Since this is our writepages call, any jump into direct reclaim would go 
to writepage, which would end up calling the same set of code to read 
metadata blocks, which would do a GFP_KERNEL allocation and end up back 
in writepage again.


We'd also have issues with blowing through transaction reservations 
since the writepage recursion would have to nest into the running 
transaction.


-chris



Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 05:14 PM, Michal Hocko wrote:

On Fri 16-12-16 13:15:18, Chris Mason wrote:

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[...]

I believe the right way to go around this is to pursue what I've started
in [1]. I will try to prepare something for testing today for you. Stay
tuned. But I would be really happy if somebody from the btrfs camp could
check the NOFS aspect of this allocation. We have already seen
allocation stalls from this path quite recently


Just double checking, are you asking why we're using GFP_NOFS to avoid going
into btrfs from the btrfs writepages call, or are you asking why we aren't
allowing highmem?


I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?



Since this is our writepages call, any jump into direct reclaim would go 
to writepage, which would end up calling the same set of code to read 
metadata blocks, which would do a GFP_KERNEL allocation and end up back 
in writepage again.


We'd also have issues with blowing through transaction reservations 
since the writepage recursion would have to nest into the running 
transaction.


-chris



Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 13:15:18, Chris Mason wrote:
> On 12/16/2016 02:39 AM, Michal Hocko wrote:
[...]
> > I believe the right way to go around this is to pursue what I've started
> > in [1]. I will try to prepare something for testing today for you. Stay
> > tuned. But I would be really happy if somebody from the btrfs camp could
> > check the NOFS aspect of this allocation. We have already seen
> > allocation stalls from this path quite recently
> 
> Just double checking, are you asking why we're using GFP_NOFS to avoid going
> into btrfs from the btrfs writepages call, or are you asking why we aren't
> allowing highmem?

I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 13:15:18, Chris Mason wrote:
> On 12/16/2016 02:39 AM, Michal Hocko wrote:
[...]
> > I believe the right way to go around this is to pursue what I've started
> > in [1]. I will try to prepare something for testing today for you. Stay
> > tuned. But I would be really happy if somebody from the btrfs camp could
> > check the NOFS aspect of this allocation. We have already seen
> > allocation stalls from this path quite recently
> 
> Just double checking, are you asking why we're using GFP_NOFS to avoid going
> into btrfs from the btrfs writepages call, or are you asking why we aren't
> allowing highmem?

I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?
-- 
Michal Hocko
SUSE Labs


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]

Of course, none of this are workloads that are new / special in any
way - prior to 4.8, I never experienced any issues doing the exact
same things.

Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook 
PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
c1163332  0292
Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
e7fa2900 c1b58785 eff0b734
Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
0007  
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
Dec 15 19:02:18 teela kernel:  [] 
btrfs_find_create_tree_block+0xe/0x10
Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
Dec 15 19:02:18 teela kernel:  [] 
writepage_delalloc.isra.47+0x144/0x1e0
Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
Dec 15 19:02:18 teela kernel:  [] __writeback_single_inode+0x35/0x2f0
Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
   active_file:274324 inactive_file:281962 
isolated_file:0


OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.


Reading harder, its possible those pagecache pages are all from the 
btree inode.  They shouldn't be pinned by btrfs, kswapd should be able 
to wander in and free a good chunk.  What btrfs wants to happen is for 
this allocation to sit and wait for kswapd to make progress.


-chris


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]

Of course, none of this are workloads that are new / special in any
way - prior to 4.8, I never experienced any issues doing the exact
same things.

Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook 
PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
c1163332  0292
Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
e7fa2900 c1b58785 eff0b734
Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
0007  
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
Dec 15 19:02:18 teela kernel:  [] 
btrfs_find_create_tree_block+0xe/0x10
Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
Dec 15 19:02:18 teela kernel:  [] 
writepage_delalloc.isra.47+0x144/0x1e0
Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
Dec 15 19:02:18 teela kernel:  [] __writeback_single_inode+0x35/0x2f0
Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
   active_file:274324 inactive_file:281962 
isolated_file:0


OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.


Reading harder, its possible those pagecache pages are all from the 
btree inode.  They shouldn't be pinned by btrfs, kswapd should be able 
to wander in and free a good chunk.  What btrfs wants to happen is for 
this allocation to sit and wait for kswapd to make progress.


-chris


Re: OOM: Better, but still there on

2016-12-16 Thread Nils Holland
On Fri, Dec 16, 2016 at 04:58:06PM +0100, Michal Hocko wrote:
> On Fri 16-12-16 08:39:41, Michal Hocko wrote:
> [...]
> > That being said, the OOM killer invocation is clearly pointless and
> > pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> > exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> > behaves differently. I am about to change that but my last attempt [1]
> > has to be rethought.
> > 
> > Now another thing is that the __GFP_NOFAIL which has this nasty side
> > effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> > early transaction abort") in 4.3 so I am quite surprised that this has
> > shown up only in 4.8. Anyway there might be some other changes in the
> > btrfs which could make it more subtle.
> > 
> > I believe the right way to go around this is to pursue what I've started
> > in [1]. I will try to prepare something for testing today for you. Stay
> > tuned. But I would be really happy if somebody from the btrfs camp could
> > check the NOFS aspect of this allocation. We have already seen
> > allocation stalls from this path quite recently
> 
> Could you try to run with the two following patches?

I tried the two patches you sent, and ... well, things are different
now, but probably still a bit problematic. ;-)

Once again, I freshly booted both of my machines and told Gentoo's
portage to unpack and build the firefox sources. The first machine,
the one from which yesterday's OOM report came, became unresponsive
during the tarball unpack phase and had to be power cycled.
Unfortunately, there's nothing concerning its OOMs in the logs. :-(

The second machine actually finished the unpack phase successfully and
started the build process (which, every now and then, had also worked
with previous problematic kernels). However, after it had been
building for a while and I decided to increase the stress level by
starting X, firefox as well as a terminal and unpack a kernel source
tarball in it, it also started OOMing, this time once more with a
genuine kernel panic. Luckily, this machine also caught something in
the logs, which I'm including below.

Despite the fact that I'm no expert, I can see that there's no more
GFP_NOFS being logged, which seems to be what the patches tried to
achieve. What the still present OOMs mean remains up for
interpretation by the experts, all I can say is that in the (pre-4.8?)
past, doing all of the things I just did would probably slow down my
machine quite a bit, but I can't remember to have ever seen it OOM or
even crash completely.

Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 
10219 pages still pinned.
Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: 
gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
order=1, oom_score_adj=0
Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
Dec 16 18:56:29 boerne.fritz.box kernel: CPU: 1 PID: 2 Comm: kthreadd Not 
tainted 4.9.0-gentoo #3
Dec 16 18:56:29 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite 
L500/KSWAA, BIOS V1.80 10/28/2009
Dec 16 18:56:29 boerne.fritz.box kernel:  f4105d6c c1433406 f4105e9c c6611280 
f4105d9c c1170011 f4105df0 00200296
Dec 16 18:56:29 boerne.fritz.box kernel:  f4105d9c c1438fff f4105da0 edc1bc80 
ee32ce00 c6611280 c1ad1899 f4105e9c
Dec 16 18:56:29 boerne.fritz.box kernel:  f4105de0 c1114407 c10513a5 f4105dcc 
c11140a1 0001  
Dec 16 18:56:29 boerne.fritz.box kernel: Call Trace:
Dec 16 18:56:29 boerne.fritz.box kernel:  [] dump_stack+0x47/0x61
Dec 16 18:56:29 boerne.fritz.box kernel:  [] dump_header+0x5f/0x175
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? ___ratelimit+0x7f/0xe0
Dec 16 18:56:29 boerne.fritz.box kernel:  [] 
oom_kill_process+0x207/0x3c0
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
has_capability_noaudit+0x15/0x20
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
oom_badness.part.13+0xb1/0x120
Dec 16 18:56:29 boerne.fritz.box kernel:  [] out_of_memory+0xd4/0x270
Dec 16 18:56:29 boerne.fritz.box kernel:  [] 
__alloc_pages_nodemask+0xcf5/0xd60
Dec 16 18:56:29 boerne.fritz.box kernel:  [] 
copy_process.part.52+0xd5/0x1410
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
pick_next_task_fair+0x479/0x510
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
__kthread_parkme+0x60/0x60
Dec 16 18:56:29 boerne.fritz.box kernel:  [] _do_fork+0xc7/0x360
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
__kthread_parkme+0x60/0x60
Dec 16 18:56:29 boerne.fritz.box kernel:  [] kernel_thread+0x30/0x40
Dec 16 18:56:29 boerne.fritz.box kernel:  [] kthreadd+0x106/0x150
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? kthread_park+0x50/0x50
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ret_from_fork+0x1b/0x28
Dec 16 18:56:29 boerne.fritz.box kernel: Mem-Info:
Dec 16 18:56:29 boerne.fritz.box kernel: active_anon:132176 inactive_anon:11640 
isolated_anon:0
  active_file:295257 

Re: OOM: Better, but still there on

2016-12-16 Thread Nils Holland
On Fri, Dec 16, 2016 at 04:58:06PM +0100, Michal Hocko wrote:
> On Fri 16-12-16 08:39:41, Michal Hocko wrote:
> [...]
> > That being said, the OOM killer invocation is clearly pointless and
> > pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> > exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> > behaves differently. I am about to change that but my last attempt [1]
> > has to be rethought.
> > 
> > Now another thing is that the __GFP_NOFAIL which has this nasty side
> > effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> > early transaction abort") in 4.3 so I am quite surprised that this has
> > shown up only in 4.8. Anyway there might be some other changes in the
> > btrfs which could make it more subtle.
> > 
> > I believe the right way to go around this is to pursue what I've started
> > in [1]. I will try to prepare something for testing today for you. Stay
> > tuned. But I would be really happy if somebody from the btrfs camp could
> > check the NOFS aspect of this allocation. We have already seen
> > allocation stalls from this path quite recently
> 
> Could you try to run with the two following patches?

I tried the two patches you sent, and ... well, things are different
now, but probably still a bit problematic. ;-)

Once again, I freshly booted both of my machines and told Gentoo's
portage to unpack and build the firefox sources. The first machine,
the one from which yesterday's OOM report came, became unresponsive
during the tarball unpack phase and had to be power cycled.
Unfortunately, there's nothing concerning its OOMs in the logs. :-(

The second machine actually finished the unpack phase successfully and
started the build process (which, every now and then, had also worked
with previous problematic kernels). However, after it had been
building for a while and I decided to increase the stress level by
starting X, firefox as well as a terminal and unpack a kernel source
tarball in it, it also started OOMing, this time once more with a
genuine kernel panic. Luckily, this machine also caught something in
the logs, which I'm including below.

Despite the fact that I'm no expert, I can see that there's no more
GFP_NOFS being logged, which seems to be what the patches tried to
achieve. What the still present OOMs mean remains up for
interpretation by the experts, all I can say is that in the (pre-4.8?)
past, doing all of the things I just did would probably slow down my
machine quite a bit, but I can't remember to have ever seen it OOM or
even crash completely.

Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 
10219 pages still pinned.
Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: 
gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
order=1, oom_score_adj=0
Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
Dec 16 18:56:29 boerne.fritz.box kernel: CPU: 1 PID: 2 Comm: kthreadd Not 
tainted 4.9.0-gentoo #3
Dec 16 18:56:29 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite 
L500/KSWAA, BIOS V1.80 10/28/2009
Dec 16 18:56:29 boerne.fritz.box kernel:  f4105d6c c1433406 f4105e9c c6611280 
f4105d9c c1170011 f4105df0 00200296
Dec 16 18:56:29 boerne.fritz.box kernel:  f4105d9c c1438fff f4105da0 edc1bc80 
ee32ce00 c6611280 c1ad1899 f4105e9c
Dec 16 18:56:29 boerne.fritz.box kernel:  f4105de0 c1114407 c10513a5 f4105dcc 
c11140a1 0001  
Dec 16 18:56:29 boerne.fritz.box kernel: Call Trace:
Dec 16 18:56:29 boerne.fritz.box kernel:  [] dump_stack+0x47/0x61
Dec 16 18:56:29 boerne.fritz.box kernel:  [] dump_header+0x5f/0x175
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? ___ratelimit+0x7f/0xe0
Dec 16 18:56:29 boerne.fritz.box kernel:  [] 
oom_kill_process+0x207/0x3c0
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
has_capability_noaudit+0x15/0x20
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
oom_badness.part.13+0xb1/0x120
Dec 16 18:56:29 boerne.fritz.box kernel:  [] out_of_memory+0xd4/0x270
Dec 16 18:56:29 boerne.fritz.box kernel:  [] 
__alloc_pages_nodemask+0xcf5/0xd60
Dec 16 18:56:29 boerne.fritz.box kernel:  [] 
copy_process.part.52+0xd5/0x1410
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
pick_next_task_fair+0x479/0x510
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
__kthread_parkme+0x60/0x60
Dec 16 18:56:29 boerne.fritz.box kernel:  [] _do_fork+0xc7/0x360
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? 
__kthread_parkme+0x60/0x60
Dec 16 18:56:29 boerne.fritz.box kernel:  [] kernel_thread+0x30/0x40
Dec 16 18:56:29 boerne.fritz.box kernel:  [] kthreadd+0x106/0x150
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ? kthread_park+0x50/0x50
Dec 16 18:56:29 boerne.fritz.box kernel:  [] ret_from_fork+0x1b/0x28
Dec 16 18:56:29 boerne.fritz.box kernel: Mem-Info:
Dec 16 18:56:29 boerne.fritz.box kernel: active_anon:132176 inactive_anon:11640 
isolated_anon:0
  active_file:295257 

Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]

Of course, none of this are workloads that are new / special in any
way - prior to 4.8, I never experienced any issues doing the exact
same things.

Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook 
PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
c1163332  0292
Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
e7fa2900 c1b58785 eff0b734
Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
0007  
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
Dec 15 19:02:18 teela kernel:  [] 
btrfs_find_create_tree_block+0xe/0x10
Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
Dec 15 19:02:18 teela kernel:  [] 
writepage_delalloc.isra.47+0x144/0x1e0
Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
Dec 15 19:02:18 teela kernel:  [] __writeback_single_inode+0x35/0x2f0
Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
   active_file:274324 inactive_file:281962 
isolated_file:0


OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.


   unevictable:0 dirty:649 writeback:0 unstable:0
   slab_reclaimable:40662 slab_unreclaimable:17754
   mapped:7382 shmem:202 pagetables:351 bounce:0
   free:206736 free_pcp:332 free_cma:0
Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB 
active_file:1097296kB inactive_file:1127848kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB 
shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB 
writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB 
active_anon:0kB inactive_anon:0kB active_file:7316kB 

Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]

Of course, none of this are workloads that are new / special in any
way - prior to 4.8, I never experienced any issues doing the exact
same things.

Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook 
PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
c1163332  0292
Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
e7fa2900 c1b58785 eff0b734
Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
0007  
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
Dec 15 19:02:18 teela kernel:  [] 
btrfs_find_create_tree_block+0xe/0x10
Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
Dec 15 19:02:18 teela kernel:  [] 
writepage_delalloc.isra.47+0x144/0x1e0
Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
Dec 15 19:02:18 teela kernel:  [] __writeback_single_inode+0x35/0x2f0
Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
   active_file:274324 inactive_file:281962 
isolated_file:0


OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.


   unevictable:0 dirty:649 writeback:0 unstable:0
   slab_reclaimable:40662 slab_unreclaimable:17754
   mapped:7382 shmem:202 pagetables:351 bounce:0
   free:206736 free_pcp:332 free_cma:0
Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB 
active_file:1097296kB inactive_file:1127848kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB 
shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB 
writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB 
active_anon:0kB inactive_anon:0kB active_file:7316kB 

Re: OOM: Better, but still there on

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 08:39:41, Michal Hocko wrote:
[...]
> That being said, the OOM killer invocation is clearly pointless and
> pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> behaves differently. I am about to change that but my last attempt [1]
> has to be rethought.
> 
> Now another thing is that the __GFP_NOFAIL which has this nasty side
> effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> early transaction abort") in 4.3 so I am quite surprised that this has
> shown up only in 4.8. Anyway there might be some other changes in the
> btrfs which could make it more subtle.
> 
> I believe the right way to go around this is to pursue what I've started
> in [1]. I will try to prepare something for testing today for you. Stay
> tuned. But I would be really happy if somebody from the btrfs camp could
> check the NOFS aspect of this allocation. We have already seen
> allocation stalls from this path quite recently

Could you try to run with the two following patches?


Re: OOM: Better, but still there on

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 08:39:41, Michal Hocko wrote:
[...]
> That being said, the OOM killer invocation is clearly pointless and
> pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> behaves differently. I am about to change that but my last attempt [1]
> has to be rethought.
> 
> Now another thing is that the __GFP_NOFAIL which has this nasty side
> effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> early transaction abort") in 4.3 so I am quite surprised that this has
> shown up only in 4.8. Anyway there might be some other changes in the
> btrfs which could make it more subtle.
> 
> I believe the right way to go around this is to pursue what I've started
> in [1]. I will try to prepare something for testing today for you. Stay
> tuned. But I would be really happy if somebody from the btrfs camp could
> check the NOFS aspect of this allocation. We have already seen
> allocation stalls from this path quite recently

Could you try to run with the two following patches?


Re: OOM: Better, but still there on 4.9

2016-12-15 Thread Michal Hocko
[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]
> Of course, none of this are workloads that are new / special in any
> way - prior to 4.8, I never experienced any issues doing the exact
> same things.
> 
> Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
> gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> oom_score_adj=0
> Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
> Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
> 4.9.0-gentoo #2
> Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 
> Notebook PC/21F7, BIOS F.22 08/06/2014
> Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
> Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
> c1163332  0292
> Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
> e7fa2900 c1b58785 eff0b734
> Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
> 0007  
> Dec 15 19:02:18 teela kernel: Call Trace:
> Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
> Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
> Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
> Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
> Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
> Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
> Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
> Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
> Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
> Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
> Dec 15 19:02:18 teela kernel:  [] 
> btrfs_find_create_tree_block+0xe/0x10
> Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
> Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
> Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
> Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
> Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
> Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
> Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
> Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
> Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
> Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
> Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
> Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
> Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
> Dec 15 19:02:18 teela kernel:  [] 
> writepage_delalloc.isra.47+0x144/0x1e0
> Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
> Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
> Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
> Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
> Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
> Dec 15 19:02:18 teela kernel:  [] 
> __writeback_single_inode+0x35/0x2f0
> Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
> Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
> Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
> Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
> Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
> Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
> Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
> Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
> Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
> Dec 15 19:02:18 teela kernel: Mem-Info:
> Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 
> isolated_anon:0
>active_file:274324 inactive_file:281962 
> isolated_file:0

OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.

>unevictable:0 dirty:649 writeback:0 unstable:0
>slab_reclaimable:40662 slab_unreclaimable:17754
>mapped:7382 shmem:202 pagetables:351 bounce:0
>free:206736 free_pcp:332 free_cma:0
> Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB 
> active_file:1097296kB inactive_file:1127848kB unevictable:0kB 
> isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB 
> writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 
> 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
> Dec 15 19:02:18 teela 

Re: OOM: Better, but still there on 4.9

2016-12-15 Thread Michal Hocko
[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]
> Of course, none of this are workloads that are new / special in any
> way - prior to 4.8, I never experienced any issues doing the exact
> same things.
> 
> Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
> gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> oom_score_adj=0
> Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
> Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
> 4.9.0-gentoo #2
> Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 
> Notebook PC/21F7, BIOS F.22 08/06/2014
> Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
> Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
> c1163332  0292
> Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
> e7fa2900 c1b58785 eff0b734
> Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
> 0007  
> Dec 15 19:02:18 teela kernel: Call Trace:
> Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
> Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
> Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
> Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
> Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
> Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
> Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
> Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
> Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
> Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
> Dec 15 19:02:18 teela kernel:  [] 
> btrfs_find_create_tree_block+0xe/0x10
> Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
> Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
> Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
> Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
> Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
> Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
> Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
> Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
> Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
> Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
> Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
> Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
> Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
> Dec 15 19:02:18 teela kernel:  [] 
> writepage_delalloc.isra.47+0x144/0x1e0
> Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
> Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
> Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
> Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
> Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
> Dec 15 19:02:18 teela kernel:  [] 
> __writeback_single_inode+0x35/0x2f0
> Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
> Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
> Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
> Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
> Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
> Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
> Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
> Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
> Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
> Dec 15 19:02:18 teela kernel: Mem-Info:
> Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 
> isolated_anon:0
>active_file:274324 inactive_file:281962 
> isolated_file:0

OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.

>unevictable:0 dirty:649 writeback:0 unstable:0
>slab_reclaimable:40662 slab_unreclaimable:17754
>mapped:7382 shmem:202 pagetables:351 bounce:0
>free:206736 free_pcp:332 free_cma:0
> Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB 
> active_file:1097296kB inactive_file:1127848kB unevictable:0kB 
> isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB 
> writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 
> 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
> Dec 15 19:02:18 teela 

  1   2   >