Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-10 Thread Michal Hocko
On Wed 10-09-14 09:57:56, Dave Hansen wrote:
> On 09/10/2014 09:29 AM, Michal Hocko wrote:
> > I do not have a bigger machine to play with unfortunately. I think the
> > patch makes sense on its own. I would really appreciate if you could
> > give it a try on your machine with !root memcg case to see how much it
> > helped. I would expect similar results to your previous testing without
> > the revert and Johannes' patch.
> 
> So you want to see before/after this patch:
> 
> Subject: [PATCH] mm, memcg: Do not kill release batching in
>  free_pages_and_swap_cache
> 
> And you want it on top of a kernel with the revert or without?

Revert doesn't make any difference if you run the load inside a memcg
(without any limit set).
So just before and after the patch would be sufficient.

Thanks a lot Dave!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-10 Thread Dave Hansen
On 09/10/2014 09:29 AM, Michal Hocko wrote:
> I do not have a bigger machine to play with unfortunately. I think the
> patch makes sense on its own. I would really appreciate if you could
> give it a try on your machine with !root memcg case to see how much it
> helped. I would expect similar results to your previous testing without
> the revert and Johannes' patch.

So you want to see before/after this patch:

Subject: [PATCH] mm, memcg: Do not kill release batching in
 free_pages_and_swap_cache

And you want it on top of a kernel with the revert or without?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-10 Thread Michal Hocko
On Fri 05-09-14 11:25:37, Michal Hocko wrote:
> On Thu 04-09-14 13:27:26, Dave Hansen wrote:
> > On 09/04/2014 07:27 AM, Michal Hocko wrote:
> > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching
> > > because it reduces it to PAGEVEC_SIZE batches.
> > > 
> > > I think we really do not need PAGEVEC_SIZE batching anymore. We are
> > > already batching on tlb_gather layer. That one is limited so I think
> > > the below should be safe but I have to think about this some more. There
> > > is a risk of prolonged lru_lock wait times but the number of pages is
> > > limited to 10k and the heavy work is done outside of the lock. If this
> > > is really a problem then we can tear LRU part and the actual
> > > freeing/uncharging into a separate functions in this path.
> > > 
> > > Could you test with this half baked patch, please? I didn't get to test
> > > it myself unfortunately.
> > 
> > 3.16 settled out at about 11.5M faults/sec before the regression.  This
> > patch gets it back up to about 10.5M, which is good.
> 
> Dave, would you be willing to test the following patch as well? I do not
> have a huge machine at hand right now. It would be great if you could

I was playing with 48CPU with 32G of RAM machine but the res_counter
lock didn't show up in the traces much (this was with 96 processes doing
mmap (256M private file, faul, unmap in parallel):
  |--0.75%-- __res_counter_charge
  |  res_counter_charge
  |  try_charge
  |  mem_cgroup_try_charge
  |  |  
  |  |--81.56%-- do_cow_fault
  |  |  handle_mm_fault
  |  |  __do_page_fault
  |  |  do_page_fault
  |  |  page_fault
[...]
  |  |  
  |   --18.44%-- __add_to_page_cache_locked
  | add_to_page_cache_lru
  | mpage_readpages
  | ext4_readpages
  | __do_page_cache_readahead
  | ondemand_readahead
  | page_cache_async_readahead
  | filemap_fault
  | __do_fault
  | do_cow_fault
  | handle_mm_fault
  | __do_page_fault
  | do_page_fault
  | page_fault

Nothing really changed in that regards when I reduced mmap size to 128M
and run with 4*CPUs.

I do not have a bigger machine to play with unfortunately. I think the
patch makes sense on its own. I would really appreciate if you could
give it a try on your machine with !root memcg case to see how much it
helped. I would expect similar results to your previous testing without
the revert and Johannes' patch.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-10 Thread Michal Hocko
On Fri 05-09-14 11:25:37, Michal Hocko wrote:
 On Thu 04-09-14 13:27:26, Dave Hansen wrote:
  On 09/04/2014 07:27 AM, Michal Hocko wrote:
   Ouch. free_pages_and_swap_cache completely kills the uncharge batching
   because it reduces it to PAGEVEC_SIZE batches.
   
   I think we really do not need PAGEVEC_SIZE batching anymore. We are
   already batching on tlb_gather layer. That one is limited so I think
   the below should be safe but I have to think about this some more. There
   is a risk of prolonged lru_lock wait times but the number of pages is
   limited to 10k and the heavy work is done outside of the lock. If this
   is really a problem then we can tear LRU part and the actual
   freeing/uncharging into a separate functions in this path.
   
   Could you test with this half baked patch, please? I didn't get to test
   it myself unfortunately.
  
  3.16 settled out at about 11.5M faults/sec before the regression.  This
  patch gets it back up to about 10.5M, which is good.
 
 Dave, would you be willing to test the following patch as well? I do not
 have a huge machine at hand right now. It would be great if you could

I was playing with 48CPU with 32G of RAM machine but the res_counter
lock didn't show up in the traces much (this was with 96 processes doing
mmap (256M private file, faul, unmap in parallel):
  |--0.75%-- __res_counter_charge
  |  res_counter_charge
  |  try_charge
  |  mem_cgroup_try_charge
  |  |  
  |  |--81.56%-- do_cow_fault
  |  |  handle_mm_fault
  |  |  __do_page_fault
  |  |  do_page_fault
  |  |  page_fault
[...]
  |  |  
  |   --18.44%-- __add_to_page_cache_locked
  | add_to_page_cache_lru
  | mpage_readpages
  | ext4_readpages
  | __do_page_cache_readahead
  | ondemand_readahead
  | page_cache_async_readahead
  | filemap_fault
  | __do_fault
  | do_cow_fault
  | handle_mm_fault
  | __do_page_fault
  | do_page_fault
  | page_fault

Nothing really changed in that regards when I reduced mmap size to 128M
and run with 4*CPUs.

I do not have a bigger machine to play with unfortunately. I think the
patch makes sense on its own. I would really appreciate if you could
give it a try on your machine with !root memcg case to see how much it
helped. I would expect similar results to your previous testing without
the revert and Johannes' patch.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-10 Thread Dave Hansen
On 09/10/2014 09:29 AM, Michal Hocko wrote:
 I do not have a bigger machine to play with unfortunately. I think the
 patch makes sense on its own. I would really appreciate if you could
 give it a try on your machine with !root memcg case to see how much it
 helped. I would expect similar results to your previous testing without
 the revert and Johannes' patch.

So you want to see before/after this patch:

Subject: [PATCH] mm, memcg: Do not kill release batching in
 free_pages_and_swap_cache

And you want it on top of a kernel with the revert or without?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-10 Thread Michal Hocko
On Wed 10-09-14 09:57:56, Dave Hansen wrote:
 On 09/10/2014 09:29 AM, Michal Hocko wrote:
  I do not have a bigger machine to play with unfortunately. I think the
  patch makes sense on its own. I would really appreciate if you could
  give it a try on your machine with !root memcg case to see how much it
  helped. I would expect similar results to your previous testing without
  the revert and Johannes' patch.
 
 So you want to see before/after this patch:
 
 Subject: [PATCH] mm, memcg: Do not kill release batching in
  free_pages_and_swap_cache
 
 And you want it on top of a kernel with the revert or without?

Revert doesn't make any difference if you run the load inside a memcg
(without any limit set).
So just before and after the patch would be sufficient.

Thanks a lot Dave!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-09 Thread Dave Hansen
On 09/09/2014 07:50 AM, Johannes Weiner wrote:
> The mctz->lock is only taken when there is, or has been, soft limit
> excess.  However, the soft limit defaults to infinity, so unless you
> set it explicitly on the root level, I can't see how this could be
> mctz->lock contention.
> 
> It's more plausible that this is the res_counter lock for testing soft
> limit excess - for me, both these locks get inlined into check_events,
> could you please double check you got the right lock?

I got the wrong lock.  Here's how it looks after mainline, plus your 
free_pages_and_swap_cache() patch:

Samples: 2M of event 'cycles', Event count (approx.): 51647128377   
 
+   60.60% 1.33%  page_fault2_processes  [.] testcase   
▒
+   59.14% 0.41%  [kernel]   [k] page_fault 
◆
+   58.72% 0.01%  [kernel]   [k] do_page_fault  
▒
+   58.70% 0.08%  [kernel]   [k] __do_page_fault
▒
+   58.50% 0.29%  [kernel]   [k] handle_mm_fault
▒
+   40.14% 0.28%  [kernel]   [k] do_cow_fault   
▒
-   34.56%34.56%  [kernel]   [k] _raw_spin_lock 
▒
   - _raw_spin_lock 
▒
  - 78.11% __res_counter_charge 
▒
   res_counter_charge   
▒
   try_charge   
▒
 - mem_cgroup_try_charge
▒
+ 99.99% do_cow_fault   
▒
  - 10.30% res_counter_uncharge_until   
▒
   res_counter_uncharge 
▒
   uncharge_batch   
▒
   uncharge_list
▒
   mem_cgroup_uncharge_list 
▒
   release_pages
▒
  + 4.75% free_pcppages_bulk
▒
  + 3.65% do_cow_fault  
▒
  + 2.24% get_page_from_freelist
▒

> You also said that this cost hasn't been there before, but I do see
> that trace in both v3.16 and v3.17-rc3 with roughly the same impact
> (although my machines show less contention than yours).  Could you
> please double check that this is in fact a regression independent of
> 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")?

Here's the same workload on the same machine with only Johannes' revert applied:

-   35.92%35.92%  [kernel]   [k] _raw_spin_lock 
▒
   - _raw_spin_lock 
▒
  - 49.09% get_page_from_freelist   
▒
 - __alloc_pages_nodemask   
▒
+ 99.90% alloc_pages_vma
▒
  - 43.67% free_pcppages_bulk   
▒
 - 100.00% free_hot_cold_page   
▒
+ 99.93% free_hot_cold_page_list
▒
  - 7.08% do_cow_fault  
▒
   handle_mm_fault  
▒
   __do_page_fault  
▒
   do_page_fault
▒
   page_fault   
▒
   testcase 
▒

So I think it's probably part of the same regression.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-09 Thread Johannes Weiner
On Mon, Sep 08, 2014 at 08:47:37AM -0700, Dave Hansen wrote:
> On 09/05/2014 05:35 AM, Johannes Weiner wrote:
> > On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote:
> >> On 09/04/2014 07:27 AM, Michal Hocko wrote:
> >>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching
> >>> because it reduces it to PAGEVEC_SIZE batches.
> >>>
> >>> I think we really do not need PAGEVEC_SIZE batching anymore. We are
> >>> already batching on tlb_gather layer. That one is limited so I think
> >>> the below should be safe but I have to think about this some more. There
> >>> is a risk of prolonged lru_lock wait times but the number of pages is
> >>> limited to 10k and the heavy work is done outside of the lock. If this
> >>> is really a problem then we can tear LRU part and the actual
> >>> freeing/uncharging into a separate functions in this path.
> >>>
> >>> Could you test with this half baked patch, please? I didn't get to test
> >>> it myself unfortunately.
> >>
> >> 3.16 settled out at about 11.5M faults/sec before the regression.  This
> >> patch gets it back up to about 10.5M, which is good.  The top spinlock
> >> contention in the kernel is still from the resource counter code via
> >> mem_cgroup_commit_charge(), though.
> > 
> > Thanks for testing, that looks a lot better.
> > 
> > But commit doesn't touch resource counters - did you mean try_charge()
> > or uncharge() by any chance?
> 
> I don't have the perf output that I was looking at when I said this, but
> here's the path that I think I was referring to.  The inlining makes
> this non-obvious, but this memcg_check_events() calls
> mem_cgroup_update_tree() which is contending on mctz->lock.
> 
> So, you were right, it's not the resource counters code, it's a lock in
> 'struct mem_cgroup_tree_per_zone'.  But, the contention isn't _that_
> high (2% of CPU) in this case.  But, that is 2% that we didn't see before.
> 
> >  1.87% 1.87%  [kernel]   [k] _raw_spin_lock_irqsave 
> >   
> >|
> >--- _raw_spin_lock_irqsave
> >   |  
> >   |--107.09%-- memcg_check_events
> >   |  |  
> >   |  |--79.98%-- 
> > mem_cgroup_commit_charge
> >   |  |  |  
> >   |  |  |--99.81%-- 
> > do_cow_fault
> >   |  |  |  
> > handle_mm_fault
> >   |  |  |  
> > __do_page_fault
> >   |  |  |  
> > do_page_fault
> >   |  |  |  
> > page_fault
> >   |  |  |  testcase
> >   |  |   --0.19%-- [...]

The mctz->lock is only taken when there is, or has been, soft limit
excess.  However, the soft limit defaults to infinity, so unless you
set it explicitly on the root level, I can't see how this could be
mctz->lock contention.

It's more plausible that this is the res_counter lock for testing soft
limit excess - for me, both these locks get inlined into check_events,
could you please double check you got the right lock?

As the limit defaults to infinity, and really doesn't mean anything on
the root level it's idiotic to test it, we can easily eliminate that.
With the patch below, I don't have that trace show up in the profile
anymore.  Could you please give it a try?

You also said that this cost hasn't been there before, but I do see
that trace in both v3.16 and v3.17-rc3 with roughly the same impact
(although my machines show less contention than yours).  Could you
please double check that this is in fact a regression independent of
05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")?

Thanks!

---
>From 465c5caa0628d640c2493e9d849dc9a1f0b373a4 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Tue, 9 Sep 2014 09:25:20 -0400
Subject: [patch] mm: memcontrol: do not track soft limit excess on the root
 level

Dave encounters res_counter lock contention from memcg_check_events()
when running a multi-threaded page fault benchmark in the root group.

This lock is taken to maintain the tree of soft limit excessors, which
is used by global reclaim to prioritize excess groups.  But that makes
no sense on the root level - parent to all other groups, and so all
this overhead is unnecessary.  Skip it.

[ The soft limit really shouldn't even be settable on the root level,
  but it's been like that forever, so don't risk breaking dopy user
  space over this now. ]

Reported-by: Dave Hansen 
Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c | 5 +++--
 1 file changed, 3 insertions(+), 2 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-09 Thread Johannes Weiner
On Mon, Sep 08, 2014 at 08:47:37AM -0700, Dave Hansen wrote:
 On 09/05/2014 05:35 AM, Johannes Weiner wrote:
  On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote:
  On 09/04/2014 07:27 AM, Michal Hocko wrote:
  Ouch. free_pages_and_swap_cache completely kills the uncharge batching
  because it reduces it to PAGEVEC_SIZE batches.
 
  I think we really do not need PAGEVEC_SIZE batching anymore. We are
  already batching on tlb_gather layer. That one is limited so I think
  the below should be safe but I have to think about this some more. There
  is a risk of prolonged lru_lock wait times but the number of pages is
  limited to 10k and the heavy work is done outside of the lock. If this
  is really a problem then we can tear LRU part and the actual
  freeing/uncharging into a separate functions in this path.
 
  Could you test with this half baked patch, please? I didn't get to test
  it myself unfortunately.
 
  3.16 settled out at about 11.5M faults/sec before the regression.  This
  patch gets it back up to about 10.5M, which is good.  The top spinlock
  contention in the kernel is still from the resource counter code via
  mem_cgroup_commit_charge(), though.
  
  Thanks for testing, that looks a lot better.
  
  But commit doesn't touch resource counters - did you mean try_charge()
  or uncharge() by any chance?
 
 I don't have the perf output that I was looking at when I said this, but
 here's the path that I think I was referring to.  The inlining makes
 this non-obvious, but this memcg_check_events() calls
 mem_cgroup_update_tree() which is contending on mctz-lock.
 
 So, you were right, it's not the resource counters code, it's a lock in
 'struct mem_cgroup_tree_per_zone'.  But, the contention isn't _that_
 high (2% of CPU) in this case.  But, that is 2% that we didn't see before.
 
   1.87% 1.87%  [kernel]   [k] _raw_spin_lock_irqsave 

 |
 --- _raw_spin_lock_irqsave
|  
|--107.09%-- memcg_check_events
|  |  
|  |--79.98%-- 
  mem_cgroup_commit_charge
|  |  |  
|  |  |--99.81%-- 
  do_cow_fault
|  |  |  
  handle_mm_fault
|  |  |  
  __do_page_fault
|  |  |  
  do_page_fault
|  |  |  
  page_fault
|  |  |  testcase
|  |   --0.19%-- [...]

The mctz-lock is only taken when there is, or has been, soft limit
excess.  However, the soft limit defaults to infinity, so unless you
set it explicitly on the root level, I can't see how this could be
mctz-lock contention.

It's more plausible that this is the res_counter lock for testing soft
limit excess - for me, both these locks get inlined into check_events,
could you please double check you got the right lock?

As the limit defaults to infinity, and really doesn't mean anything on
the root level it's idiotic to test it, we can easily eliminate that.
With the patch below, I don't have that trace show up in the profile
anymore.  Could you please give it a try?

You also said that this cost hasn't been there before, but I do see
that trace in both v3.16 and v3.17-rc3 with roughly the same impact
(although my machines show less contention than yours).  Could you
please double check that this is in fact a regression independent of
05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter)?

Thanks!

---
From 465c5caa0628d640c2493e9d849dc9a1f0b373a4 Mon Sep 17 00:00:00 2001
From: Johannes Weiner han...@cmpxchg.org
Date: Tue, 9 Sep 2014 09:25:20 -0400
Subject: [patch] mm: memcontrol: do not track soft limit excess on the root
 level

Dave encounters res_counter lock contention from memcg_check_events()
when running a multi-threaded page fault benchmark in the root group.

This lock is taken to maintain the tree of soft limit excessors, which
is used by global reclaim to prioritize excess groups.  But that makes
no sense on the root level - parent to all other groups, and so all
this overhead is unnecessary.  Skip it.

[ The soft limit really shouldn't even be settable on the root level,
  but it's been like that forever, so don't risk breaking dopy user
  space over this now. ]

Reported-by: Dave Hansen d...@sr71.net
Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 mm/memcontrol.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 085dc6d2f876..b4de17e4f267 100644

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-09 Thread Dave Hansen
On 09/09/2014 07:50 AM, Johannes Weiner wrote:
 The mctz-lock is only taken when there is, or has been, soft limit
 excess.  However, the soft limit defaults to infinity, so unless you
 set it explicitly on the root level, I can't see how this could be
 mctz-lock contention.
 
 It's more plausible that this is the res_counter lock for testing soft
 limit excess - for me, both these locks get inlined into check_events,
 could you please double check you got the right lock?

I got the wrong lock.  Here's how it looks after mainline, plus your 
free_pages_and_swap_cache() patch:

Samples: 2M of event 'cycles', Event count (approx.): 51647128377   
 
+   60.60% 1.33%  page_fault2_processes  [.] testcase   
▒
+   59.14% 0.41%  [kernel]   [k] page_fault 
◆
+   58.72% 0.01%  [kernel]   [k] do_page_fault  
▒
+   58.70% 0.08%  [kernel]   [k] __do_page_fault
▒
+   58.50% 0.29%  [kernel]   [k] handle_mm_fault
▒
+   40.14% 0.28%  [kernel]   [k] do_cow_fault   
▒
-   34.56%34.56%  [kernel]   [k] _raw_spin_lock 
▒
   - _raw_spin_lock 
▒
  - 78.11% __res_counter_charge 
▒
   res_counter_charge   
▒
   try_charge   
▒
 - mem_cgroup_try_charge
▒
+ 99.99% do_cow_fault   
▒
  - 10.30% res_counter_uncharge_until   
▒
   res_counter_uncharge 
▒
   uncharge_batch   
▒
   uncharge_list
▒
   mem_cgroup_uncharge_list 
▒
   release_pages
▒
  + 4.75% free_pcppages_bulk
▒
  + 3.65% do_cow_fault  
▒
  + 2.24% get_page_from_freelist
▒

 You also said that this cost hasn't been there before, but I do see
 that trace in both v3.16 and v3.17-rc3 with roughly the same impact
 (although my machines show less contention than yours).  Could you
 please double check that this is in fact a regression independent of
 05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter)?

Here's the same workload on the same machine with only Johannes' revert applied:

-   35.92%35.92%  [kernel]   [k] _raw_spin_lock 
▒
   - _raw_spin_lock 
▒
  - 49.09% get_page_from_freelist   
▒
 - __alloc_pages_nodemask   
▒
+ 99.90% alloc_pages_vma
▒
  - 43.67% free_pcppages_bulk   
▒
 - 100.00% free_hot_cold_page   
▒
+ 99.93% free_hot_cold_page_list
▒
  - 7.08% do_cow_fault  
▒
   handle_mm_fault  
▒
   __do_page_fault  
▒
   do_page_fault
▒
   page_fault   
▒
   testcase 
▒

So I think it's probably part of the same regression.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-08 Thread Dave Hansen
On 09/05/2014 05:35 AM, Johannes Weiner wrote:
> On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote:
>> On 09/04/2014 07:27 AM, Michal Hocko wrote:
>>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching
>>> because it reduces it to PAGEVEC_SIZE batches.
>>>
>>> I think we really do not need PAGEVEC_SIZE batching anymore. We are
>>> already batching on tlb_gather layer. That one is limited so I think
>>> the below should be safe but I have to think about this some more. There
>>> is a risk of prolonged lru_lock wait times but the number of pages is
>>> limited to 10k and the heavy work is done outside of the lock. If this
>>> is really a problem then we can tear LRU part and the actual
>>> freeing/uncharging into a separate functions in this path.
>>>
>>> Could you test with this half baked patch, please? I didn't get to test
>>> it myself unfortunately.
>>
>> 3.16 settled out at about 11.5M faults/sec before the regression.  This
>> patch gets it back up to about 10.5M, which is good.  The top spinlock
>> contention in the kernel is still from the resource counter code via
>> mem_cgroup_commit_charge(), though.
> 
> Thanks for testing, that looks a lot better.
> 
> But commit doesn't touch resource counters - did you mean try_charge()
> or uncharge() by any chance?

I don't have the perf output that I was looking at when I said this, but
here's the path that I think I was referring to.  The inlining makes
this non-obvious, but this memcg_check_events() calls
mem_cgroup_update_tree() which is contending on mctz->lock.

So, you were right, it's not the resource counters code, it's a lock in
'struct mem_cgroup_tree_per_zone'.  But, the contention isn't _that_
high (2% of CPU) in this case.  But, that is 2% that we didn't see before.

>  1.87% 1.87%  [kernel]   [k] _raw_spin_lock_irqsave   
>|
>--- _raw_spin_lock_irqsave
>   |  
>   |--107.09%-- memcg_check_events
>   |  |  
>   |  |--79.98%-- 
> mem_cgroup_commit_charge
>   |  |  |  
>   |  |  |--99.81%-- 
> do_cow_fault
>   |  |  |  
> handle_mm_fault
>   |  |  |  
> __do_page_fault
>   |  |  |  
> do_page_fault
>   |  |  |  page_fault
>   |  |  |  testcase
>   |  |   --0.19%-- [...]


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-08 Thread Dave Hansen
On 09/05/2014 05:35 AM, Johannes Weiner wrote:
 On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote:
 On 09/04/2014 07:27 AM, Michal Hocko wrote:
 Ouch. free_pages_and_swap_cache completely kills the uncharge batching
 because it reduces it to PAGEVEC_SIZE batches.

 I think we really do not need PAGEVEC_SIZE batching anymore. We are
 already batching on tlb_gather layer. That one is limited so I think
 the below should be safe but I have to think about this some more. There
 is a risk of prolonged lru_lock wait times but the number of pages is
 limited to 10k and the heavy work is done outside of the lock. If this
 is really a problem then we can tear LRU part and the actual
 freeing/uncharging into a separate functions in this path.

 Could you test with this half baked patch, please? I didn't get to test
 it myself unfortunately.

 3.16 settled out at about 11.5M faults/sec before the regression.  This
 patch gets it back up to about 10.5M, which is good.  The top spinlock
 contention in the kernel is still from the resource counter code via
 mem_cgroup_commit_charge(), though.
 
 Thanks for testing, that looks a lot better.
 
 But commit doesn't touch resource counters - did you mean try_charge()
 or uncharge() by any chance?

I don't have the perf output that I was looking at when I said this, but
here's the path that I think I was referring to.  The inlining makes
this non-obvious, but this memcg_check_events() calls
mem_cgroup_update_tree() which is contending on mctz-lock.

So, you were right, it's not the resource counters code, it's a lock in
'struct mem_cgroup_tree_per_zone'.  But, the contention isn't _that_
high (2% of CPU) in this case.  But, that is 2% that we didn't see before.

  1.87% 1.87%  [kernel]   [k] _raw_spin_lock_irqsave   
|
--- _raw_spin_lock_irqsave
   |  
   |--107.09%-- memcg_check_events
   |  |  
   |  |--79.98%-- 
 mem_cgroup_commit_charge
   |  |  |  
   |  |  |--99.81%-- 
 do_cow_fault
   |  |  |  
 handle_mm_fault
   |  |  |  
 __do_page_fault
   |  |  |  
 do_page_fault
   |  |  |  page_fault
   |  |  |  testcase
   |  |   --0.19%-- [...]


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Fri 05-09-14 10:47:23, Johannes Weiner wrote:
> On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote:
> > @@ -900,10 +900,10 @@ void lru_add_drain_all(void)
> >   * grabbed the page via the LRU.  If it did, give up: 
> > shrink_inactive_list()
> >   * will free it.
> >   */
> > -void release_pages(struct page **pages, int nr, bool cold)
> > +static void release_lru_pages(struct page **pages, int nr,
> > + struct list_head *pages_to_free)
> >  {
> > int i;
> > -   LIST_HEAD(pages_to_free);
> > struct zone *zone = NULL;
> > struct lruvec *lruvec;
> > unsigned long uninitialized_var(flags);
> > @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool 
> > cold)
> > /* Clear Active bit in case of parallel mark_page_accessed */
> > __ClearPageActive(page);
> >  
> > -   list_add(>lru, _to_free);
> > +   list_add(>lru, pages_to_free);
> > }
> > if (zone)
> > spin_unlock_irqrestore(>lru_lock, flags);
> > +}
> > +/*
> > + * Batched page_cache_release(). Frees and uncharges all given pages
> > + * for which the reference count drops to 0.
> > + */
> > +void release_pages(struct page **pages, int nr, bool cold)
> > +{
> > +   LIST_HEAD(pages_to_free);
> >  
> > +   while (nr) {
> > +   int batch = min(nr, PAGEVEC_SIZE);
> > +
> > +   release_lru_pages(pages, batch, _to_free);
> > +   pages += batch;
> > +   nr -= batch;
> > +   }
> 
> We might be able to process a lot more pages in one go if nobody else
> needs the lock or the CPU.  Can't we just cycle the lock or reschedule
> if necessary?

Is it safe to cond_resched here for all callers? I hope it is but there
are way too many callers to check so I am not 100% sure.

Besides that spin_needbreak doesn't seem to be available for all architectures.
git grep "arch_spin_is_contended(" -- arch/
arch/arm/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/arm64/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/ia64/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/mips/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/x86/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)

Moreover it doesn't seem to do anything for !CONFIG_PREEMPT but this
should be trivial to fix.

I am also not sure this will work well in all cases. If we have a heavy
reclaim activity on other CPUs then this path might be interrupted too
often resulting in too much lock bouncing. So I guess we want at least
few pages to be processed in one run. On the other hand if the lock is
not contended then doing batches and retake the lock shouldn't add too
much overhead, no?

> diff --git a/mm/swap.c b/mm/swap.c
> index 6b2dc3897cd5..ee0cf21dd521 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool 
> cold)
>   __ClearPageActive(page);
>  
>   list_add(>lru, _to_free);
> +
> + if (should_resched() ||
> + (zone && spin_needbreak(>lru_lock))) {
> + if (zone) {
> + spin_unlock_irqrestore(>lru_lock, flags);
> + zone = NULL;
> + }
> + cond_resched();
> + }
>   }
>   if (zone)
>   spin_unlock_irqrestore(>lru_lock, flags);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3e0ec83d000c..c487ca4682a4 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page)
>   */
>  void free_pages_and_swap_cache(struct page **pages, int nr)
>  {
> - struct page **pagep = pages;
> + int i;
>  
>   lru_add_drain();
> - while (nr) {
> - int todo = min(nr, PAGEVEC_SIZE);
> - int i;
> -
> - for (i = 0; i < todo; i++)
> - free_swap_cache(pagep[i]);
> - release_pages(pagep, todo, false);
> - pagep += todo;
> - nr -= todo;
> - }
> + for (i = 0; i < nr; i++)
> + free_swap_cache(pages[i]);
> + release_pages(pages, nr, false);
>  }
>  
>  /*

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Johannes Weiner
On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote:
> @@ -900,10 +900,10 @@ void lru_add_drain_all(void)
>   * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
>   * will free it.
>   */
> -void release_pages(struct page **pages, int nr, bool cold)
> +static void release_lru_pages(struct page **pages, int nr,
> +   struct list_head *pages_to_free)
>  {
>   int i;
> - LIST_HEAD(pages_to_free);
>   struct zone *zone = NULL;
>   struct lruvec *lruvec;
>   unsigned long uninitialized_var(flags);
> @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool 
> cold)
>   /* Clear Active bit in case of parallel mark_page_accessed */
>   __ClearPageActive(page);
>  
> - list_add(>lru, _to_free);
> + list_add(>lru, pages_to_free);
>   }
>   if (zone)
>   spin_unlock_irqrestore(>lru_lock, flags);
> +}
> +/*
> + * Batched page_cache_release(). Frees and uncharges all given pages
> + * for which the reference count drops to 0.
> + */
> +void release_pages(struct page **pages, int nr, bool cold)
> +{
> + LIST_HEAD(pages_to_free);
>  
> + while (nr) {
> + int batch = min(nr, PAGEVEC_SIZE);
> +
> + release_lru_pages(pages, batch, _to_free);
> + pages += batch;
> + nr -= batch;
> + }

We might be able to process a lot more pages in one go if nobody else
needs the lock or the CPU.  Can't we just cycle the lock or reschedule
if necessary?

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..ee0cf21dd521 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold)
__ClearPageActive(page);
 
list_add(>lru, _to_free);
+
+   if (should_resched() ||
+   (zone && spin_needbreak(>lru_lock))) {
+   if (zone) {
+   spin_unlock_irqrestore(>lru_lock, flags);
+   zone = NULL;
+   }
+   cond_resched();
+   }
}
if (zone)
spin_unlock_irqrestore(>lru_lock, flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3e0ec83d000c..c487ca4682a4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page)
  */
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
-   struct page **pagep = pages;
+   int i;
 
lru_add_drain();
-   while (nr) {
-   int todo = min(nr, PAGEVEC_SIZE);
-   int i;
-
-   for (i = 0; i < todo; i++)
-   free_swap_cache(pagep[i]);
-   release_pages(pagep, todo, false);
-   pagep += todo;
-   nr -= todo;
-   }
+   for (i = 0; i < nr; i++)
+   free_swap_cache(pages[i]);
+   release_pages(pages, nr, false);
 }
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Johannes Weiner
On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote:
> On 09/04/2014 07:27 AM, Michal Hocko wrote:
> > Ouch. free_pages_and_swap_cache completely kills the uncharge batching
> > because it reduces it to PAGEVEC_SIZE batches.
> > 
> > I think we really do not need PAGEVEC_SIZE batching anymore. We are
> > already batching on tlb_gather layer. That one is limited so I think
> > the below should be safe but I have to think about this some more. There
> > is a risk of prolonged lru_lock wait times but the number of pages is
> > limited to 10k and the heavy work is done outside of the lock. If this
> > is really a problem then we can tear LRU part and the actual
> > freeing/uncharging into a separate functions in this path.
> > 
> > Could you test with this half baked patch, please? I didn't get to test
> > it myself unfortunately.
> 
> 3.16 settled out at about 11.5M faults/sec before the regression.  This
> patch gets it back up to about 10.5M, which is good.  The top spinlock
> contention in the kernel is still from the resource counter code via
> mem_cgroup_commit_charge(), though.

Thanks for testing, that looks a lot better.

But commit doesn't touch resource counters - did you mean try_charge()
or uncharge() by any chance?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Thu 04-09-14 15:53:46, Dave Hansen wrote:
> On 09/04/2014 01:27 PM, Dave Hansen wrote:
> > On 09/04/2014 07:27 AM, Michal Hocko wrote:
> >> Ouch. free_pages_and_swap_cache completely kills the uncharge batching
> >> because it reduces it to PAGEVEC_SIZE batches.
> >>
> >> I think we really do not need PAGEVEC_SIZE batching anymore. We are
> >> already batching on tlb_gather layer. That one is limited so I think
> >> the below should be safe but I have to think about this some more. There
> >> is a risk of prolonged lru_lock wait times but the number of pages is
> >> limited to 10k and the heavy work is done outside of the lock. If this
> >> is really a problem then we can tear LRU part and the actual
> >> freeing/uncharging into a separate functions in this path.
> >>
> >> Could you test with this half baked patch, please? I didn't get to test
> >> it myself unfortunately.
> > 
> > 3.16 settled out at about 11.5M faults/sec before the regression.  This
> > patch gets it back up to about 10.5M, which is good.  The top spinlock
> > contention in the kernel is still from the resource counter code via
> > mem_cgroup_commit_charge(), though.
> > 
> > I'm running Johannes' patch now.
> 
> This looks pretty good.  The area where it plateaus (above 80 threads
> where hyperthreading kicks in) might be a bit slower than it was in
> 3.16, but that could easily be from other things.

Good news indeed. But I think it would be safer to apply Johannes'
revert for now. Both changes are still worth having anyway because they
have potential to improve memcg case.

> > https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f
> 
> Feel free to add my Tested-by:

Thanks a lot! I have posted another patch which reduces the batching for
LRU handling because this would be too risky. So I haven't added your
Tested-by yet.
 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Thu 04-09-14 13:27:26, Dave Hansen wrote:
> On 09/04/2014 07:27 AM, Michal Hocko wrote:
> > Ouch. free_pages_and_swap_cache completely kills the uncharge batching
> > because it reduces it to PAGEVEC_SIZE batches.
> > 
> > I think we really do not need PAGEVEC_SIZE batching anymore. We are
> > already batching on tlb_gather layer. That one is limited so I think
> > the below should be safe but I have to think about this some more. There
> > is a risk of prolonged lru_lock wait times but the number of pages is
> > limited to 10k and the heavy work is done outside of the lock. If this
> > is really a problem then we can tear LRU part and the actual
> > freeing/uncharging into a separate functions in this path.
> > 
> > Could you test with this half baked patch, please? I didn't get to test
> > it myself unfortunately.
> 
> 3.16 settled out at about 11.5M faults/sec before the regression.  This
> patch gets it back up to about 10.5M, which is good.

Dave, would you be willing to test the following patch as well? I do not
have a huge machine at hand right now. It would be great if you could
run the same load within a !root memcg. We have basically the same
sub-optimality there as well. The root bypass will be re-introduced for
now but I think we can make the regular memcg load better regardless and
this would be also preparation for later root bypass removal again.
---
>From 17c4c8710f3ec37fe04866bd18dbc68a0f47b560 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 5 Sep 2014 11:16:17 +0200
Subject: [PATCH] mm, memcg: Do not kill release batching in
 free_pages_and_swap_cache

free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
80.18%80.18%  [kernel]   [k] _raw_spin_lock
  |
  --- _raw_spin_lock
 |
 |--66.59%-- res_counter_uncharge_until
 |  res_counter_uncharge
 |  uncharge_batch
 |  uncharge_list
 |  mem_cgroup_uncharge_list
 |  release_pages
 |  free_pages_and_swap_cache
 |  tlb_flush_mmu_free
 |  |
 |  |--90.12%-- unmap_single_vma
 |  |  unmap_vmas
 |  |  unmap_region
 |  |  do_munmap
 |  |  vm_munmap
 |  |  sys_munmap
 |  |  system_call_fastpath
 |  |  __GI___munmap
 |  |
 |   --9.88%-- tlb_flush_mmu
 | tlb_finish_mmu
 | unmap_region
 | do_munmap
 | vm_munmap
 | sys_munmap
 | system_call_fastpath
 | __GI___munmap

In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression.
But it makes sense to help loads which are running within a memcg
as well. So this is basically a performance optimization.

There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.

Page reference count and LRU handling is moved to release_lru_pages and
that is run in PAGEVEC_SIZE batches.

Reported-by: Dave Hansen 
Signed-off-by: Michal Hocko 
---
 mm/swap.c   | 27 +--
 mm/swap_state.c | 14 --
 2 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..8af99dd68dd2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -888,9 +888,9 @@ void lru_add_drain_all(void)
 }
 
 /*
- * Batched page_cache_release().  Decrement the reference count on all the
- * passed pages.  If it fell to zero then remove the page from the LRU and
- * free it.
+ * Batched lru release. Decrement the reference count on all the passed pages.
+ * If it fell to zero then remove the page from the LRU and add it to the given
+ * list to be freed by the caller.
  *
  * Avoid taking zone->lru_lock if 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Thu 04-09-14 11:08:46, Johannes Weiner wrote:
[...]
> From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Thu, 4 Sep 2014 10:04:34 -0400
> Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter
> 
> Dave Hansen reports a massive scalability regression in an uncontained
> page fault benchmark with more than 30 concurrent threads, which he
> bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup
> res_counter") and pin-pointed on res_counter spinlock contention.
> 
> That change relied on the per-cpu charge caches to mostly swallow the
> res_counter costs, but it's apparent that the caches don't scale yet.
> 
> Revert memcg back to bypassing res_counters on the root level in order
> to restore performance for uncontained workloads.
> 
> Reported-by: Dave Hansen 
> Signed-off-by: Johannes Weiner 

The revert looks good to me.
Acked-by: Michal Hocko 

> ---
>  mm/memcontrol.c | 103 
> ++--
>  1 file changed, 78 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ec4dcf1b9562..085dc6d2f876 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
>   unsigned long long size;
>   int ret = 0;
>  
> + if (mem_cgroup_is_root(memcg))
> + goto done;
>  retry:
>   if (consume_stock(memcg, nr_pages))
>   goto done;
> @@ -2611,9 +2613,7 @@ nomem:
>   if (!(gfp_mask & __GFP_NOFAIL))
>   return -ENOMEM;
>  bypass:
> - memcg = root_mem_cgroup;
> - ret = -EINTR;
> - goto retry;
> + return -EINTR;
>  
>  done_restock:
>   if (batch > nr_pages)
> @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, 
> unsigned int nr_pages)
>  {
>   unsigned long bytes = nr_pages * PAGE_SIZE;
>  
> + if (mem_cgroup_is_root(memcg))
> + return;
> +
>   res_counter_uncharge(>res, bytes);
>   if (do_swap_account)
>   res_counter_uncharge(>memsw, bytes);
> @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct 
> mem_cgroup *memcg,
>  {
>   unsigned long bytes = nr_pages * PAGE_SIZE;
>  
> + if (mem_cgroup_is_root(memcg))
> + return;
> +
>   res_counter_uncharge_until(>res, memcg->res.parent, bytes);
>   if (do_swap_account)
>   res_counter_uncharge_until(>memsw,
> @@ -4093,6 +4099,46 @@ out:
>   return retval;
>  }
>  
> +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
> +enum mem_cgroup_stat_index idx)
> +{
> + struct mem_cgroup *iter;
> + long val = 0;
> +
> + /* Per-cpu values can be negative, use a signed accumulator */
> + for_each_mem_cgroup_tree(iter, memcg)
> + val += mem_cgroup_read_stat(iter, idx);
> +
> + if (val < 0) /* race ? */
> + val = 0;
> + return val;
> +}
> +
> +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
> +{
> + u64 val;
> +
> + if (!mem_cgroup_is_root(memcg)) {
> + if (!swap)
> + return res_counter_read_u64(>res, RES_USAGE);
> + else
> + return res_counter_read_u64(>memsw, RES_USAGE);
> + }
> +
> + /*
> +  * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
> +  * as well as in MEM_CGROUP_STAT_RSS_HUGE.
> +  */
> + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
> + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
> +
> + if (swap)
> + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
> +
> + return val << PAGE_SHIFT;
> +}
> +
> +
>  static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>  struct cftype *cft)
>  {
> @@ -4102,8 +4148,12 @@ static u64 mem_cgroup_read_u64(struct 
> cgroup_subsys_state *css,
>  
>   switch (type) {
>   case _MEM:
> + if (name == RES_USAGE)
> + return mem_cgroup_usage(memcg, false);
>   return res_counter_read_u64(>res, name);
>   case _MEMSWAP:
> + if (name == RES_USAGE)
> + return mem_cgroup_usage(memcg, true);
>   return res_counter_read_u64(>memsw, name);
>   case _KMEM:
>   return res_counter_read_u64(>kmem, name);
> @@ -4572,10 +4622,7 @@ static void __mem_cgroup_threshold(struct mem_cgroup 
> *memcg, bool swap)
>   if (!t)
>   goto unlock;
>  
> - if (!swap)
> - usage = res_counter_read_u64(>res, RES_USAGE);
> - else
> - usage = res_counter_read_u64(>memsw, RES_USAGE);
> + usage = mem_cgroup_usage(memcg, swap);
>  
>   /*
>* current_threshold points to threshold just below or equal to usage.
> @@ 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Thu 04-09-14 11:08:46, Johannes Weiner wrote:
[...]
 From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001
 From: Johannes Weiner han...@cmpxchg.org
 Date: Thu, 4 Sep 2014 10:04:34 -0400
 Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter
 
 Dave Hansen reports a massive scalability regression in an uncontained
 page fault benchmark with more than 30 concurrent threads, which he
 bisected down to 05b843012335 (mm: memcontrol: use root_mem_cgroup
 res_counter) and pin-pointed on res_counter spinlock contention.
 
 That change relied on the per-cpu charge caches to mostly swallow the
 res_counter costs, but it's apparent that the caches don't scale yet.
 
 Revert memcg back to bypassing res_counters on the root level in order
 to restore performance for uncontained workloads.
 
 Reported-by: Dave Hansen d...@sr71.net
 Signed-off-by: Johannes Weiner han...@cmpxchg.org

The revert looks good to me.
Acked-by: Michal Hocko mho...@suse.cz

 ---
  mm/memcontrol.c | 103 
 ++--
  1 file changed, 78 insertions(+), 25 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index ec4dcf1b9562..085dc6d2f876 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
 gfp_mask,
   unsigned long long size;
   int ret = 0;
  
 + if (mem_cgroup_is_root(memcg))
 + goto done;
  retry:
   if (consume_stock(memcg, nr_pages))
   goto done;
 @@ -2611,9 +2613,7 @@ nomem:
   if (!(gfp_mask  __GFP_NOFAIL))
   return -ENOMEM;
  bypass:
 - memcg = root_mem_cgroup;
 - ret = -EINTR;
 - goto retry;
 + return -EINTR;
  
  done_restock:
   if (batch  nr_pages)
 @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, 
 unsigned int nr_pages)
  {
   unsigned long bytes = nr_pages * PAGE_SIZE;
  
 + if (mem_cgroup_is_root(memcg))
 + return;
 +
   res_counter_uncharge(memcg-res, bytes);
   if (do_swap_account)
   res_counter_uncharge(memcg-memsw, bytes);
 @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct 
 mem_cgroup *memcg,
  {
   unsigned long bytes = nr_pages * PAGE_SIZE;
  
 + if (mem_cgroup_is_root(memcg))
 + return;
 +
   res_counter_uncharge_until(memcg-res, memcg-res.parent, bytes);
   if (do_swap_account)
   res_counter_uncharge_until(memcg-memsw,
 @@ -4093,6 +4099,46 @@ out:
   return retval;
  }
  
 +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
 +enum mem_cgroup_stat_index idx)
 +{
 + struct mem_cgroup *iter;
 + long val = 0;
 +
 + /* Per-cpu values can be negative, use a signed accumulator */
 + for_each_mem_cgroup_tree(iter, memcg)
 + val += mem_cgroup_read_stat(iter, idx);
 +
 + if (val  0) /* race ? */
 + val = 0;
 + return val;
 +}
 +
 +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 +{
 + u64 val;
 +
 + if (!mem_cgroup_is_root(memcg)) {
 + if (!swap)
 + return res_counter_read_u64(memcg-res, RES_USAGE);
 + else
 + return res_counter_read_u64(memcg-memsw, RES_USAGE);
 + }
 +
 + /*
 +  * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
 +  * as well as in MEM_CGROUP_STAT_RSS_HUGE.
 +  */
 + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
 + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
 +
 + if (swap)
 + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
 +
 + return val  PAGE_SHIFT;
 +}
 +
 +
  static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
  struct cftype *cft)
  {
 @@ -4102,8 +4148,12 @@ static u64 mem_cgroup_read_u64(struct 
 cgroup_subsys_state *css,
  
   switch (type) {
   case _MEM:
 + if (name == RES_USAGE)
 + return mem_cgroup_usage(memcg, false);
   return res_counter_read_u64(memcg-res, name);
   case _MEMSWAP:
 + if (name == RES_USAGE)
 + return mem_cgroup_usage(memcg, true);
   return res_counter_read_u64(memcg-memsw, name);
   case _KMEM:
   return res_counter_read_u64(memcg-kmem, name);
 @@ -4572,10 +4622,7 @@ static void __mem_cgroup_threshold(struct mem_cgroup 
 *memcg, bool swap)
   if (!t)
   goto unlock;
  
 - if (!swap)
 - usage = res_counter_read_u64(memcg-res, RES_USAGE);
 - else
 - usage = res_counter_read_u64(memcg-memsw, RES_USAGE);
 + usage = mem_cgroup_usage(memcg, swap);
  
   /*
* current_threshold points to threshold just below or equal to usage.
 @@ -4673,10 +4720,10 @@ static int 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Thu 04-09-14 13:27:26, Dave Hansen wrote:
 On 09/04/2014 07:27 AM, Michal Hocko wrote:
  Ouch. free_pages_and_swap_cache completely kills the uncharge batching
  because it reduces it to PAGEVEC_SIZE batches.
  
  I think we really do not need PAGEVEC_SIZE batching anymore. We are
  already batching on tlb_gather layer. That one is limited so I think
  the below should be safe but I have to think about this some more. There
  is a risk of prolonged lru_lock wait times but the number of pages is
  limited to 10k and the heavy work is done outside of the lock. If this
  is really a problem then we can tear LRU part and the actual
  freeing/uncharging into a separate functions in this path.
  
  Could you test with this half baked patch, please? I didn't get to test
  it myself unfortunately.
 
 3.16 settled out at about 11.5M faults/sec before the regression.  This
 patch gets it back up to about 10.5M, which is good.

Dave, would you be willing to test the following patch as well? I do not
have a huge machine at hand right now. It would be great if you could
run the same load within a !root memcg. We have basically the same
sub-optimality there as well. The root bypass will be re-introduced for
now but I think we can make the regular memcg load better regardless and
this would be also preparation for later root bypass removal again.
---
From 17c4c8710f3ec37fe04866bd18dbc68a0f47b560 Mon Sep 17 00:00:00 2001
From: Michal Hocko mho...@suse.cz
Date: Fri, 5 Sep 2014 11:16:17 +0200
Subject: [PATCH] mm, memcg: Do not kill release batching in
 free_pages_and_swap_cache

free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
80.18%80.18%  [kernel]   [k] _raw_spin_lock
  |
  --- _raw_spin_lock
 |
 |--66.59%-- res_counter_uncharge_until
 |  res_counter_uncharge
 |  uncharge_batch
 |  uncharge_list
 |  mem_cgroup_uncharge_list
 |  release_pages
 |  free_pages_and_swap_cache
 |  tlb_flush_mmu_free
 |  |
 |  |--90.12%-- unmap_single_vma
 |  |  unmap_vmas
 |  |  unmap_region
 |  |  do_munmap
 |  |  vm_munmap
 |  |  sys_munmap
 |  |  system_call_fastpath
 |  |  __GI___munmap
 |  |
 |   --9.88%-- tlb_flush_mmu
 | tlb_finish_mmu
 | unmap_region
 | do_munmap
 | vm_munmap
 | sys_munmap
 | system_call_fastpath
 | __GI___munmap

In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 (mm: memcontrol: use
root_mem_cgroup res_counter) because this is a clear regression.
But it makes sense to help loads which are running within a memcg
as well. So this is basically a performance optimization.

There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.

Page reference count and LRU handling is moved to release_lru_pages and
that is run in PAGEVEC_SIZE batches.

Reported-by: Dave Hansen d...@sr71.net
Signed-off-by: Michal Hocko mho...@suse.cz
---
 mm/swap.c   | 27 +--
 mm/swap_state.c | 14 --
 2 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..8af99dd68dd2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -888,9 +888,9 @@ void lru_add_drain_all(void)
 }
 
 /*
- * Batched page_cache_release().  Decrement the reference count on all the
- * passed pages.  If it fell to zero then remove the page from the LRU and
- * free it.
+ * Batched lru release. Decrement the reference count on all the passed pages.
+ * If it fell to zero then remove the page from the LRU and add it to the given
+ * list to be freed by the caller.
  *
  * Avoid taking 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Thu 04-09-14 15:53:46, Dave Hansen wrote:
 On 09/04/2014 01:27 PM, Dave Hansen wrote:
  On 09/04/2014 07:27 AM, Michal Hocko wrote:
  Ouch. free_pages_and_swap_cache completely kills the uncharge batching
  because it reduces it to PAGEVEC_SIZE batches.
 
  I think we really do not need PAGEVEC_SIZE batching anymore. We are
  already batching on tlb_gather layer. That one is limited so I think
  the below should be safe but I have to think about this some more. There
  is a risk of prolonged lru_lock wait times but the number of pages is
  limited to 10k and the heavy work is done outside of the lock. If this
  is really a problem then we can tear LRU part and the actual
  freeing/uncharging into a separate functions in this path.
 
  Could you test with this half baked patch, please? I didn't get to test
  it myself unfortunately.
  
  3.16 settled out at about 11.5M faults/sec before the regression.  This
  patch gets it back up to about 10.5M, which is good.  The top spinlock
  contention in the kernel is still from the resource counter code via
  mem_cgroup_commit_charge(), though.
  
  I'm running Johannes' patch now.
 
 This looks pretty good.  The area where it plateaus (above 80 threads
 where hyperthreading kicks in) might be a bit slower than it was in
 3.16, but that could easily be from other things.

Good news indeed. But I think it would be safer to apply Johannes'
revert for now. Both changes are still worth having anyway because they
have potential to improve memcg case.

  https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/2=3.17.0-rc3-g57b252f
 
 Feel free to add my Tested-by:

Thanks a lot! I have posted another patch which reduces the batching for
LRU handling because this would be too risky. So I haven't added your
Tested-by yet.
 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Johannes Weiner
On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote:
 On 09/04/2014 07:27 AM, Michal Hocko wrote:
  Ouch. free_pages_and_swap_cache completely kills the uncharge batching
  because it reduces it to PAGEVEC_SIZE batches.
  
  I think we really do not need PAGEVEC_SIZE batching anymore. We are
  already batching on tlb_gather layer. That one is limited so I think
  the below should be safe but I have to think about this some more. There
  is a risk of prolonged lru_lock wait times but the number of pages is
  limited to 10k and the heavy work is done outside of the lock. If this
  is really a problem then we can tear LRU part and the actual
  freeing/uncharging into a separate functions in this path.
  
  Could you test with this half baked patch, please? I didn't get to test
  it myself unfortunately.
 
 3.16 settled out at about 11.5M faults/sec before the regression.  This
 patch gets it back up to about 10.5M, which is good.  The top spinlock
 contention in the kernel is still from the resource counter code via
 mem_cgroup_commit_charge(), though.

Thanks for testing, that looks a lot better.

But commit doesn't touch resource counters - did you mean try_charge()
or uncharge() by any chance?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Johannes Weiner
On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote:
 @@ -900,10 +900,10 @@ void lru_add_drain_all(void)
   * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
   * will free it.
   */
 -void release_pages(struct page **pages, int nr, bool cold)
 +static void release_lru_pages(struct page **pages, int nr,
 +   struct list_head *pages_to_free)
  {
   int i;
 - LIST_HEAD(pages_to_free);
   struct zone *zone = NULL;
   struct lruvec *lruvec;
   unsigned long uninitialized_var(flags);
 @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool 
 cold)
   /* Clear Active bit in case of parallel mark_page_accessed */
   __ClearPageActive(page);
  
 - list_add(page-lru, pages_to_free);
 + list_add(page-lru, pages_to_free);
   }
   if (zone)
   spin_unlock_irqrestore(zone-lru_lock, flags);
 +}
 +/*
 + * Batched page_cache_release(). Frees and uncharges all given pages
 + * for which the reference count drops to 0.
 + */
 +void release_pages(struct page **pages, int nr, bool cold)
 +{
 + LIST_HEAD(pages_to_free);
  
 + while (nr) {
 + int batch = min(nr, PAGEVEC_SIZE);
 +
 + release_lru_pages(pages, batch, pages_to_free);
 + pages += batch;
 + nr -= batch;
 + }

We might be able to process a lot more pages in one go if nobody else
needs the lock or the CPU.  Can't we just cycle the lock or reschedule
if necessary?

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..ee0cf21dd521 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold)
__ClearPageActive(page);
 
list_add(page-lru, pages_to_free);
+
+   if (should_resched() ||
+   (zone  spin_needbreak(zone-lru_lock))) {
+   if (zone) {
+   spin_unlock_irqrestore(zone-lru_lock, flags);
+   zone = NULL;
+   }
+   cond_resched();
+   }
}
if (zone)
spin_unlock_irqrestore(zone-lru_lock, flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3e0ec83d000c..c487ca4682a4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page)
  */
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
-   struct page **pagep = pages;
+   int i;
 
lru_add_drain();
-   while (nr) {
-   int todo = min(nr, PAGEVEC_SIZE);
-   int i;
-
-   for (i = 0; i  todo; i++)
-   free_swap_cache(pagep[i]);
-   release_pages(pagep, todo, false);
-   pagep += todo;
-   nr -= todo;
-   }
+   for (i = 0; i  nr; i++)
+   free_swap_cache(pages[i]);
+   release_pages(pages, nr, false);
 }
 
 /*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-05 Thread Michal Hocko
On Fri 05-09-14 10:47:23, Johannes Weiner wrote:
 On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote:
  @@ -900,10 +900,10 @@ void lru_add_drain_all(void)
* grabbed the page via the LRU.  If it did, give up: 
  shrink_inactive_list()
* will free it.
*/
  -void release_pages(struct page **pages, int nr, bool cold)
  +static void release_lru_pages(struct page **pages, int nr,
  + struct list_head *pages_to_free)
   {
  int i;
  -   LIST_HEAD(pages_to_free);
  struct zone *zone = NULL;
  struct lruvec *lruvec;
  unsigned long uninitialized_var(flags);
  @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool 
  cold)
  /* Clear Active bit in case of parallel mark_page_accessed */
  __ClearPageActive(page);
   
  -   list_add(page-lru, pages_to_free);
  +   list_add(page-lru, pages_to_free);
  }
  if (zone)
  spin_unlock_irqrestore(zone-lru_lock, flags);
  +}
  +/*
  + * Batched page_cache_release(). Frees and uncharges all given pages
  + * for which the reference count drops to 0.
  + */
  +void release_pages(struct page **pages, int nr, bool cold)
  +{
  +   LIST_HEAD(pages_to_free);
   
  +   while (nr) {
  +   int batch = min(nr, PAGEVEC_SIZE);
  +
  +   release_lru_pages(pages, batch, pages_to_free);
  +   pages += batch;
  +   nr -= batch;
  +   }
 
 We might be able to process a lot more pages in one go if nobody else
 needs the lock or the CPU.  Can't we just cycle the lock or reschedule
 if necessary?

Is it safe to cond_resched here for all callers? I hope it is but there
are way too many callers to check so I am not 100% sure.

Besides that spin_needbreak doesn't seem to be available for all architectures.
git grep arch_spin_is_contended( -- arch/
arch/arm/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/arm64/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/ia64/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/mips/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)
arch/x86/include/asm/spinlock.h:static inline int 
arch_spin_is_contended(arch_spinlock_t *lock)

Moreover it doesn't seem to do anything for !CONFIG_PREEMPT but this
should be trivial to fix.

I am also not sure this will work well in all cases. If we have a heavy
reclaim activity on other CPUs then this path might be interrupted too
often resulting in too much lock bouncing. So I guess we want at least
few pages to be processed in one run. On the other hand if the lock is
not contended then doing batches and retake the lock shouldn't add too
much overhead, no?

 diff --git a/mm/swap.c b/mm/swap.c
 index 6b2dc3897cd5..ee0cf21dd521 100644
 --- a/mm/swap.c
 +++ b/mm/swap.c
 @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool 
 cold)
   __ClearPageActive(page);
  
   list_add(page-lru, pages_to_free);
 +
 + if (should_resched() ||
 + (zone  spin_needbreak(zone-lru_lock))) {
 + if (zone) {
 + spin_unlock_irqrestore(zone-lru_lock, flags);
 + zone = NULL;
 + }
 + cond_resched();
 + }
   }
   if (zone)
   spin_unlock_irqrestore(zone-lru_lock, flags);
 diff --git a/mm/swap_state.c b/mm/swap_state.c
 index 3e0ec83d000c..c487ca4682a4 100644
 --- a/mm/swap_state.c
 +++ b/mm/swap_state.c
 @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page)
   */
  void free_pages_and_swap_cache(struct page **pages, int nr)
  {
 - struct page **pagep = pages;
 + int i;
  
   lru_add_drain();
 - while (nr) {
 - int todo = min(nr, PAGEVEC_SIZE);
 - int i;
 -
 - for (i = 0; i  todo; i++)
 - free_swap_cache(pagep[i]);
 - release_pages(pagep, todo, false);
 - pagep += todo;
 - nr -= todo;
 - }
 + for (i = 0; i  nr; i++)
 + free_swap_cache(pages[i]);
 + release_pages(pages, nr, false);
  }
  
  /*

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Dave Hansen
On 09/04/2014 01:27 PM, Dave Hansen wrote:
> On 09/04/2014 07:27 AM, Michal Hocko wrote:
>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching
>> because it reduces it to PAGEVEC_SIZE batches.
>>
>> I think we really do not need PAGEVEC_SIZE batching anymore. We are
>> already batching on tlb_gather layer. That one is limited so I think
>> the below should be safe but I have to think about this some more. There
>> is a risk of prolonged lru_lock wait times but the number of pages is
>> limited to 10k and the heavy work is done outside of the lock. If this
>> is really a problem then we can tear LRU part and the actual
>> freeing/uncharging into a separate functions in this path.
>>
>> Could you test with this half baked patch, please? I didn't get to test
>> it myself unfortunately.
> 
> 3.16 settled out at about 11.5M faults/sec before the regression.  This
> patch gets it back up to about 10.5M, which is good.  The top spinlock
> contention in the kernel is still from the resource counter code via
> mem_cgroup_commit_charge(), though.
> 
> I'm running Johannes' patch now.

This looks pretty good.  The area where it plateaus (above 80 threads
where hyperthreading kicks in) might be a bit slower than it was in
3.16, but that could easily be from other things.

> https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f

Feel free to add my Tested-by:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Dave Hansen
On 09/04/2014 08:08 AM, Johannes Weiner wrote:
> Dave Hansen reports a massive scalability regression in an uncontained
> page fault benchmark with more than 30 concurrent threads, which he
> bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup
> res_counter") and pin-pointed on res_counter spinlock contention.
> 
> That change relied on the per-cpu charge caches to mostly swallow the
> res_counter costs, but it's apparent that the caches don't scale yet.
> 
> Revert memcg back to bypassing res_counters on the root level in order
> to restore performance for uncontained workloads.

A quick sniff test shows performance coming back to what it was around
3.16 with this patch.  I'll run a more thorough set of tests and verify
that it's working well.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Dave Hansen
On 09/04/2014 07:27 AM, Michal Hocko wrote:
> Ouch. free_pages_and_swap_cache completely kills the uncharge batching
> because it reduces it to PAGEVEC_SIZE batches.
> 
> I think we really do not need PAGEVEC_SIZE batching anymore. We are
> already batching on tlb_gather layer. That one is limited so I think
> the below should be safe but I have to think about this some more. There
> is a risk of prolonged lru_lock wait times but the number of pages is
> limited to 10k and the heavy work is done outside of the lock. If this
> is really a problem then we can tear LRU part and the actual
> freeing/uncharging into a separate functions in this path.
> 
> Could you test with this half baked patch, please? I didn't get to test
> it myself unfortunately.

3.16 settled out at about 11.5M faults/sec before the regression.  This
patch gets it back up to about 10.5M, which is good.  The top spinlock
contention in the kernel is still from the resource counter code via
mem_cgroup_commit_charge(), though.

I'm running Johannes' patch now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Johannes Weiner
On Tue, Sep 02, 2014 at 05:30:38PM -0700, Dave Hansen wrote:
> On 09/02/2014 05:10 PM, Johannes Weiner wrote:
> > On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote:
> >> On 09/02/2014 03:18 PM, Johannes Weiner wrote:
> >>> Accounting new pages is buffered through per-cpu caches, but taking
> >>> them off the counters on free is not, so I'm guessing that above a
> >>> certain allocation rate the cost of locking and changing the counters
> >>> takes over.  Is there a chance you could profile this to see if locks
> >>> and res_counter-related operations show up?
> >>
> >> It looks pretty much the same, although it might have equalized the
> >> charge and uncharge sides a bit.  Full 'perf top' output attached.
> > 
> > That looks like a partial profile, where did the page allocator, page
> > zeroing etc. go?  Because the distribution among these listed symbols
> > doesn't seem all that crazy:
> 
> Perf was only outputting the top 20 functions.  Believe it or not, page
> zeroing and the rest of the allocator path wasn't even in the path of
> the top 20 functions because there is so much lock contention.
> 
> Here's a longer run of 'perf top' along with the top 100 functions:
> 
>   http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz
> 
> you can at least see copy_page_rep in there.

Thanks for the clarification, that is truly horrible.  Does the
following revert restore performance in your case?

---
>From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Thu, 4 Sep 2014 10:04:34 -0400
Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter

Dave Hansen reports a massive scalability regression in an uncontained
page fault benchmark with more than 30 concurrent threads, which he
bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup
res_counter") and pin-pointed on res_counter spinlock contention.

That change relied on the per-cpu charge caches to mostly swallow the
res_counter costs, but it's apparent that the caches don't scale yet.

Revert memcg back to bypassing res_counters on the root level in order
to restore performance for uncontained workloads.

Reported-by: Dave Hansen 
Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c | 103 ++--
 1 file changed, 78 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec4dcf1b9562..085dc6d2f876 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
unsigned long long size;
int ret = 0;
 
+   if (mem_cgroup_is_root(memcg))
+   goto done;
 retry:
if (consume_stock(memcg, nr_pages))
goto done;
@@ -2611,9 +2613,7 @@ nomem:
if (!(gfp_mask & __GFP_NOFAIL))
return -ENOMEM;
 bypass:
-   memcg = root_mem_cgroup;
-   ret = -EINTR;
-   goto retry;
+   return -EINTR;
 
 done_restock:
if (batch > nr_pages)
@@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, 
unsigned int nr_pages)
 {
unsigned long bytes = nr_pages * PAGE_SIZE;
 
+   if (mem_cgroup_is_root(memcg))
+   return;
+
res_counter_uncharge(>res, bytes);
if (do_swap_account)
res_counter_uncharge(>memsw, bytes);
@@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct 
mem_cgroup *memcg,
 {
unsigned long bytes = nr_pages * PAGE_SIZE;
 
+   if (mem_cgroup_is_root(memcg))
+   return;
+
res_counter_uncharge_until(>res, memcg->res.parent, bytes);
if (do_swap_account)
res_counter_uncharge_until(>memsw,
@@ -4093,6 +4099,46 @@ out:
return retval;
 }
 
+static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
+  enum mem_cgroup_stat_index idx)
+{
+   struct mem_cgroup *iter;
+   long val = 0;
+
+   /* Per-cpu values can be negative, use a signed accumulator */
+   for_each_mem_cgroup_tree(iter, memcg)
+   val += mem_cgroup_read_stat(iter, idx);
+
+   if (val < 0) /* race ? */
+   val = 0;
+   return val;
+}
+
+static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
+{
+   u64 val;
+
+   if (!mem_cgroup_is_root(memcg)) {
+   if (!swap)
+   return res_counter_read_u64(>res, RES_USAGE);
+   else
+   return res_counter_read_u64(>memsw, RES_USAGE);
+   }
+
+   /*
+* Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
+* as well as in MEM_CGROUP_STAT_RSS_HUGE.
+*/
+   val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
+   val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
+
+   if (swap)
+   val += 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Michal Hocko
[Sorry to reply so late]

On Tue 02-09-14 13:57:22, Dave Hansen wrote:
> I, of course, forgot to include the most important detail.  This appears
> to be pretty run-of-the-mill spinlock contention in the resource counter
> code.  Nearly 80% of the CPU is spent spinning in the charge or uncharge
> paths in the kernel.  It is apparently spinning on res_counter->lock in
> both the charge and uncharge paths.
> 
> It already does _some_ batching here on the free side, but that
> apparently breaks down after ~40 threads.
> 
> It's a no-brainer since the patch in question removed an optimization
> skipping the charging, and now we're seeing overhead from the charging.
> 
> Here's the first entry from perf top:
> 
> 80.18%80.18%  [kernel]   [k] _raw_spin_lock
>   |
>   --- _raw_spin_lock
>  |
>  |--66.59%-- res_counter_uncharge_until
>  |  res_counter_uncharge
>  |  uncharge_batch
>  |  uncharge_list
>  |  mem_cgroup_uncharge_list
>  |  release_pages
>  |  free_pages_and_swap_cache

Ouch. free_pages_and_swap_cache completely kills the uncharge batching
because it reduces it to PAGEVEC_SIZE batches.

I think we really do not need PAGEVEC_SIZE batching anymore. We are
already batching on tlb_gather layer. That one is limited so I think
the below should be safe but I have to think about this some more. There
is a risk of prolonged lru_lock wait times but the number of pages is
limited to 10k and the heavy work is done outside of the lock. If this
is really a problem then we can tear LRU part and the actual
freeing/uncharging into a separate functions in this path.

Could you test with this half baked patch, please? I didn't get to test
it myself unfortunately.
---
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..15918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
struct page **pagep = pages;
+   int i;
 
lru_add_drain();
-   while (nr) {
-   int todo = min(nr, PAGEVEC_SIZE);
-   int i;
-
-   for (i = 0; i < todo; i++)
-   free_swap_cache(pagep[i]);
-   release_pages(pagep, todo, false);
-   pagep += todo;
-   nr -= todo;
-   }
+   for (i = 0; i < nr; i++)
+   free_swap_cache(pagep[i]);
+   release_pages(pagep, nr, false);
 }
 
 /*
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Dave Hansen
On 09/04/2014 07:27 AM, Michal Hocko wrote:
 Ouch. free_pages_and_swap_cache completely kills the uncharge batching
 because it reduces it to PAGEVEC_SIZE batches.
 
 I think we really do not need PAGEVEC_SIZE batching anymore. We are
 already batching on tlb_gather layer. That one is limited so I think
 the below should be safe but I have to think about this some more. There
 is a risk of prolonged lru_lock wait times but the number of pages is
 limited to 10k and the heavy work is done outside of the lock. If this
 is really a problem then we can tear LRU part and the actual
 freeing/uncharging into a separate functions in this path.
 
 Could you test with this half baked patch, please? I didn't get to test
 it myself unfortunately.

3.16 settled out at about 11.5M faults/sec before the regression.  This
patch gets it back up to about 10.5M, which is good.  The top spinlock
contention in the kernel is still from the resource counter code via
mem_cgroup_commit_charge(), though.

I'm running Johannes' patch now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Dave Hansen
On 09/04/2014 08:08 AM, Johannes Weiner wrote:
 Dave Hansen reports a massive scalability regression in an uncontained
 page fault benchmark with more than 30 concurrent threads, which he
 bisected down to 05b843012335 (mm: memcontrol: use root_mem_cgroup
 res_counter) and pin-pointed on res_counter spinlock contention.
 
 That change relied on the per-cpu charge caches to mostly swallow the
 res_counter costs, but it's apparent that the caches don't scale yet.
 
 Revert memcg back to bypassing res_counters on the root level in order
 to restore performance for uncontained workloads.

A quick sniff test shows performance coming back to what it was around
3.16 with this patch.  I'll run a more thorough set of tests and verify
that it's working well.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Dave Hansen
On 09/04/2014 01:27 PM, Dave Hansen wrote:
 On 09/04/2014 07:27 AM, Michal Hocko wrote:
 Ouch. free_pages_and_swap_cache completely kills the uncharge batching
 because it reduces it to PAGEVEC_SIZE batches.

 I think we really do not need PAGEVEC_SIZE batching anymore. We are
 already batching on tlb_gather layer. That one is limited so I think
 the below should be safe but I have to think about this some more. There
 is a risk of prolonged lru_lock wait times but the number of pages is
 limited to 10k and the heavy work is done outside of the lock. If this
 is really a problem then we can tear LRU part and the actual
 freeing/uncharging into a separate functions in this path.

 Could you test with this half baked patch, please? I didn't get to test
 it myself unfortunately.
 
 3.16 settled out at about 11.5M faults/sec before the regression.  This
 patch gets it back up to about 10.5M, which is good.  The top spinlock
 contention in the kernel is still from the resource counter code via
 mem_cgroup_commit_charge(), though.
 
 I'm running Johannes' patch now.

This looks pretty good.  The area where it plateaus (above 80 threads
where hyperthreading kicks in) might be a bit slower than it was in
3.16, but that could easily be from other things.

 https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/2=3.17.0-rc3-g57b252f

Feel free to add my Tested-by:

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Michal Hocko
[Sorry to reply so late]

On Tue 02-09-14 13:57:22, Dave Hansen wrote:
 I, of course, forgot to include the most important detail.  This appears
 to be pretty run-of-the-mill spinlock contention in the resource counter
 code.  Nearly 80% of the CPU is spent spinning in the charge or uncharge
 paths in the kernel.  It is apparently spinning on res_counter-lock in
 both the charge and uncharge paths.
 
 It already does _some_ batching here on the free side, but that
 apparently breaks down after ~40 threads.
 
 It's a no-brainer since the patch in question removed an optimization
 skipping the charging, and now we're seeing overhead from the charging.
 
 Here's the first entry from perf top:
 
 80.18%80.18%  [kernel]   [k] _raw_spin_lock
   |
   --- _raw_spin_lock
  |
  |--66.59%-- res_counter_uncharge_until
  |  res_counter_uncharge
  |  uncharge_batch
  |  uncharge_list
  |  mem_cgroup_uncharge_list
  |  release_pages
  |  free_pages_and_swap_cache

Ouch. free_pages_and_swap_cache completely kills the uncharge batching
because it reduces it to PAGEVEC_SIZE batches.

I think we really do not need PAGEVEC_SIZE batching anymore. We are
already batching on tlb_gather layer. That one is limited so I think
the below should be safe but I have to think about this some more. There
is a risk of prolonged lru_lock wait times but the number of pages is
limited to 10k and the heavy work is done outside of the lock. If this
is really a problem then we can tear LRU part and the actual
freeing/uncharging into a separate functions in this path.

Could you test with this half baked patch, please? I didn't get to test
it myself unfortunately.
---
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..15918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
struct page **pagep = pages;
+   int i;
 
lru_add_drain();
-   while (nr) {
-   int todo = min(nr, PAGEVEC_SIZE);
-   int i;
-
-   for (i = 0; i  todo; i++)
-   free_swap_cache(pagep[i]);
-   release_pages(pagep, todo, false);
-   pagep += todo;
-   nr -= todo;
-   }
+   for (i = 0; i  nr; i++)
+   free_swap_cache(pagep[i]);
+   release_pages(pagep, nr, false);
 }
 
 /*
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-04 Thread Johannes Weiner
On Tue, Sep 02, 2014 at 05:30:38PM -0700, Dave Hansen wrote:
 On 09/02/2014 05:10 PM, Johannes Weiner wrote:
  On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote:
  On 09/02/2014 03:18 PM, Johannes Weiner wrote:
  Accounting new pages is buffered through per-cpu caches, but taking
  them off the counters on free is not, so I'm guessing that above a
  certain allocation rate the cost of locking and changing the counters
  takes over.  Is there a chance you could profile this to see if locks
  and res_counter-related operations show up?
 
  It looks pretty much the same, although it might have equalized the
  charge and uncharge sides a bit.  Full 'perf top' output attached.
  
  That looks like a partial profile, where did the page allocator, page
  zeroing etc. go?  Because the distribution among these listed symbols
  doesn't seem all that crazy:
 
 Perf was only outputting the top 20 functions.  Believe it or not, page
 zeroing and the rest of the allocator path wasn't even in the path of
 the top 20 functions because there is so much lock contention.
 
 Here's a longer run of 'perf top' along with the top 100 functions:
 
   http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz
 
 you can at least see copy_page_rep in there.

Thanks for the clarification, that is truly horrible.  Does the
following revert restore performance in your case?

---
From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001
From: Johannes Weiner han...@cmpxchg.org
Date: Thu, 4 Sep 2014 10:04:34 -0400
Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter

Dave Hansen reports a massive scalability regression in an uncontained
page fault benchmark with more than 30 concurrent threads, which he
bisected down to 05b843012335 (mm: memcontrol: use root_mem_cgroup
res_counter) and pin-pointed on res_counter spinlock contention.

That change relied on the per-cpu charge caches to mostly swallow the
res_counter costs, but it's apparent that the caches don't scale yet.

Revert memcg back to bypassing res_counters on the root level in order
to restore performance for uncontained workloads.

Reported-by: Dave Hansen d...@sr71.net
Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 mm/memcontrol.c | 103 ++--
 1 file changed, 78 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec4dcf1b9562..085dc6d2f876 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
unsigned long long size;
int ret = 0;
 
+   if (mem_cgroup_is_root(memcg))
+   goto done;
 retry:
if (consume_stock(memcg, nr_pages))
goto done;
@@ -2611,9 +2613,7 @@ nomem:
if (!(gfp_mask  __GFP_NOFAIL))
return -ENOMEM;
 bypass:
-   memcg = root_mem_cgroup;
-   ret = -EINTR;
-   goto retry;
+   return -EINTR;
 
 done_restock:
if (batch  nr_pages)
@@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, 
unsigned int nr_pages)
 {
unsigned long bytes = nr_pages * PAGE_SIZE;
 
+   if (mem_cgroup_is_root(memcg))
+   return;
+
res_counter_uncharge(memcg-res, bytes);
if (do_swap_account)
res_counter_uncharge(memcg-memsw, bytes);
@@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct 
mem_cgroup *memcg,
 {
unsigned long bytes = nr_pages * PAGE_SIZE;
 
+   if (mem_cgroup_is_root(memcg))
+   return;
+
res_counter_uncharge_until(memcg-res, memcg-res.parent, bytes);
if (do_swap_account)
res_counter_uncharge_until(memcg-memsw,
@@ -4093,6 +4099,46 @@ out:
return retval;
 }
 
+static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
+  enum mem_cgroup_stat_index idx)
+{
+   struct mem_cgroup *iter;
+   long val = 0;
+
+   /* Per-cpu values can be negative, use a signed accumulator */
+   for_each_mem_cgroup_tree(iter, memcg)
+   val += mem_cgroup_read_stat(iter, idx);
+
+   if (val  0) /* race ? */
+   val = 0;
+   return val;
+}
+
+static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
+{
+   u64 val;
+
+   if (!mem_cgroup_is_root(memcg)) {
+   if (!swap)
+   return res_counter_read_u64(memcg-res, RES_USAGE);
+   else
+   return res_counter_read_u64(memcg-memsw, RES_USAGE);
+   }
+
+   /*
+* Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
+* as well as in MEM_CGROUP_STAT_RSS_HUGE.
+*/
+   val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
+   val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
+
+   if (swap)
+   val += 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 06:33 PM, Johannes Weiner wrote:
> kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to
> me whether Dave filtered symbols from only memcontrol.o, memory.o, and
> mmap.o in a similar way.  I'm not arguing against the regression, I'm
> just trying to make sense of the numbers from the *patched* kernel.

I guess I could have included it in the description, but that was a
pretty vanilla run:

perf top --call-graph=fp --stdio > foo.txt

I didn't use any filtering.  I didn't even know I _could_ filter. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Johannes Weiner
On Tue, Sep 02, 2014 at 05:20:55PM -0700, Linus Torvalds wrote:
> On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner  wrote:
> >
> > That looks like a partial profile, where did the page allocator, page
> > zeroing etc. go?  Because the distribution among these listed symbols
> > doesn't seem all that crazy:
> 
> Please argue this *after* the commit has been reverted. You guys can
> try to make the memcontrol batching actually work and scale later.
> It's not appropriate to argue against major regressions when reported
> and bisected by users.

I'll send a clean revert later.

> Showing the spinlock at the top of the profile is very much crazy
> (apparently taking 68% of all cpu time), when it's all useless
> make-believe work. I don't understand why you wouldn't call that
> crazy.

If you limit perf to a subset of symbols, it will show a relative
distribution between them, i.e: perf top --symbols kfree,memset during
some disk access:

   PerfTop:1292 irqs/sec  kernel:84.4%  exact:  0.0% [4000Hz cycles],  
(all, 4 CPUs)
---

56.23%  [kernel]  [k] kfree 
41.86%  [kernel]  [k] memset
 1.91%  libc-2.19.so  [.] memset

kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to
me whether Dave filtered symbols from only memcontrol.o, memory.o, and
mmap.o in a similar way.  I'm not arguing against the regression, I'm
just trying to make sense of the numbers from the *patched* kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 05:10 PM, Johannes Weiner wrote:
> On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote:
>> On 09/02/2014 03:18 PM, Johannes Weiner wrote:
>>> Accounting new pages is buffered through per-cpu caches, but taking
>>> them off the counters on free is not, so I'm guessing that above a
>>> certain allocation rate the cost of locking and changing the counters
>>> takes over.  Is there a chance you could profile this to see if locks
>>> and res_counter-related operations show up?
>>
>> It looks pretty much the same, although it might have equalized the
>> charge and uncharge sides a bit.  Full 'perf top' output attached.
> 
> That looks like a partial profile, where did the page allocator, page
> zeroing etc. go?  Because the distribution among these listed symbols
> doesn't seem all that crazy:

Perf was only outputting the top 20 functions.  Believe it or not, page
zeroing and the rest of the allocator path wasn't even in the path of
the top 20 functions because there is so much lock contention.

Here's a longer run of 'perf top' along with the top 100 functions:

http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz

you can at least see copy_page_rep in there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Linus Torvalds
On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner  wrote:
>
> That looks like a partial profile, where did the page allocator, page
> zeroing etc. go?  Because the distribution among these listed symbols
> doesn't seem all that crazy:

Please argue this *after* the commit has been reverted. You guys can
try to make the memcontrol batching actually work and scale later.
It's not appropriate to argue against major regressions when reported
and bisected by users.

Showing the spinlock at the top of the profile is very much crazy
(apparently taking 68% of all cpu time), when it's all useless
make-believe work. I don't understand why you wouldn't call that
crazy.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Johannes Weiner
On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote:
> On 09/02/2014 03:18 PM, Johannes Weiner wrote:
> > Accounting new pages is buffered through per-cpu caches, but taking
> > them off the counters on free is not, so I'm guessing that above a
> > certain allocation rate the cost of locking and changing the counters
> > takes over.  Is there a chance you could profile this to see if locks
> > and res_counter-related operations show up?
> 
> It looks pretty much the same, although it might have equalized the
> charge and uncharge sides a bit.  Full 'perf top' output attached.

That looks like a partial profile, where did the page allocator, page
zeroing etc. go?  Because the distribution among these listed symbols
doesn't seem all that crazy:

>PerfTop:  275580 irqs/sec  kernel:98.0%  exact:  0.0% [4000Hz cycles],  
> (all, 160 CPUs)
> ---
> 
> 68.10%68.10%  [kernel]   [k] _raw_spin_lock   
>   |
>   --- _raw_spin_lock
>  |  
>  |--57.35%-- __res_counter_charge
>  |  res_counter_charge
>  |  try_charge
>  |  mem_cgroup_try_charge
>  |  |  
>  |  |--99.93%-- do_cow_fault
>  |  |  handle_mm_fault
>  |  |  __do_page_fault
>  |  |  do_page_fault
>  |  |  page_fault
>  |  |  testcase
>  |   --0.07%-- [...]
>  |  
>  |--53.93%-- res_counter_uncharge_until
>  |  res_counter_uncharge
>  |  refill_stock
>  |  uncharge_batch
>  |  uncharge_list
>  |  mem_cgroup_uncharge_list
>  |  release_pages
>  |  free_pages_and_swap_cache
>  |  tlb_flush_mmu_free
>  |  |  
>  |  |--98.62%-- unmap_single_vma
>  |  |  unmap_vmas
>  |  |  unmap_region
>  |  |  do_munmap
>  |  |  vm_munmap
>  |  |  sys_munmap
>  |  |  system_call_fastpath
>  |  |  __GI___munmap
>  |  |  
>  |   --1.38%-- tlb_flush_mmu
>  | tlb_finish_mmu
>  | unmap_region
>  | do_munmap
>  | vm_munmap
>  | sys_munmap
>  | system_call_fastpath
>  | __GI___munmap
>  |  
>  |--2.18%-- do_cow_fault
>  |  handle_mm_fault
>  |  __do_page_fault
>  |  do_page_fault
>  |  page_fault
>  |  testcase
>   --9307219025.55%-- [...]
> 
> 64.00% 1.34%  page_fault2_processes  [.] testcase 
>   |
>   --- testcase
> 
> 62.64% 0.37%  [kernel]   [k] page_fault   
>   |
>   --- page_fault
>  |  
>  |--114.21%-- testcase
>   --10118485450.89%-- [...]
> 
> 62.27% 0.01%  [kernel]   [k] do_page_fault
>   |
>   --- do_page_fault
>  |  
>  |--114.28%-- page_fault
>  |  |  
>  |  |--99.93%-- testcase
>  |   --0.07%-- [...]
>   --10178525138.33%-- [...]
> 
> 62.25% 0.07%  [kernel]   [k] __do_page_fault  
>   |
>   --- __do_page_fault
>  |  
>  |--114.27%-- do_page_fault
>  |  page_fault
>  |  |  
>  |  |--99.93%-- testcase
>  |   --0.07%-- [...]
>   --10182230022.41%-- [...]
> 
> 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 03:18 PM, Johannes Weiner wrote:
> Accounting new pages is buffered through per-cpu caches, but taking
> them off the counters on free is not, so I'm guessing that above a
> certain allocation rate the cost of locking and changing the counters
> takes over.  Is there a chance you could profile this to see if locks
> and res_counter-related operations show up?

It looks pretty much the same, although it might have equalized the
charge and uncharge sides a bit.  Full 'perf top' output attached.
   PerfTop:  275580 irqs/sec  kernel:98.0%  exact:  0.0% [4000Hz cycles],  
(all, 160 CPUs)
---

68.10%68.10%  [kernel]   [k] _raw_spin_lock   
  |
  --- _raw_spin_lock
 |  
 |--57.35%-- __res_counter_charge
 |  res_counter_charge
 |  try_charge
 |  mem_cgroup_try_charge
 |  |  
 |  |--99.93%-- do_cow_fault
 |  |  handle_mm_fault
 |  |  __do_page_fault
 |  |  do_page_fault
 |  |  page_fault
 |  |  testcase
 |   --0.07%-- [...]
 |  
 |--53.93%-- res_counter_uncharge_until
 |  res_counter_uncharge
 |  refill_stock
 |  uncharge_batch
 |  uncharge_list
 |  mem_cgroup_uncharge_list
 |  release_pages
 |  free_pages_and_swap_cache
 |  tlb_flush_mmu_free
 |  |  
 |  |--98.62%-- unmap_single_vma
 |  |  unmap_vmas
 |  |  unmap_region
 |  |  do_munmap
 |  |  vm_munmap
 |  |  sys_munmap
 |  |  system_call_fastpath
 |  |  __GI___munmap
 |  |  
 |   --1.38%-- tlb_flush_mmu
 | tlb_finish_mmu
 | unmap_region
 | do_munmap
 | vm_munmap
 | sys_munmap
 | system_call_fastpath
 | __GI___munmap
 |  
 |--2.18%-- do_cow_fault
 |  handle_mm_fault
 |  __do_page_fault
 |  do_page_fault
 |  page_fault
 |  testcase
  --9307219025.55%-- [...]

64.00% 1.34%  page_fault2_processes  [.] testcase 
  |
  --- testcase

62.64% 0.37%  [kernel]   [k] page_fault   
  |
  --- page_fault
 |  
 |--114.21%-- testcase
  --10118485450.89%-- [...]

62.27% 0.01%  [kernel]   [k] do_page_fault
  |
  --- do_page_fault
 |  
 |--114.28%-- page_fault
 |  |  
 |  |--99.93%-- testcase
 |   --0.07%-- [...]
  --10178525138.33%-- [...]

62.25% 0.07%  [kernel]   [k] __do_page_fault  
  |
  --- __do_page_fault
 |  
 |--114.27%-- do_page_fault
 |  page_fault
 |  |  
 |  |--99.93%-- testcase
 |   --0.07%-- [...]
  --10182230022.41%-- [...]

62.08% 0.26%  [kernel]   [k] handle_mm_fault  
  |
  --- handle_mm_fault
 |  
 |--114.28%-- __do_page_fault
 |  do_page_fault
 |  page_fault
 |  |  
 |  |--99.94%-- testcase
 |   

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Johannes Weiner
Hi Dave,

On Tue, Sep 02, 2014 at 12:05:41PM -0700, Dave Hansen wrote:
> I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the
> memory cgroups code.  This is on a kernel with cgroups enabled at
> compile time, but not _used_ for anything.  See the green lines in the
> graph:
> 
>   https://www.sr71.net/~dave/intel/regression-from-05b843012.png
> 
> The workload is a little parallel microbenchmark doing page faults:

Ouch.

> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c
> 
> The hardware is an 8-socket Westmere box with 160 hardware threads.  For
> some reason, this does not affect the version of the microbenchmark
> which is doing completely anonymous page faults.
> 
> I bisected it down to this commit:
> 
> > commit 05b8430123359886ef6a4146fba384e30d771b3f
> > Author: Johannes Weiner 
> > Date:   Wed Aug 6 16:05:59 2014 -0700
> > 
> > mm: memcontrol: use root_mem_cgroup res_counter
> > 
> > Due to an old optimization to keep expensive res_counter changes at a
> > minimum, the root_mem_cgroup res_counter is never charged; there is no
> > limit at that level anyway, and any statistics can be generated on
> > demand by summing up the counters of all other cgroups.
> > 
> > However, with per-cpu charge caches, res_counter operations do not even
> > show up in profiles anymore, so this optimization is no longer
> > necessary.
> > 
> > Remove it to simplify the code.

Accounting new pages is buffered through per-cpu caches, but taking
them off the counters on free is not, so I'm guessing that above a
certain allocation rate the cost of locking and changing the counters
takes over.  Is there a chance you could profile this to see if locks
and res_counter-related operations show up?

I can't reproduce this complete breakdown on my smaller test gear, but
I do see an improvement with the following patch:

---
>From 29a51326c24b7bb45d17e9c864c34506f10868f6 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Tue, 2 Sep 2014 18:11:39 -0400
Subject: [patch] mm: memcontrol: use per-cpu caches for uncharging

---
 mm/memcontrol.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec4dcf1b9562..cb79ecff399d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2365,6 +2365,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
res_counter_uncharge(>res, bytes);
if (do_swap_account)
res_counter_uncharge(>memsw, bytes);
+   memcg_oom_recover(old);
stock->nr_pages = 0;
}
stock->cached = NULL;
@@ -2405,6 +2406,13 @@ static void refill_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
stock->cached = memcg;
}
stock->nr_pages += nr_pages;
+   if (stock->nr_pages > CHARGE_BATCH * 4) {
+   res_counter_uncharge(>res, CHARGE_BATCH);
+   if (do_swap_account)
+   res_counter_uncharge(>memsw, CHARGE_BATCH);
+   memcg_oom_recover(memcg);
+   stock->nr_pages -= CHARGE_BATCH;
+   }
put_cpu_var(memcg_stock);
 }
 
@@ -6509,12 +6517,20 @@ static void uncharge_batch(struct mem_cgroup *memcg, 
unsigned long pgpgout,
 {
unsigned long flags;
 
-   if (nr_mem)
-   res_counter_uncharge(>res, nr_mem * PAGE_SIZE);
-   if (nr_memsw)
-   res_counter_uncharge(>memsw, nr_memsw * PAGE_SIZE);
-
-   memcg_oom_recover(memcg);
+   /*
+* The percpu caches count both memory and memsw charges in a
+* single conuter, but there might be less memsw charges when
+* some of the pages have been swapped out.
+*/
+   if (nr_mem == nr_memsw)
+   refill_stock(memcg, nr_mem);
+   else {
+   if (nr_mem)
+   res_counter_uncharge(>res, nr_mem * PAGE_SIZE);
+   if (nr_memsw)
+   res_counter_uncharge(>memsw, nr_memsw * 
PAGE_SIZE);
+   memcg_oom_recover(memcg);
+   }
 
local_irq_save(flags);
__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon);
-- 
2.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
I, of course, forgot to include the most important detail.  This appears
to be pretty run-of-the-mill spinlock contention in the resource counter
code.  Nearly 80% of the CPU is spent spinning in the charge or uncharge
paths in the kernel.  It is apparently spinning on res_counter->lock in
both the charge and uncharge paths.

It already does _some_ batching here on the free side, but that
apparently breaks down after ~40 threads.

It's a no-brainer since the patch in question removed an optimization
skipping the charging, and now we're seeing overhead from the charging.

Here's the first entry from perf top:

80.18%80.18%  [kernel]   [k] _raw_spin_lock
  |
  --- _raw_spin_lock
 |
 |--66.59%-- res_counter_uncharge_until
 |  res_counter_uncharge
 |  uncharge_batch
 |  uncharge_list
 |  mem_cgroup_uncharge_list
 |  release_pages
 |  free_pages_and_swap_cache
 |  tlb_flush_mmu_free
 |  |
 |  |--90.12%-- unmap_single_vma
 |  |  unmap_vmas
 |  |  unmap_region
 |  |  do_munmap
 |  |  vm_munmap
 |  |  sys_munmap
 |  |  system_call_fastpath
 |  |  __GI___munmap
 |  |
 |   --9.88%-- tlb_flush_mmu
 | tlb_finish_mmu
 | unmap_region
 | do_munmap
 | vm_munmap
 | sys_munmap
 | system_call_fastpath
 | __GI___munmap
 |
 |--46.13%-- __res_counter_charge
 |  res_counter_charge
 |  try_charge
 |  mem_cgroup_try_charge
 |  |
 |  |--99.89%-- do_cow_fault
 |  |  handle_mm_fault
 |  |  __do_page_fault
 |  |  do_page_fault
 |  |  page_fault
 |  |  testcase
 |   --0.11%-- [...]
 |
 |--1.14%-- do_cow_fault
 |  handle_mm_fault
 |  __do_page_fault
 |  do_page_fault
 |  page_fault
 |  testcase
  --8217937613.29%-- [...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 12:05 PM, Dave Hansen wrote:
> It does not revert cleanly because of the hunks below.  The code in
> those hunks was removed, so I tried running without properly merging
> them and it spews warnings because counter->usage is seen going negative.
> 
> So, it doesn't appear we can quickly revert this.

I'm fairly confident that I missed some of the cases (especially in the
charge-moving code), but the attached patch does at least work around
the regression for me.  It restores the original performance, or at
least gets _close_ to it.




---

 b/mm/memcontrol.c |   14 ++
 1 file changed, 14 insertions(+)

diff -puN mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch mm/memcontrol.c
--- a/mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch	2014-09-02 12:20:11.209527453 -0700
+++ b/mm/memcontrol.c	2014-09-02 13:10:28.756736862 -0700
@@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup
 	unsigned long long size;
 	int ret = 0;
 
+	if (mem_cgroup_is_root(memcg))
+		goto done;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2640,6 +2642,9 @@ static void __mem_cgroup_cancel_local_ch
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	res_counter_uncharge_until(>res, memcg->res.parent, bytes);
 	if (do_swap_account)
 		res_counter_uncharge_until(>memsw,
@@ -6440,6 +6445,9 @@ void mem_cgroup_commit_charge(struct pag
 	VM_BUG_ON_PAGE(!page->mapping, page);
 	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	/*
@@ -6484,6 +6492,9 @@ void mem_cgroup_cancel_charge(struct pag
 {
 	unsigned int nr_pages = 1;
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	/*
@@ -6509,6 +6520,9 @@ static void uncharge_batch(struct mem_cg
 {
 	unsigned long flags;
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	if (nr_mem)
 		res_counter_uncharge(>res, nr_mem * PAGE_SIZE);
 	if (nr_memsw)
_


regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the
memory cgroups code.  This is on a kernel with cgroups enabled at
compile time, but not _used_ for anything.  See the green lines in the
graph:

https://www.sr71.net/~dave/intel/regression-from-05b843012.png

The workload is a little parallel microbenchmark doing page faults:

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c

The hardware is an 8-socket Westmere box with 160 hardware threads.  For
some reason, this does not affect the version of the microbenchmark
which is doing completely anonymous page faults.

I bisected it down to this commit:

> commit 05b8430123359886ef6a4146fba384e30d771b3f
> Author: Johannes Weiner 
> Date:   Wed Aug 6 16:05:59 2014 -0700
> 
> mm: memcontrol: use root_mem_cgroup res_counter
> 
> Due to an old optimization to keep expensive res_counter changes at a
> minimum, the root_mem_cgroup res_counter is never charged; there is no
> limit at that level anyway, and any statistics can be generated on
> demand by summing up the counters of all other cgroups.
> 
> However, with per-cpu charge caches, res_counter operations do not even
> show up in profiles anymore, so this optimization is no longer
> necessary.
> 
> Remove it to simplify the code.

It does not revert cleanly because of the hunks below.  The code in
those hunks was removed, so I tried running without properly merging
them and it spews warnings because counter->usage is seen going negative.

So, it doesn't appear we can quickly revert this.

> --- mm/memcontrol.c
> +++ mm/memcontrol.c
> @@ -3943,7 +3947,7 @@
>  * replacement page, so leave it alone when phasing out the
>  * page that is unused after the migration.
>  */
> -   if (!end_migration)
> +   if (!end_migration && !mem_cgroup_is_root(memcg))
> mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
>  
> return memcg;
> @@ -4076,7 +4080,8 @@
>  * We uncharge this because swap is freed.  This memcg can
>  * be obsolete one. We avoid calling css_tryget_online().
>  */
> -   res_counter_uncharge(>memsw, PAGE_SIZE);
> +   if (!mem_cgroup_is_root(memcg))
> +   res_counter_uncharge(>memsw, PAGE_SIZE);
> mem_cgroup_swap_statistics(memcg, false);
> css_put(>css);
> }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the
memory cgroups code.  This is on a kernel with cgroups enabled at
compile time, but not _used_ for anything.  See the green lines in the
graph:

https://www.sr71.net/~dave/intel/regression-from-05b843012.png

The workload is a little parallel microbenchmark doing page faults:

 https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c

The hardware is an 8-socket Westmere box with 160 hardware threads.  For
some reason, this does not affect the version of the microbenchmark
which is doing completely anonymous page faults.

I bisected it down to this commit:

 commit 05b8430123359886ef6a4146fba384e30d771b3f
 Author: Johannes Weiner han...@cmpxchg.org
 Date:   Wed Aug 6 16:05:59 2014 -0700
 
 mm: memcontrol: use root_mem_cgroup res_counter
 
 Due to an old optimization to keep expensive res_counter changes at a
 minimum, the root_mem_cgroup res_counter is never charged; there is no
 limit at that level anyway, and any statistics can be generated on
 demand by summing up the counters of all other cgroups.
 
 However, with per-cpu charge caches, res_counter operations do not even
 show up in profiles anymore, so this optimization is no longer
 necessary.
 
 Remove it to simplify the code.

It does not revert cleanly because of the hunks below.  The code in
those hunks was removed, so I tried running without properly merging
them and it spews warnings because counter-usage is seen going negative.

So, it doesn't appear we can quickly revert this.

 --- mm/memcontrol.c
 +++ mm/memcontrol.c
 @@ -3943,7 +3947,7 @@
  * replacement page, so leave it alone when phasing out the
  * page that is unused after the migration.
  */
 -   if (!end_migration)
 +   if (!end_migration  !mem_cgroup_is_root(memcg))
 mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
  
 return memcg;
 @@ -4076,7 +4080,8 @@
  * We uncharge this because swap is freed.  This memcg can
  * be obsolete one. We avoid calling css_tryget_online().
  */
 -   res_counter_uncharge(memcg-memsw, PAGE_SIZE);
 +   if (!mem_cgroup_is_root(memcg))
 +   res_counter_uncharge(memcg-memsw, PAGE_SIZE);
 mem_cgroup_swap_statistics(memcg, false);
 css_put(memcg-css);
 }
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 12:05 PM, Dave Hansen wrote:
 It does not revert cleanly because of the hunks below.  The code in
 those hunks was removed, so I tried running without properly merging
 them and it spews warnings because counter-usage is seen going negative.
 
 So, it doesn't appear we can quickly revert this.

I'm fairly confident that I missed some of the cases (especially in the
charge-moving code), but the attached patch does at least work around
the regression for me.  It restores the original performance, or at
least gets _close_ to it.




---

 b/mm/memcontrol.c |   14 ++
 1 file changed, 14 insertions(+)

diff -puN mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch mm/memcontrol.c
--- a/mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch	2014-09-02 12:20:11.209527453 -0700
+++ b/mm/memcontrol.c	2014-09-02 13:10:28.756736862 -0700
@@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup
 	unsigned long long size;
 	int ret = 0;
 
+	if (mem_cgroup_is_root(memcg))
+		goto done;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2640,6 +2642,9 @@ static void __mem_cgroup_cancel_local_ch
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	res_counter_uncharge_until(memcg-res, memcg-res.parent, bytes);
 	if (do_swap_account)
 		res_counter_uncharge_until(memcg-memsw,
@@ -6440,6 +6445,9 @@ void mem_cgroup_commit_charge(struct pag
 	VM_BUG_ON_PAGE(!page-mapping, page);
 	VM_BUG_ON_PAGE(PageLRU(page)  !lrucare, page);
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	/*
@@ -6484,6 +6492,9 @@ void mem_cgroup_cancel_charge(struct pag
 {
 	unsigned int nr_pages = 1;
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	/*
@@ -6509,6 +6520,9 @@ static void uncharge_batch(struct mem_cg
 {
 	unsigned long flags;
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	if (nr_mem)
 		res_counter_uncharge(memcg-res, nr_mem * PAGE_SIZE);
 	if (nr_memsw)
_


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
I, of course, forgot to include the most important detail.  This appears
to be pretty run-of-the-mill spinlock contention in the resource counter
code.  Nearly 80% of the CPU is spent spinning in the charge or uncharge
paths in the kernel.  It is apparently spinning on res_counter-lock in
both the charge and uncharge paths.

It already does _some_ batching here on the free side, but that
apparently breaks down after ~40 threads.

It's a no-brainer since the patch in question removed an optimization
skipping the charging, and now we're seeing overhead from the charging.

Here's the first entry from perf top:

80.18%80.18%  [kernel]   [k] _raw_spin_lock
  |
  --- _raw_spin_lock
 |
 |--66.59%-- res_counter_uncharge_until
 |  res_counter_uncharge
 |  uncharge_batch
 |  uncharge_list
 |  mem_cgroup_uncharge_list
 |  release_pages
 |  free_pages_and_swap_cache
 |  tlb_flush_mmu_free
 |  |
 |  |--90.12%-- unmap_single_vma
 |  |  unmap_vmas
 |  |  unmap_region
 |  |  do_munmap
 |  |  vm_munmap
 |  |  sys_munmap
 |  |  system_call_fastpath
 |  |  __GI___munmap
 |  |
 |   --9.88%-- tlb_flush_mmu
 | tlb_finish_mmu
 | unmap_region
 | do_munmap
 | vm_munmap
 | sys_munmap
 | system_call_fastpath
 | __GI___munmap
 |
 |--46.13%-- __res_counter_charge
 |  res_counter_charge
 |  try_charge
 |  mem_cgroup_try_charge
 |  |
 |  |--99.89%-- do_cow_fault
 |  |  handle_mm_fault
 |  |  __do_page_fault
 |  |  do_page_fault
 |  |  page_fault
 |  |  testcase
 |   --0.11%-- [...]
 |
 |--1.14%-- do_cow_fault
 |  handle_mm_fault
 |  __do_page_fault
 |  do_page_fault
 |  page_fault
 |  testcase
  --8217937613.29%-- [...]

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Johannes Weiner
Hi Dave,

On Tue, Sep 02, 2014 at 12:05:41PM -0700, Dave Hansen wrote:
 I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the
 memory cgroups code.  This is on a kernel with cgroups enabled at
 compile time, but not _used_ for anything.  See the green lines in the
 graph:
 
   https://www.sr71.net/~dave/intel/regression-from-05b843012.png
 
 The workload is a little parallel microbenchmark doing page faults:

Ouch.

  https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c
 
 The hardware is an 8-socket Westmere box with 160 hardware threads.  For
 some reason, this does not affect the version of the microbenchmark
 which is doing completely anonymous page faults.
 
 I bisected it down to this commit:
 
  commit 05b8430123359886ef6a4146fba384e30d771b3f
  Author: Johannes Weiner han...@cmpxchg.org
  Date:   Wed Aug 6 16:05:59 2014 -0700
  
  mm: memcontrol: use root_mem_cgroup res_counter
  
  Due to an old optimization to keep expensive res_counter changes at a
  minimum, the root_mem_cgroup res_counter is never charged; there is no
  limit at that level anyway, and any statistics can be generated on
  demand by summing up the counters of all other cgroups.
  
  However, with per-cpu charge caches, res_counter operations do not even
  show up in profiles anymore, so this optimization is no longer
  necessary.
  
  Remove it to simplify the code.

Accounting new pages is buffered through per-cpu caches, but taking
them off the counters on free is not, so I'm guessing that above a
certain allocation rate the cost of locking and changing the counters
takes over.  Is there a chance you could profile this to see if locks
and res_counter-related operations show up?

I can't reproduce this complete breakdown on my smaller test gear, but
I do see an improvement with the following patch:

---
From 29a51326c24b7bb45d17e9c864c34506f10868f6 Mon Sep 17 00:00:00 2001
From: Johannes Weiner han...@cmpxchg.org
Date: Tue, 2 Sep 2014 18:11:39 -0400
Subject: [patch] mm: memcontrol: use per-cpu caches for uncharging

---
 mm/memcontrol.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec4dcf1b9562..cb79ecff399d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2365,6 +2365,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
res_counter_uncharge(old-res, bytes);
if (do_swap_account)
res_counter_uncharge(old-memsw, bytes);
+   memcg_oom_recover(old);
stock-nr_pages = 0;
}
stock-cached = NULL;
@@ -2405,6 +2406,13 @@ static void refill_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
stock-cached = memcg;
}
stock-nr_pages += nr_pages;
+   if (stock-nr_pages  CHARGE_BATCH * 4) {
+   res_counter_uncharge(memcg-res, CHARGE_BATCH);
+   if (do_swap_account)
+   res_counter_uncharge(memcg-memsw, CHARGE_BATCH);
+   memcg_oom_recover(memcg);
+   stock-nr_pages -= CHARGE_BATCH;
+   }
put_cpu_var(memcg_stock);
 }
 
@@ -6509,12 +6517,20 @@ static void uncharge_batch(struct mem_cgroup *memcg, 
unsigned long pgpgout,
 {
unsigned long flags;
 
-   if (nr_mem)
-   res_counter_uncharge(memcg-res, nr_mem * PAGE_SIZE);
-   if (nr_memsw)
-   res_counter_uncharge(memcg-memsw, nr_memsw * PAGE_SIZE);
-
-   memcg_oom_recover(memcg);
+   /*
+* The percpu caches count both memory and memsw charges in a
+* single conuter, but there might be less memsw charges when
+* some of the pages have been swapped out.
+*/
+   if (nr_mem == nr_memsw)
+   refill_stock(memcg, nr_mem);
+   else {
+   if (nr_mem)
+   res_counter_uncharge(memcg-res, nr_mem * PAGE_SIZE);
+   if (nr_memsw)
+   res_counter_uncharge(memcg-memsw, nr_memsw * 
PAGE_SIZE);
+   memcg_oom_recover(memcg);
+   }
 
local_irq_save(flags);
__this_cpu_sub(memcg-stat-count[MEM_CGROUP_STAT_RSS], nr_anon);
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 03:18 PM, Johannes Weiner wrote:
 Accounting new pages is buffered through per-cpu caches, but taking
 them off the counters on free is not, so I'm guessing that above a
 certain allocation rate the cost of locking and changing the counters
 takes over.  Is there a chance you could profile this to see if locks
 and res_counter-related operations show up?

It looks pretty much the same, although it might have equalized the
charge and uncharge sides a bit.  Full 'perf top' output attached.
   PerfTop:  275580 irqs/sec  kernel:98.0%  exact:  0.0% [4000Hz cycles],  
(all, 160 CPUs)
---

68.10%68.10%  [kernel]   [k] _raw_spin_lock   
  |
  --- _raw_spin_lock
 |  
 |--57.35%-- __res_counter_charge
 |  res_counter_charge
 |  try_charge
 |  mem_cgroup_try_charge
 |  |  
 |  |--99.93%-- do_cow_fault
 |  |  handle_mm_fault
 |  |  __do_page_fault
 |  |  do_page_fault
 |  |  page_fault
 |  |  testcase
 |   --0.07%-- [...]
 |  
 |--53.93%-- res_counter_uncharge_until
 |  res_counter_uncharge
 |  refill_stock
 |  uncharge_batch
 |  uncharge_list
 |  mem_cgroup_uncharge_list
 |  release_pages
 |  free_pages_and_swap_cache
 |  tlb_flush_mmu_free
 |  |  
 |  |--98.62%-- unmap_single_vma
 |  |  unmap_vmas
 |  |  unmap_region
 |  |  do_munmap
 |  |  vm_munmap
 |  |  sys_munmap
 |  |  system_call_fastpath
 |  |  __GI___munmap
 |  |  
 |   --1.38%-- tlb_flush_mmu
 | tlb_finish_mmu
 | unmap_region
 | do_munmap
 | vm_munmap
 | sys_munmap
 | system_call_fastpath
 | __GI___munmap
 |  
 |--2.18%-- do_cow_fault
 |  handle_mm_fault
 |  __do_page_fault
 |  do_page_fault
 |  page_fault
 |  testcase
  --9307219025.55%-- [...]

64.00% 1.34%  page_fault2_processes  [.] testcase 
  |
  --- testcase

62.64% 0.37%  [kernel]   [k] page_fault   
  |
  --- page_fault
 |  
 |--114.21%-- testcase
  --10118485450.89%-- [...]

62.27% 0.01%  [kernel]   [k] do_page_fault
  |
  --- do_page_fault
 |  
 |--114.28%-- page_fault
 |  |  
 |  |--99.93%-- testcase
 |   --0.07%-- [...]
  --10178525138.33%-- [...]

62.25% 0.07%  [kernel]   [k] __do_page_fault  
  |
  --- __do_page_fault
 |  
 |--114.27%-- do_page_fault
 |  page_fault
 |  |  
 |  |--99.93%-- testcase
 |   --0.07%-- [...]
  --10182230022.41%-- [...]

62.08% 0.26%  [kernel]   [k] handle_mm_fault  
  |
  --- handle_mm_fault
 |  
 |--114.28%-- __do_page_fault
 |  do_page_fault
 |  page_fault
 |  |  
 |  |--99.94%-- testcase
 |   --0.06%-- 

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Johannes Weiner
On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote:
 On 09/02/2014 03:18 PM, Johannes Weiner wrote:
  Accounting new pages is buffered through per-cpu caches, but taking
  them off the counters on free is not, so I'm guessing that above a
  certain allocation rate the cost of locking and changing the counters
  takes over.  Is there a chance you could profile this to see if locks
  and res_counter-related operations show up?
 
 It looks pretty much the same, although it might have equalized the
 charge and uncharge sides a bit.  Full 'perf top' output attached.

That looks like a partial profile, where did the page allocator, page
zeroing etc. go?  Because the distribution among these listed symbols
doesn't seem all that crazy:

PerfTop:  275580 irqs/sec  kernel:98.0%  exact:  0.0% [4000Hz cycles],  
 (all, 160 CPUs)
 ---
 
 68.10%68.10%  [kernel]   [k] _raw_spin_lock   
   |
   --- _raw_spin_lock
  |  
  |--57.35%-- __res_counter_charge
  |  res_counter_charge
  |  try_charge
  |  mem_cgroup_try_charge
  |  |  
  |  |--99.93%-- do_cow_fault
  |  |  handle_mm_fault
  |  |  __do_page_fault
  |  |  do_page_fault
  |  |  page_fault
  |  |  testcase
  |   --0.07%-- [...]
  |  
  |--53.93%-- res_counter_uncharge_until
  |  res_counter_uncharge
  |  refill_stock
  |  uncharge_batch
  |  uncharge_list
  |  mem_cgroup_uncharge_list
  |  release_pages
  |  free_pages_and_swap_cache
  |  tlb_flush_mmu_free
  |  |  
  |  |--98.62%-- unmap_single_vma
  |  |  unmap_vmas
  |  |  unmap_region
  |  |  do_munmap
  |  |  vm_munmap
  |  |  sys_munmap
  |  |  system_call_fastpath
  |  |  __GI___munmap
  |  |  
  |   --1.38%-- tlb_flush_mmu
  | tlb_finish_mmu
  | unmap_region
  | do_munmap
  | vm_munmap
  | sys_munmap
  | system_call_fastpath
  | __GI___munmap
  |  
  |--2.18%-- do_cow_fault
  |  handle_mm_fault
  |  __do_page_fault
  |  do_page_fault
  |  page_fault
  |  testcase
   --9307219025.55%-- [...]
 
 64.00% 1.34%  page_fault2_processes  [.] testcase 
   |
   --- testcase
 
 62.64% 0.37%  [kernel]   [k] page_fault   
   |
   --- page_fault
  |  
  |--114.21%-- testcase
   --10118485450.89%-- [...]
 
 62.27% 0.01%  [kernel]   [k] do_page_fault
   |
   --- do_page_fault
  |  
  |--114.28%-- page_fault
  |  |  
  |  |--99.93%-- testcase
  |   --0.07%-- [...]
   --10178525138.33%-- [...]
 
 62.25% 0.07%  [kernel]   [k] __do_page_fault  
   |
   --- __do_page_fault
  |  
  |--114.27%-- do_page_fault
  |  page_fault
  |  |  
  |  |--99.93%-- testcase
  |   --0.07%-- [...]
   --10182230022.41%-- [...]
 
 62.08% 0.26%  [kernel]   [k] handle_mm_fault  
   |
  

Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Linus Torvalds
On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner han...@cmpxchg.org wrote:

 That looks like a partial profile, where did the page allocator, page
 zeroing etc. go?  Because the distribution among these listed symbols
 doesn't seem all that crazy:

Please argue this *after* the commit has been reverted. You guys can
try to make the memcontrol batching actually work and scale later.
It's not appropriate to argue against major regressions when reported
and bisected by users.

Showing the spinlock at the top of the profile is very much crazy
(apparently taking 68% of all cpu time), when it's all useless
make-believe work. I don't understand why you wouldn't call that
crazy.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 05:10 PM, Johannes Weiner wrote:
 On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote:
 On 09/02/2014 03:18 PM, Johannes Weiner wrote:
 Accounting new pages is buffered through per-cpu caches, but taking
 them off the counters on free is not, so I'm guessing that above a
 certain allocation rate the cost of locking and changing the counters
 takes over.  Is there a chance you could profile this to see if locks
 and res_counter-related operations show up?

 It looks pretty much the same, although it might have equalized the
 charge and uncharge sides a bit.  Full 'perf top' output attached.
 
 That looks like a partial profile, where did the page allocator, page
 zeroing etc. go?  Because the distribution among these listed symbols
 doesn't seem all that crazy:

Perf was only outputting the top 20 functions.  Believe it or not, page
zeroing and the rest of the allocator path wasn't even in the path of
the top 20 functions because there is so much lock contention.

Here's a longer run of 'perf top' along with the top 100 functions:

http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz

you can at least see copy_page_rep in there.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Johannes Weiner
On Tue, Sep 02, 2014 at 05:20:55PM -0700, Linus Torvalds wrote:
 On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner han...@cmpxchg.org wrote:
 
  That looks like a partial profile, where did the page allocator, page
  zeroing etc. go?  Because the distribution among these listed symbols
  doesn't seem all that crazy:
 
 Please argue this *after* the commit has been reverted. You guys can
 try to make the memcontrol batching actually work and scale later.
 It's not appropriate to argue against major regressions when reported
 and bisected by users.

I'll send a clean revert later.

 Showing the spinlock at the top of the profile is very much crazy
 (apparently taking 68% of all cpu time), when it's all useless
 make-believe work. I don't understand why you wouldn't call that
 crazy.

If you limit perf to a subset of symbols, it will show a relative
distribution between them, i.e: perf top --symbols kfree,memset during
some disk access:

   PerfTop:1292 irqs/sec  kernel:84.4%  exact:  0.0% [4000Hz cycles],  
(all, 4 CPUs)
---

56.23%  [kernel]  [k] kfree 
41.86%  [kernel]  [k] memset
 1.91%  libc-2.19.so  [.] memset

kfree isn't eating 56% of all cpu time here, and it wasn't clear to
me whether Dave filtered symbols from only memcontrol.o, memory.o, and
mmap.o in a similar way.  I'm not arguing against the regression, I'm
just trying to make sense of the numbers from the *patched* kernel.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression caused by cgroups optimization in 3.17-rc2

2014-09-02 Thread Dave Hansen
On 09/02/2014 06:33 PM, Johannes Weiner wrote:
 kfree isn't eating 56% of all cpu time here, and it wasn't clear to
 me whether Dave filtered symbols from only memcontrol.o, memory.o, and
 mmap.o in a similar way.  I'm not arguing against the regression, I'm
 just trying to make sense of the numbers from the *patched* kernel.

I guess I could have included it in the description, but that was a
pretty vanilla run:

perf top --call-graph=fp --stdio  foo.txt

I didn't use any filtering.  I didn't even know I _could_ filter. :)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/