Re: regression caused by cgroups optimization in 3.17-rc2
On Wed 10-09-14 09:57:56, Dave Hansen wrote: > On 09/10/2014 09:29 AM, Michal Hocko wrote: > > I do not have a bigger machine to play with unfortunately. I think the > > patch makes sense on its own. I would really appreciate if you could > > give it a try on your machine with !root memcg case to see how much it > > helped. I would expect similar results to your previous testing without > > the revert and Johannes' patch. > > So you want to see before/after this patch: > > Subject: [PATCH] mm, memcg: Do not kill release batching in > free_pages_and_swap_cache > > And you want it on top of a kernel with the revert or without? Revert doesn't make any difference if you run the load inside a memcg (without any limit set). So just before and after the patch would be sufficient. Thanks a lot Dave! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/10/2014 09:29 AM, Michal Hocko wrote: > I do not have a bigger machine to play with unfortunately. I think the > patch makes sense on its own. I would really appreciate if you could > give it a try on your machine with !root memcg case to see how much it > helped. I would expect similar results to your previous testing without > the revert and Johannes' patch. So you want to see before/after this patch: Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache And you want it on top of a kernel with the revert or without? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Fri 05-09-14 11:25:37, Michal Hocko wrote: > On Thu 04-09-14 13:27:26, Dave Hansen wrote: > > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > > because it reduces it to PAGEVEC_SIZE batches. > > > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > > already batching on tlb_gather layer. That one is limited so I think > > > the below should be safe but I have to think about this some more. There > > > is a risk of prolonged lru_lock wait times but the number of pages is > > > limited to 10k and the heavy work is done outside of the lock. If this > > > is really a problem then we can tear LRU part and the actual > > > freeing/uncharging into a separate functions in this path. > > > > > > Could you test with this half baked patch, please? I didn't get to test > > > it myself unfortunately. > > > > 3.16 settled out at about 11.5M faults/sec before the regression. This > > patch gets it back up to about 10.5M, which is good. > > Dave, would you be willing to test the following patch as well? I do not > have a huge machine at hand right now. It would be great if you could I was playing with 48CPU with 32G of RAM machine but the res_counter lock didn't show up in the traces much (this was with 96 processes doing mmap (256M private file, faul, unmap in parallel): |--0.75%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--81.56%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault [...] | | | --18.44%-- __add_to_page_cache_locked | add_to_page_cache_lru | mpage_readpages | ext4_readpages | __do_page_cache_readahead | ondemand_readahead | page_cache_async_readahead | filemap_fault | __do_fault | do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault Nothing really changed in that regards when I reduced mmap size to 128M and run with 4*CPUs. I do not have a bigger machine to play with unfortunately. I think the patch makes sense on its own. I would really appreciate if you could give it a try on your machine with !root memcg case to see how much it helped. I would expect similar results to your previous testing without the revert and Johannes' patch. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Fri 05-09-14 11:25:37, Michal Hocko wrote: On Thu 04-09-14 13:27:26, Dave Hansen wrote: On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. Dave, would you be willing to test the following patch as well? I do not have a huge machine at hand right now. It would be great if you could I was playing with 48CPU with 32G of RAM machine but the res_counter lock didn't show up in the traces much (this was with 96 processes doing mmap (256M private file, faul, unmap in parallel): |--0.75%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--81.56%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault [...] | | | --18.44%-- __add_to_page_cache_locked | add_to_page_cache_lru | mpage_readpages | ext4_readpages | __do_page_cache_readahead | ondemand_readahead | page_cache_async_readahead | filemap_fault | __do_fault | do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault Nothing really changed in that regards when I reduced mmap size to 128M and run with 4*CPUs. I do not have a bigger machine to play with unfortunately. I think the patch makes sense on its own. I would really appreciate if you could give it a try on your machine with !root memcg case to see how much it helped. I would expect similar results to your previous testing without the revert and Johannes' patch. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/10/2014 09:29 AM, Michal Hocko wrote: I do not have a bigger machine to play with unfortunately. I think the patch makes sense on its own. I would really appreciate if you could give it a try on your machine with !root memcg case to see how much it helped. I would expect similar results to your previous testing without the revert and Johannes' patch. So you want to see before/after this patch: Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache And you want it on top of a kernel with the revert or without? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Wed 10-09-14 09:57:56, Dave Hansen wrote: On 09/10/2014 09:29 AM, Michal Hocko wrote: I do not have a bigger machine to play with unfortunately. I think the patch makes sense on its own. I would really appreciate if you could give it a try on your machine with !root memcg case to see how much it helped. I would expect similar results to your previous testing without the revert and Johannes' patch. So you want to see before/after this patch: Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache And you want it on top of a kernel with the revert or without? Revert doesn't make any difference if you run the load inside a memcg (without any limit set). So just before and after the patch would be sufficient. Thanks a lot Dave! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/09/2014 07:50 AM, Johannes Weiner wrote: > The mctz->lock is only taken when there is, or has been, soft limit > excess. However, the soft limit defaults to infinity, so unless you > set it explicitly on the root level, I can't see how this could be > mctz->lock contention. > > It's more plausible that this is the res_counter lock for testing soft > limit excess - for me, both these locks get inlined into check_events, > could you please double check you got the right lock? I got the wrong lock. Here's how it looks after mainline, plus your free_pages_and_swap_cache() patch: Samples: 2M of event 'cycles', Event count (approx.): 51647128377 + 60.60% 1.33% page_fault2_processes [.] testcase ▒ + 59.14% 0.41% [kernel] [k] page_fault ◆ + 58.72% 0.01% [kernel] [k] do_page_fault ▒ + 58.70% 0.08% [kernel] [k] __do_page_fault ▒ + 58.50% 0.29% [kernel] [k] handle_mm_fault ▒ + 40.14% 0.28% [kernel] [k] do_cow_fault ▒ - 34.56%34.56% [kernel] [k] _raw_spin_lock ▒ - _raw_spin_lock ▒ - 78.11% __res_counter_charge ▒ res_counter_charge ▒ try_charge ▒ - mem_cgroup_try_charge ▒ + 99.99% do_cow_fault ▒ - 10.30% res_counter_uncharge_until ▒ res_counter_uncharge ▒ uncharge_batch ▒ uncharge_list ▒ mem_cgroup_uncharge_list ▒ release_pages ▒ + 4.75% free_pcppages_bulk ▒ + 3.65% do_cow_fault ▒ + 2.24% get_page_from_freelist ▒ > You also said that this cost hasn't been there before, but I do see > that trace in both v3.16 and v3.17-rc3 with roughly the same impact > (although my machines show less contention than yours). Could you > please double check that this is in fact a regression independent of > 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")? Here's the same workload on the same machine with only Johannes' revert applied: - 35.92%35.92% [kernel] [k] _raw_spin_lock ▒ - _raw_spin_lock ▒ - 49.09% get_page_from_freelist ▒ - __alloc_pages_nodemask ▒ + 99.90% alloc_pages_vma ▒ - 43.67% free_pcppages_bulk ▒ - 100.00% free_hot_cold_page ▒ + 99.93% free_hot_cold_page_list ▒ - 7.08% do_cow_fault ▒ handle_mm_fault ▒ __do_page_fault ▒ do_page_fault ▒ page_fault ▒ testcase ▒ So I think it's probably part of the same regression. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Mon, Sep 08, 2014 at 08:47:37AM -0700, Dave Hansen wrote: > On 09/05/2014 05:35 AM, Johannes Weiner wrote: > > On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: > >> On 09/04/2014 07:27 AM, Michal Hocko wrote: > >>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching > >>> because it reduces it to PAGEVEC_SIZE batches. > >>> > >>> I think we really do not need PAGEVEC_SIZE batching anymore. We are > >>> already batching on tlb_gather layer. That one is limited so I think > >>> the below should be safe but I have to think about this some more. There > >>> is a risk of prolonged lru_lock wait times but the number of pages is > >>> limited to 10k and the heavy work is done outside of the lock. If this > >>> is really a problem then we can tear LRU part and the actual > >>> freeing/uncharging into a separate functions in this path. > >>> > >>> Could you test with this half baked patch, please? I didn't get to test > >>> it myself unfortunately. > >> > >> 3.16 settled out at about 11.5M faults/sec before the regression. This > >> patch gets it back up to about 10.5M, which is good. The top spinlock > >> contention in the kernel is still from the resource counter code via > >> mem_cgroup_commit_charge(), though. > > > > Thanks for testing, that looks a lot better. > > > > But commit doesn't touch resource counters - did you mean try_charge() > > or uncharge() by any chance? > > I don't have the perf output that I was looking at when I said this, but > here's the path that I think I was referring to. The inlining makes > this non-obvious, but this memcg_check_events() calls > mem_cgroup_update_tree() which is contending on mctz->lock. > > So, you were right, it's not the resource counters code, it's a lock in > 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ > high (2% of CPU) in this case. But, that is 2% that we didn't see before. > > > 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave > > > >| > >--- _raw_spin_lock_irqsave > > | > > |--107.09%-- memcg_check_events > > | | > > | |--79.98%-- > > mem_cgroup_commit_charge > > | | | > > | | |--99.81%-- > > do_cow_fault > > | | | > > handle_mm_fault > > | | | > > __do_page_fault > > | | | > > do_page_fault > > | | | > > page_fault > > | | | testcase > > | | --0.19%-- [...] The mctz->lock is only taken when there is, or has been, soft limit excess. However, the soft limit defaults to infinity, so unless you set it explicitly on the root level, I can't see how this could be mctz->lock contention. It's more plausible that this is the res_counter lock for testing soft limit excess - for me, both these locks get inlined into check_events, could you please double check you got the right lock? As the limit defaults to infinity, and really doesn't mean anything on the root level it's idiotic to test it, we can easily eliminate that. With the patch below, I don't have that trace show up in the profile anymore. Could you please give it a try? You also said that this cost hasn't been there before, but I do see that trace in both v3.16 and v3.17-rc3 with roughly the same impact (although my machines show less contention than yours). Could you please double check that this is in fact a regression independent of 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")? Thanks! --- >From 465c5caa0628d640c2493e9d849dc9a1f0b373a4 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 9 Sep 2014 09:25:20 -0400 Subject: [patch] mm: memcontrol: do not track soft limit excess on the root level Dave encounters res_counter lock contention from memcg_check_events() when running a multi-threaded page fault benchmark in the root group. This lock is taken to maintain the tree of soft limit excessors, which is used by global reclaim to prioritize excess groups. But that makes no sense on the root level - parent to all other groups, and so all this overhead is unnecessary. Skip it. [ The soft limit really shouldn't even be settable on the root level, but it's been like that forever, so don't risk breaking dopy user space over this now. ] Reported-by: Dave Hansen Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 5 +++-- 1 file changed, 3 insertions(+), 2
Re: regression caused by cgroups optimization in 3.17-rc2
On Mon, Sep 08, 2014 at 08:47:37AM -0700, Dave Hansen wrote: On 09/05/2014 05:35 AM, Johannes Weiner wrote: On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. Thanks for testing, that looks a lot better. But commit doesn't touch resource counters - did you mean try_charge() or uncharge() by any chance? I don't have the perf output that I was looking at when I said this, but here's the path that I think I was referring to. The inlining makes this non-obvious, but this memcg_check_events() calls mem_cgroup_update_tree() which is contending on mctz-lock. So, you were right, it's not the resource counters code, it's a lock in 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ high (2% of CPU) in this case. But, that is 2% that we didn't see before. 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--107.09%-- memcg_check_events | | | |--79.98%-- mem_cgroup_commit_charge | | | | | |--99.81%-- do_cow_fault | | | handle_mm_fault | | | __do_page_fault | | | do_page_fault | | | page_fault | | | testcase | | --0.19%-- [...] The mctz-lock is only taken when there is, or has been, soft limit excess. However, the soft limit defaults to infinity, so unless you set it explicitly on the root level, I can't see how this could be mctz-lock contention. It's more plausible that this is the res_counter lock for testing soft limit excess - for me, both these locks get inlined into check_events, could you please double check you got the right lock? As the limit defaults to infinity, and really doesn't mean anything on the root level it's idiotic to test it, we can easily eliminate that. With the patch below, I don't have that trace show up in the profile anymore. Could you please give it a try? You also said that this cost hasn't been there before, but I do see that trace in both v3.16 and v3.17-rc3 with roughly the same impact (although my machines show less contention than yours). Could you please double check that this is in fact a regression independent of 05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter)? Thanks! --- From 465c5caa0628d640c2493e9d849dc9a1f0b373a4 Mon Sep 17 00:00:00 2001 From: Johannes Weiner han...@cmpxchg.org Date: Tue, 9 Sep 2014 09:25:20 -0400 Subject: [patch] mm: memcontrol: do not track soft limit excess on the root level Dave encounters res_counter lock contention from memcg_check_events() when running a multi-threaded page fault benchmark in the root group. This lock is taken to maintain the tree of soft limit excessors, which is used by global reclaim to prioritize excess groups. But that makes no sense on the root level - parent to all other groups, and so all this overhead is unnecessary. Skip it. [ The soft limit really shouldn't even be settable on the root level, but it's been like that forever, so don't risk breaking dopy user space over this now. ] Reported-by: Dave Hansen d...@sr71.net Signed-off-by: Johannes Weiner han...@cmpxchg.org --- mm/memcontrol.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 085dc6d2f876..b4de17e4f267 100644
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/09/2014 07:50 AM, Johannes Weiner wrote: The mctz-lock is only taken when there is, or has been, soft limit excess. However, the soft limit defaults to infinity, so unless you set it explicitly on the root level, I can't see how this could be mctz-lock contention. It's more plausible that this is the res_counter lock for testing soft limit excess - for me, both these locks get inlined into check_events, could you please double check you got the right lock? I got the wrong lock. Here's how it looks after mainline, plus your free_pages_and_swap_cache() patch: Samples: 2M of event 'cycles', Event count (approx.): 51647128377 + 60.60% 1.33% page_fault2_processes [.] testcase ▒ + 59.14% 0.41% [kernel] [k] page_fault ◆ + 58.72% 0.01% [kernel] [k] do_page_fault ▒ + 58.70% 0.08% [kernel] [k] __do_page_fault ▒ + 58.50% 0.29% [kernel] [k] handle_mm_fault ▒ + 40.14% 0.28% [kernel] [k] do_cow_fault ▒ - 34.56%34.56% [kernel] [k] _raw_spin_lock ▒ - _raw_spin_lock ▒ - 78.11% __res_counter_charge ▒ res_counter_charge ▒ try_charge ▒ - mem_cgroup_try_charge ▒ + 99.99% do_cow_fault ▒ - 10.30% res_counter_uncharge_until ▒ res_counter_uncharge ▒ uncharge_batch ▒ uncharge_list ▒ mem_cgroup_uncharge_list ▒ release_pages ▒ + 4.75% free_pcppages_bulk ▒ + 3.65% do_cow_fault ▒ + 2.24% get_page_from_freelist ▒ You also said that this cost hasn't been there before, but I do see that trace in both v3.16 and v3.17-rc3 with roughly the same impact (although my machines show less contention than yours). Could you please double check that this is in fact a regression independent of 05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter)? Here's the same workload on the same machine with only Johannes' revert applied: - 35.92%35.92% [kernel] [k] _raw_spin_lock ▒ - _raw_spin_lock ▒ - 49.09% get_page_from_freelist ▒ - __alloc_pages_nodemask ▒ + 99.90% alloc_pages_vma ▒ - 43.67% free_pcppages_bulk ▒ - 100.00% free_hot_cold_page ▒ + 99.93% free_hot_cold_page_list ▒ - 7.08% do_cow_fault ▒ handle_mm_fault ▒ __do_page_fault ▒ do_page_fault ▒ page_fault ▒ testcase ▒ So I think it's probably part of the same regression. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/05/2014 05:35 AM, Johannes Weiner wrote: > On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: >> On 09/04/2014 07:27 AM, Michal Hocko wrote: >>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching >>> because it reduces it to PAGEVEC_SIZE batches. >>> >>> I think we really do not need PAGEVEC_SIZE batching anymore. We are >>> already batching on tlb_gather layer. That one is limited so I think >>> the below should be safe but I have to think about this some more. There >>> is a risk of prolonged lru_lock wait times but the number of pages is >>> limited to 10k and the heavy work is done outside of the lock. If this >>> is really a problem then we can tear LRU part and the actual >>> freeing/uncharging into a separate functions in this path. >>> >>> Could you test with this half baked patch, please? I didn't get to test >>> it myself unfortunately. >> >> 3.16 settled out at about 11.5M faults/sec before the regression. This >> patch gets it back up to about 10.5M, which is good. The top spinlock >> contention in the kernel is still from the resource counter code via >> mem_cgroup_commit_charge(), though. > > Thanks for testing, that looks a lot better. > > But commit doesn't touch resource counters - did you mean try_charge() > or uncharge() by any chance? I don't have the perf output that I was looking at when I said this, but here's the path that I think I was referring to. The inlining makes this non-obvious, but this memcg_check_events() calls mem_cgroup_update_tree() which is contending on mctz->lock. So, you were right, it's not the resource counters code, it's a lock in 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ high (2% of CPU) in this case. But, that is 2% that we didn't see before. > 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave >| >--- _raw_spin_lock_irqsave > | > |--107.09%-- memcg_check_events > | | > | |--79.98%-- > mem_cgroup_commit_charge > | | | > | | |--99.81%-- > do_cow_fault > | | | > handle_mm_fault > | | | > __do_page_fault > | | | > do_page_fault > | | | page_fault > | | | testcase > | | --0.19%-- [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/05/2014 05:35 AM, Johannes Weiner wrote: On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. Thanks for testing, that looks a lot better. But commit doesn't touch resource counters - did you mean try_charge() or uncharge() by any chance? I don't have the perf output that I was looking at when I said this, but here's the path that I think I was referring to. The inlining makes this non-obvious, but this memcg_check_events() calls mem_cgroup_update_tree() which is contending on mctz-lock. So, you were right, it's not the resource counters code, it's a lock in 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ high (2% of CPU) in this case. But, that is 2% that we didn't see before. 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--107.09%-- memcg_check_events | | | |--79.98%-- mem_cgroup_commit_charge | | | | | |--99.81%-- do_cow_fault | | | handle_mm_fault | | | __do_page_fault | | | do_page_fault | | | page_fault | | | testcase | | --0.19%-- [...] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Fri 05-09-14 10:47:23, Johannes Weiner wrote: > On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: > > @@ -900,10 +900,10 @@ void lru_add_drain_all(void) > > * grabbed the page via the LRU. If it did, give up: > > shrink_inactive_list() > > * will free it. > > */ > > -void release_pages(struct page **pages, int nr, bool cold) > > +static void release_lru_pages(struct page **pages, int nr, > > + struct list_head *pages_to_free) > > { > > int i; > > - LIST_HEAD(pages_to_free); > > struct zone *zone = NULL; > > struct lruvec *lruvec; > > unsigned long uninitialized_var(flags); > > @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool > > cold) > > /* Clear Active bit in case of parallel mark_page_accessed */ > > __ClearPageActive(page); > > > > - list_add(>lru, _to_free); > > + list_add(>lru, pages_to_free); > > } > > if (zone) > > spin_unlock_irqrestore(>lru_lock, flags); > > +} > > +/* > > + * Batched page_cache_release(). Frees and uncharges all given pages > > + * for which the reference count drops to 0. > > + */ > > +void release_pages(struct page **pages, int nr, bool cold) > > +{ > > + LIST_HEAD(pages_to_free); > > > > + while (nr) { > > + int batch = min(nr, PAGEVEC_SIZE); > > + > > + release_lru_pages(pages, batch, _to_free); > > + pages += batch; > > + nr -= batch; > > + } > > We might be able to process a lot more pages in one go if nobody else > needs the lock or the CPU. Can't we just cycle the lock or reschedule > if necessary? Is it safe to cond_resched here for all callers? I hope it is but there are way too many callers to check so I am not 100% sure. Besides that spin_needbreak doesn't seem to be available for all architectures. git grep "arch_spin_is_contended(" -- arch/ arch/arm/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/arm64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/ia64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/mips/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/x86/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) Moreover it doesn't seem to do anything for !CONFIG_PREEMPT but this should be trivial to fix. I am also not sure this will work well in all cases. If we have a heavy reclaim activity on other CPUs then this path might be interrupted too often resulting in too much lock bouncing. So I guess we want at least few pages to be processed in one run. On the other hand if the lock is not contended then doing batches and retake the lock shouldn't add too much overhead, no? > diff --git a/mm/swap.c b/mm/swap.c > index 6b2dc3897cd5..ee0cf21dd521 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool > cold) > __ClearPageActive(page); > > list_add(>lru, _to_free); > + > + if (should_resched() || > + (zone && spin_needbreak(>lru_lock))) { > + if (zone) { > + spin_unlock_irqrestore(>lru_lock, flags); > + zone = NULL; > + } > + cond_resched(); > + } > } > if (zone) > spin_unlock_irqrestore(>lru_lock, flags); > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3e0ec83d000c..c487ca4682a4 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) > */ > void free_pages_and_swap_cache(struct page **pages, int nr) > { > - struct page **pagep = pages; > + int i; > > lru_add_drain(); > - while (nr) { > - int todo = min(nr, PAGEVEC_SIZE); > - int i; > - > - for (i = 0; i < todo; i++) > - free_swap_cache(pagep[i]); > - release_pages(pagep, todo, false); > - pagep += todo; > - nr -= todo; > - } > + for (i = 0; i < nr; i++) > + free_swap_cache(pages[i]); > + release_pages(pages, nr, false); > } > > /* -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: > @@ -900,10 +900,10 @@ void lru_add_drain_all(void) > * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() > * will free it. > */ > -void release_pages(struct page **pages, int nr, bool cold) > +static void release_lru_pages(struct page **pages, int nr, > + struct list_head *pages_to_free) > { > int i; > - LIST_HEAD(pages_to_free); > struct zone *zone = NULL; > struct lruvec *lruvec; > unsigned long uninitialized_var(flags); > @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool > cold) > /* Clear Active bit in case of parallel mark_page_accessed */ > __ClearPageActive(page); > > - list_add(>lru, _to_free); > + list_add(>lru, pages_to_free); > } > if (zone) > spin_unlock_irqrestore(>lru_lock, flags); > +} > +/* > + * Batched page_cache_release(). Frees and uncharges all given pages > + * for which the reference count drops to 0. > + */ > +void release_pages(struct page **pages, int nr, bool cold) > +{ > + LIST_HEAD(pages_to_free); > > + while (nr) { > + int batch = min(nr, PAGEVEC_SIZE); > + > + release_lru_pages(pages, batch, _to_free); > + pages += batch; > + nr -= batch; > + } We might be able to process a lot more pages in one go if nobody else needs the lock or the CPU. Can't we just cycle the lock or reschedule if necessary? diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..ee0cf21dd521 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold) __ClearPageActive(page); list_add(>lru, _to_free); + + if (should_resched() || + (zone && spin_needbreak(>lru_lock))) { + if (zone) { + spin_unlock_irqrestore(>lru_lock, flags); + zone = NULL; + } + cond_resched(); + } } if (zone) spin_unlock_irqrestore(>lru_lock, flags); diff --git a/mm/swap_state.c b/mm/swap_state.c index 3e0ec83d000c..c487ca4682a4 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) */ void free_pages_and_swap_cache(struct page **pages, int nr) { - struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i < todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i < nr; i++) + free_swap_cache(pages[i]); + release_pages(pages, nr, false); } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > because it reduces it to PAGEVEC_SIZE batches. > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > already batching on tlb_gather layer. That one is limited so I think > > the below should be safe but I have to think about this some more. There > > is a risk of prolonged lru_lock wait times but the number of pages is > > limited to 10k and the heavy work is done outside of the lock. If this > > is really a problem then we can tear LRU part and the actual > > freeing/uncharging into a separate functions in this path. > > > > Could you test with this half baked patch, please? I didn't get to test > > it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. The top spinlock > contention in the kernel is still from the resource counter code via > mem_cgroup_commit_charge(), though. Thanks for testing, that looks a lot better. But commit doesn't touch resource counters - did you mean try_charge() or uncharge() by any chance? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu 04-09-14 15:53:46, Dave Hansen wrote: > On 09/04/2014 01:27 PM, Dave Hansen wrote: > > On 09/04/2014 07:27 AM, Michal Hocko wrote: > >> Ouch. free_pages_and_swap_cache completely kills the uncharge batching > >> because it reduces it to PAGEVEC_SIZE batches. > >> > >> I think we really do not need PAGEVEC_SIZE batching anymore. We are > >> already batching on tlb_gather layer. That one is limited so I think > >> the below should be safe but I have to think about this some more. There > >> is a risk of prolonged lru_lock wait times but the number of pages is > >> limited to 10k and the heavy work is done outside of the lock. If this > >> is really a problem then we can tear LRU part and the actual > >> freeing/uncharging into a separate functions in this path. > >> > >> Could you test with this half baked patch, please? I didn't get to test > >> it myself unfortunately. > > > > 3.16 settled out at about 11.5M faults/sec before the regression. This > > patch gets it back up to about 10.5M, which is good. The top spinlock > > contention in the kernel is still from the resource counter code via > > mem_cgroup_commit_charge(), though. > > > > I'm running Johannes' patch now. > > This looks pretty good. The area where it plateaus (above 80 threads > where hyperthreading kicks in) might be a bit slower than it was in > 3.16, but that could easily be from other things. Good news indeed. But I think it would be safer to apply Johannes' revert for now. Both changes are still worth having anyway because they have potential to improve memcg case. > > https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f > > Feel free to add my Tested-by: Thanks a lot! I have posted another patch which reduces the batching for LRU handling because this would be too risky. So I haven't added your Tested-by yet. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu 04-09-14 13:27:26, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > because it reduces it to PAGEVEC_SIZE batches. > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > already batching on tlb_gather layer. That one is limited so I think > > the below should be safe but I have to think about this some more. There > > is a risk of prolonged lru_lock wait times but the number of pages is > > limited to 10k and the heavy work is done outside of the lock. If this > > is really a problem then we can tear LRU part and the actual > > freeing/uncharging into a separate functions in this path. > > > > Could you test with this half baked patch, please? I didn't get to test > > it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. Dave, would you be willing to test the following patch as well? I do not have a huge machine at hand right now. It would be great if you could run the same load within a !root memcg. We have basically the same sub-optimality there as well. The root bypass will be re-introduced for now but I think we can make the regular memcg load better regardless and this would be also preparation for later root bypass removal again. --- >From 17c4c8710f3ec37fe04866bd18dbc68a0f47b560 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 5 Sep 2014 11:16:17 +0200 Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks. This is not a big deal for the normal release path but it completely kills memcg uncharge batching which reduces res_counter spin_lock contention. Dave has noticed this with his page fault scalability test case on a large machine when the lock was basically dominating on all CPUs: 80.18%80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap In his case the load was running in the root memcg and that part has been handled by reverting 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter") because this is a clear regression. But it makes sense to help loads which are running within a memcg as well. So this is basically a performance optimization. There is no reason to limit release_pages to PAGEVEC_SIZE batches other than lru_lock held times. This logic, however, can be moved inside the function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not hold any lock for the whole pages_to_free list so it is safe to call them in a single run. Page reference count and LRU handling is moved to release_lru_pages and that is run in PAGEVEC_SIZE batches. Reported-by: Dave Hansen Signed-off-by: Michal Hocko --- mm/swap.c | 27 +-- mm/swap_state.c | 14 -- 2 files changed, 25 insertions(+), 16 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..8af99dd68dd2 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -888,9 +888,9 @@ void lru_add_drain_all(void) } /* - * Batched page_cache_release(). Decrement the reference count on all the - * passed pages. If it fell to zero then remove the page from the LRU and - * free it. + * Batched lru release. Decrement the reference count on all the passed pages. + * If it fell to zero then remove the page from the LRU and add it to the given + * list to be freed by the caller. * * Avoid taking zone->lru_lock if
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu 04-09-14 11:08:46, Johannes Weiner wrote: [...] > From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 4 Sep 2014 10:04:34 -0400 > Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter > > Dave Hansen reports a massive scalability regression in an uncontained > page fault benchmark with more than 30 concurrent threads, which he > bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup > res_counter") and pin-pointed on res_counter spinlock contention. > > That change relied on the per-cpu charge caches to mostly swallow the > res_counter costs, but it's apparent that the caches don't scale yet. > > Revert memcg back to bypassing res_counters on the root level in order > to restore performance for uncontained workloads. > > Reported-by: Dave Hansen > Signed-off-by: Johannes Weiner The revert looks good to me. Acked-by: Michal Hocko > --- > mm/memcontrol.c | 103 > ++-- > 1 file changed, 78 insertions(+), 25 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ec4dcf1b9562..085dc6d2f876 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t > gfp_mask, > unsigned long long size; > int ret = 0; > > + if (mem_cgroup_is_root(memcg)) > + goto done; > retry: > if (consume_stock(memcg, nr_pages)) > goto done; > @@ -2611,9 +2613,7 @@ nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; > bypass: > - memcg = root_mem_cgroup; > - ret = -EINTR; > - goto retry; > + return -EINTR; > > done_restock: > if (batch > nr_pages) > @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, > unsigned int nr_pages) > { > unsigned long bytes = nr_pages * PAGE_SIZE; > > + if (mem_cgroup_is_root(memcg)) > + return; > + > res_counter_uncharge(>res, bytes); > if (do_swap_account) > res_counter_uncharge(>memsw, bytes); > @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct > mem_cgroup *memcg, > { > unsigned long bytes = nr_pages * PAGE_SIZE; > > + if (mem_cgroup_is_root(memcg)) > + return; > + > res_counter_uncharge_until(>res, memcg->res.parent, bytes); > if (do_swap_account) > res_counter_uncharge_until(>memsw, > @@ -4093,6 +4099,46 @@ out: > return retval; > } > > +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, > +enum mem_cgroup_stat_index idx) > +{ > + struct mem_cgroup *iter; > + long val = 0; > + > + /* Per-cpu values can be negative, use a signed accumulator */ > + for_each_mem_cgroup_tree(iter, memcg) > + val += mem_cgroup_read_stat(iter, idx); > + > + if (val < 0) /* race ? */ > + val = 0; > + return val; > +} > + > +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) > +{ > + u64 val; > + > + if (!mem_cgroup_is_root(memcg)) { > + if (!swap) > + return res_counter_read_u64(>res, RES_USAGE); > + else > + return res_counter_read_u64(>memsw, RES_USAGE); > + } > + > + /* > + * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS > + * as well as in MEM_CGROUP_STAT_RSS_HUGE. > + */ > + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); > + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); > + > + if (swap) > + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP); > + > + return val << PAGE_SHIFT; > +} > + > + > static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -4102,8 +4148,12 @@ static u64 mem_cgroup_read_u64(struct > cgroup_subsys_state *css, > > switch (type) { > case _MEM: > + if (name == RES_USAGE) > + return mem_cgroup_usage(memcg, false); > return res_counter_read_u64(>res, name); > case _MEMSWAP: > + if (name == RES_USAGE) > + return mem_cgroup_usage(memcg, true); > return res_counter_read_u64(>memsw, name); > case _KMEM: > return res_counter_read_u64(>kmem, name); > @@ -4572,10 +4622,7 @@ static void __mem_cgroup_threshold(struct mem_cgroup > *memcg, bool swap) > if (!t) > goto unlock; > > - if (!swap) > - usage = res_counter_read_u64(>res, RES_USAGE); > - else > - usage = res_counter_read_u64(>memsw, RES_USAGE); > + usage = mem_cgroup_usage(memcg, swap); > > /* >* current_threshold points to threshold just below or equal to usage. > @@
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu 04-09-14 11:08:46, Johannes Weiner wrote: [...] From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner han...@cmpxchg.org Date: Thu, 4 Sep 2014 10:04:34 -0400 Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter) and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. Reported-by: Dave Hansen d...@sr71.net Signed-off-by: Johannes Weiner han...@cmpxchg.org The revert looks good to me. Acked-by: Michal Hocko mho...@suse.cz --- mm/memcontrol.c | 103 ++-- 1 file changed, 78 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..085dc6d2f876 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long long size; int ret = 0; + if (mem_cgroup_is_root(memcg)) + goto done; retry: if (consume_stock(memcg, nr_pages)) goto done; @@ -2611,9 +2613,7 @@ nomem: if (!(gfp_mask __GFP_NOFAIL)) return -ENOMEM; bypass: - memcg = root_mem_cgroup; - ret = -EINTR; - goto retry; + return -EINTR; done_restock: if (batch nr_pages) @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge(memcg-res, bytes); if (do_swap_account) res_counter_uncharge(memcg-memsw, bytes); @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg, { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge_until(memcg-res, memcg-res.parent, bytes); if (do_swap_account) res_counter_uncharge_until(memcg-memsw, @@ -4093,6 +4099,46 @@ out: return retval; } +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, +enum mem_cgroup_stat_index idx) +{ + struct mem_cgroup *iter; + long val = 0; + + /* Per-cpu values can be negative, use a signed accumulator */ + for_each_mem_cgroup_tree(iter, memcg) + val += mem_cgroup_read_stat(iter, idx); + + if (val 0) /* race ? */ + val = 0; + return val; +} + +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) +{ + u64 val; + + if (!mem_cgroup_is_root(memcg)) { + if (!swap) + return res_counter_read_u64(memcg-res, RES_USAGE); + else + return res_counter_read_u64(memcg-memsw, RES_USAGE); + } + + /* + * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS + * as well as in MEM_CGROUP_STAT_RSS_HUGE. + */ + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); + + if (swap) + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP); + + return val PAGE_SHIFT; +} + + static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -4102,8 +4148,12 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, switch (type) { case _MEM: + if (name == RES_USAGE) + return mem_cgroup_usage(memcg, false); return res_counter_read_u64(memcg-res, name); case _MEMSWAP: + if (name == RES_USAGE) + return mem_cgroup_usage(memcg, true); return res_counter_read_u64(memcg-memsw, name); case _KMEM: return res_counter_read_u64(memcg-kmem, name); @@ -4572,10 +4622,7 @@ static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) if (!t) goto unlock; - if (!swap) - usage = res_counter_read_u64(memcg-res, RES_USAGE); - else - usage = res_counter_read_u64(memcg-memsw, RES_USAGE); + usage = mem_cgroup_usage(memcg, swap); /* * current_threshold points to threshold just below or equal to usage. @@ -4673,10 +4720,10 @@ static int
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu 04-09-14 13:27:26, Dave Hansen wrote: On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. Dave, would you be willing to test the following patch as well? I do not have a huge machine at hand right now. It would be great if you could run the same load within a !root memcg. We have basically the same sub-optimality there as well. The root bypass will be re-introduced for now but I think we can make the regular memcg load better regardless and this would be also preparation for later root bypass removal again. --- From 17c4c8710f3ec37fe04866bd18dbc68a0f47b560 Mon Sep 17 00:00:00 2001 From: Michal Hocko mho...@suse.cz Date: Fri, 5 Sep 2014 11:16:17 +0200 Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks. This is not a big deal for the normal release path but it completely kills memcg uncharge batching which reduces res_counter spin_lock contention. Dave has noticed this with his page fault scalability test case on a large machine when the lock was basically dominating on all CPUs: 80.18%80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap In his case the load was running in the root memcg and that part has been handled by reverting 05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter) because this is a clear regression. But it makes sense to help loads which are running within a memcg as well. So this is basically a performance optimization. There is no reason to limit release_pages to PAGEVEC_SIZE batches other than lru_lock held times. This logic, however, can be moved inside the function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not hold any lock for the whole pages_to_free list so it is safe to call them in a single run. Page reference count and LRU handling is moved to release_lru_pages and that is run in PAGEVEC_SIZE batches. Reported-by: Dave Hansen d...@sr71.net Signed-off-by: Michal Hocko mho...@suse.cz --- mm/swap.c | 27 +-- mm/swap_state.c | 14 -- 2 files changed, 25 insertions(+), 16 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..8af99dd68dd2 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -888,9 +888,9 @@ void lru_add_drain_all(void) } /* - * Batched page_cache_release(). Decrement the reference count on all the - * passed pages. If it fell to zero then remove the page from the LRU and - * free it. + * Batched lru release. Decrement the reference count on all the passed pages. + * If it fell to zero then remove the page from the LRU and add it to the given + * list to be freed by the caller. * * Avoid taking
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu 04-09-14 15:53:46, Dave Hansen wrote: On 09/04/2014 01:27 PM, Dave Hansen wrote: On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. I'm running Johannes' patch now. This looks pretty good. The area where it plateaus (above 80 threads where hyperthreading kicks in) might be a bit slower than it was in 3.16, but that could easily be from other things. Good news indeed. But I think it would be safer to apply Johannes' revert for now. Both changes are still worth having anyway because they have potential to improve memcg case. https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/2=3.17.0-rc3-g57b252f Feel free to add my Tested-by: Thanks a lot! I have posted another patch which reduces the batching for LRU handling because this would be too risky. So I haven't added your Tested-by yet. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. Thanks for testing, that looks a lot better. But commit doesn't touch resource counters - did you mean try_charge() or uncharge() by any chance? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: @@ -900,10 +900,10 @@ void lru_add_drain_all(void) * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() * will free it. */ -void release_pages(struct page **pages, int nr, bool cold) +static void release_lru_pages(struct page **pages, int nr, + struct list_head *pages_to_free) { int i; - LIST_HEAD(pages_to_free); struct zone *zone = NULL; struct lruvec *lruvec; unsigned long uninitialized_var(flags); @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold) /* Clear Active bit in case of parallel mark_page_accessed */ __ClearPageActive(page); - list_add(page-lru, pages_to_free); + list_add(page-lru, pages_to_free); } if (zone) spin_unlock_irqrestore(zone-lru_lock, flags); +} +/* + * Batched page_cache_release(). Frees and uncharges all given pages + * for which the reference count drops to 0. + */ +void release_pages(struct page **pages, int nr, bool cold) +{ + LIST_HEAD(pages_to_free); + while (nr) { + int batch = min(nr, PAGEVEC_SIZE); + + release_lru_pages(pages, batch, pages_to_free); + pages += batch; + nr -= batch; + } We might be able to process a lot more pages in one go if nobody else needs the lock or the CPU. Can't we just cycle the lock or reschedule if necessary? diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..ee0cf21dd521 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold) __ClearPageActive(page); list_add(page-lru, pages_to_free); + + if (should_resched() || + (zone spin_needbreak(zone-lru_lock))) { + if (zone) { + spin_unlock_irqrestore(zone-lru_lock, flags); + zone = NULL; + } + cond_resched(); + } } if (zone) spin_unlock_irqrestore(zone-lru_lock, flags); diff --git a/mm/swap_state.c b/mm/swap_state.c index 3e0ec83d000c..c487ca4682a4 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) */ void free_pages_and_swap_cache(struct page **pages, int nr) { - struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i nr; i++) + free_swap_cache(pages[i]); + release_pages(pages, nr, false); } /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Fri 05-09-14 10:47:23, Johannes Weiner wrote: On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: @@ -900,10 +900,10 @@ void lru_add_drain_all(void) * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() * will free it. */ -void release_pages(struct page **pages, int nr, bool cold) +static void release_lru_pages(struct page **pages, int nr, + struct list_head *pages_to_free) { int i; - LIST_HEAD(pages_to_free); struct zone *zone = NULL; struct lruvec *lruvec; unsigned long uninitialized_var(flags); @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold) /* Clear Active bit in case of parallel mark_page_accessed */ __ClearPageActive(page); - list_add(page-lru, pages_to_free); + list_add(page-lru, pages_to_free); } if (zone) spin_unlock_irqrestore(zone-lru_lock, flags); +} +/* + * Batched page_cache_release(). Frees and uncharges all given pages + * for which the reference count drops to 0. + */ +void release_pages(struct page **pages, int nr, bool cold) +{ + LIST_HEAD(pages_to_free); + while (nr) { + int batch = min(nr, PAGEVEC_SIZE); + + release_lru_pages(pages, batch, pages_to_free); + pages += batch; + nr -= batch; + } We might be able to process a lot more pages in one go if nobody else needs the lock or the CPU. Can't we just cycle the lock or reschedule if necessary? Is it safe to cond_resched here for all callers? I hope it is but there are way too many callers to check so I am not 100% sure. Besides that spin_needbreak doesn't seem to be available for all architectures. git grep arch_spin_is_contended( -- arch/ arch/arm/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/arm64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/ia64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/mips/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/x86/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) Moreover it doesn't seem to do anything for !CONFIG_PREEMPT but this should be trivial to fix. I am also not sure this will work well in all cases. If we have a heavy reclaim activity on other CPUs then this path might be interrupted too often resulting in too much lock bouncing. So I guess we want at least few pages to be processed in one run. On the other hand if the lock is not contended then doing batches and retake the lock shouldn't add too much overhead, no? diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..ee0cf21dd521 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold) __ClearPageActive(page); list_add(page-lru, pages_to_free); + + if (should_resched() || + (zone spin_needbreak(zone-lru_lock))) { + if (zone) { + spin_unlock_irqrestore(zone-lru_lock, flags); + zone = NULL; + } + cond_resched(); + } } if (zone) spin_unlock_irqrestore(zone-lru_lock, flags); diff --git a/mm/swap_state.c b/mm/swap_state.c index 3e0ec83d000c..c487ca4682a4 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) */ void free_pages_and_swap_cache(struct page **pages, int nr) { - struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i nr; i++) + free_swap_cache(pages[i]); + release_pages(pages, nr, false); } /* -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/04/2014 01:27 PM, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: >> Ouch. free_pages_and_swap_cache completely kills the uncharge batching >> because it reduces it to PAGEVEC_SIZE batches. >> >> I think we really do not need PAGEVEC_SIZE batching anymore. We are >> already batching on tlb_gather layer. That one is limited so I think >> the below should be safe but I have to think about this some more. There >> is a risk of prolonged lru_lock wait times but the number of pages is >> limited to 10k and the heavy work is done outside of the lock. If this >> is really a problem then we can tear LRU part and the actual >> freeing/uncharging into a separate functions in this path. >> >> Could you test with this half baked patch, please? I didn't get to test >> it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. The top spinlock > contention in the kernel is still from the resource counter code via > mem_cgroup_commit_charge(), though. > > I'm running Johannes' patch now. This looks pretty good. The area where it plateaus (above 80 threads where hyperthreading kicks in) might be a bit slower than it was in 3.16, but that could easily be from other things. > https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f Feel free to add my Tested-by: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/04/2014 08:08 AM, Johannes Weiner wrote: > Dave Hansen reports a massive scalability regression in an uncontained > page fault benchmark with more than 30 concurrent threads, which he > bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup > res_counter") and pin-pointed on res_counter spinlock contention. > > That change relied on the per-cpu charge caches to mostly swallow the > res_counter costs, but it's apparent that the caches don't scale yet. > > Revert memcg back to bypassing res_counters on the root level in order > to restore performance for uncontained workloads. A quick sniff test shows performance coming back to what it was around 3.16 with this patch. I'll run a more thorough set of tests and verify that it's working well. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/04/2014 07:27 AM, Michal Hocko wrote: > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > because it reduces it to PAGEVEC_SIZE batches. > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > already batching on tlb_gather layer. That one is limited so I think > the below should be safe but I have to think about this some more. There > is a risk of prolonged lru_lock wait times but the number of pages is > limited to 10k and the heavy work is done outside of the lock. If this > is really a problem then we can tear LRU part and the actual > freeing/uncharging into a separate functions in this path. > > Could you test with this half baked patch, please? I didn't get to test > it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. I'm running Johannes' patch now. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 02, 2014 at 05:30:38PM -0700, Dave Hansen wrote: > On 09/02/2014 05:10 PM, Johannes Weiner wrote: > > On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: > >> On 09/02/2014 03:18 PM, Johannes Weiner wrote: > >>> Accounting new pages is buffered through per-cpu caches, but taking > >>> them off the counters on free is not, so I'm guessing that above a > >>> certain allocation rate the cost of locking and changing the counters > >>> takes over. Is there a chance you could profile this to see if locks > >>> and res_counter-related operations show up? > >> > >> It looks pretty much the same, although it might have equalized the > >> charge and uncharge sides a bit. Full 'perf top' output attached. > > > > That looks like a partial profile, where did the page allocator, page > > zeroing etc. go? Because the distribution among these listed symbols > > doesn't seem all that crazy: > > Perf was only outputting the top 20 functions. Believe it or not, page > zeroing and the rest of the allocator path wasn't even in the path of > the top 20 functions because there is so much lock contention. > > Here's a longer run of 'perf top' along with the top 100 functions: > > http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz > > you can at least see copy_page_rep in there. Thanks for the clarification, that is truly horrible. Does the following revert restore performance in your case? --- >From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 4 Sep 2014 10:04:34 -0400 Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter") and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. Reported-by: Dave Hansen Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 103 ++-- 1 file changed, 78 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..085dc6d2f876 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long long size; int ret = 0; + if (mem_cgroup_is_root(memcg)) + goto done; retry: if (consume_stock(memcg, nr_pages)) goto done; @@ -2611,9 +2613,7 @@ nomem: if (!(gfp_mask & __GFP_NOFAIL)) return -ENOMEM; bypass: - memcg = root_mem_cgroup; - ret = -EINTR; - goto retry; + return -EINTR; done_restock: if (batch > nr_pages) @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge(>res, bytes); if (do_swap_account) res_counter_uncharge(>memsw, bytes); @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg, { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge_until(>res, memcg->res.parent, bytes); if (do_swap_account) res_counter_uncharge_until(>memsw, @@ -4093,6 +4099,46 @@ out: return retval; } +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ + struct mem_cgroup *iter; + long val = 0; + + /* Per-cpu values can be negative, use a signed accumulator */ + for_each_mem_cgroup_tree(iter, memcg) + val += mem_cgroup_read_stat(iter, idx); + + if (val < 0) /* race ? */ + val = 0; + return val; +} + +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) +{ + u64 val; + + if (!mem_cgroup_is_root(memcg)) { + if (!swap) + return res_counter_read_u64(>res, RES_USAGE); + else + return res_counter_read_u64(>memsw, RES_USAGE); + } + + /* +* Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS +* as well as in MEM_CGROUP_STAT_RSS_HUGE. +*/ + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); + + if (swap) + val +=
Re: regression caused by cgroups optimization in 3.17-rc2
[Sorry to reply so late] On Tue 02-09-14 13:57:22, Dave Hansen wrote: > I, of course, forgot to include the most important detail. This appears > to be pretty run-of-the-mill spinlock contention in the resource counter > code. Nearly 80% of the CPU is spent spinning in the charge or uncharge > paths in the kernel. It is apparently spinning on res_counter->lock in > both the charge and uncharge paths. > > It already does _some_ batching here on the free side, but that > apparently breaks down after ~40 threads. > > It's a no-brainer since the patch in question removed an optimization > skipping the charging, and now we're seeing overhead from the charging. > > Here's the first entry from perf top: > > 80.18%80.18% [kernel] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--66.59%-- res_counter_uncharge_until > | res_counter_uncharge > | uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. --- diff --git a/mm/swap_state.c b/mm/swap_state.c index ef1f39139b71..15918685 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page) void free_pages_and_swap_cache(struct page **pages, int nr) { struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i < todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i < nr; i++) + free_swap_cache(pagep[i]); + release_pages(pagep, nr, false); } /* -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. I'm running Johannes' patch now. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/04/2014 08:08 AM, Johannes Weiner wrote: Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter) and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. A quick sniff test shows performance coming back to what it was around 3.16 with this patch. I'll run a more thorough set of tests and verify that it's working well. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/04/2014 01:27 PM, Dave Hansen wrote: On 09/04/2014 07:27 AM, Michal Hocko wrote: Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. I'm running Johannes' patch now. This looks pretty good. The area where it plateaus (above 80 threads where hyperthreading kicks in) might be a bit slower than it was in 3.16, but that could easily be from other things. https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/2=3.17.0-rc3-g57b252f Feel free to add my Tested-by: -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
[Sorry to reply so late] On Tue 02-09-14 13:57:22, Dave Hansen wrote: I, of course, forgot to include the most important detail. This appears to be pretty run-of-the-mill spinlock contention in the resource counter code. Nearly 80% of the CPU is spent spinning in the charge or uncharge paths in the kernel. It is apparently spinning on res_counter-lock in both the charge and uncharge paths. It already does _some_ batching here on the free side, but that apparently breaks down after ~40 threads. It's a no-brainer since the patch in question removed an optimization skipping the charging, and now we're seeing overhead from the charging. Here's the first entry from perf top: 80.18%80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. --- diff --git a/mm/swap_state.c b/mm/swap_state.c index ef1f39139b71..15918685 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page) void free_pages_and_swap_cache(struct page **pages, int nr) { struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i nr; i++) + free_swap_cache(pagep[i]); + release_pages(pagep, nr, false); } /* -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 02, 2014 at 05:30:38PM -0700, Dave Hansen wrote: On 09/02/2014 05:10 PM, Johannes Weiner wrote: On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: On 09/02/2014 03:18 PM, Johannes Weiner wrote: Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? It looks pretty much the same, although it might have equalized the charge and uncharge sides a bit. Full 'perf top' output attached. That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: Perf was only outputting the top 20 functions. Believe it or not, page zeroing and the rest of the allocator path wasn't even in the path of the top 20 functions because there is so much lock contention. Here's a longer run of 'perf top' along with the top 100 functions: http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz you can at least see copy_page_rep in there. Thanks for the clarification, that is truly horrible. Does the following revert restore performance in your case? --- From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner han...@cmpxchg.org Date: Thu, 4 Sep 2014 10:04:34 -0400 Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b843012335 (mm: memcontrol: use root_mem_cgroup res_counter) and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. Reported-by: Dave Hansen d...@sr71.net Signed-off-by: Johannes Weiner han...@cmpxchg.org --- mm/memcontrol.c | 103 ++-- 1 file changed, 78 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..085dc6d2f876 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long long size; int ret = 0; + if (mem_cgroup_is_root(memcg)) + goto done; retry: if (consume_stock(memcg, nr_pages)) goto done; @@ -2611,9 +2613,7 @@ nomem: if (!(gfp_mask __GFP_NOFAIL)) return -ENOMEM; bypass: - memcg = root_mem_cgroup; - ret = -EINTR; - goto retry; + return -EINTR; done_restock: if (batch nr_pages) @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge(memcg-res, bytes); if (do_swap_account) res_counter_uncharge(memcg-memsw, bytes); @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg, { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge_until(memcg-res, memcg-res.parent, bytes); if (do_swap_account) res_counter_uncharge_until(memcg-memsw, @@ -4093,6 +4099,46 @@ out: return retval; } +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ + struct mem_cgroup *iter; + long val = 0; + + /* Per-cpu values can be negative, use a signed accumulator */ + for_each_mem_cgroup_tree(iter, memcg) + val += mem_cgroup_read_stat(iter, idx); + + if (val 0) /* race ? */ + val = 0; + return val; +} + +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) +{ + u64 val; + + if (!mem_cgroup_is_root(memcg)) { + if (!swap) + return res_counter_read_u64(memcg-res, RES_USAGE); + else + return res_counter_read_u64(memcg-memsw, RES_USAGE); + } + + /* +* Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS +* as well as in MEM_CGROUP_STAT_RSS_HUGE. +*/ + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); + + if (swap) + val +=
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 06:33 PM, Johannes Weiner wrote: > kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to > me whether Dave filtered symbols from only memcontrol.o, memory.o, and > mmap.o in a similar way. I'm not arguing against the regression, I'm > just trying to make sense of the numbers from the *patched* kernel. I guess I could have included it in the description, but that was a pretty vanilla run: perf top --call-graph=fp --stdio > foo.txt I didn't use any filtering. I didn't even know I _could_ filter. :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 02, 2014 at 05:20:55PM -0700, Linus Torvalds wrote: > On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner wrote: > > > > That looks like a partial profile, where did the page allocator, page > > zeroing etc. go? Because the distribution among these listed symbols > > doesn't seem all that crazy: > > Please argue this *after* the commit has been reverted. You guys can > try to make the memcontrol batching actually work and scale later. > It's not appropriate to argue against major regressions when reported > and bisected by users. I'll send a clean revert later. > Showing the spinlock at the top of the profile is very much crazy > (apparently taking 68% of all cpu time), when it's all useless > make-believe work. I don't understand why you wouldn't call that > crazy. If you limit perf to a subset of symbols, it will show a relative distribution between them, i.e: perf top --symbols kfree,memset during some disk access: PerfTop:1292 irqs/sec kernel:84.4% exact: 0.0% [4000Hz cycles], (all, 4 CPUs) --- 56.23% [kernel] [k] kfree 41.86% [kernel] [k] memset 1.91% libc-2.19.so [.] memset kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to me whether Dave filtered symbols from only memcontrol.o, memory.o, and mmap.o in a similar way. I'm not arguing against the regression, I'm just trying to make sense of the numbers from the *patched* kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 05:10 PM, Johannes Weiner wrote: > On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: >> On 09/02/2014 03:18 PM, Johannes Weiner wrote: >>> Accounting new pages is buffered through per-cpu caches, but taking >>> them off the counters on free is not, so I'm guessing that above a >>> certain allocation rate the cost of locking and changing the counters >>> takes over. Is there a chance you could profile this to see if locks >>> and res_counter-related operations show up? >> >> It looks pretty much the same, although it might have equalized the >> charge and uncharge sides a bit. Full 'perf top' output attached. > > That looks like a partial profile, where did the page allocator, page > zeroing etc. go? Because the distribution among these listed symbols > doesn't seem all that crazy: Perf was only outputting the top 20 functions. Believe it or not, page zeroing and the rest of the allocator path wasn't even in the path of the top 20 functions because there is so much lock contention. Here's a longer run of 'perf top' along with the top 100 functions: http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz you can at least see copy_page_rep in there. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner wrote: > > That looks like a partial profile, where did the page allocator, page > zeroing etc. go? Because the distribution among these listed symbols > doesn't seem all that crazy: Please argue this *after* the commit has been reverted. You guys can try to make the memcontrol batching actually work and scale later. It's not appropriate to argue against major regressions when reported and bisected by users. Showing the spinlock at the top of the profile is very much crazy (apparently taking 68% of all cpu time), when it's all useless make-believe work. I don't understand why you wouldn't call that crazy. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: > On 09/02/2014 03:18 PM, Johannes Weiner wrote: > > Accounting new pages is buffered through per-cpu caches, but taking > > them off the counters on free is not, so I'm guessing that above a > > certain allocation rate the cost of locking and changing the counters > > takes over. Is there a chance you could profile this to see if locks > > and res_counter-related operations show up? > > It looks pretty much the same, although it might have equalized the > charge and uncharge sides a bit. Full 'perf top' output attached. That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: >PerfTop: 275580 irqs/sec kernel:98.0% exact: 0.0% [4000Hz cycles], > (all, 160 CPUs) > --- > > 68.10%68.10% [kernel] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--57.35%-- __res_counter_charge > | res_counter_charge > | try_charge > | mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > | > |--53.93%-- res_counter_uncharge_until > | res_counter_uncharge > | refill_stock > | uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > | > |--2.18%-- do_cow_fault > | handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault > | testcase > --9307219025.55%-- [...] > > 64.00% 1.34% page_fault2_processes [.] testcase > | > --- testcase > > 62.64% 0.37% [kernel] [k] page_fault > | > --- page_fault > | > |--114.21%-- testcase > --10118485450.89%-- [...] > > 62.27% 0.01% [kernel] [k] do_page_fault > | > --- do_page_fault > | > |--114.28%-- page_fault > | | > | |--99.93%-- testcase > | --0.07%-- [...] > --10178525138.33%-- [...] > > 62.25% 0.07% [kernel] [k] __do_page_fault > | > --- __do_page_fault > | > |--114.27%-- do_page_fault > | page_fault > | | > | |--99.93%-- testcase > | --0.07%-- [...] > --10182230022.41%-- [...] > >
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 03:18 PM, Johannes Weiner wrote: > Accounting new pages is buffered through per-cpu caches, but taking > them off the counters on free is not, so I'm guessing that above a > certain allocation rate the cost of locking and changing the counters > takes over. Is there a chance you could profile this to see if locks > and res_counter-related operations show up? It looks pretty much the same, although it might have equalized the charge and uncharge sides a bit. Full 'perf top' output attached. PerfTop: 275580 irqs/sec kernel:98.0% exact: 0.0% [4000Hz cycles], (all, 160 CPUs) --- 68.10%68.10% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--57.35%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.93%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.07%-- [...] | |--53.93%-- res_counter_uncharge_until | res_counter_uncharge | refill_stock | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--2.18%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --9307219025.55%-- [...] 64.00% 1.34% page_fault2_processes [.] testcase | --- testcase 62.64% 0.37% [kernel] [k] page_fault | --- page_fault | |--114.21%-- testcase --10118485450.89%-- [...] 62.27% 0.01% [kernel] [k] do_page_fault | --- do_page_fault | |--114.28%-- page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10178525138.33%-- [...] 62.25% 0.07% [kernel] [k] __do_page_fault | --- __do_page_fault | |--114.27%-- do_page_fault | page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10182230022.41%-- [...] 62.08% 0.26% [kernel] [k] handle_mm_fault | --- handle_mm_fault | |--114.28%-- __do_page_fault | do_page_fault | page_fault | | | |--99.94%-- testcase |
Re: regression caused by cgroups optimization in 3.17-rc2
Hi Dave, On Tue, Sep 02, 2014 at 12:05:41PM -0700, Dave Hansen wrote: > I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the > memory cgroups code. This is on a kernel with cgroups enabled at > compile time, but not _used_ for anything. See the green lines in the > graph: > > https://www.sr71.net/~dave/intel/regression-from-05b843012.png > > The workload is a little parallel microbenchmark doing page faults: Ouch. > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c > > The hardware is an 8-socket Westmere box with 160 hardware threads. For > some reason, this does not affect the version of the microbenchmark > which is doing completely anonymous page faults. > > I bisected it down to this commit: > > > commit 05b8430123359886ef6a4146fba384e30d771b3f > > Author: Johannes Weiner > > Date: Wed Aug 6 16:05:59 2014 -0700 > > > > mm: memcontrol: use root_mem_cgroup res_counter > > > > Due to an old optimization to keep expensive res_counter changes at a > > minimum, the root_mem_cgroup res_counter is never charged; there is no > > limit at that level anyway, and any statistics can be generated on > > demand by summing up the counters of all other cgroups. > > > > However, with per-cpu charge caches, res_counter operations do not even > > show up in profiles anymore, so this optimization is no longer > > necessary. > > > > Remove it to simplify the code. Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? I can't reproduce this complete breakdown on my smaller test gear, but I do see an improvement with the following patch: --- >From 29a51326c24b7bb45d17e9c864c34506f10868f6 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 2 Sep 2014 18:11:39 -0400 Subject: [patch] mm: memcontrol: use per-cpu caches for uncharging --- mm/memcontrol.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..cb79ecff399d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2365,6 +2365,7 @@ static void drain_stock(struct memcg_stock_pcp *stock) res_counter_uncharge(>res, bytes); if (do_swap_account) res_counter_uncharge(>memsw, bytes); + memcg_oom_recover(old); stock->nr_pages = 0; } stock->cached = NULL; @@ -2405,6 +2406,13 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) stock->cached = memcg; } stock->nr_pages += nr_pages; + if (stock->nr_pages > CHARGE_BATCH * 4) { + res_counter_uncharge(>res, CHARGE_BATCH); + if (do_swap_account) + res_counter_uncharge(>memsw, CHARGE_BATCH); + memcg_oom_recover(memcg); + stock->nr_pages -= CHARGE_BATCH; + } put_cpu_var(memcg_stock); } @@ -6509,12 +6517,20 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, { unsigned long flags; - if (nr_mem) - res_counter_uncharge(>res, nr_mem * PAGE_SIZE); - if (nr_memsw) - res_counter_uncharge(>memsw, nr_memsw * PAGE_SIZE); - - memcg_oom_recover(memcg); + /* +* The percpu caches count both memory and memsw charges in a +* single conuter, but there might be less memsw charges when +* some of the pages have been swapped out. +*/ + if (nr_mem == nr_memsw) + refill_stock(memcg, nr_mem); + else { + if (nr_mem) + res_counter_uncharge(>res, nr_mem * PAGE_SIZE); + if (nr_memsw) + res_counter_uncharge(>memsw, nr_memsw * PAGE_SIZE); + memcg_oom_recover(memcg); + } local_irq_save(flags); __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon); -- 2.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
I, of course, forgot to include the most important detail. This appears to be pretty run-of-the-mill spinlock contention in the resource counter code. Nearly 80% of the CPU is spent spinning in the charge or uncharge paths in the kernel. It is apparently spinning on res_counter->lock in both the charge and uncharge paths. It already does _some_ batching here on the free side, but that apparently breaks down after ~40 threads. It's a no-brainer since the patch in question removed an optimization skipping the charging, and now we're seeing overhead from the charging. Here's the first entry from perf top: 80.18%80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--46.13%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.89%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.11%-- [...] | |--1.14%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --8217937613.29%-- [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 12:05 PM, Dave Hansen wrote: > It does not revert cleanly because of the hunks below. The code in > those hunks was removed, so I tried running without properly merging > them and it spews warnings because counter->usage is seen going negative. > > So, it doesn't appear we can quickly revert this. I'm fairly confident that I missed some of the cases (especially in the charge-moving code), but the attached patch does at least work around the regression for me. It restores the original performance, or at least gets _close_ to it. --- b/mm/memcontrol.c | 14 ++ 1 file changed, 14 insertions(+) diff -puN mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch mm/memcontrol.c --- a/mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch 2014-09-02 12:20:11.209527453 -0700 +++ b/mm/memcontrol.c 2014-09-02 13:10:28.756736862 -0700 @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup unsigned long long size; int ret = 0; + if (mem_cgroup_is_root(memcg)) + goto done; retry: if (consume_stock(memcg, nr_pages)) goto done; @@ -2640,6 +2642,9 @@ static void __mem_cgroup_cancel_local_ch { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge_until(>res, memcg->res.parent, bytes); if (do_swap_account) res_counter_uncharge_until(>memsw, @@ -6440,6 +6445,9 @@ void mem_cgroup_commit_charge(struct pag VM_BUG_ON_PAGE(!page->mapping, page); VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page); + if (mem_cgroup_is_root(memcg)) + return; + if (mem_cgroup_disabled()) return; /* @@ -6484,6 +6492,9 @@ void mem_cgroup_cancel_charge(struct pag { unsigned int nr_pages = 1; + if (mem_cgroup_is_root(memcg)) + return; + if (mem_cgroup_disabled()) return; /* @@ -6509,6 +6520,9 @@ static void uncharge_batch(struct mem_cg { unsigned long flags; + if (mem_cgroup_is_root(memcg)) + return; + if (nr_mem) res_counter_uncharge(>res, nr_mem * PAGE_SIZE); if (nr_memsw) _
regression caused by cgroups optimization in 3.17-rc2
I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the memory cgroups code. This is on a kernel with cgroups enabled at compile time, but not _used_ for anything. See the green lines in the graph: https://www.sr71.net/~dave/intel/regression-from-05b843012.png The workload is a little parallel microbenchmark doing page faults: > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c The hardware is an 8-socket Westmere box with 160 hardware threads. For some reason, this does not affect the version of the microbenchmark which is doing completely anonymous page faults. I bisected it down to this commit: > commit 05b8430123359886ef6a4146fba384e30d771b3f > Author: Johannes Weiner > Date: Wed Aug 6 16:05:59 2014 -0700 > > mm: memcontrol: use root_mem_cgroup res_counter > > Due to an old optimization to keep expensive res_counter changes at a > minimum, the root_mem_cgroup res_counter is never charged; there is no > limit at that level anyway, and any statistics can be generated on > demand by summing up the counters of all other cgroups. > > However, with per-cpu charge caches, res_counter operations do not even > show up in profiles anymore, so this optimization is no longer > necessary. > > Remove it to simplify the code. It does not revert cleanly because of the hunks below. The code in those hunks was removed, so I tried running without properly merging them and it spews warnings because counter->usage is seen going negative. So, it doesn't appear we can quickly revert this. > --- mm/memcontrol.c > +++ mm/memcontrol.c > @@ -3943,7 +3947,7 @@ > * replacement page, so leave it alone when phasing out the > * page that is unused after the migration. > */ > - if (!end_migration) > + if (!end_migration && !mem_cgroup_is_root(memcg)) > mem_cgroup_do_uncharge(memcg, nr_pages, ctype); > > return memcg; > @@ -4076,7 +4080,8 @@ > * We uncharge this because swap is freed. This memcg can > * be obsolete one. We avoid calling css_tryget_online(). > */ > - res_counter_uncharge(>memsw, PAGE_SIZE); > + if (!mem_cgroup_is_root(memcg)) > + res_counter_uncharge(>memsw, PAGE_SIZE); > mem_cgroup_swap_statistics(memcg, false); > css_put(>css); > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
regression caused by cgroups optimization in 3.17-rc2
I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the memory cgroups code. This is on a kernel with cgroups enabled at compile time, but not _used_ for anything. See the green lines in the graph: https://www.sr71.net/~dave/intel/regression-from-05b843012.png The workload is a little parallel microbenchmark doing page faults: https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c The hardware is an 8-socket Westmere box with 160 hardware threads. For some reason, this does not affect the version of the microbenchmark which is doing completely anonymous page faults. I bisected it down to this commit: commit 05b8430123359886ef6a4146fba384e30d771b3f Author: Johannes Weiner han...@cmpxchg.org Date: Wed Aug 6 16:05:59 2014 -0700 mm: memcontrol: use root_mem_cgroup res_counter Due to an old optimization to keep expensive res_counter changes at a minimum, the root_mem_cgroup res_counter is never charged; there is no limit at that level anyway, and any statistics can be generated on demand by summing up the counters of all other cgroups. However, with per-cpu charge caches, res_counter operations do not even show up in profiles anymore, so this optimization is no longer necessary. Remove it to simplify the code. It does not revert cleanly because of the hunks below. The code in those hunks was removed, so I tried running without properly merging them and it spews warnings because counter-usage is seen going negative. So, it doesn't appear we can quickly revert this. --- mm/memcontrol.c +++ mm/memcontrol.c @@ -3943,7 +3947,7 @@ * replacement page, so leave it alone when phasing out the * page that is unused after the migration. */ - if (!end_migration) + if (!end_migration !mem_cgroup_is_root(memcg)) mem_cgroup_do_uncharge(memcg, nr_pages, ctype); return memcg; @@ -4076,7 +4080,8 @@ * We uncharge this because swap is freed. This memcg can * be obsolete one. We avoid calling css_tryget_online(). */ - res_counter_uncharge(memcg-memsw, PAGE_SIZE); + if (!mem_cgroup_is_root(memcg)) + res_counter_uncharge(memcg-memsw, PAGE_SIZE); mem_cgroup_swap_statistics(memcg, false); css_put(memcg-css); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 12:05 PM, Dave Hansen wrote: It does not revert cleanly because of the hunks below. The code in those hunks was removed, so I tried running without properly merging them and it spews warnings because counter-usage is seen going negative. So, it doesn't appear we can quickly revert this. I'm fairly confident that I missed some of the cases (especially in the charge-moving code), but the attached patch does at least work around the regression for me. It restores the original performance, or at least gets _close_ to it. --- b/mm/memcontrol.c | 14 ++ 1 file changed, 14 insertions(+) diff -puN mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch mm/memcontrol.c --- a/mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch 2014-09-02 12:20:11.209527453 -0700 +++ b/mm/memcontrol.c 2014-09-02 13:10:28.756736862 -0700 @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup unsigned long long size; int ret = 0; + if (mem_cgroup_is_root(memcg)) + goto done; retry: if (consume_stock(memcg, nr_pages)) goto done; @@ -2640,6 +2642,9 @@ static void __mem_cgroup_cancel_local_ch { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge_until(memcg-res, memcg-res.parent, bytes); if (do_swap_account) res_counter_uncharge_until(memcg-memsw, @@ -6440,6 +6445,9 @@ void mem_cgroup_commit_charge(struct pag VM_BUG_ON_PAGE(!page-mapping, page); VM_BUG_ON_PAGE(PageLRU(page) !lrucare, page); + if (mem_cgroup_is_root(memcg)) + return; + if (mem_cgroup_disabled()) return; /* @@ -6484,6 +6492,9 @@ void mem_cgroup_cancel_charge(struct pag { unsigned int nr_pages = 1; + if (mem_cgroup_is_root(memcg)) + return; + if (mem_cgroup_disabled()) return; /* @@ -6509,6 +6520,9 @@ static void uncharge_batch(struct mem_cg { unsigned long flags; + if (mem_cgroup_is_root(memcg)) + return; + if (nr_mem) res_counter_uncharge(memcg-res, nr_mem * PAGE_SIZE); if (nr_memsw) _
Re: regression caused by cgroups optimization in 3.17-rc2
I, of course, forgot to include the most important detail. This appears to be pretty run-of-the-mill spinlock contention in the resource counter code. Nearly 80% of the CPU is spent spinning in the charge or uncharge paths in the kernel. It is apparently spinning on res_counter-lock in both the charge and uncharge paths. It already does _some_ batching here on the free side, but that apparently breaks down after ~40 threads. It's a no-brainer since the patch in question removed an optimization skipping the charging, and now we're seeing overhead from the charging. Here's the first entry from perf top: 80.18%80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--46.13%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.89%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.11%-- [...] | |--1.14%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --8217937613.29%-- [...] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
Hi Dave, On Tue, Sep 02, 2014 at 12:05:41PM -0700, Dave Hansen wrote: I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the memory cgroups code. This is on a kernel with cgroups enabled at compile time, but not _used_ for anything. See the green lines in the graph: https://www.sr71.net/~dave/intel/regression-from-05b843012.png The workload is a little parallel microbenchmark doing page faults: Ouch. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c The hardware is an 8-socket Westmere box with 160 hardware threads. For some reason, this does not affect the version of the microbenchmark which is doing completely anonymous page faults. I bisected it down to this commit: commit 05b8430123359886ef6a4146fba384e30d771b3f Author: Johannes Weiner han...@cmpxchg.org Date: Wed Aug 6 16:05:59 2014 -0700 mm: memcontrol: use root_mem_cgroup res_counter Due to an old optimization to keep expensive res_counter changes at a minimum, the root_mem_cgroup res_counter is never charged; there is no limit at that level anyway, and any statistics can be generated on demand by summing up the counters of all other cgroups. However, with per-cpu charge caches, res_counter operations do not even show up in profiles anymore, so this optimization is no longer necessary. Remove it to simplify the code. Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? I can't reproduce this complete breakdown on my smaller test gear, but I do see an improvement with the following patch: --- From 29a51326c24b7bb45d17e9c864c34506f10868f6 Mon Sep 17 00:00:00 2001 From: Johannes Weiner han...@cmpxchg.org Date: Tue, 2 Sep 2014 18:11:39 -0400 Subject: [patch] mm: memcontrol: use per-cpu caches for uncharging --- mm/memcontrol.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..cb79ecff399d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2365,6 +2365,7 @@ static void drain_stock(struct memcg_stock_pcp *stock) res_counter_uncharge(old-res, bytes); if (do_swap_account) res_counter_uncharge(old-memsw, bytes); + memcg_oom_recover(old); stock-nr_pages = 0; } stock-cached = NULL; @@ -2405,6 +2406,13 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) stock-cached = memcg; } stock-nr_pages += nr_pages; + if (stock-nr_pages CHARGE_BATCH * 4) { + res_counter_uncharge(memcg-res, CHARGE_BATCH); + if (do_swap_account) + res_counter_uncharge(memcg-memsw, CHARGE_BATCH); + memcg_oom_recover(memcg); + stock-nr_pages -= CHARGE_BATCH; + } put_cpu_var(memcg_stock); } @@ -6509,12 +6517,20 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, { unsigned long flags; - if (nr_mem) - res_counter_uncharge(memcg-res, nr_mem * PAGE_SIZE); - if (nr_memsw) - res_counter_uncharge(memcg-memsw, nr_memsw * PAGE_SIZE); - - memcg_oom_recover(memcg); + /* +* The percpu caches count both memory and memsw charges in a +* single conuter, but there might be less memsw charges when +* some of the pages have been swapped out. +*/ + if (nr_mem == nr_memsw) + refill_stock(memcg, nr_mem); + else { + if (nr_mem) + res_counter_uncharge(memcg-res, nr_mem * PAGE_SIZE); + if (nr_memsw) + res_counter_uncharge(memcg-memsw, nr_memsw * PAGE_SIZE); + memcg_oom_recover(memcg); + } local_irq_save(flags); __this_cpu_sub(memcg-stat-count[MEM_CGROUP_STAT_RSS], nr_anon); -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 03:18 PM, Johannes Weiner wrote: Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? It looks pretty much the same, although it might have equalized the charge and uncharge sides a bit. Full 'perf top' output attached. PerfTop: 275580 irqs/sec kernel:98.0% exact: 0.0% [4000Hz cycles], (all, 160 CPUs) --- 68.10%68.10% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--57.35%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.93%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.07%-- [...] | |--53.93%-- res_counter_uncharge_until | res_counter_uncharge | refill_stock | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--2.18%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --9307219025.55%-- [...] 64.00% 1.34% page_fault2_processes [.] testcase | --- testcase 62.64% 0.37% [kernel] [k] page_fault | --- page_fault | |--114.21%-- testcase --10118485450.89%-- [...] 62.27% 0.01% [kernel] [k] do_page_fault | --- do_page_fault | |--114.28%-- page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10178525138.33%-- [...] 62.25% 0.07% [kernel] [k] __do_page_fault | --- __do_page_fault | |--114.27%-- do_page_fault | page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10182230022.41%-- [...] 62.08% 0.26% [kernel] [k] handle_mm_fault | --- handle_mm_fault | |--114.28%-- __do_page_fault | do_page_fault | page_fault | | | |--99.94%-- testcase | --0.06%--
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: On 09/02/2014 03:18 PM, Johannes Weiner wrote: Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? It looks pretty much the same, although it might have equalized the charge and uncharge sides a bit. Full 'perf top' output attached. That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: PerfTop: 275580 irqs/sec kernel:98.0% exact: 0.0% [4000Hz cycles], (all, 160 CPUs) --- 68.10%68.10% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--57.35%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.93%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.07%-- [...] | |--53.93%-- res_counter_uncharge_until | res_counter_uncharge | refill_stock | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--2.18%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --9307219025.55%-- [...] 64.00% 1.34% page_fault2_processes [.] testcase | --- testcase 62.64% 0.37% [kernel] [k] page_fault | --- page_fault | |--114.21%-- testcase --10118485450.89%-- [...] 62.27% 0.01% [kernel] [k] do_page_fault | --- do_page_fault | |--114.28%-- page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10178525138.33%-- [...] 62.25% 0.07% [kernel] [k] __do_page_fault | --- __do_page_fault | |--114.27%-- do_page_fault | page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10182230022.41%-- [...] 62.08% 0.26% [kernel] [k] handle_mm_fault |
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner han...@cmpxchg.org wrote: That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: Please argue this *after* the commit has been reverted. You guys can try to make the memcontrol batching actually work and scale later. It's not appropriate to argue against major regressions when reported and bisected by users. Showing the spinlock at the top of the profile is very much crazy (apparently taking 68% of all cpu time), when it's all useless make-believe work. I don't understand why you wouldn't call that crazy. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 05:10 PM, Johannes Weiner wrote: On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: On 09/02/2014 03:18 PM, Johannes Weiner wrote: Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? It looks pretty much the same, although it might have equalized the charge and uncharge sides a bit. Full 'perf top' output attached. That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: Perf was only outputting the top 20 functions. Believe it or not, page zeroing and the rest of the allocator path wasn't even in the path of the top 20 functions because there is so much lock contention. Here's a longer run of 'perf top' along with the top 100 functions: http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz you can at least see copy_page_rep in there. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On Tue, Sep 02, 2014 at 05:20:55PM -0700, Linus Torvalds wrote: On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner han...@cmpxchg.org wrote: That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: Please argue this *after* the commit has been reverted. You guys can try to make the memcontrol batching actually work and scale later. It's not appropriate to argue against major regressions when reported and bisected by users. I'll send a clean revert later. Showing the spinlock at the top of the profile is very much crazy (apparently taking 68% of all cpu time), when it's all useless make-believe work. I don't understand why you wouldn't call that crazy. If you limit perf to a subset of symbols, it will show a relative distribution between them, i.e: perf top --symbols kfree,memset during some disk access: PerfTop:1292 irqs/sec kernel:84.4% exact: 0.0% [4000Hz cycles], (all, 4 CPUs) --- 56.23% [kernel] [k] kfree 41.86% [kernel] [k] memset 1.91% libc-2.19.so [.] memset kfree isn't eating 56% of all cpu time here, and it wasn't clear to me whether Dave filtered symbols from only memcontrol.o, memory.o, and mmap.o in a similar way. I'm not arguing against the regression, I'm just trying to make sense of the numbers from the *patched* kernel. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression caused by cgroups optimization in 3.17-rc2
On 09/02/2014 06:33 PM, Johannes Weiner wrote: kfree isn't eating 56% of all cpu time here, and it wasn't clear to me whether Dave filtered symbols from only memcontrol.o, memory.o, and mmap.o in a similar way. I'm not arguing against the regression, I'm just trying to make sense of the numbers from the *patched* kernel. I guess I could have included it in the description, but that was a pretty vanilla run: perf top --call-graph=fp --stdio foo.txt I didn't use any filtering. I didn't even know I _could_ filter. :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/