Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-20 Thread Mel Gorman
On Fri, Oct 18, 2019 at 06:40:24PM -0700, Andrew Morton wrote:
> On Fri, 18 Oct 2019 15:09:59 +0100 Mel Gorman  
> wrote:
> 
> > > > Cc: sta...@vger.kernel.org # v4.15+
> > > 
> > > Hmm, are you sure about 4.15? Doesn't this go all the way down to
> > > deferred initialization? I do not see any recent changes on when
> > > setup_per_cpu_pageset is called.
> > > 
> > 
> > No, I'm not 100% sure. It looks like this was always an issue from the
> > code but did not happen on at least one 4.12-based distribution kernel for
> > reasons that are non-obvious. Either way, the tag should have been "v4.1+"
> 
> I could mark
> 
> mm-pcp-share-common-code-between-memory-hotplug-and-percpu-sysctl-handler.patch
> mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch
> 
> as Cc:[4.1+]
> 

That would be fine.

> But for backporting purposes it's a bit cumbersome that [2/3] is the
> important patch.  I think I'll switch the ordering so that
> mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch
> is the first patch and the other two can be queued for 5.5-rc1, OK?
> 

It might be easier to simply collapse patch 1 and 2 together. They were
only split to make the review easier and to avoid two relatively big
changes in one patch.

> Also, is a Reported-by:Matt appropriate here?
> 

I don't object but I'm not actually sure who reported this first. I think
it was Thomas who talked to Boris about an EPYC performance issue, who
talked to Matt thinking it might be a scheduler issue who identified it
was my problem :P

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-18 Thread Andrew Morton
On Fri, 18 Oct 2019 15:09:59 +0100 Mel Gorman  
wrote:

> > > Cc: sta...@vger.kernel.org # v4.15+
> > 
> > Hmm, are you sure about 4.15? Doesn't this go all the way down to
> > deferred initialization? I do not see any recent changes on when
> > setup_per_cpu_pageset is called.
> > 
> 
> No, I'm not 100% sure. It looks like this was always an issue from the
> code but did not happen on at least one 4.12-based distribution kernel for
> reasons that are non-obvious. Either way, the tag should have been "v4.1+"

I could mark

mm-pcp-share-common-code-between-memory-hotplug-and-percpu-sysctl-handler.patch
mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch

as Cc:  [4.1+]

But for backporting purposes it's a bit cumbersome that [2/3] is the
important patch.  I think I'll switch the ordering so that
mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch
is the first patch and the other two can be queued for 5.5-rc1, OK?

Also, is a Reported-by:Matt appropriate here?


From: Mel Gorman 
Subject: mm, meminit: recalculate pcpu batch and high limits after init 
completes

Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page allocator
(pcpu) calculates the number of pages allocated/freed in batches as well
as the maximum number of pages allowed on a per-cpu list.  As
zone->managed_pages is not up to date yet, the pcpu initialisation
calculates inappropriately low batch and high values.

This increases zone lock contention quite severely in some cases with the
degree of severity depending on how many CPUs share a local zone and the
size of the zone.  A private report indicated that kernel build times were
excessive with extremely high system CPU usage.  A perf profile indicated
that a large chunk of time was lost on zone->lock contention.

This patch recalculates the pcpu batch and high values after deferred
initialisation completes on each node.  It was tested on a 2-socket AMD
EPYC 2 machine using a kernel compilation workload -- allmodconfig and all
available CPUs.

mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.

kernbench
5.4.0-rc3  5.4.0-rc3
  vanilla resetpcpu-v1r1
Amean user-25613249.50 (   0.00%)15928.40 * -20.22%*
Amean syst-25614760.30 (   0.00%) 4551.77 *  69.16%*
Amean elsp-256  162.42 (   0.00%)  118.46 *  27.06%*
Stddevuser-256   42.97 (   0.00%)   50.83 ( -18.30%)
Stddevsyst-256  336.87 (   0.00%)   33.70 (  90.00%)
Stddevelsp-2562.46 (   0.00%)0.81 (  67.01%)

   5.4.0-rc3   5.4.0-rc3
 vanillaresetpcpu-v1r1
Duration User   39766.2447802.92
Duration System 44298.1013671.93
Duration Elapsed  519.11  387.65

The patch reduces system CPU usage by 69.16% and total build time by
27.06%.  The variance of system CPU usage is also much reduced.

Link: http://lkml.kernel.org/r/20191018105606.3249-3-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Tested-by: Matt Fleming 
Acked-by: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: [4.1+]
Signed-off-by: Andrew Morton 
---

 mm/page_alloc.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

--- 
a/mm/page_alloc.c~mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes
+++ a/mm/page_alloc.c
@@ -1818,6 +1818,14 @@ static int __init deferred_init_memmap(v
 */
while (spfn < epfn)
nr_pages += deferred_init_maxorder(, zone, , );
+
+   /*
+* The number of managed pages has changed due to the initialisation
+* so the pcpu batch and high limits needs to be updated or the limits
+* will be artificially small.
+*/
+   zone_pcp_update(zone);
+
 zone_empty:
pgdat_resize_unlock(pgdat, );
 
@@ -8514,7 +8522,6 @@ void free_contig_range(unsigned long pfn
WARN(count != 0, "%d pages are still in use!\n", count);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
  * page high values need to be recalulated.
@@ -8528,7 +8535,6 @@ void __meminit zone_pcp_update(struct zo
per_cpu_ptr(zone->pageset, cpu));
mutex_unlock(_batch_high_lock);
 }
-#endif
 
 void zone_pcp_reset(struct zone *zone)
 {
_



Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-18 Thread Mel Gorman
On Fri, Oct 18, 2019 at 03:01:27PM +0200, Michal Hocko wrote:
> On Fri 18-10-19 11:56:05, Mel Gorman wrote:
> > Deferred memory initialisation updates zone->managed_pages during
> > the initialisation phase but before that finishes, the per-cpu page
> > allocator (pcpu) calculates the number of pages allocated/freed in
> > batches as well as the maximum number of pages allowed on a per-cpu list.
> > As zone->managed_pages is not up to date yet, the pcpu initialisation
> > calculates inappropriately low batch and high values.
> > 
> > This increases zone lock contention quite severely in some cases with the
> > degree of severity depending on how many CPUs share a local zone and the
> > size of the zone. A private report indicated that kernel build times were
> > excessive with extremely high system CPU usage. A perf profile indicated
> > that a large chunk of time was lost on zone->lock contention.
> > 
> > This patch recalculates the pcpu batch and high values after deferred
> > initialisation completes on each node. It was tested on a 2-socket AMD
> > EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> > all available CPUs.
> > 
> > mmtests configuration: config-workload-kernbench-max
> > Configuration was modified to build on a fresh XFS partition.
> > 
> > kernbench
> > 5.4.0-rc3  5.4.0-rc3
> >   vanilla resetpcpu-v1r1
> > Amean user-25613249.50 (   0.00%)15928.40 * -20.22%*
> > Amean syst-25614760.30 (   0.00%) 4551.77 *  69.16%*
> > Amean elsp-256  162.42 (   0.00%)  118.46 *  27.06%*
> > Stddevuser-256   42.97 (   0.00%)   50.83 ( -18.30%)
> > Stddevsyst-256  336.87 (   0.00%)   33.70 (  90.00%)
> > Stddevelsp-2562.46 (   0.00%)0.81 (  67.01%)
> > 
> >5.4.0-rc3   5.4.0-rc3
> >  vanillaresetpcpu-v1r1
> > Duration User   39766.2447802.92
> > Duration System 44298.1013671.93
> > Duration Elapsed  519.11  387.65
> > 
> > The patch reduces system CPU usage by 69.16% and total build time by
> > 27.06%. The variance of system CPU usage is also much reduced.
> 
> The fix makes sense. It would be nice to see the difference in the batch
> sizes from the initial setup compared to the one after the deferred
> intialization is done
> 

Before, this was the breakdown of batch and high values over all zones
were

256   batch: 1
256   batch: 63
512   batch: 7

256   high:  0
256   high:  378
512   high:  42

i.e. 512 pcpu pagesets had a batch limit of 7 and a high limit of 42.
These were for the NORMAL zones on the system. After the patch

256   batch: 1
768   batch: 63

256   high:  0
768   high:  378

> > Cc: sta...@vger.kernel.org # v4.15+
> 
> Hmm, are you sure about 4.15? Doesn't this go all the way down to
> deferred initialization? I do not see any recent changes on when
> setup_per_cpu_pageset is called.
> 

No, I'm not 100% sure. It looks like this was always an issue from the
code but did not happen on at least one 4.12-based distribution kernel for
reasons that are non-obvious. Either way, the tag should have been "v4.1+"

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-18 Thread Michal Hocko
On Fri 18-10-19 11:56:05, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes on each node. It was tested on a 2-socket AMD
> EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
> 5.4.0-rc3  5.4.0-rc3
>   vanilla resetpcpu-v1r1
> Amean user-25613249.50 (   0.00%)15928.40 * -20.22%*
> Amean syst-25614760.30 (   0.00%) 4551.77 *  69.16%*
> Amean elsp-256  162.42 (   0.00%)  118.46 *  27.06%*
> Stddevuser-256   42.97 (   0.00%)   50.83 ( -18.30%)
> Stddevsyst-256  336.87 (   0.00%)   33.70 (  90.00%)
> Stddevelsp-2562.46 (   0.00%)0.81 (  67.01%)
> 
>5.4.0-rc3   5.4.0-rc3
>  vanillaresetpcpu-v1r1
> Duration User   39766.2447802.92
> Duration System 44298.1013671.93
> Duration Elapsed  519.11  387.65
> 
> The patch reduces system CPU usage by 69.16% and total build time by
> 27.06%. The variance of system CPU usage is also much reduced.

The fix makes sense. It would be nice to see the difference in the batch
sizes from the initial setup compared to the one after the deferred
intialization is done

> Cc: sta...@vger.kernel.org # v4.15+

Hmm, are you sure about 4.15? Doesn't this go all the way down to
deferred initialization? I do not see any recent changes on when
setup_per_cpu_pageset is called.

> Signed-off-by: Mel Gorman 

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cafe568d36f6..0a0dd74edc83 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1818,6 +1818,14 @@ static int __init deferred_init_memmap(void *data)
>*/
>   while (spfn < epfn)
>   nr_pages += deferred_init_maxorder(, zone, , );
> +
> + /*
> +  * The number of managed pages has changed due to the initialisation
> +  * so the pcpu batch and high limits needs to be updated or the limits
> +  * will be artificially small.
> +  */
> + zone_pcp_update(zone);
> +
>  zone_empty:
>   pgdat_resize_unlock(pgdat, );
>  
> @@ -8516,7 +8524,6 @@ void free_contig_range(unsigned long pfn, unsigned int 
> nr_pages)
>   WARN(count != 0, "%d pages are still in use!\n", count);
>  }
>  
> -#ifdef CONFIG_MEMORY_HOTPLUG
>  /*
>   * The zone indicated has a new number of managed_pages; batch sizes and 
> percpu
>   * page high values need to be recalulated.
> @@ -8527,7 +8534,6 @@ void __meminit zone_pcp_update(struct zone *zone)
>   __zone_pcp_update(zone);
>   mutex_unlock(_batch_high_lock);
>  }
> -#endif
>  
>  void zone_pcp_reset(struct zone *zone)
>  {
> -- 
> 2.16.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-18 Thread Matt Fleming
On Fri, 18 Oct, at 11:56:05AM, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes on each node. It was tested on a 2-socket AMD
> EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
> 5.4.0-rc3  5.4.0-rc3
>   vanilla resetpcpu-v1r1
> Amean user-25613249.50 (   0.00%)15928.40 * -20.22%*
> Amean syst-25614760.30 (   0.00%) 4551.77 *  69.16%*
> Amean elsp-256  162.42 (   0.00%)  118.46 *  27.06%*
> Stddevuser-256   42.97 (   0.00%)   50.83 ( -18.30%)
> Stddevsyst-256  336.87 (   0.00%)   33.70 (  90.00%)
> Stddevelsp-2562.46 (   0.00%)0.81 (  67.01%)
> 
>5.4.0-rc3   5.4.0-rc3
>  vanillaresetpcpu-v1r1
> Duration User   39766.2447802.92
> Duration System 44298.1013671.93
> Duration Elapsed  519.11  387.65
> 
> The patch reduces system CPU usage by 69.16% and total build time by
> 27.06%. The variance of system CPU usage is also much reduced.
> 
> Cc: sta...@vger.kernel.org # v4.15+
> Signed-off-by: Mel Gorman 
> ---
>  mm/page_alloc.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)

Tested-by: Matt Fleming