Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-27 Thread Daniel Jordan
On Tue, Nov 27, 2018 at 12:12:28AM +, Elliott, Robert (Persistent Memory) 
wrote:
> I ran a short test with:
> * HPE ProLiant DL360 Gen9 system
> * Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and 
>   18 hyperthreaded cores (36-53)
> * DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
> * fio workload generator
> * cores on one CPU socket talking to a pmem device on the same CPU
> * large (1 MiB) random writes (to minimize the threads getting CPU cache
>   hits from each other)
> 
> Results:
> * 31.7 GB/sfour threads, four physical cores (0,1,2,3)
> * 22.2 GB/sfour threads, two physical cores (0,1,36,37)
> * 21.4 GB/stwo threads, two physical cores (0,1)
> * 12.1 GB/stwo threads, one physical core (0,36)
> * 11.2 GB/sone thread, one physical core (0)
> 
> So, I think it's important that the initialization threads run on
> separate physical cores.

Thanks for running this.  And fair enough, in this test using both siblings
gives only a 4-8% speedup over one, so it makes sense to use only cores in the
calculation.

As for how to actually do this, some arches have smp_num_siblings, but there
should be a generic interface to provide that.

It's also possible to calculate this from the existing
topology_sibling_cpumask, but the first option is better IMHO.  Open to
suggestions.

> For the number of cores to use, one approach is:
> memory bandwidth (number of interleaved channels * speed)
> divided by 
> CPU core max sustained write bandwidth
> 
> For example, this 2133 MT/s system is roughly:
> 68 GB/s(4 * 17 GB/s nominal)
> divided by
> 11.2 GB/s  (one core's performance)
> which is 
> 6 cores
> 
> ACPI HMAT will report that 68 GB/s number.  I'm not sure of
> a good way to discover the 11.2 GB/s number.

Yes, this would be nice to do if we could know the per-core number, with the
caveat that a single number like this would be most useful for the CPU-memory
pair it was calculated for, so the kernel could at least calculate it for jobs
operating on local memory.

Some BogoMIPS-like calibration may work, but I'll wait for ACPI HMAT support in
the kernel.


RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-26 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Daniel Jordan [mailto:daniel.m.jor...@oracle.com]
> Sent: Monday, November 19, 2018 10:02 AM
> On Mon, Nov 12, 2018 at 10:15:46PM +, Elliott, Robert (Persistent Memory) 
> wrote:
> >
> > > -Original Message-
> > > From: Daniel Jordan 
> > > Sent: Monday, November 12, 2018 11:54 AM
> > >
> > > On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent
> > > Memory) wrote:
> > > > > -Original Message-
> > > > > From: linux-kernel-ow...@vger.kernel.org  > > > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > > initialization within each node
> > > > >
> > ...
> > > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > > CPUs on the node.
> > > > ...
> > > > > + /*
> > > > > +  * We'd like to know the memory bandwidth of the chip to
> > > > > calculate the
> > > > > +  * most efficient number of threads to start, but we can't.
> > > > > +  * In testing, a good value for a variety of systems was a
> > > > > quarter of the CPUs on the node.
> > > > > +  */
> > > > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > > >
> > > >
> > > > You might want to base that calculation on and limit the threads to
> > > > physical cores, not hyperthreaded cores.
> > >
> > > Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> > > don't have data that shows that in this case.
> >
> > I think that's only if there are some register-based calculations to do 
> > while
> > waiting. If both threads are just doing memory accesses, they'll both 
> > stall, and
> > there doesn't seem to be any benefit in having two contexts generate the IOs
> > rather than one (at least on the systems I've used). I think it takes longer
> > to switch contexts than to just turnaround the next IO.
> 
> (Sorry for the delay, Plumbers is over now...)
> 
> I guess we're both just waving our hands without data.  I've only got x86, so
> using a quarter of the CPUs rules out HT on my end.  Do you have a system that
> you can test this on, where using a quarter of the CPUs will involve HT?

I ran a short test with:
* HPE ProLiant DL360 Gen9 system
* Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and 
  18 hyperthreaded cores (36-53)
* DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
* fio workload generator
* cores on one CPU socket talking to a pmem device on the same CPU
* large (1 MiB) random writes (to minimize the threads getting CPU cache
  hits from each other)

Results:
* 31.7 GB/sfour threads, four physical cores (0,1,2,3)
* 22.2 GB/sfour threads, two physical cores (0,1,36,37)
* 21.4 GB/stwo threads, two physical cores (0,1)
* 12.1 GB/stwo threads, one physical core (0,36)
* 11.2 GB/sone thread, one physical core (0)

So, I think it's important that the initialization threads run on
separate physical cores.

For the number of cores to use, one approach is:
memory bandwidth (number of interleaved channels * speed)
divided by 
CPU core max sustained write bandwidth

For example, this 2133 MT/s system is roughly:
68 GB/s(4 * 17 GB/s nominal)
divided by
11.2 GB/s  (one core's performance)
which is 
6 cores

ACPI HMAT will report that 68 GB/s number.  I'm not sure of
a good way to discover the 11.2 GB/s number.


fio job file:
[global]
direct=1
ioengine=sync
norandommap
randrepeat=0
bs=1M
runtime=20
time_based=1
group_reporting
thread
gtod_reduce=1
zero_buffers
cpus_allowed_policy=split
# pick the desired number of threads
numjobs=4
numjobs=2
numjobs=1

# CPU0: cores 0-17, hyperthreaded cores 36-53
[pmem0]
filename=/dev/pmem0
# pick the desired cpus_allowed list
cpus_allowed=0,1,2,3
cpus_allowed=0,1,36,37
cpus_allowed=0,36
cpus_allowed=0,1
cpus_allowed=0
rw=randwrite

Although most CPU time is in movnti instructions (non-temporal stores),
there is overhead in clearing the page cache and in the pmem block
driver; those won't be present in your initialization function. 
perf top shows:
  82.00%  [kernel][k] memcpy_flushcache
   5.23%  [kernel][k] gup_pgd_range
   3.41%  [kernel][k] __blkdev_direct_IO_simple
   2.38%  [kernel][k] pmem_make_request
   1.46%  [kernel][k] write_pmem
   1.29%  [kernel][k] pmem_do_bvec


---
Robert Elliott, HPE Persistent Memory





Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-19 Thread Daniel Jordan
On Mon, Nov 12, 2018 at 08:54:12AM -0800, Daniel Jordan wrote:
> On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent Memory) 
> wrote:
> > > -Original Message-
> > > From: linux-kernel-ow...@vger.kernel.org  > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > > 
> > > ...  The kernel doesn't
> > > know the memory bandwidth of a given system to get the most efficient
> > > number of threads, so there's some guesswork involved.  
> > 
> > The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
> > that kind of information, and could facilitate automatic tuning.
> > 
> > There was discussion last year about kernel support for it:
> > https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/
> 
> Thanks for bringing this up.  I'm traveling but will take a closer look when I
> get back.

So this series would give the total bandwidth for a memory target, but there's
not a way to map that to a CPU count.  In other words, it seems we couldn't
determine how many CPUs it takes to reach the max bandwidth.  If I haven't
missed something, I'm going to remove that comment.


Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-19 Thread Daniel Jordan
On Mon, Nov 12, 2018 at 10:15:46PM +, Elliott, Robert (Persistent Memory) 
wrote:
> 
> 
> > -Original Message-
> > From: Daniel Jordan 
> > Sent: Monday, November 12, 2018 11:54 AM
> > To: Elliott, Robert (Persistent Memory) 
> > Cc: Daniel Jordan ; linux...@kvack.org;
> > k...@vger.kernel.org; linux-kernel@vger.kernel.org; aarca...@redhat.com;
> > aaron...@intel.com; a...@linux-foundation.org; alex.william...@redhat.com;
> > b...@redhat.com; darrick.w...@oracle.com; dave.han...@linux.intel.com;
> > j...@mellanox.com; jwad...@google.com; jiangshan...@gmail.com;
> > mho...@kernel.org; mike.krav...@oracle.com; pavel.tatas...@microsoft.com;
> > prasad.singamse...@oracle.com; rdun...@infradead.org;
> > steven.sist...@oracle.com; tim.c.c...@intel.com; t...@kernel.org;
> > vba...@suse.cz
> > Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > initialization within each node
> > 
> > On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent
> > Memory) wrote:
> > > > -Original Message-
> > > > From: linux-kernel-ow...@vger.kernel.org  > > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > > Sent: Monday, November 05, 2018 10:56 AM
> > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > > initialization within each node
> > > >
> ...
> > > > In testing, a reasonable value turned out to be about a quarter of the
> > > > CPUs on the node.
> > > ...
> > > > +   /*
> > > > +* We'd like to know the memory bandwidth of the chip to
> > > > calculate the
> > > > +* most efficient number of threads to start, but we can't.
> > > > +* In testing, a good value for a variety of systems was a
> > > > quarter of the CPUs on the node.
> > > > +*/
> > > > +   nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> > >
> > >
> > > You might want to base that calculation on and limit the threads to
> > > physical cores, not hyperthreaded cores.
> > 
> > Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> > don't have data that shows that in this case.
> 
> I think that's only if there are some register-based calculations to do while
> waiting. If both threads are just doing memory accesses, they'll both stall, 
> and
> there doesn't seem to be any benefit in having two contexts generate the IOs
> rather than one (at least on the systems I've used). I think it takes longer
> to switch contexts than to just turnaround the next IO.

(Sorry for the delay, Plumbers is over now...)

I guess we're both just waving our hands without data.  I've only got x86, so
using a quarter of the CPUs rules out HT on my end.  Do you have a system that
you can test this on, where using a quarter of the CPUs will involve HT?

Thanks,
Daniel


RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-12 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Daniel Jordan 
> Sent: Monday, November 12, 2018 11:54 AM
> To: Elliott, Robert (Persistent Memory) 
> Cc: Daniel Jordan ; linux...@kvack.org;
> k...@vger.kernel.org; linux-kernel@vger.kernel.org; aarca...@redhat.com;
> aaron...@intel.com; a...@linux-foundation.org; alex.william...@redhat.com;
> b...@redhat.com; darrick.w...@oracle.com; dave.han...@linux.intel.com;
> j...@mellanox.com; jwad...@google.com; jiangshan...@gmail.com;
> mho...@kernel.org; mike.krav...@oracle.com; pavel.tatas...@microsoft.com;
> prasad.singamse...@oracle.com; rdun...@infradead.org;
> steven.sist...@oracle.com; tim.c.c...@intel.com; t...@kernel.org;
> vba...@suse.cz
> Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent
> Memory) wrote:
> > > -Original Message-
> > > From: linux-kernel-ow...@vger.kernel.org  > > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > >
...
> > > In testing, a reasonable value turned out to be about a quarter of the
> > > CPUs on the node.
> > ...
> > > + /*
> > > +  * We'd like to know the memory bandwidth of the chip to
> > > calculate the
> > > +  * most efficient number of threads to start, but we can't.
> > > +  * In testing, a good value for a variety of systems was a
> > > quarter of the CPUs on the node.
> > > +  */
> > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> >
> >
> > You might want to base that calculation on and limit the threads to
> > physical cores, not hyperthreaded cores.
> 
> Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
> don't have data that shows that in this case.

I think that's only if there are some register-based calculations to do while
waiting. If both threads are just doing memory accesses, they'll both stall, and
there doesn't seem to be any benefit in having two contexts generate the IOs
rather than one (at least on the systems I've used). I think it takes longer
to switch contexts than to just turnaround the next IO.


---
Robert Elliott, HPE Persistent Memory





Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-12 Thread Daniel Jordan
On Sat, Nov 10, 2018 at 03:48:14AM +, Elliott, Robert (Persistent Memory) 
wrote:
> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org  > ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> > Sent: Monday, November 05, 2018 10:56 AM
> > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > initialization within each node
> > 
> > ...  The kernel doesn't
> > know the memory bandwidth of a given system to get the most efficient
> > number of threads, so there's some guesswork involved.  
> 
> The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
> that kind of information, and could facilitate automatic tuning.
> 
> There was discussion last year about kernel support for it:
> https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/

Thanks for bringing this up.  I'm traveling but will take a closer look when I
get back.

> > In testing, a reasonable value turned out to be about a quarter of the
> > CPUs on the node.
> ...
> > +   /*
> > +* We'd like to know the memory bandwidth of the chip to
> > calculate the
> > +* most efficient number of threads to start, but we can't.
> > +* In testing, a good value for a variety of systems was a
> > quarter of the CPUs on the node.
> > +*/
> > +   nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> 
> 
> You might want to base that calculation on and limit the threads to
> physical cores, not hyperthreaded cores.

Why?  Hyperthreads can be beneficial when waiting on memory.  That said, I
don't have data that shows that in this case.


RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> Sent: Monday, November 05, 2018 10:56 AM
> Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> ...  The kernel doesn't
> know the memory bandwidth of a given system to get the most efficient
> number of threads, so there's some guesswork involved.  

The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
that kind of information, and could facilitate automatic tuning.

There was discussion last year about kernel support for it:
https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/


> In testing, a reasonable value turned out to be about a quarter of the
> CPUs on the node.
...
> + /*
> +  * We'd like to know the memory bandwidth of the chip to
> calculate the
> +  * most efficient number of threads to start, but we can't.
> +  * In testing, a good value for a variety of systems was a
> quarter of the CPUs on the node.
> +  */
> + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);


You might want to base that calculation on and limit the threads to
physical cores, not hyperthreaded cores.

---
Robert Elliott, HPE Persistent Memory