Re: [PATCH V5 2/2] genirq/affinity: Spread vectors on node according to nr_cpu ratio

2019-08-16 Thread Derrick, Jonathan
On Fri, 2019-08-16 at 09:53 -0600, Keith Busch wrote:
> On Thu, Aug 15, 2019 at 07:28:49PM -0700, Ming Lei wrote:
> > Now __irq_build_affinity_masks() spreads vectors evenly per node, and
> > all vectors may not be spread in case that each numa node has different
> > CPU number, then the warning in irq_build_affinity_masks() can
> > be triggered.
> > 
> > Improve current spreading algorithm by assigning vectors according to
> > the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
> > assignment from smaller nodes to bigger nodes to guarantee that every
> > active node gets allocated at least one vector, then we can avoid
> > cross-node spread in normal situation.
> > 
> > Meantime the reported warning can be fixed.
> > 
> > Another big goodness is that the spread approach becomes more fair if
> > node has different CPU number.
> > 
> > For example, on the following machine:
> > [root@ktest-01 ~]# lscpu
> > ...
> > CPU(s):  16
> > On-line CPU(s) list: 0-15
> > Thread(s) per core:  1
> > Core(s) per socket:  8
> > Socket(s):   2
> > NUMA node(s):2
> > ...
> > NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
> > NUMA node1 CPU(s):   2,4,10,12
> > 
> > When driver requests to allocate 8 vectors, the following spread can
> > be got:
> > irq 31, cpu list 2,4
> > irq 32, cpu list 10,12
> > irq 33, cpu list 0-1
> > irq 34, cpu list 3,5
> > irq 35, cpu list 6-7
> > irq 36, cpu list 8-9
> > irq 37, cpu list 11,13
> > irq 38, cpu list 14-15
> > 
> > Without this patch, kernel warning is triggered on above situation, and
> > allocation result was supposed to be 4 vectors for each node.
> > 
> > Cc: Christoph Hellwig 
> > Cc: Keith Busch 
> > Cc: linux-n...@lists.infradead.org,
> > Cc: Jon Derrick 
> > Cc: Jens Axboe 
> > Reported-by: Jon Derrick 
> > Signed-off-by: Ming Lei 
> 
> I had every intention to thoroughly test this on imbalanced node
> configurations, but that's not going to happen anytime soon. It looks
> correct to me, so I'll append my review here.
> 
I can only test this with 2 nodes but I have varied nr_cpus as well as
using different devices with fewer and more vectors than CPUs. Spread
looks good.

Thank you

Reviewed-by: Jon Derrick 


[snip]


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH V5 2/2] genirq/affinity: Spread vectors on node according to nr_cpu ratio

2019-08-16 Thread Ming Lei
On Fri, Aug 16, 2019 at 11:56 PM Keith Busch  wrote:
>
> On Thu, Aug 15, 2019 at 07:28:49PM -0700, Ming Lei wrote:
> > Now __irq_build_affinity_masks() spreads vectors evenly per node, and
> > all vectors may not be spread in case that each numa node has different
> > CPU number, then the warning in irq_build_affinity_masks() can
> > be triggered.
> >
> > Improve current spreading algorithm by assigning vectors according to
> > the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
> > assignment from smaller nodes to bigger nodes to guarantee that every
> > active node gets allocated at least one vector, then we can avoid
> > cross-node spread in normal situation.
> >
> > Meantime the reported warning can be fixed.
> >
> > Another big goodness is that the spread approach becomes more fair if
> > node has different CPU number.
> >
> > For example, on the following machine:
> >   [root@ktest-01 ~]# lscpu
> >   ...
> >   CPU(s):  16
> >   On-line CPU(s) list: 0-15
> >   Thread(s) per core:  1
> >   Core(s) per socket:  8
> >   Socket(s):   2
> >   NUMA node(s):2
> >   ...
> >   NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
> >   NUMA node1 CPU(s):   2,4,10,12
> >
> > When driver requests to allocate 8 vectors, the following spread can
> > be got:
> >   irq 31, cpu list 2,4
> >   irq 32, cpu list 10,12
> >   irq 33, cpu list 0-1
> >   irq 34, cpu list 3,5
> >   irq 35, cpu list 6-7
> >   irq 36, cpu list 8-9
> >   irq 37, cpu list 11,13
> >   irq 38, cpu list 14-15
> >
> > Without this patch, kernel warning is triggered on above situation, and
> > allocation result was supposed to be 4 vectors for each node.
> >
> > Cc: Christoph Hellwig 
> > Cc: Keith Busch 
> > Cc: linux-n...@lists.infradead.org,
> > Cc: Jon Derrick 
> > Cc: Jens Axboe 
> > Reported-by: Jon Derrick 
> > Signed-off-by: Ming Lei 
>
> I had every intention to thoroughly test this on imbalanced node
> configurations, but that's not going to happen anytime soon. It looks
> correct to me, so I'll append my review here.

Thanks for your review!

>
> I'm not sure I'd include the detailed correctness explanation in the
> comments, though. It essentially boils down to "the sum of the parts
> doesn't exceed the whole", and the key to carving up the parts is the
> sorted iteration you've implemented.

There are two invariants reached by this approach:

1) In each node, if the active CPU number isn't zero, >=1 vector is allocated
for this node

2) For each node, the allocated vector number is <= CPU number in this node.

Both two are not obviously, however, the two points are quite important wrt.
the correctness of this approach.

That is why I add the long correctness proof to the comment.

Thanks,
Ming
>
> Reviewed-by: Keith Busch 
>
> > ---
> >  kernel/irq/affinity.c | 223 --
> >  1 file changed, 193 insertions(+), 30 deletions(-)
> >
> > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > index c7cca942bd8a..32e07e58ce81 100644
> > --- a/kernel/irq/affinity.c
> > +++ b/kernel/irq/affinity.c
> > @@ -7,6 +7,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask 
> > *nmsk,
> >   unsigned int cpus_per_vec)
> > @@ -94,6 +95,155 @@ static int get_nodes_in_cpumask(cpumask_var_t 
> > *node_to_cpumask,
> >   return nodes;
> >  }
> >
> > +struct node_vectors {
> > + unsigned id;
> > +
> > + union {
> > + unsigned nvectors;
> > + unsigned ncpus;
> > + };
> > +};
> > +
> > +static int ncpus_cmp_func(const void *l, const void *r)
> > +{
> > + const struct node_vectors *ln = l;
> > + const struct node_vectors *rn = r;
> > +
> > + return ln->ncpus - rn->ncpus;
> > +}
> > +
> > +/*
> > + * Allocate vector number for each node, so that for each node:
> > + *
> > + * 1) the allocated number is >= 1
> > + *
> > + * 2) the allocated numbver is <= active CPU number of this node
> > + *
> > + * The actual allocated total vectors may be less than @numvecs when
> > + * active total CPU number is less than @numvecs.
> > + *
> > + * Active CPUs means the CPUs in '@cpu_mask AND @node_to_cpumask[]'
> > + * for each node.
> > + */
> > +static void alloc_nodes_vectors(unsigned int numvecs,
> > + const cpumask_var_t *node_to_cpumask,
> > + const struct cpumask *cpu_mask,
> > + const nodemask_t nodemsk,
> > + struct cpumask *nmsk,
> > + struct node_vectors *node_vectors)
> > +{
> > + unsigned n, remaining_ncpus = 0;
> > +
> > + for (n = 0; n < nr_node_ids; n++) {
> > + node_vectors[n].id = n;
> > + node_vectors[n].ncpus = UINT_MAX;
> > + }
> > +
> > + for_each_node_mask(n, 

Re: [PATCH V5 2/2] genirq/affinity: Spread vectors on node according to nr_cpu ratio

2019-08-16 Thread Keith Busch
On Thu, Aug 15, 2019 at 07:28:49PM -0700, Ming Lei wrote:
> Now __irq_build_affinity_masks() spreads vectors evenly per node, and
> all vectors may not be spread in case that each numa node has different
> CPU number, then the warning in irq_build_affinity_masks() can
> be triggered.
> 
> Improve current spreading algorithm by assigning vectors according to
> the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
> assignment from smaller nodes to bigger nodes to guarantee that every
> active node gets allocated at least one vector, then we can avoid
> cross-node spread in normal situation.
> 
> Meantime the reported warning can be fixed.
> 
> Another big goodness is that the spread approach becomes more fair if
> node has different CPU number.
> 
> For example, on the following machine:
>   [root@ktest-01 ~]# lscpu
>   ...
>   CPU(s):  16
>   On-line CPU(s) list: 0-15
>   Thread(s) per core:  1
>   Core(s) per socket:  8
>   Socket(s):   2
>   NUMA node(s):2
>   ...
>   NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
>   NUMA node1 CPU(s):   2,4,10,12
> 
> When driver requests to allocate 8 vectors, the following spread can
> be got:
>   irq 31, cpu list 2,4
>   irq 32, cpu list 10,12
>   irq 33, cpu list 0-1
>   irq 34, cpu list 3,5
>   irq 35, cpu list 6-7
>   irq 36, cpu list 8-9
>   irq 37, cpu list 11,13
>   irq 38, cpu list 14-15
> 
> Without this patch, kernel warning is triggered on above situation, and
> allocation result was supposed to be 4 vectors for each node.
> 
> Cc: Christoph Hellwig 
> Cc: Keith Busch 
> Cc: linux-n...@lists.infradead.org,
> Cc: Jon Derrick 
> Cc: Jens Axboe 
> Reported-by: Jon Derrick 
> Signed-off-by: Ming Lei 

I had every intention to thoroughly test this on imbalanced node
configurations, but that's not going to happen anytime soon. It looks
correct to me, so I'll append my review here.

I'm not sure I'd include the detailed correctness explanation in the
comments, though. It essentially boils down to "the sum of the parts
doesn't exceed the whole", and the key to carving up the parts is the
sorted iteration you've implemented.

Reviewed-by: Keith Busch 

> ---
>  kernel/irq/affinity.c | 223 --
>  1 file changed, 193 insertions(+), 30 deletions(-)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index c7cca942bd8a..32e07e58ce81 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>   unsigned int cpus_per_vec)
> @@ -94,6 +95,155 @@ static int get_nodes_in_cpumask(cpumask_var_t 
> *node_to_cpumask,
>   return nodes;
>  }
>  
> +struct node_vectors {
> + unsigned id;
> +
> + union {
> + unsigned nvectors;
> + unsigned ncpus;
> + };
> +};
> +
> +static int ncpus_cmp_func(const void *l, const void *r)
> +{
> + const struct node_vectors *ln = l;
> + const struct node_vectors *rn = r;
> +
> + return ln->ncpus - rn->ncpus;
> +}
> +
> +/*
> + * Allocate vector number for each node, so that for each node:
> + *
> + * 1) the allocated number is >= 1
> + *
> + * 2) the allocated numbver is <= active CPU number of this node
> + *
> + * The actual allocated total vectors may be less than @numvecs when
> + * active total CPU number is less than @numvecs.
> + *
> + * Active CPUs means the CPUs in '@cpu_mask AND @node_to_cpumask[]'
> + * for each node.
> + */
> +static void alloc_nodes_vectors(unsigned int numvecs,
> + const cpumask_var_t *node_to_cpumask,
> + const struct cpumask *cpu_mask,
> + const nodemask_t nodemsk,
> + struct cpumask *nmsk,
> + struct node_vectors *node_vectors)
> +{
> + unsigned n, remaining_ncpus = 0;
> +
> + for (n = 0; n < nr_node_ids; n++) {
> + node_vectors[n].id = n;
> + node_vectors[n].ncpus = UINT_MAX;
> + }
> +
> + for_each_node_mask(n, nodemsk) {
> + unsigned ncpus;
> +
> + cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
> + ncpus = cpumask_weight(nmsk);
> +
> + if (!ncpus)
> + continue;
> + remaining_ncpus += ncpus;
> + node_vectors[n].ncpus = ncpus;
> + }
> +
> + numvecs = min_t(unsigned, remaining_ncpus, numvecs);
> +
> + sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
> +  ncpus_cmp_func, NULL);
> +
> + /*
> +  * Allocate vectors for each node according to the ratio of this
> +  * node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
> +  * bigger than number of active numa nodes. Always start the
> +  *