Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts

2024-11-01 Thread Thomas Gleixner
On Fri, Nov 01 2024 at 11:03, mapicccy wrote:
>> 2024年10月31日 18:35,Thomas Gleixner  写道:
>>> +   get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>>> +
>>> +   for_each_node_mask(n, nodemsk) {
>>> +   cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
>>> premask);
>>> +   cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
>>> node_to_cpumask[n]);
>> 
>> How is this managed_irqs_cpumsk array protected against concurrency?
>
> My intention was to allocate up to `managed_irq_per_node` cpu bits from 
> `managed_irqs_cpumask[n]`,
> even if another task modifies some of the bits in the 
> `managed_irqs_cpumask[n]` at the same time.

That may have been your intention, but how is this even remotely
correct?

Aside of that. If it's intentional and you think it's correct then you
should have documented that in the code and also annotated it to not
trigger santiziers.

>> Given the limitations of the x86 vector space, which is not going away
>> anytime soon, there are only two options IMO to handle such a scenario.
>> 
>>   1) Tell the nvme/block layer to disable queue affinity management
>> 
>>   2) Restrict the devices and queues to the nodes they sit on
>
> I have tried fixing this issue through nvme driver, but later
> discovered that the same issue exists with virtio net.  Therefore, I
> want to address this with a more general solution.

I understand, but a general solution for this problem won't exist
ever.

It's very reasonable to restrict this for one particular device type or
subsystem while maintaining the strict managed property for others, no?

General solutions are definitely preferred, but not for the price that
they break existing completely correct and working setups. Which is what
your 2/2 patch does for sure.

Thanks,

tglx



Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts

2024-11-01 Thread Jiri Slaby

Hi,

On 31. 10. 24, 8:46, 'Guanjun' wrote:

From: Guanjun 

Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct 
irq_affinity_desc)
introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts
treated as managed interrupts, in scenarios where a large number of
devices are present (using massive msix queue interrupts), an excessive number
of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during
interrupt allocation. This sequently leads to the situation where interrupts
for some devices cannot be properly allocated.

Support for limiting the number of managed interrupts on every node per 
allocation.

Signed-off-by: Guanjun 
---
  .../admin-guide/kernel-parameters.txt |  9 +++
  block/blk-mq-cpumap.c |  2 +-
  drivers/virtio/virtio_vdpa.c  |  2 +-
  fs/fuse/virtio_fs.c   |  2 +-
  include/linux/group_cpus.h|  2 +-
  kernel/irq/affinity.c | 11 ++--
  lib/group_cpus.c  | 55 ++-
  7 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9b61097a6448..ac80f35d04c9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3238,6 +3238,15 @@
different yeeloong laptops.
Example: machtype=lemote-yeeloong-2f-7inch
  
+	managed_irqs_per_node=

+   [KNL,SMP] Support for limiting the number of managed
+   interrupts on every node to prevent the case that
+   interrupts cannot be properly allocated where a large
+   number of devices are present. The default number is 0,
+   that means no limit to the number of managed irqs.
+   Format: integer between 0 and num_possible_cpus() / 
num_possible_nodes()
+   Default: 0


Kernel parameters suck. Esp. here you have to guess to even properly 
boot. Could this be auto-tuned instead?



--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -11,6 +11,30 @@
  
  #ifdef CONFIG_SMP
  
+static unsigned int __read_mostly managed_irqs_per_node;

+static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] 
__cacheline_aligned_in_smp = {


This is quite excessive. On SUSE configs, this is 8192 cpu bits * 1024 
nodes = 1 M. For everyone. You have to allocate this dynamically 
instead. See e.g. setup_node_to_cpumask_map().



+   [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
+};
+
+static int __init irq_managed_setup(char *str)
+{
+   int ret;
+
+   ret = kstrtouint(str, 10, &managed_irqs_per_node);
+   if (ret < 0) {
+   pr_warn("managed_irqs_per_node= cannot parse, ignored\n");


could not be parsed


+   return 0;
+   }
+
+   if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) 
{
+   managed_irqs_per_node = num_possible_cpus() / 
num_possible_nodes();
+   pr_warn("managed_irqs_per_node= cannot be larger than %u\n",
+   managed_irqs_per_node);
+   }
+   return 1;
+}
+__setup("managed_irqs_per_node=", irq_managed_setup);
+
  static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
unsigned int cpus_per_grp)
  {

...

@@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, 
unsigned int numgrps,
  /**
   * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
   * @numgrps: number of groups
+ * @is_managed: if these groups managed by kernel


are managed by the kernel


   *
   * Return: cpumask array if successful, NULL otherwise. And each element
   * includes CPUs assigned to this group


thanks,
--
js
suse labs




Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts

2024-10-31 Thread Jason Wang
On Fri, Nov 1, 2024 at 11:12 AM mapicccy  wrote:
>
>
>
> 2024年10月31日 18:50,Ming Lei  写道:
>
> On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner  wrote:
>
>
> On Thu, Oct 31 2024 at 15:46, guan...@linux.alibaba.com wrote:
>
> #ifdef CONFIG_SMP
>
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] 
> __cacheline_aligned_in_smp = {
> + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>
> +static void __group_prepare_affinity(struct cpumask *premask,
> +  cpumask_var_t *node_to_cpumask)
> +{
> + nodemask_t nodemsk = NODE_MASK_NONE;
> + unsigned int ncpus, n;
> +
> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> + for_each_node_mask(n, nodemsk) {
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
> premask);
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
> node_to_cpumask[n]);
>
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> + if (ncpus < managed_irqs_per_node) {
> + /* Reset node n to current node cpumask */
> + cpumask_copy(&managed_irqs_cpumsk[n], 
> node_to_cpumask[n]);
>
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
>   1) Tell the nvme/block layer to disable queue affinity management
>
>
> +1
>
> There are other use cases, such as cpu isolation, which can benefit from
> this way too.
>
> https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming@redhat.com/
>

I wonder if we need to do the same for virtio-blk.

>
> Thanks for your reminder. However, in this link only modified the NVMe driver,
> but there is the same issue in the virtio net driver as well.

I guess you meant virtio-blk actually?

>
> Guanjun
>
>
> Thanks,
>

Thanks




Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts

2024-10-31 Thread mapicccy



> 2024年10月31日 18:35,Thomas Gleixner  写道:
> 
> On Thu, Oct 31 2024 at 15:46, guan...@linux.alibaba.com wrote:
>> #ifdef CONFIG_SMP
>> 
>> +static unsigned int __read_mostly managed_irqs_per_node;
>> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] 
>> __cacheline_aligned_in_smp = {
>> +[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
>> +};
>> 
>> +static void __group_prepare_affinity(struct cpumask *premask,
>> + cpumask_var_t *node_to_cpumask)
>> +{
>> +nodemask_t nodemsk = NODE_MASK_NONE;
>> +unsigned int ncpus, n;
>> +
>> +get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>> +
>> +for_each_node_mask(n, nodemsk) {
>> +cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
>> premask);
>> +cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
>> node_to_cpumask[n]);
> 
> How is this managed_irqs_cpumsk array protected against concurrency?

My intention was to allocate up to `managed_irq_per_node` cpu bits from 
`managed_irqs_cpumask[n]`,
even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` 
at the same time.

> 
>> +ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
>> +if (ncpus < managed_irqs_per_node) {
>> +/* Reset node n to current node cpumask */
>> +cpumask_copy(&managed_irqs_cpumsk[n], 
>> node_to_cpumask[n]);
> 
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.

Sorry, I realize there is indeed a logic issue here (caused by developing on 
5.10 LTS and rebase to the latest linux-next).

> 
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
> 
>   1) Tell the nvme/block layer to disable queue affinity management
> 
>   2) Restrict the devices and queues to the nodes they sit on

I have tried fixing this issue through nvme driver, but later discovered that 
the same issue exists with virtio net.
Therefore, I want to address this with a more general solution.

Thanks,
Guanjun

> 
> Thanks,
> 
>tglx




Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts

2024-10-31 Thread Ming Lei
On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner  wrote:
>
> On Thu, Oct 31 2024 at 15:46, guan...@linux.alibaba.com wrote:
> >  #ifdef CONFIG_SMP
> >
> > +static unsigned int __read_mostly managed_irqs_per_node;
> > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] 
> > __cacheline_aligned_in_smp = {
> > + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> > +};
> >
> > +static void __group_prepare_affinity(struct cpumask *premask,
> > +  cpumask_var_t *node_to_cpumask)
> > +{
> > + nodemask_t nodemsk = NODE_MASK_NONE;
> > + unsigned int ncpus, n;
> > +
> > + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> > +
> > + for_each_node_mask(n, nodemsk) {
> > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
> > premask);
> > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
> > node_to_cpumask[n]);
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> > + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> > + if (ncpus < managed_irqs_per_node) {
> > + /* Reset node n to current node cpumask */
> > + cpumask_copy(&managed_irqs_cpumsk[n], 
> > node_to_cpumask[n]);
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
>1) Tell the nvme/block layer to disable queue affinity management

+1

There are other use cases, such as cpu isolation, which can benefit from
this way too.

https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming@redhat.com/

Thanks,




Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts

2024-10-31 Thread Thomas Gleixner
On Thu, Oct 31 2024 at 15:46, guan...@linux.alibaba.com wrote:
>  #ifdef CONFIG_SMP
>  
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] 
> __cacheline_aligned_in_smp = {
> + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>  
> +static void __group_prepare_affinity(struct cpumask *premask,
> +  cpumask_var_t *node_to_cpumask)
> +{
> + nodemask_t nodemsk = NODE_MASK_NONE;
> + unsigned int ncpus, n;
> +
> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> + for_each_node_mask(n, nodemsk) {
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
> premask);
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], 
> node_to_cpumask[n]);

How is this managed_irqs_cpumsk array protected against concurrency?

> + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> + if (ncpus < managed_irqs_per_node) {
> + /* Reset node n to current node cpumask */
> + cpumask_copy(&managed_irqs_cpumsk[n], 
> node_to_cpumask[n]);

This whole logic is incomprehensible and aside of the concurrency
problem it's broken when CPUs are made present at run-time because these
cpu masks are static and represent the stale state of the last
invocation.

Given the limitations of the x86 vector space, which is not going away
anytime soon, there are only two options IMO to handle such a scenario.

   1) Tell the nvme/block layer to disable queue affinity management

   2) Restrict the devices and queues to the nodes they sit on

Thanks,

tglx