from:"Thomas Gleixner"

Re: [PATCH v2 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

2024-05-08 Thread Thomas Gleixner

On Tue, May 07 2024 at 15:23, Vignesh Balasubramanian wrote:
> +struct xfeat_component {
> + u32 xfeat_type;
> + u32 xfeat_sz;
> + u32 xfeat_off;
> + u32 xfeat_flags;
> +} __packed;

Why repeating xfeat_ for all member names?

u32   type;
u32   size;
u32   offset;
u32   flags;

is sufficient and obvious, no?

> +enum custom_feature {
> + FEATURE_XSAVE_FP = 0,
> + FEATURE_XSAVE_SSE = 1,
> + FEATURE_XSAVE_YMM = 2,
> + FEATURE_XSAVE_BNDREGS = 3,
> + FEATURE_XSAVE_BNDCSR = 4,
> + FEATURE_XSAVE_OPMASK = 5,
> + FEATURE_XSAVE_ZMM_Hi256 = 6,
> + FEATURE_XSAVE_Hi16_ZMM = 7,
> + FEATURE_XSAVE_PT = 8,
> + FEATURE_XSAVE_PKRU = 9,
> + FEATURE_XSAVE_PASID = 10,
> + FEATURE_XSAVE_CET_USER = 11,
> + FEATURE_XSAVE_CET_SHADOW_STACK = 12,
> + FEATURE_XSAVE_HDC = 13,
> + FEATURE_XSAVE_UINTR = 14,
> + FEATURE_XSAVE_LBR = 15,
> + FEATURE_XSAVE_HWP = 16,
> + FEATURE_XSAVE_XTILE_CFG = 17,
> + FEATURE_XSAVE_XTILE_DATA = 18,
> + FEATURE_MAX,
> + FEATURE_XSAVE_EXTENDED_START = FEATURE_XSAVE_YMM,
> + FEATURE_XSAVE_EXTENDED_END = FEATURE_XSAVE_XTILE_DATA,
> +};

Why can't this use the existing 'enum xfeature' which is providing
exactly the same information already?

> +#ifdef CONFIG_COREDUMP
> +static int get_sub_leaf(int custom_xfeat)
> +{
> + switch (custom_xfeat) {
> + case FEATURE_XSAVE_YMM: return XFEATURE_YMM;
> + case FEATURE_XSAVE_BNDREGS: return XFEATURE_BNDREGS;
> + case FEATURE_XSAVE_BNDCSR:  return XFEATURE_BNDCSR;
> + case FEATURE_XSAVE_OPMASK:  return XFEATURE_OPMASK;
> + case FEATURE_XSAVE_ZMM_Hi256:   return XFEATURE_ZMM_Hi256;
> + case FEATURE_XSAVE_Hi16_ZMM:return XFEATURE_Hi16_ZMM;
> + case FEATURE_XSAVE_PT:  return 
> XFEATURE_PT_UNIMPLEMENTED_SO_FAR;
> + case FEATURE_XSAVE_PKRU:return XFEATURE_PKRU;
> + case FEATURE_XSAVE_PASID:   return XFEATURE_PASID;
> + case FEATURE_XSAVE_CET_USER:return XFEATURE_CET_USER;
> + case FEATURE_XSAVE_CET_SHADOW_STACK:return 
> XFEATURE_CET_KERNEL_UNUSED;
> + case FEATURE_XSAVE_HDC: return XFEATURE_RSRVD_COMP_13;
> + case FEATURE_XSAVE_UINTR:   return XFEATURE_RSRVD_COMP_14;
> + case FEATURE_XSAVE_LBR: return XFEATURE_LBR;
> + case FEATURE_XSAVE_HWP: return XFEATURE_RSRVD_COMP_16;
> + case FEATURE_XSAVE_XTILE_CFG:   return XFEATURE_XTILE_CFG;
> + case FEATURE_XSAVE_XTILE_DATA:  return XFEATURE_XTILE_DATA;
> + default:
> + pr_warn_ratelimited("Not a valid XSAVE Feature.");
> + return 0;
> + }
> +}

This function then maps the identical enums one to one. The only actual
"functionality" is the default case and that's completely pointless.

> +/*
> + * Dump type, size, offset and flag values for every xfeature that is 
> present.
> + */
> +static int dump_xsave_layout_desc(struct coredump_params *cprm)
> +{
> + u32 supported_features = 0;
> + struct xfeat_component xc;
> + u32 eax, ebx, ecx, edx;
> + int num_records = 0;
> + int sub_leaf = 0;
> + int i;
> +
> + /* Find supported extended features */
> + cpuid_count(XSTATE_CPUID, 0, , , , );
> + supported_features = eax;

Why does this need to re-evaluate CPUID instead of just using the
existing fpu_user_cfg.max_features?

> + for (i = FEATURE_XSAVE_EXTENDED_START;
> + i <= FEATURE_XSAVE_EXTENDED_END; i++) {

Please use the full 100 character line width.

> + sub_leaf = get_sub_leaf(i);
> + if (!sub_leaf)
> + continue;
> + if (supported_features & (1U << sub_leaf)) {
> + cpuid_count(XSTATE_CPUID, sub_leaf, , , , 
> );
> + xc.xfeat_type = i;
> + xc.xfeat_sz = eax;
> + xc.xfeat_off = ebx;
> + /* Reserved for future use */
> + xc.xfeat_flags = 0;
> +
> + if (!dump_emit(cprm, ,
> +sizeof(struct xfeat_component)))

sizeof(xc), no?

> + return 0;
> + num_records++;
> + }
> + }

This whole thing can be written as:

for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
struct xfeat_component xc = {
.type   = i,
.size   = xstate_sizes[i],
.offset = xstate_offsets[i],
};

if (!dump_emit(cprm, , sizeof(xc)))
return 0;
num_records++;
}

It omits the features which are supported by the CPU, but not enabled by
the kernel. That's perfectly fine because:

  1) the corresponding xfeature

Re: [PATCHv12 1/4] genirq: Provide a snapshot mechanism for interrupt statistics

2024-04-10 Thread Thomas Gleixner

On Wed, Apr 10 2024 at 14:45, Bitao Hu wrote:
> On 2024/4/9 17:58, Thomas Gleixner wrote:
> By the way, what do you think of my reason for using printk() instead of
> pr_crit()? Should I change this part of the code in v13?

Either way is fine. Just put a proper explanation into the change log if
you stick with printk().

Thanks,

tglx

Re: [PATCHv12 1/4] genirq: Provide a snapshot mechanism for interrupt statistics

2024-04-09 Thread Thomas Gleixner

On Wed, Mar 06 2024 at 20:52, Bitao Hu wrote:
> The soft lockup detector lacks a mechanism to identify interrupt storms
> as root cause of a lockup. To enable this the detector needs a
> mechanism to snapshot the interrupt count statistics on a CPU when the
> detector observes a potential lockup scenario and compare that against
> the interrupt count when it warns about the lockup later on. The number
> of interrupts in that period give a hint whether the lockup might be
> caused by an interrupt storm.
>
> Instead of having extra storage in the lockup detector and accessing
> the internals of the interrupt descriptor directly, convert the per CPU
> irq_desc::kstat_irq member to a data structure which contains the
> counter plus a snapshot member and provide interfaces to take a
> snapshot of all interrupts on the current CPU and to retrieve the delta
> of a specific interrupt later on.
>
> Originally-by: Thomas Gleixner 
> Signed-off-by: Bitao Hu 
> Reviewed-by: Liu Song 
> Reviewed-by: Douglas Anderson 

This does not apply anymore.

Also can you please split this apart to convert kstat_irqs to a struct
with just the count in it and then add the snapshot mechanics on top.

Thanks,

tglx

Re: [PATCHv12 4/4] watchdog/softlockup: report the most frequent interrupts

2024-03-23 Thread Thomas Gleixner

On Wed, Mar 06 2024 at 20:52, Bitao Hu wrote:
> + if (__this_cpu_read(snapshot_taken)) {
> + for_each_active_irq(i) {
> + count = kstat_get_irq_since_snapshot(i);
> + tabulate_irq_count(irq_counts_sorted, i, count, 
> NUM_HARDIRQ_REPORT);
> + }
> +
> + /*
> +  * We do not want the "watchdog: " prefix on every line,
> +  * hence we use "printk" instead of "pr_crit".
> +  */

You are not providing any justification why the prefix is not
wanted. Just saying 'We do not want' does not cut it and who is 'We'. I
certainly not.

I really disagree because the prefixes are very useful for searching log
files. So not having it makes it harder to filter out for no reason.

Thanks,

tglx

Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

2024-03-15 Thread Thomas Gleixner

On Thu, Mar 14 2024 at 17:29, Borislav Petkov wrote:
> On Thu, Mar 14, 2024 at 09:19:15AM -0700, Dave Hansen wrote:
>> Are you envisioning an *XSAVE* state component that's not described by
>> CPUID?
>
> I want to be prepared. You can imagine some of the short cuts and
> corners cutting hw guys would do so I'd want to be prepared there and
> not tie this to CPUID.

Anything which is not enumerated in CPUID does not exist in
XSTATE. Period and end of story.

Stop paving the way for hardware people to have an excuse for being
stupid.

Aside of that XSTATE is a dead end as we all know.

Thanks,

tglx

Re: [v2 PATCH 0/3] arch: mm, vdso: consolidate PAGE_SIZE definition

2024-03-06 Thread Thomas Gleixner

On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote:
> From: Arnd Bergmann 
>
> Naresh noticed that the newly added usage of the PAGE_SIZE macro in
> include/vdso/datapage.h introduced a build regression. I had an older
> patch that I revived to have this defined through Kconfig rather than
> through including asm/page.h, which is not allowed in vdso code.
>
> The vdso patch series now has a temporary workaround, but I still want to
> get this into v6.9 so we can place the hack with CONFIG_PAGE_SIZE
> in the vdso.

Thank you for cleaning this up!

  tglx

Re: [PATCH v2 3/3] arch: define CONFIG_PAGE_SIZE_*KB on all architectures

2024-03-06 Thread Thomas Gleixner

On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote:

> From: Arnd Bergmann 
>
> Most architectures only support a single hardcoded page size. In order
> to ensure that each one of these sets the corresponding Kconfig symbols,
> change over the PAGE_SHIFT definition to the common one and allow
> only the hardware page size to be selected.
>
> Acked-by: Guo Ren 
> Acked-by: Heiko Carstens 
> Acked-by: Stafford Horne 
> Acked-by: Johannes Berg 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner

Re: [PATCH v2 2/3] arch: simplify architecture specific page size configuration

2024-03-06 Thread Thomas Gleixner

On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote:
> From: Arnd Bergmann 
>
> arc, arm64, parisc and powerpc all have their own Kconfig symbols
> in place of the common CONFIG_PAGE_SIZE_4KB symbols. Change these
> so the common symbols are the ones that are actually used, while
> leaving the arhcitecture specific ones as the user visible
> place for configuring it, to avoid breaking user configs.
>
> Reviewed-by: Christophe Leroy  (powerpc32)
> Acked-by: Catalin Marinas 
> Acked-by: Helge Deller  # parisc
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner

Re: [PATCH v2 1/3] arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions

2024-03-06 Thread Thomas Gleixner

On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote:
> From: Arnd Bergmann 
>
> These four architectures define the same Kconfig symbols for configuring
> the page size. Move the logic into a common place where it can be shared
> with all other architectures.
>
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner

Re: [PATCHv11 2/4] genirq: Provide a snapshot mechanism for interrupt statistics

2024-03-05 Thread Thomas Gleixner

On Tue, Mar 05 2024 at 18:57, Bitao Hu wrote:
> On 2024/3/4 22:24, Thomas Gleixner wrote:
> "GENERIC_IRQ_STAT_SNAPSHOT" visible to the user. However, after
> analyzing the previous emails, it seems that what you were actually
> proposing was to directly disable "GENERIC_IRQ_STAT_SNAPSHOT" when
> "SOFTLOCKUP_DETECTOR_INTR_STORM" is not enabled, as a way to save
> memory. If my current understanding is correct, then the code for that
> part would look something like the following.

Correct.

> diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
> index 2531f3496ab6..a28e5ac5fc79 100644
> --- a/kernel/irq/Kconfig
> +++ b/kernel/irq/Kconfig
> @@ -108,6 +108,10 @@ config GENERIC_IRQ_MATRIX_ALLOCATOR
>   config GENERIC_IRQ_RESERVATION_MODE
>  bool
>
> +# Snapshot for interrupt statistics
> +config GENERIC_IRQ_STAT_SNAPSHOT
> +   bool
> +
>   # Support forced irq threading
>   config IRQ_FORCED_THREADING
>  bool
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 49f652674bd8..899b69fcb598 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1032,6 +1032,7 @@ config SOFTLOCKUP_DETECTOR
>   config SOFTLOCKUP_DETECTOR_INTR_STORM
>  bool "Detect Interrupt Storm in Soft Lockups"
>  depends on SOFTLOCKUP_DETECTOR && IRQ_TIME_ACCOUNTING
> +   select GENERIC_IRQ_STAT_SNAPSHOT

This goes into the patch which adds the lockup detector parts.

Thanks,

tglx

Re: [PATCHv11 2/4] genirq: Provide a snapshot mechanism for interrupt statistics

2024-03-04 Thread Thomas Gleixner

On Mon, Mar 04 2024 at 20:00, Bitao Hu wrote:
>> +# Snapshot for interrupt statistics
>> +config GENERIC_IRQ_STAT_SNAPSHOT
>> +bool
>> +
>>   # Support forced irq threading
>>   config IRQ_FORCED_THREADING
>>  bool
>
> I think we should follow Douglas's suggestion by making
> "config GENERIC_IRQ_STAT_SNAPSHOT" automatically selectable by
> "config SOFTLOCKUP_DETECTOR_INTR_STORM". This can prevent users
> from inadvertently disabling "config GENERIC_IRQ_STAT_SNAPSHOT"
> while enabling "config SOFTLOCKUP_DETECTOR_INTR_STORM".

The above is not even configurable by the user. It's only selectable by
some other config option.

> +# Snapshot for interrupt statistics
> +config GENERIC_IRQ_STAT_SNAPSHOT
> +   bool
> +   help
> +
> + Say Y here to enable the kernel to provide a snapshot mechanism
> + for interrupt statistics.

That makes is visible which is pointless because it's only relevant when
there is an actual user.

Thanks,

tglx

Re: [PATCH v2 4/7] mm/x86: Drop two unnecessary pud_leaf() definitions

2024-03-04 Thread Thomas Gleixner

On Thu, Feb 29 2024 at 16:42, pet...@redhat.com wrote:
> From: Peter Xu 
>
> pud_leaf() has a fallback macro defined in include/linux/pgtable.h already.
> Drop the extra two for x86.
>
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: x...@kernel.org
> Signed-off-by: Peter Xu 

Acked-by: Thomas Gleixner

Re: [PATCHv11 2/4] genirq: Provide a snapshot mechanism for interrupt statistics

2024-03-01 Thread Thomas Gleixner

Doug!

On Wed, Feb 28 2024 at 14:44, Doug Anderson wrote:
> I won't insist on it, but I continue to worry about memory
> implications with large numbers of CPUs. With a 4-byte int, 8192 max
> CPUs, and 100 IRQs the extra "ref" value takes up over 3MB of memory
> (8192 * 4 bytes * 100).
>
> Technically, you could add a new symbol like "config
> NEED_IRQ_SNAPSHOTS". This wouldn't be a symbol selectable by the end
> user but would automatically be selected by "config
> SOFTLOCKUP_DETECTOR_INTR_STORM". If the config wasn't defined then the
> struct wouldn't contain "ref" and the snapshot routines would just be
> static inline stubs.
>
> Maybe Thomas has an opinion about whether this is something to worry
> about. Worst case it wouldn't be hard to do in a follow-up patch.

I'd say it makes sense to give people a choice to save memory especially
when the softlock detector code is not enabled.

It's rather straight forward to do.

Thanks,

tglx
---
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -24,7 +24,9 @@ struct pt_regs;
  */
 struct irqstat {
unsigned intcnt;
+#ifdef CONFIG_GENIRQ_STAT_SNAPSHOT
unsigned intref;
+#endif
 };
 
 /**
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -978,6 +978,7 @@ static unsigned int kstat_irqs(unsigned
return sum;
 }
 
+#ifdef CONFIG_GENIRQ_STAT_SNAPSHOT
 void kstat_snapshot_irqs(void)
 {
struct irq_desc *desc;
@@ -998,6 +999,7 @@ unsigned int kstat_get_irq_since_snapsho
return 0;
return this_cpu_read(desc->kstat_irqs->cnt) - 
this_cpu_read(desc->kstat_irqs->ref);
 }
+#endif
 
 /**
  * kstat_irqs_usr - Get the statistics for an interrupt from thread context
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -108,6 +108,10 @@ config GENERIC_IRQ_MATRIX_ALLOCATOR
 config GENERIC_IRQ_RESERVATION_MODE
bool
 
+# Snapshot for interrupt statistics
+config GENERIC_IRQ_STAT_SNAPSHOT
+   bool
+
 # Support forced irq threading
 config IRQ_FORCED_THREADING
bool

Re: [PATCHv10 3/4] genirq: Avoid summation loops for /proc/interrupts

2024-02-27 Thread Thomas Gleixner

On Tue, Feb 27 2024 at 19:20, Bitao Hu wrote:
> On 2024/2/27 17:26, Thomas Gleixner wrote:
>> 
>> and then let kstat_irqs() and show_interrupts() use it. See?
>
> I have a concern. kstat_irqs() uses for_each_possible_cpu() for
> summation. However, show_interrupts() uses for_each_online_cpu(),
> which means it only outputs interrupt statistics for online cpus.
> If we use for_each_possible_cpu() in show_interrupts() to calculate
> 'any_count', there could be a problem with the following scenario:
> If an interrupt has a count of zero on online cpus but a non-zero
> count on possible cpus, then 'any_count' would not be zero, and the
> statistics for that interrupt would be output, which is not the
> desired behavior for show_interrupts(). Therefore, I think it's not
> good to have kstat_irqs() and show_interrupts() both use the same
> logic. What do you think?

Good point. But you simply can have

unsigned int kstat_irq_desc(struct irq_desc *desc, const struct cpumask *mask)

and hand in the appropriate cpumask, which still shares the code, no?

Thanks,

tglx

Re: [PATCHv10 3/4] genirq: Avoid summation loops for /proc/interrupts

2024-02-27 Thread Thomas Gleixner

On Mon, Feb 26 2024 at 10:09, Bitao Hu wrote:
> We could use the irq_desc::tot_count member to avoid the summation
> loop for interrupts which are not marked as 'PER_CPU' interrupts in
> 'show_interrupts'. This could reduce the time overhead of reading
> /proc/interrupts.

"Could" is not really a technical term. Either we do or we do not. Also
please provide context for your change and avoid the 'We'.

> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -121,6 +121,8 @@ static inline void irq_unlock_sparse(void) { }
>  extern struct irq_desc irq_desc[NR_IRQS];
>  #endif
>
> +extern bool irq_is_nmi(struct irq_desc *desc);
> +

If at all this wants to be in kernel/irq/internal.h. There is zero
reason to expose this globally.

> -static bool irq_is_nmi(struct irq_desc *desc)
> +bool irq_is_nmi(struct irq_desc *desc)
>  {
>   return desc->istate & IRQS_NMI;
>  }

If at all this really wants to be a static inline in internals.h, but
instead of blindly copying code this can be done smarter:

unsigned int kstat_irq_desc(struct irq_desc *desc)
{
unsigned int sum = 0;
int cpu;

if (!irq_settings_is_per_cpu_devid(desc) &&
!irq_settings_is_per_cpu(desc) &&
!irq_is_nmi(desc))
return data_race(desc->tot_count);

for_each_possible_cpu(cpu)
sum += data_race(*per_cpu_ptr(desc->kstat_irqs, cpu));
return sum;
}

and then let kstat_irqs() and show_interrupts() use it. See?

With that a proper changelog would be:

   show_interrupts() unconditionally accumulates the per CPU interrupt
   statistics to determine whether an interrupt was ever raised.

   This can be avoided for all interrupts which are not strictly per CPU
   and not of type NMI because those interrupts provide already an
   accumulated counter. The required logic is already implemented in
   kstat_irqs().

   Split the inner access logic out of kstat_irqs() and use it for
   kstat_irqs() and show_interrupts() to avoid the accumulation loop
   when possible.

Thanks,

tglx

Re: [PATCHv9 2/3] irq: use a struct for the kstat_irqs in the interrupt descriptor

2024-02-22 Thread Thomas Gleixner

On Fri, Feb 23 2024 at 15:18, Bitao Hu wrote:
> On 2024/2/22 21:22, Thomas Gleixner wrote:
>>> -   if (desc->kstat_irqs) {
>>> -   for_each_online_cpu(j)
>>> -   any_count |= data_race(*per_cpu_ptr(desc->kstat_irqs, 
>>> j));
>>> -   }
>>> +   if (desc->kstat_irqs)
>>> +   any_count = data_race(desc->tot_count);
>> 
>> This is an unrelated change and needs to be split out into a separate
>> patch with a proper changelog which explains why this is equivalent.
>> 
>
> Alright, I will remove this change witch is not related to the purpose
> of this patch.
>
> I guess that the purpose of suggesting this change in your V1 response
> was to speedup the 'show_interrupts'. However, after reviewing the
> usage of 'desc->tot_count' in 'unsigned int kstat_irqs(unsigned int 
> irq)', I think the change might be as follows:
>
> diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
> index 623b8136e9af..53b8d6edd7ac 100644
> --- a/kernel/irq/proc.c
> +++ b/kernel/irq/proc.c
> @@ -489,8 +489,13 @@ int show_interrupts(struct seq_file *p, void *v)
>  goto outsparse;
>
>  if (desc->kstat_irqs) {
> -   for_each_online_cpu(j)
> -   any_count |= 
> data_race(per_cpu(desc->kstat_irqs->cnt, j));
> +   if (!irq_settings_is_per_cpu_devid(desc) &&
> +   !irq_settings_is_per_cpu(desc) &&
> +   !irq_is_nmi(desc))
> +   any_count = data_race(desc->tot_count);
> +   else
> +   for_each_online_cpu(j)
> +   any_count |= 
> data_race(per_cpu(desc->kstat_irqs->cnt, j));
>  }
>
>  if ((!desc->action || irq_desc_is_chained(desc)) && !any_count)
>
> Is my idea correct?

Yes.

Re: [PATCHv9 2/3] irq: use a struct for the kstat_irqs in the interrupt descriptor

2024-02-22 Thread Thomas Gleixner

On Thu, Feb 22 2024 at 17:34, Bitao Hu wrote:

First of all the subsystem prefix is 'genirq:'. 'git log kernel/irq/'
gives you a pretty good hint. It's documented

Secondly the subject line does not match what this patch is about. It's
not about using a struct, it's about providing a snapshot mechanism, no?

> The current implementation uses an int for the kstat_irqs in the
> interrupt descriptor.
>
> However, we need to know the number of interrupts which happened
> since softlockup detection took a snapshot in order to analyze
> the problem caused by an interrupt storm.
>
> Replacing an int with a struct and providing sensible interfaces
> for the watchdog code can keep it self contained to the interrupt
> core code.

So something like this makes a useful change log for this:

 Subject: genirq: Provide a snapshot mechanism for interrupt statistics

 The soft lockup detector lacks a mechanism to identify interrupt storms
 as root cause of a lockup. To enable this the detector needs a
 mechanism to snapshot the interrupt count statistics on a CPU when the
 detector observes a potential lockup scenario and compare that against
 the interrupt count when it warns about the lockup later on. The number
 of interrupts in that period give a hint whether the lockup might be
 caused by an interrupt storm.

 Instead of having extra storage in the lockup detector and accessing
 the internals of the interrupt descriptor directly, convert the per CPU
 irq_desc::kstat_irq member to a data structure which contains the
 counter plus a snapshot member and provide interfaces to take a
 snapshot of all interrupts on the current CPU and to retrieve the delta
 of a specific interrupt later on.

Hmm?

> Signed-off-by: Bitao Hu 

Interesting. You fully authored the patch?

That's not how it works. You cannot take work from others and claim that
it is yours. The minimal courtesy is to add a 'Originally-by:' tag.

> diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
> index 623b8136e9af..3ad40cf30c66 100644
> --- a/kernel/irq/proc.c
> +++ b/kernel/irq/proc.c
> @@ -488,18 +488,15 @@ int show_interrupts(struct seq_file *p, void *v)
>   if (!desc || irq_settings_is_hidden(desc))
>   goto outsparse;
>  
> - if (desc->kstat_irqs) {
> - for_each_online_cpu(j)
> - any_count |= data_race(*per_cpu_ptr(desc->kstat_irqs, 
> j));
> - }
> + if (desc->kstat_irqs)
> + any_count = data_race(desc->tot_count);

This is an unrelated change and needs to be split out into a separate
patch with a proper changelog which explains why this is equivalent.

Thanks,

tglx

Re: [PATCHv8 2/2] watchdog/softlockup: report the most frequent interrupts

2024-02-20 Thread Thomas Gleixner

On Tue, Feb 20 2024 at 00:19, Bitao Hu wrote:
>  arch/mips/dec/setup.c|   2 +-
>  arch/parisc/kernel/smp.c |   2 +-
>  arch/powerpc/kvm/book3s_hv_rm_xics.c |   2 +-
>  include/linux/irqdesc.h  |   9 ++-
>  include/linux/kernel_stat.h  |   4 +
>  kernel/irq/internals.h   |   2 +-
>  kernel/irq/irqdesc.c |  34 ++--
>  kernel/irq/proc.c|   9 +--

This really wants to be split into two patches. Interrupt infrastructure
first and then the actual usage site in the watchdog code.

Thanks,

tglx

Re: [PATCH tty v1 00/74] serial: wrappers for uart port lock

2023-09-15 Thread Thomas Gleixner

On Thu, Sep 14 2023 at 20:01, Maciej W. Rozycki wrote:

> On Thu, 14 Sep 2023, John Ogness wrote:
>
>> Patches 2-74 switch all uart port locking call sites to use the new
>> wrappers. These patches were automatically generated using coccinelle.
>
>  Hmm, no need to do this for drivers/tty/serial/zs.c?

zs.c does not use port lock at all. It has like a couple of other
drivers a local homebrewn spinlock.

Thanks,

tglx

Re: [PATCH v4 00/10] Introduce SMT level and add PowerPC support

2023-07-28 Thread Thomas Gleixner

On Fri, Jul 28 2023 at 14:23, Rui Zhang wrote:
> On Fri, 2023-07-28 at 09:40 +0200, Thomas Gleixner wrote:
>> On Sun, Jul 09 2023 at 15:25, Rui Zhang wrote:
>> > I ran into a boot hang regression with latest upstream code, and it
>> > took me a while to bisect the offending commit and workaround it.
>> 
>> Where is the bug report and the analysis? And what's the workaround?
>
> As it is an iwlwifi regression, I didn't paste the link here.
>
> The regression was reported at
> https://lore.kernel.org/all/b533071f38804247f06da9e52a04f15cce7a3836.ca...@intel.com/
>
> And it was fixed later by below commit in 6.5-rc2.

Ah, ok. I was worried that you ran into issues with the parallel bootup
muck.

Re: [PATCH v4 00/10] Introduce SMT level and add PowerPC support

2023-07-28 Thread Thomas Gleixner

Laurent, Michael!

On Wed, Jul 05 2023 at 16:51, Laurent Dufour wrote:
> I'm taking over the series Michael sent previously [1] which is smartly
> reviewing the initial series I sent [2].  This series is addressing the
> comments sent by Thomas and me on the Michael's one.

Thanks for getting this into shape.

I've merged it into:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/core

and tagged it at patch 7 for consumption into the powerpc tree, so the
powerpc specific changes can be applied there on top:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
smp-core-for-ppc-23-07-28

Thanks,

tglx

Re: [PATCH v4 00/10] Introduce SMT level and add PowerPC support

2023-07-28 Thread Thomas Gleixner

Rui!

On Sun, Jul 09 2023 at 15:25, Rui Zhang wrote:
> I ran into a boot hang regression with latest upstream code, and it
> took me a while to bisect the offending commit and workaround it.

Where is the bug report and the analysis? And what's the workaround?

Thanks,

tglx

Re: [PATCH 07/10] cpu/SMT: Allow enabling partial SMT states via sysfs

2023-06-22 Thread Thomas Gleixner

On Thu, Jun 15 2023 at 17:46, Laurent Dufour wrote:
>  
> - if (ctrlval != cpu_smt_control) {
> + orig_threads = cpu_smt_num_threads;
> + cpu_smt_num_threads = num_threads;
> +
> + if (num_threads > orig_threads) {
> + ret = cpuhp_smt_enable();
> + } else if (num_threads < orig_threads) {
> + ret = cpuhp_smt_disable(ctrlval);
> + } else if (ctrlval != cpu_smt_control) {
>   switch (ctrlval) {
>   case CPU_SMT_ENABLED:
>   ret = cpuhp_smt_enable();

This switch() is still as pointless as in the previous version.

OFF -> ON, ON -> OFF, ON -> FORCE_OFF are covered by the num_threads
comparisons.

So the only case where (ctrlval != cpu_smt_control) is relevant is the
OFF -> FORCE_OFF transition because in that case the number of threads
is not changing.

  force_off = ctrlval != cpu_smt_control && ctrval == 
CPU_SMT_FORCE_DISABLED;

  if (num_threads > orig_threads)
  ret = cpuhp_smt_enable();
  else if (num_threads < orig_threads || force_off)
  ret = cpuhp_smt_disable(ctrlval);

Should just work, no?

Thanks,

tglx

Re: [PATCH 01/10] cpu/SMT: Move SMT prototypes into cpu_smt.h

2023-06-22 Thread Thomas Gleixner

On Thu, Jun 15 2023 at 17:46, Laurent Dufour wrote:
> From: Michael Ellerman 
>
> A subsequent patch would like to use the cpuhp_smt_control enum as part
> of the interface between generic and arch code.

This still has the 'patch' and 'arch' style which I pointed out
before. It seems you fixed it only for one patch in the series.

Thanks,

tglx

Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-18 Thread Thomas Gleixner

Kent!

On Sun, Jun 18 2023 at 19:14, Kent Overstreet wrote:
> On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote:
>
> Thomas, you're confusing an internal interface with external

No. I am not.

Whether that's an internal function or not does not make any difference
at all.

Having a function growing _eight_ parameters where _six_ of them are
derived from a well defined data structure is a clear sign of design
fail.

It's not rocket science to do:

struct generic_allocation_info {

};

struct execmem_info {

   struct generic_allocation_info alloc_info;
;

struct execmem_param {
   ...
   struct execmem_info[NTYPES];
};

and having a function which can either operate on execmem_param and type
or on generic_allocation_info itself. It does not matter as long as the
data structure which is handed into this internal function is
describing it completely or needs a supplementary argument, i.e. flags.

Having tons of wrappers which do:

   a = generic_info.a;
   b = generic_info.b;

   n = generic_info.n;

   internal_func(a, b, ,, n);

is just hillarious and to repeat myself tasteless and therefore
disgusting.

That's CS course first semester hackery, but TBH, I can only tell from
imagination because I did not take CS courses - maybe that's the
problem...

Data structure driven design works not from the usage site down to the
internals. It's the other way round:

  1) Define a data structure which describes what the internal function
 needs to know

  2) Implement use case specific variants which describe that

  3) Hand the use case specific variant to the internal function
 eventually with some minimal supplementary information.

Object oriented basics, right?

There is absolutely nothing wrong with having

  internal_func(generic_info *, modifier);

but having

  internal_func(a, b, ... n)

is fundamentally wrong in the context of an extensible and future proof
internal function.

Guess what. Today it's sufficient to have _eight_ arguments and we are
happy to have 10 nonsensical wrappers around this internal
function. Tomorrow there happens to be a use case which needs another
argument so you end up:

  Changing 10 wrappers plus the function declaration and definition in
  one go

instead of adding

  One new naturally 0 initialized member to generic_info and be done
  with it.

Look at the evolution of execmem_alloc() in this very pathset which I
pointed out. That very patchset covers _two_ of at least _six_ cases
Song and myself identified. It already had _three_ steps of evolution
from _one_ to _five_ to _eight_ parameters.

C is not like e.g. python where you can "solve" that problem by simply
doing:

- internal_func(a, b, c):
+ internal_func(a, b, c, d=None, ..., n=None):

But that's not a solution either. That's a horrible workaround even in
python once your parameter space gets sufficiently large. The way out in
python is to have **kwargs. But that's not an option in C, and not
necessarily the best option for python either.

Even in python or any other object oriented language you get to the
point where you have to rethink your approach, go back to the drawing
board and think about data representation.

But creating a new interface based on "let's see what we need over
time and add parameters as we see fit" is simply wrong to begin with
independent of the programming language.

Even if the _eight_ parameters are the end of the range, then they are
beyond justifyable because that's way beyond the natural register
argument space of any architecture and you are offloading your lazyness
to wrappers and the compiler to emit pointlessly horrible code.

There is a reason why new syscalls which need more than a few parameters
are based on 'struct DATA_WHICH_I_NEED_TO_KNOW' and 'flags'.

We've got burned on the non-extensibilty often enough. Why would a new
internal function have any different requirements especially as it is
neither implemented to the full extent nor a hotpath function?

Now you might argue that it _is_ a "hotpath" due to the BPF usage, but
then even more so as any intermediate wrapper which converts from one
data representation to another data representation is not going to
increase performance, right?

> ... I made the same mistake reviewing Song's patchset...

Songs series had rough edges, but was way more data structure driven
and palatable than this hackery.

The fact that you made a mistake while reviewing Songs series has
absolutely nothing to do with the above or my previous reply to Mike.

Thanks,

tglx

Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-18 Thread Thomas Gleixner

Mike!

Sorry for being late on this ...

On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote:
>  
> +void *execmem_data_alloc(size_t size)
> +{
> + unsigned long start = execmem_params.modules.data.start;
> + unsigned long end = execmem_params.modules.data.end;
> + pgprot_t pgprot = execmem_params.modules.data.pgprot;
> + unsigned int align = execmem_params.modules.data.alignment;
> + unsigned long fallback_start = 
> execmem_params.modules.data.fallback_start;
> + unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> + bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;

While I know for sure that you read up on the discussion I had with Song
about data structures, it seems you completely failed to understand it.

> + return execmem_alloc(size, start, end, align, pgprot,
> +  fallback_start, fallback_end, kasan);

Having _seven_ intermediate variables to fill _eight_ arguments of a
function instead of handing in @size and a proper struct pointer is
tasteless and disgusting at best.

Six out of those seven parameters are from:

execmem_params.module.data

while the KASAN shadow part is retrieved from

execmem_params.module.flags

So what prevents you from having a uniform data structure, which is
extensible and decribes _all_ types of allocations?

Absolutely nothing. The flags part can either be in the type dependend
part or you make the type configs an array as I had suggested originally
and then execmem_alloc() becomes:

void *execmem_alloc(type, size)

and

static inline void *execmem_data_alloc(size_t size)
{
return execmem_alloc(EXECMEM_TYPE_DATA, size);
}

which gets the type independent parts from @execmem_param.

Just read through your own series and watch the evolution of
execmem_alloc():

  static void *execmem_alloc(size_t size)

  static void *execmem_alloc(size_t size, unsigned long start,
 unsigned long end, unsigned int align,
 pgprot_t pgprot)

  static void *execmem_alloc(size_t len, unsigned long start,
 unsigned long end, unsigned int align,
 pgprot_t pgprot,
 unsigned long fallback_start,
 unsigned long fallback_end,
 bool kasan)

In a month from now this function will have _ten_ parameters and tons of
horrible wrappers which convert an already existing data structure into
individual function arguments.

Seriously?

If you want this function to be [ab]used outside of the exec_param
configuration space for whatever non-sensical reasons then this still
can be either:

void *execmem_alloc(params, type, size)

static inline void *execmem_data_alloc(size_t size)
{
return execmem_alloc(_param, EXECMEM_TYPE_DATA, size);
}

or

void *execmem_alloc(type_params, size);

static inline void *execmem_data_alloc(size_t size)
{
return execmem_alloc(_param.data, size);
}

which both allows you to provide alternative params, right?

Coming back to my conversation with Song:

   "Bad programmers worry about the code. Good programmers worry about
data structures and their relationships."

You might want to reread:

https://lore.kernel.org/r/87lenuukj0.ffs@tglx

and the subsequent thread.

The fact that my suggestions had a 'mod_' namespace prefix does not make
any of my points moot.

Song did an extremly good job in abstracting things out, but you decided
to ditch his ground work instead of building on it and keeping the good
parts. That's beyond sad.

Worse, you went the full 'not invented here' path with an outcome which is
even worse than the original hackery from Song which started the above
referenced thread.

I don't know what caused you to decide that ad hoc hackery is better
than proper data structure based design patterns. I actually don't want
to know.

As much as my voice counts:

  NAK-ed-by: Thomas Gleixner 

Thanks,

tglx

Re: [PATCH 3/9] cpu/SMT: Store the current/max number of threads

2023-06-13 Thread Thomas Gleixner

On Tue, Jun 13 2023 at 19:16, Laurent Dufour wrote:
> On 10/06/2023 23:26:18, Thomas Gleixner wrote:
>> On Thu, May 25 2023 at 01:56, Michael Ellerman wrote:
>>>  #ifdef CONFIG_HOTPLUG_SMT
>>>  enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
>>> +static unsigned int cpu_smt_max_threads __ro_after_init;
>>> +unsigned int cpu_smt_num_threads;
>> 
>> Why needs this to be global? cpu_smt_control is pointlessly global already.
>
> I agree that cpu_smt_*_threads should be static.
>
> Howwever, regarding cpu_smt_control, it is used in 2 places in the x86 code:
>  - arch/x86/power/hibernate.c in arch_resume_nosmt()
>  - arch/x86/kernel/cpu/bugs.c in spectre_v2_user_select_mitigation()

Bah. I must have fatfingered the grep then.

> An accessor function may be introduced to read that value in these 2
> functions, but I'm wondering if that's really the best option.
>
> Unless there is a real need to change this through this series, I think
> cpu_smt_control can remain global.

That's fine.

Thanks,

tglx

Re: [PATCH 8/9] powerpc: Add HOTPLUG_SMT support

2023-06-12 Thread Thomas Gleixner

On Mon, Jun 12 2023 at 17:20, Laurent Dufour wrote:
> On 10/06/2023 23:10:02, Thomas Gleixner wrote:
>> This needs more thoughts to avoid a completely inconsistent duct tape
>> mess.
>
> Despite the test against smt_enabled_at_boot, mentioned above, I can't see
> anything else to rework. Am I missing something?

See my comments on the core code changes.

>> Btw, the command line parser and the variable smt_enabled_at_boot being
>> type int allow negative number of threads too... Maybe not what you want.
>
> I do agree, it should an unsigned type.

I assume you'll fix that yourself. :)

Thanks,

tglx

Re: [PATCH 5/9] cpu/SMT: Create topology_smt_thread_allowed()

2023-06-10 Thread Thomas Gleixner

On Thu, May 25 2023 at 01:56, Michael Ellerman wrote:
> A subsequent patch will enable partial SMT states, ie. when not all SMT
> threads are brought online.

Nitpick. I stumbled over this 'subsequent patch' theme a couple of times
now because it's very similar to the 'This patch does' phrase.

Just explain what you want to achieve at the end.

>  #else
>  #define topology_max_packages()  (1)
>  static inline int
> @@ -159,6 +160,7 @@ static inline int topology_max_smt_threads(void) { return 
> 1; }
>  static inline bool topology_is_primary_thread(unsigned int cpu) { return 
> true; }
>  static inline bool topology_smt_supported(void) { return false; }
>  static inline bool topology_smt_threads_supported(unsigned int threads) { 
> return false; }
> +static inline bool topology_smt_thread_allowed(unsigned int cpu) { return 
> false; }

Not all these functions need a !SMP stub. Think about the context in
which they are called. There is probably precedence for pointless ones,
but that does not make an argument to add more.

> +/**
> + * topology_smt_thread_allowed - When enabling SMT check whether this 
> particular
> + *CPU thread is allowed to be brought online.
> + * @cpu: CPU to check
> + */
> +bool topology_smt_thread_allowed(unsigned int cpu)
> +{
> + /*
> +  * No extra logic s required here to support different thread values
> +  * because threads will always == 1 or smp_num_siblings because of
> +  * topology_smt_threads_supported().
> +  */
> + return true;
> +}
> +

As x86 only supoorts the on/off model there is no need for this function
if you pick up the CONFIG_SMT_NUM_THREADS_DYNAMIC idea.

You still need something like that for your PPC use case, but that
reduces the overall impact, right?

Thanks,

tglx

Re: [PATCH 4/9] cpu/SMT: Create topology_smt_threads_supported()

2023-06-10 Thread Thomas Gleixner

On Thu, May 25 2023 at 01:56, Michael Ellerman wrote:
> +/**
> + * topology_smt_threads_supported - Check if the given number of SMT threads
> + *   is supported.
> + *
> + * @threads: The number of SMT threads.
> + */
> +bool topology_smt_threads_supported(unsigned int threads)
> +{
> + // Only support a single thread or all threads.
> + return threads == 1 || threads == smp_num_siblings;
> +}

You can make that a simple core function when cpu_smt_*_threads is
consistent along the lines of my previous reply.

static bool cpu_smt_num_threads_valid(unsigned int threads)
{
if (IS_ENABLED(CONFIG_SMT_NUM_THREADS_DYNAMIC))
return threads >= 1 && threads <= cpu_smt_max_threads;
return threads == 1 || threads == cpu_smt_max_threads;
}

Or something like that.

Thanks,

tglx

Re: [PATCH 3/9] cpu/SMT: Store the current/max number of threads

2023-06-10 Thread Thomas Gleixner

On Sat, Jun 10 2023 at 23:26, Thomas Gleixner wrote:
> On Thu, May 25 2023 at 01:56, Michael Ellerman wrote:
>  /*
>   * The decision whether SMT is supported can only be done after the full
>   * CPU identification. Called from architecture code.
>   */
> -void __init cpu_smt_check_topology(void)
> +void __init cpu_smt_set_num_threads(unsigned int max_threads, unsigned int 
> num_threads)
>  {
> - if (!topology_smt_supported())
> + if (max_threads == 1)

Which makes topology_smt_supported() redundant, i.e. it can be removed.

Thanks,

tglx

Re: [PATCH 3/9] cpu/SMT: Store the current/max number of threads

2023-06-10 Thread Thomas Gleixner

On Thu, May 25 2023 at 01:56, Michael Ellerman wrote:
>  #ifdef CONFIG_HOTPLUG_SMT
>  enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
> +static unsigned int cpu_smt_max_threads __ro_after_init;
> +unsigned int cpu_smt_num_threads;

Why needs this to be global? cpu_smt_control is pointlessly global already.

>  void __init cpu_smt_disable(bool force)
>  {
> @@ -433,10 +435,18 @@ void __init cpu_smt_disable(bool force)
>   * The decision whether SMT is supported can only be done after the full
>   * CPU identification. Called from architecture code.
>   */
> -void __init cpu_smt_check_topology(void)
> +void __init cpu_smt_check_topology(unsigned int num_threads)
>  {
>   if (!topology_smt_supported())
>   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
> +
> + cpu_smt_max_threads = num_threads;
> +
> + // May already be disabled by nosmt command line parameter
> + if (cpu_smt_control != CPU_SMT_ENABLED)
> + cpu_smt_num_threads = 1;
> + else
> + cpu_smt_num_threads = num_threads;

Taking Laurents findings into account this should be something like
the incomplete below.

x86 would simply invoke cpu_smt_set_num_threads() with both arguments as
smp_num_siblings while PPC can funnel its command line parameter through
the num_threads argument.

Thanks,

tglx
---
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -414,6 +414,8 @@ void __weak arch_smt_update(void) { }
 
 #ifdef CONFIG_HOTPLUG_SMT
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
+static unsigned int cpu_smt_max_threads __ro_after_init;
+static unsigned int cpu_smt_num_threads = UINT_MAX;
 
 void __init cpu_smt_disable(bool force)
 {
@@ -427,24 +429,31 @@ void __init cpu_smt_disable(bool force)
pr_info("SMT: disabled\n");
cpu_smt_control = CPU_SMT_DISABLED;
}
+   cpu_smt_num_threads = 1;
 }
 
 /*
  * The decision whether SMT is supported can only be done after the full
  * CPU identification. Called from architecture code.
  */
-void __init cpu_smt_check_topology(void)
+void __init cpu_smt_set_num_threads(unsigned int max_threads, unsigned int 
num_threads)
 {
-   if (!topology_smt_supported())
+   if (max_threads == 1)
cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
-}
 
-static int __init smt_cmdline_disable(char *str)
-{
-   cpu_smt_disable(str && !strcmp(str, "force"));
-   return 0;
+   cpu_smt_max_threads = max_threads;
+
+   /*
+* If SMT has been disabled via the kernel command line or SMT is
+* not supported, set cpu_smt_num_threads to 1 for consistency.
+* If enabled, take the architecture requested number of threads
+* to bring up into account.
+*/
+   if (cpu_smt_control != CPU_SMT_ENABLED)
+   cpu_smt_num_threads = 1;
+   else if (num_threads < cpu_smt_num_threads)
+   cpu_smt_num_threads = num_threads;
 }
-early_param("nosmt", smt_cmdline_disable);
 
 static inline bool cpu_smt_allowed(unsigned int cpu)
 {
@@ -463,6 +472,13 @@ static inline bool cpu_smt_allowed(unsig
return !cpumask_test_cpu(cpu, _booted_once_mask);
 }
 
+static int __init smt_cmdline_disable(char *str)
+{
+   cpu_smt_disable(str && !strcmp(str, "force"));
+   return 0;
+}
+early_param("nosmt", smt_cmdline_disable);
+
 /* Returns true if SMT is not supported of forcefully (irreversibly) disabled 
*/
 bool cpu_smt_possible(void)
 {

Re: [PATCH 8/9] powerpc: Add HOTPLUG_SMT support

2023-06-10 Thread Thomas Gleixner

On Thu, Jun 01 2023 at 18:19, Laurent Dufour wrote:
> @@ -435,12 +435,17 @@ void __init cpu_smt_disable(bool force)
>   * The decision whether SMT is supported can only be done after the full
>   * CPU identification. Called from architecture code.
>   */
> -void __init cpu_smt_check_topology(unsigned int num_threads)
> +void __init cpu_smt_check_topology(unsigned int num_threads,
> +unsigned int max_threads)
>  {
>   if (!topology_smt_supported())
>   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
>  
> - cpu_smt_max_threads = num_threads;
> + cpu_smt_max_threads = max_threads;
> +
> + WARN_ON(num_threads > max_threads);
> + if (num_threads > max_threads)
> + num_threads = max_threads;

This does not work. The call site does:

> + cpu_smt_check_topology(smt_enabled_at_boot, threads_per_core);

smt_enabled_at_boot is 0 when 'smt-enabled=off', which is not what the
hotplug core expects. If SMT is disabled it brings up the primary
thread, which means cpu_smt_num_threads = 1.

This needs more thoughts to avoid a completely inconsistent duct tape
mess.

Btw, the command line parser and the variable smt_enabled_at_boot being
type int allow negative number of threads too... Maybe not what you want.

Thanks,

tglx

Re: [PATCH 6/9] cpu/SMT: Allow enabling partial SMT states via sysfs

2023-06-10 Thread Thomas Gleixner

On Thu, May 25 2023 at 01:56, Michael Ellerman wrote:
> There is a hook which allows arch code to control how many threads per

Can you please write out architecture in changelogs and comments?

I know 'arch' is commonly used but while my brain parser tolerates
'arch_' prefixes it raises an exception on 'arch' in prose as 'arch' is
a regular word with a completely different meaning. Changelogs and
comments are not space constraint.

> @@ -2505,20 +2505,38 @@ __store_smt_control(struct device *dev, struct 
> device_attribute *attr,
>   if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
>   return -ENODEV;
>  
> - if (sysfs_streq(buf, "on"))
> + if (sysfs_streq(buf, "on")) {
>   ctrlval = CPU_SMT_ENABLED;
> - else if (sysfs_streq(buf, "off"))
> + num_threads = cpu_smt_max_threads;
> + } else if (sysfs_streq(buf, "off")) {
>   ctrlval = CPU_SMT_DISABLED;
> - else if (sysfs_streq(buf, "forceoff"))
> + num_threads = 1;
> + } else if (sysfs_streq(buf, "forceoff")) {
>   ctrlval = CPU_SMT_FORCE_DISABLED;
> - else
> + num_threads = 1;
> + } else if (kstrtoint(buf, 10, _threads) == 0) {
> + if (num_threads == 1)
> + ctrlval = CPU_SMT_DISABLED;
> + else if (num_threads > 1 && 
> topology_smt_threads_supported(num_threads))
> + ctrlval = CPU_SMT_ENABLED;
> + else
> + return -EINVAL;
> + } else {
>   return -EINVAL;
> + }
>  
>   ret = lock_device_hotplug_sysfs();
>   if (ret)
>   return ret;
>  
> - if (ctrlval != cpu_smt_control) {
> + orig_threads = cpu_smt_num_threads;
> + cpu_smt_num_threads = num_threads;
> +
> + if (num_threads > orig_threads) {
> + ret = cpuhp_smt_enable();
> + } else if (num_threads < orig_threads) {
> + ret = cpuhp_smt_disable(ctrlval);
> + } else if (ctrlval != cpu_smt_control) {
>   switch (ctrlval) {
>   case CPU_SMT_ENABLED:
>   ret = cpuhp_smt_enable();

This switch case does not make sense anymore.

The only situation which reaches this is when the control value goes
from CPU_SMT_DISABLED to CPU_SMT_FORCE_DISABLED because that's not
changing the number of threads.

So something like this is completely sufficient:

if (num_threads > orig_threads)
ret = cpuhp_smt_enable();
else if (num_threads < orig_threads || ctrval == CPU_SMT_FORCE_DISABLED)
ret = cpuhp_smt_disable(ctrlval);

No?

Thanks,

tglx

Re: [PATCH 6/9] cpu/SMT: Allow enabling partial SMT states via sysfs

2023-06-10 Thread Thomas Gleixner

On Sat, Jun 10 2023 at 22:09, Thomas Gleixner wrote:

> On Thu, May 25 2023 at 01:56, Michael Ellerman wrote:
>> There is a hook which allows arch code to control how many threads per
>
> Can you please write out architecture in changelogs and comments?
>
> I know 'arch' is commonly used but while my brain parser tolerates
> 'arch_' prefixes it raises an exception on 'arch' in prose as 'arch' is
> a regular word with a completely different meaning. Changelogs and
> comments are not space constraint.
>
>> @@ -2505,20 +2505,38 @@ __store_smt_control(struct device *dev, struct 
>> device_attribute *attr,
>>  if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
>>  return -ENODEV;
>>  
>> -if (sysfs_streq(buf, "on"))
>> +if (sysfs_streq(buf, "on")) {
>>  ctrlval = CPU_SMT_ENABLED;
>> -else if (sysfs_streq(buf, "off"))
>> +num_threads = cpu_smt_max_threads;
>> +} else if (sysfs_streq(buf, "off")) {
>>  ctrlval = CPU_SMT_DISABLED;
>> -else if (sysfs_streq(buf, "forceoff"))
>> +num_threads = 1;
>> +} else if (sysfs_streq(buf, "forceoff")) {
>>  ctrlval = CPU_SMT_FORCE_DISABLED;
>> -else
>> +num_threads = 1;
>> +} else if (kstrtoint(buf, 10, _threads) == 0) {
>> +if (num_threads == 1)
>> +ctrlval = CPU_SMT_DISABLED;
>> +else if (num_threads > 1 && 
>> topology_smt_threads_supported(num_threads))

Why does this not simply check cpu_smt_max_threads?

else if (num_threads > 1 && num_threads <= cpu_smt_max_threads)

cpu_smt_max_threads should have been established already, no?

Thanks,

tglx

Re: [patch 05/39] genirq/msi: Remove filter from msi_free_descs_free_range()

2023-03-01 Thread Thomas Gleixner

Miquel!

On Wed, Mar 01 2023 at 11:55, Miquel Raynal wrote:
> t...@linutronix.de wrote on Fri, 11 Nov 2022 14:54:22 +0100 (CET):
>
>> When a range of descriptors is freed then all of them are not associated to
>> a linux interrupt. Remove the filter and add a warning to the free function.
>> +/* Leak the descriptor when it is still referenced */
>> +if (WARN_ON_ONCE(msi_desc_match(desc, MSI_DESC_ASSOCIATED)))
>> +continue;
>> +msi_free_desc(desc);
>>  }
>>  }
>
> It looks like since this commit I am getting warnings upon EPROBE_DEFER
> errors in the mvpp2 Marvell Ethernet driver. I looked a bit at the
> internals to understand why this warning was shown, and it seems that
> nothing "de-references" the descriptors, which would mean here:
> resetting desc->irq to 0.

Correct. This platform-msi ^(*&!@&^ hack really needs to die ASAP.

Marc, where are we on that? Is this still in limbo?

> I am wondering how useful thisd WARN_ON() is, or otherwise where the

It is useful as it caught bugs already.

> desc->irq entry should be zeroed (if I understand that correctly), any
> help will be appreciated.

Untested workaround below. I hate it with a passion, but *shrug*.

Thanks,

tglx
---
 drivers/base/platform-msi.c |1 +
 include/linux/msi.h |2 ++
 kernel/irq/msi.c|   23 ++-
 3 files changed, 25 insertions(+), 1 deletion(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -324,6 +324,7 @@ void platform_msi_device_domain_free(str
struct platform_msi_priv_data *data = domain->host_data;
 
msi_lock_descs(data->dev);
+   msi_domain_depopulate_descs(data->dev, virq, nr_irqs);
irq_domain_free_irqs_common(domain, virq, nr_irqs);
msi_free_msi_descs_range(data->dev, virq, virq + nr_irqs - 1);
msi_unlock_descs(data->dev);
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -631,6 +631,8 @@ int msi_domain_prepare_irqs(struct irq_d
int nvec, msi_alloc_info_t *args);
 int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
 int virq, int nvec, msi_alloc_info_t *args);
+void msi_domain_depopulate_descs(struct device *dev, int virq, int nvec);
+
 struct irq_domain *
 __platform_msi_create_device_domain(struct device *dev,
unsigned int nvec,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -1109,14 +1109,35 @@ int msi_domain_populate_irqs(struct irq_
return 0;
 
 fail:
-   for (--virq; virq >= virq_base; virq--)
+   for (--virq; virq >= virq_base; virq--) {
+   msi_domain_depopulate_descs(dev, virq, 1);
irq_domain_free_irqs_common(domain, virq, 1);
+   }
msi_domain_free_descs(dev, );
 unlock:
msi_unlock_descs(dev);
return ret;
 }
 
+void msi_domain_depopulate_descs(struct device *dev, int virq_base, int nvec)
+{
+   struct msi_ctrl ctrl = {
+   .domid  = MSI_DEFAULT_DOMAIN,
+   .first  = virq_base,
+   .last   = virq_base + nvec - 1,
+   };
+   struct msi_desc *desc;
+   struct xarray *xa;
+   unsigned long idx;
+
+   if (!msi_ctrl_valid(dev, ))
+   return;
+
+   xa = >msi.data->__domains[ctrl.domid].store;
+   xa_for_each_range(xa, idx, desc, ctrl.first, ctrl.last)
+   desc->irq = 0;
+}
+
 /*
  * Carefully check whether the device can use reservation mode. If
  * reservation mode is enabled then the early activation will assign a

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-28 Thread Thomas Gleixner

On Sun, Nov 27 2022 at 09:53, Paul E. McKenney wrote:
> On Sun, Nov 27, 2022 at 01:40:28PM +0100, Thomas Gleixner wrote:
>> There are quite some reasons why a CPU-hotplug or a hot-unplug operation
>> can fail, which is not a fatal problem, really.
>> 
>> So if a CPU hotplug operation fails, then why can't the torture test
>> just move on and validate that the system still behaves correctly?
>> 
>> That gives us more coverage than just testing the good case and giving
>> up when something unexpected happens.
>
> Agreed, with access to a function like the tick_nohz_full_timekeeper()
> suggested earlier in this email thread, then yes, it would make sense to
> try to offline the CPU anyway, then forgive the failure in cases where
> the CPU matches that indicated by tick_nohz_full_timekeeper().

Why special casing this? There are other valid reasons why offlining can
fail. So we special case timekeeper today and then next week we special
case something else just because. That does not make sense. If it fails
there is a reason and you can log it. The important part is that the
system is functional and stable after the fail and the rollback.

>> I even argue that the torture test should inject random failures into
>> the hotplug state machine to achieve extended code coverage.
>
> I could imagine torture_onoff() telling various CPU-hotplug notifiers
> to refuse the transition using some TBD interface.

There is already an interface which is exposed to sysfs which allows you
to enforce a "fail" at a defined hotplug state.

> That would better test the CPU-hotplug common code's ability to deal
> with failures.

Correct.

> Or did you have something else/additional in mind?

No.

Thanks,

tglx

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-27 Thread Thomas Gleixner

Zhouyi,

On Sun, Nov 27 2022 at 10:45, Zhouyi Zhou wrote:
> On Sun, Nov 27, 2022 at 1:05 AM Thomas Gleixner  wrote:
>
> So, I should construct my patch as:
> We avoid ... by ...

Not "We avoid".

Avoid this behaviour by 

>> No. We are not exporting this just to make a bogus test case happy.
>>
>> Fix the torture code to handle -EBUSY correctly.
> I am going to do a study on this, for now, I do a grep in the kernel tree:
> find . -name "*.c"|xargs grep cpuhp_setup_state|wc -l
> The result of the grep command shows that there are 268
> cpuhp_setup_state* cases.
> which may make our task more complicated.

Why? The whole point of this torture thing is to stress the
infrastructure.

There are quite some reasons why a CPU-hotplug or a hot-unplug operation
can fail, which is not a fatal problem, really.

So if a CPU hotplug operation fails, then why can't the torture test
just move on and validate that the system still behaves correctly?

That gives us more coverage than just testing the good case and giving
up when something unexpected happens.

I even argue that the torture test should inject random failures into
the hotplug state machine to achieve extended code coverage.

Thanks,

tglx

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-26 Thread Thomas Gleixner

On Mon, Nov 21 2022 at 11:51, Zhouyi Zhou wrote:
> During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to
> offline tick_do_timer_cpu, the operation will fail because in
> function tick_nohz_cpu_down:
> ```
> if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
>   return -EBUSY;
> ```
> Above bug was first discovered in torture tests performed in PPC VM

How is this a bug?

> of Open Source Lab of Oregon State University, and reproducable in RISC-V
> and X86-64 (with additional kernel commandline cpu0_hotplug).
>
> In this patch, we avoid offline tick_do_timer_cpu by distribute
> the offlining cpu among remaining cpus.

Please read Documentation/process. Search for 'this patch'...

>
> Signed-off-by: Zhouyi Zhou 
> ---
>  include/linux/tick.h|  1 +
>  kernel/time/tick-common.c   |  1 +
>  kernel/time/tick-internal.h |  1 -
>  kernel/torture.c| 10 ++
>  4 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index bfd571f18cfd..23cc0b205853 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -14,6 +14,7 @@
>  #include 
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
> +extern int tick_do_timer_cpu __read_mostly;
>  extern void __init tick_init(void);
>  /* Should be core only, but ARM BL switcher requires it */
>  extern void tick_suspend_local(void);
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index 46789356f856..87b9b9afa320 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -48,6 +48,7 @@ ktime_t tick_next_period;
>   *procedure also covers cpu hotplug.
>   */
>  int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
> +EXPORT_SYMBOL_GPL(tick_do_timer_cpu);

No. We are not exporting this just to make a bogus test case happy.

Fix the torture code to handle -EBUSY correctly.

Thanks,

tglx

Re: [patch 39/39] x86/apic: Remove X86_IRQ_ALLOC_CONTIGUOUS_VECTORS

2022-11-17 Thread Thomas Gleixner

Jason, Bjorn, Ashok!

On Wed, Nov 16 2022 at 14:05, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:55:17PM +0100, Thomas Gleixner wrote:
>> Now that the PCI/MSI core code does early checking for multi-MSI support
>> X86_IRQ_ALLOC_CONTIGUOUS_VECTORS is not required anymore.
>
>> Remove the flag and rely on MSI_FLAG_MULTI_PCI_MSI.
>
> Reviewed-by: Jason Gunthorpe 

Thanks for taking the time to go through this pile!

   tglx

Re: [patch 01/39] PCI/MSI: Check for MSI enabled in __pci_msix_enable()

2022-11-17 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 07:39, Ashok Raj wrote:
> On Fri, Nov 11, 2022 at 02:54:15PM +0100, Thomas Gleixner wrote:
>
> Can the pre-enabled checks for msi and msix be moved up before any vector
> range check?
>
> not that it matters for how it fails, does EBUSY sound better?

Does any caller care about the error code or about the ordering in which
the caller stupity is detected?

Re: [patch 34/39] PCI/MSI: Reject multi-MSI early

2022-11-17 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 10:31, Bjorn Helgaas wrote:
> On Fri, Nov 11, 2022 at 02:55:09PM +0100, Thomas Gleixner wrote:
>>  
>> +/**
>> + * pci_msi_domain_supports - Check for support of a particular feature flag
>> + * @pdev:   The PCI device to operate on
>> + * @feature_mask:   The feature mask to check for (full match)
>> + * @mode:   If ALLOW_LEGACY this grants the feature when there is 
>> no irq domain
>> + *  associated to the device. If DENY_LEGACY the lack of an 
>> irq domain
>> + *  makes the feature unsupported
>
> Looks like some of these might be wider than 80 columns, which I think
> was the typical width of this file.

I accommodated to the general sentiment that 80 columns is yesterday. My
new cutoff is 100.

Thanks,

tglx

Re: [patch 13/39] PCI/MSI: Use msi_domain_info::bus_token

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 13:51, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:54:35PM +0100, Thomas Gleixner wrote:
>>  /* PCI-MSI is oneshot-safe */
>>  info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
>> +/* Let the core update the bus token */
>> +info->bus_token = DOMAIN_BUS_PCI_MSI;
>
> comment seems a bit obvious

:)

> Reviewed-by: Jason Gunthorpe 
>
> Should the callers be updated to set this in their "struct
> msi_domain_info" ?

For PCI/MSI we can handle that in the core for all of them. :)

The other msi_domain_info usage in various places needs obviously
special care.

Thanks,

tglx

Re: [patch 12/39] genirq/msi: Add bus token to struct msi_domain_info

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 13:49, Jason Gunthorpe wrote:

> On Fri, Nov 11, 2022 at 02:54:33PM +0100, Thomas Gleixner wrote:
>> From: Ahmed S. Darwish 
>> 
>> Add a bus token member to struct msi_domain_info and let
>> msi_create_irq_domain() set the bus token.
>> 
>> That allows to remove the bus token updates at the call sites.
>> 
>> Suggested-by: Thomas Gleixner 
>> Signed-off-by: Ahmed S. Darwish 
>> Signed-off-by: Thomas Gleixner 
>> ---
>>  include/linux/msi.h |   19 +++
>>  kernel/irq/msi.c|7 +--
>>  2 files changed, 16 insertions(+), 10 deletions(-)
>
> Reviewed-by: Jason Gunthorpe 
>
>>  struct msi_domain_info {
>> -u32 flags;
>> -struct msi_domain_ops   *ops;
>> -struct irq_chip *chip;
>> -void*chip_data;
>> -irq_flow_handler_t  handler;
>> -void*handler_data;
>> -const char  *handler_name;
>> -void*data;
>> +u32 flags;
>> +enum irq_domain_bus_token   bus_token;
>> +struct msi_domain_ops   *ops;
>> +struct irq_chip *chip;
>> +void*chip_data;
>> +irq_flow_handler_t  handler;
>> +void*handler_data;
>> +const char  *handler_name;
>> +void*data;
>>  };
>
> This is why I've been frowning on horizontal alignment :(

Yes, it's annoying when you have to adjust it, but it's fundamentaly
simpler to parse than the clogged together word salad.

Re: [patch 08/39] genirq/msi: Provide msi_domain_ops::post_free()

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 13:44, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:54:27PM +0100, Thomas Gleixner wrote:
>> To prepare for removing the exposure of __msi_domain_free_irqs() provide a
>> post_free() callback in the MSI domain ops which can be used to solve
>> the problem of the only user of __msi_domain_free_irqs() in arch/powerpc.
>> 
>> Signed-off-by: Thomas Gleixner 
>> ---
>>  include/linux/msi.h |4 
>>  kernel/irq/msi.c|2 ++
>>  2 files changed, 6 insertions(+)
>
> Make sense, but I do wonder why PPC needs this ordering..

Some hypervisor thing.

Re: [patch 32/39] PCI/MSI: Reorder functions in msi.c

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 10:28, Bjorn Helgaas wrote:

> On Fri, Nov 11, 2022 at 02:55:06PM +0100, Thomas Gleixner wrote:
>> From: Ahmed S. Darwish 
>> 
>> There is no way to navigate msi.c without banging the head against the wall
>> every now and then because MSI and MSI-X specific functions are
>> intermingled and the code flow is completely non-obvious.
>> 
>> Reorder everthing so common helpers, MSI and MSI-X specific functions are
>> grouped together.
>
> s/everthing/everything/
>
>> Suggested-by: Thomas Gleixner 
>> Signed-off-by: Ahmed S. Darwish 
>> Signed-off-by: Thomas Gleixner 
>
> Acked-by: Bjorn Helgaas 
>
> I assume this is pure code movement, so I didn't even look at the
> text below.

It is.

Re: [patch 27/39] PCI/MSI: Move pci_disable_msix() to api.c

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 10:26, Bjorn Helgaas wrote:
>>  /**
>> + * pci_disable_msix() - Disable MSI-X interrupt mode on device
>> + * @dev: the PCI device to operate on
>> + *
>> + * Legacy device driver API to disable MSI-X interrupt mode on device,
>> + * free earlier-allocated interrupt vectors, and restore INTx emulation.
>
> Isn't INTx *emulation* a PCIe implementation detail?  Doesn't seem
> relevant to callers that it's emulated.

Don't know how the emulation ended up there. It restores INTx which is
obvioulsy not a wire on PCIe... But yes, it's uninteresting.

Re: [patch 23/39] PCI/MSI: Move pci_alloc_irq_vectors_affinity() to api.c

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 10:23, Bjorn Helgaas wrote:
> On Fri, Nov 11, 2022 at 02:54:51PM +0100, Thomas Gleixner wrote:
>> + * Same as pci_alloc_irq_vectors(), but with the extra @affd parameter.
>> + * Check that function docs, and  irq_affinity, for more details.
>
> Is " irq_affinity" some kernel-doc syntax, or is the "&"
> superfluous?

The latter.

Re: [patch 22/39] PCI/MSI: Move pci_alloc_irq_vectors() to api.c

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 10:22, Bjorn Helgaas wrote:
>> + * Allocate up to @max_vecs interrupt vectors on device. MSI-X irq
>
> s/irq/IRQ/
>
>> + * vector allocation has a higher precedence over plain MSI, which has a
>> + * higher precedence over legacy INTx emulation.
>> + *
>> + * Upon a successful allocation, the caller should use pci_irq_vector()
>> + * to get the Linux IRQ number to be passed to request_threaded_irq().
>> + * The driver must call pci_free_irq_vectors() on cleanup.
>> + *
>> + * Return: number of allocated vectors (which might be smaller than
>> + * @max_vecs), -ENOSPC if less than @min_vecs interrupt vectors are
>
> s/less/fewer/ (also in some previous patches, IIRC)

Will fix.

Re: [patch 20/39] PCI/MSI: Move pci_enable_msi() API to api.c

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 10:18, Bjorn Helgaas wrote:
> On Fri, Nov 11, 2022 at 02:54:46PM +0100, Thomas Gleixner wrote:
>> From: Ahmed S. Darwish 
>> 
>> To distangle the maze in msi.c all exported device-driver MSI APIs are now
>> to be grouped in one file, api.c.
>> 
>> Move pci_enable_msi() and make its kernel-doc comprehensive.
>> 
>> Signed-off-by: Ahmed S. Darwish 
>> Signed-off-by: Thomas Gleixner 
>
> Acked-by: Bjorn Helgaas 
>
> Nit: suggest "disentangle" or "untangle" for "distangle" here and in
> subsequent patches.

My fault. I suggested the word to Ahmed well knowing that this is one of
the words I never get spelled correctly :)

Re: [patch 15/39] PCI/MSI: Get rid of PCI_MSI_IRQ_DOMAIN

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 10:12, Bjorn Helgaas wrote:
> On Fri, Nov 11, 2022 at 02:54:38PM +0100, Thomas Gleixner wrote:
>> What a zoo:
>> 
>>  PCI_MSI
>>  select GENERIC_MSI_IRQ
>> 
>>  PCI_MSI_IRQ_DOMAIN
>>  def_bool y
>>  depends on PCI_MSI
>>  select GENERIC_MSI_IRQ_DOMAIN
>> 
>> Ergo PCI_MSI enables PCI_MSI_IRQ_DOMAIN which in turn selects
>> GENERIC_MSI_IRQ_DOMAIN. So all the dependencies on PCI_MSI_IRQ_DOMAIN are
>> just an indirection to PCI_MSI.
>> 
>> Match the reality and just admit that PCI_MSI requires
>> GENERIC_MSI_IRQ_DOMAIN.
>> 
>> Signed-off-by: Thomas Gleixner 
>
> Acked-by: Bjorn Helgaas 
>
> Just FYI, this will conflict with my work-in-progress to add more
> COMPILE_TEST coverage:
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?id=72b5e7c401a1
>
> No *actual* conflicts, just textually next door, so should be sipmle
> to resolve.  Worst case I can postpone my patch until the next cycle.

Linus should be able to resolve that conflict :)

Re: [patch 03/39] iommu/amd: Remove bogus check for multi MSI-X

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 08:02, Ashok Raj wrote:
> On Fri, Nov 11, 2022 at 02:54:19PM +0100, Thomas Gleixner wrote:
>> PCI/Multi-MSI is MSI specific and not supported for MSI-X
>> 
>> Signed-off-by: Thomas Gleixner 
>> ---
>>  drivers/iommu/amd/iommu.c |3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>> 
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -3294,8 +3294,7 @@ static int irq_remapping_alloc(struct ir
>>  
>>  if (!info)
>>  return -EINVAL;
>> -if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI &&
>> -info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
>> +if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
>>  return -EINVAL;
>>  
>>  /*
>> 
>
> nit:
>
> maybe better to merge patch2/3 since both seem related?

What's better about it? It's two different drivers.

Re: [patch 02/39] iommu/vt-d: Remove bogus check for multi MSI-X

2022-11-16 Thread Thomas Gleixner

On Wed, Nov 16 2022 at 07:52, Ashok Raj wrote:
> On Fri, Nov 11, 2022 at 02:54:17PM +0100, Thomas Gleixner wrote:
>> PCI/Multi-MSI is MSI specific and not supported for MSI-X.
>> 
>> Signed-off-by: Thomas Gleixner 
>> ---
>>  drivers/iommu/intel/irq_remapping.c |3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>> 
>> --- a/drivers/iommu/intel/irq_remapping.c
>> +++ b/drivers/iommu/intel/irq_remapping.c
>> @@ -1334,8 +1334,7 @@ static int intel_irq_remapping_alloc(str
>>  
>>  if (!info || !iommu)
>>  return -EINVAL;
>> -if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI &&
>> -info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
>> +if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
>>  return -EINVAL;
>>  
>>  /*
>> 
>
> This check is only making sure that when multi-msi is requested that the
> type has to be either MSI/MSIX.

MSI-X does not support multi vector allocations on a single entry.

> Wouldn't this change return -EINVAL when type = MSIX?

Rightfully so. MSIX vectors are allocated one by one. Has been that way
forever.

Thanks,

tglx

[patch 32/39] PCI/MSI: Reorder functions in msi.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

There is no way to navigate msi.c without banging the head against the wall
every now and then because MSI and MSI-X specific functions are
intermingled and the code flow is completely non-obvious.

Reorder everthing so common helpers, MSI and MSI-X specific functions are
grouped together.

Suggested-by: Thomas Gleixner 
Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/msi.c |  577 +-
 1 file changed, 295 insertions(+), 282 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -16,6 +16,97 @@
 int pci_msi_enable = 1;
 int pci_msi_ignore_mask;
 
+/**
+ * pci_msi_supported - check whether MSI may be enabled on a device
+ * @dev: pointer to the pci_dev data structure of MSI device function
+ * @nvec: how many MSIs have been requested?
+ *
+ * Look at global flags, the device itself, and its parent buses
+ * to determine if MSI/-X are supported for the device. If MSI/-X is
+ * supported return 1, else return 0.
+ **/
+static int pci_msi_supported(struct pci_dev *dev, int nvec)
+{
+   struct pci_bus *bus;
+
+   /* MSI must be globally enabled and supported by the device */
+   if (!pci_msi_enable)
+   return 0;
+
+   if (!dev || dev->no_msi)
+   return 0;
+
+   /*
+* You can't ask to have 0 or less MSIs configured.
+*  a) it's stupid ..
+*  b) the list manipulation code assumes nvec >= 1.
+*/
+   if (nvec < 1)
+   return 0;
+
+   /*
+* Any bridge which does NOT route MSI transactions from its
+* secondary bus to its primary bus must set NO_MSI flag on
+* the secondary pci_bus.
+*
+* The NO_MSI flag can either be set directly by:
+* - arch-specific PCI host bus controller drivers (deprecated)
+* - quirks for specific PCI bridges
+*
+* or indirectly by platform-specific PCI host bridge drivers by
+* advertising the 'msi_domain' property, which results in
+* the NO_MSI flag when no MSI domain is found for this bridge
+* at probe time.
+*/
+   for (bus = dev->bus; bus; bus = bus->parent)
+   if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
+   return 0;
+
+   return 1;
+}
+
+static void pcim_msi_release(void *pcidev)
+{
+   struct pci_dev *dev = pcidev;
+
+   dev->is_msi_managed = false;
+   pci_free_irq_vectors(dev);
+}
+
+/*
+ * Needs to be separate from pcim_release to prevent an ordering problem
+ * vs. msi_device_data_release() in the MSI core code.
+ */
+static int pcim_setup_msi_release(struct pci_dev *dev)
+{
+   int ret;
+
+   if (!pci_is_managed(dev) || dev->is_msi_managed)
+   return 0;
+
+   ret = devm_add_action(>dev, pcim_msi_release, dev);
+   if (!ret)
+   dev->is_msi_managed = true;
+   return ret;
+}
+
+/*
+ * Ordering vs. devres: msi device data has to be installed first so that
+ * pcim_msi_release() is invoked before it on device release.
+ */
+static int pci_setup_msi_context(struct pci_dev *dev)
+{
+   int ret = msi_setup_device_data(>dev);
+
+   if (!ret)
+   ret = pcim_setup_msi_release(dev);
+   return ret;
+}
+
+/*
+ * Helper functions for mask/unmask and MSI message handling
+ */
+
 void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
 {
raw_spinlock_t *lock = _pci_dev(desc->dev)->msi_lock;
@@ -163,15 +254,8 @@ void pci_write_msi_msg(unsigned int irq,
 }
 EXPORT_SYMBOL_GPL(pci_write_msi_msg);
 
-void pci_free_msi_irqs(struct pci_dev *dev)
-{
-   pci_msi_teardown_msi_irqs(dev);
 
-   if (dev->msix_base) {
-   iounmap(dev->msix_base);
-   dev->msix_base = NULL;
-   }
-}
+/* PCI/MSI specific functionality */
 
 static void pci_intx_for_msi(struct pci_dev *dev, int enable)
 {
@@ -190,111 +274,6 @@ static void pci_msi_set_enable(struct pc
pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control);
 }
 
-/*
- * Architecture override returns true when the PCI MSI message should be
- * written by the generic restore function.
- */
-bool __weak arch_restore_msi_irqs(struct pci_dev *dev)
-{
-   return true;
-}
-
-void __pci_restore_msi_state(struct pci_dev *dev)
-{
-   struct msi_desc *entry;
-   u16 control;
-
-   if (!dev->msi_enabled)
-   return;
-
-   entry = irq_get_msi_desc(dev->irq);
-
-   pci_intx_for_msi(dev, 0);
-   pci_msi_set_enable(dev, 0);
-   if (arch_restore_msi_irqs(dev))
-   __pci_write_msi_msg(entry, >msg);
-
-   pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, );
-   pci_msi_update_mask(entry, 0, 0);
-   control &= ~PCI_MSI_FLAGS_QSIZE;
-   control |= (entry->pci.msi_attrib.multiple << 4) | PCI_MSI_FLAGS_

[patch 36/39] PCI/MSI: Validate MSIX contiguous restriction early

2022-11-11 Thread Thomas Gleixner

With interrupt domains the sanity check for MSI-X vector validation can be
done _before_ any allocation happens. The sanity check only applies to the
allocation functions which have an 'entries' array argument. The entries
array is filled by the caller with the requested MSI-X indicies. Some drivers
have gaps in the index space which is not supported on all architectures.

The PCI/MSI irqdomain has a 'feature' bit to enforce this validation late
during the allocation phase.

Just do it right away before doing any other work along with the other
sanity checks on that array.

Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/msi.c |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -725,13 +725,17 @@ static int msix_capability_init(struct p
return ret;
 }
 
-static bool pci_msix_validate_entries(struct msix_entry *entries, int nvec, 
int hwsize)
+static bool pci_msix_validate_entries(struct pci_dev *dev, struct msix_entry 
*entries,
+ int nvec, int hwsize)
 {
+   bool nogap;
int i, j;
 
if (!entries)
return true;
 
+   nogap = pci_msi_domain_supports(dev, MSI_FLAG_MSIX_CONTIGUOUS, 
DENY_LEGACY);
+
for (i = 0; i < nvec; i++) {
/* Entry within hardware limit? */
if (entries[i].entry >= hwsize)
@@ -742,6 +746,9 @@ static bool pci_msix_validate_entries(st
if (entries[i].entry == entries[j].entry)
return false;
}
+   /* Check for unsupported gaps */
+   if (nogap && entries[i].entry != i)
+   return false;
}
return true;
 }
@@ -773,7 +780,7 @@ int __pci_enable_msix_range(struct pci_d
if (hwsize < 0)
return hwsize;
 
-   if (!pci_msix_validate_entries(entries, nvec, hwsize))
+   if (!pci_msix_validate_entries(dev, entries, nvec, hwsize))
return -EINVAL;
 
/* PCI_IRQ_VIRTUAL is a horrible hack! */

[patch 23/39] PCI/MSI: Move pci_alloc_irq_vectors_affinity() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_alloc_irq_vectors_affinity() and let its kernel-doc reference
pci_alloc_irq_vectors() documentation added in parent commit.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 59 +++-
 drivers/pci/msi/msi.c | 65 
+
 2 files changed, 59 insertions(+), 65 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 1714905943fb..8546749afa6e 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -123,3 +123,62 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned 
int min_vecs,
  flags, NULL);
 }
 EXPORT_SYMBOL(pci_alloc_irq_vectors);
+
+/**
+ * pci_alloc_irq_vectors_affinity() - Allocate multiple device interrupt
+ *vectors with affinity requirements
+ * @dev:  the PCI device to operate on
+ * @min_vecs: minimum required number of vectors (must be >= 1)
+ * @max_vecs: maximum desired number of vectors
+ * @flags:allocation flags, as in pci_alloc_irq_vectors()
+ * @affd: affinity requirements (can be %NULL).
+ *
+ * Same as pci_alloc_irq_vectors(), but with the extra @affd parameter.
+ * Check that function docs, and  irq_affinity, for more details.
+ */
+int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
+  unsigned int max_vecs, unsigned int flags,
+  struct irq_affinity *affd)
+{
+   struct irq_affinity msi_default_affd = {0};
+   int nvecs = -ENOSPC;
+
+   if (flags & PCI_IRQ_AFFINITY) {
+   if (!affd)
+   affd = _default_affd;
+   } else {
+   if (WARN_ON(affd))
+   affd = NULL;
+   }
+
+   if (flags & PCI_IRQ_MSIX) {
+   nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
+   affd, flags);
+   if (nvecs > 0)
+   return nvecs;
+   }
+
+   if (flags & PCI_IRQ_MSI) {
+   nvecs = __pci_enable_msi_range(dev, min_vecs, max_vecs, affd);
+   if (nvecs > 0)
+   return nvecs;
+   }
+
+   /* use legacy IRQ if allowed */
+   if (flags & PCI_IRQ_LEGACY) {
+   if (min_vecs == 1 && dev->irq) {
+   /*
+* Invoke the affinity spreading logic to ensure that
+* the device driver can adjust queue configuration
+* for the single interrupt case.
+*/
+   if (affd)
+   irq_create_affinity_masks(1, affd);
+   pci_intx(dev, 1);
+   return 1;
+   }
+   }
+
+   return nvecs;
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 6700ef1c734e..a028774f438d 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -887,71 +887,6 @@ int __pci_enable_msix_range(struct pci_dev *dev,
 }
 
 /**
- * pci_alloc_irq_vectors_affinity - allocate multiple IRQs for a device
- * @dev:   PCI device to operate on
- * @min_vecs:  minimum number of vectors required (must be >= 1)
- * @max_vecs:  maximum (desired) number of vectors
- * @flags: flags or quirks for the allocation
- * @affd:  optional description of the affinity requirements
- *
- * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X or MSI
- * vectors if available, and fall back to a single legacy vector
- * if neither is available.  Return the number of vectors allocated,
- * (which might be smaller than @max_vecs) if successful, or a negative
- * error code on error. If less than @min_vecs interrupt vectors are
- * available for @dev the function will fail with -ENOSPC.
- *
- * To get the Linux IRQ number used for a vector that can be passed to
- * request_irq() use the pci_irq_vector() helper.
- */
-int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
-  unsigned int max_vecs, unsigned int flags,
-  struct irq_affinity *affd)
-{
-   struct irq_affinity msi_default_affd = {0};
-   int nvecs = -ENOSPC;
-
-   if (flags & PCI_IRQ_AFFINITY) {
-   if (!affd)
-   affd = _default_affd;
-   } else {
-   if (WARN_ON(affd))
-   affd = NULL;
-   }
-
-   if (flags & PCI_IRQ_MSIX) {
-   nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
-

[patch 31/39] Documentation: PCI: Add reference to PCI/MSI device driver APIs

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

All exported device-driver MSI APIs are now grouped in one place at
drivers/pci/msi/api.c with comprehensive kernel-docs added.

Reference these kernel-docs in the official PCI/MSI howto.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 Documentation/PCI/msi-howto.rst |   10 ++
 1 file changed, 10 insertions(+)
---
--- a/Documentation/PCI/msi-howto.rst
+++ b/Documentation/PCI/msi-howto.rst
@@ -285,3 +285,13 @@ to bridges between the PCI root and the
 It is also worth checking the device driver to see whether it supports MSIs.
 For example, it may contain calls to pci_alloc_irq_vectors() with the
 PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
+
+
+List of device drivers MSI(-X) APIs
+===
+
+The PCI/MSI subystem has a dedicated C file for its exported device driver
+APIs — `drivers/pci/msi/api.c`. The following functions are exported:
+
+.. kernel-doc:: drivers/pci/msi/api.c
+   :export:

[patch 27/39] PCI/MSI: Move pci_disable_msix() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_disable_msix() and make its kernel-doc comprehensive.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 24 
 drivers/pci/msi/msi.c | 14 +-
 drivers/pci/msi/msi.h |  1 +
 3 files changed, 26 insertions(+), 13 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 83ea38ffa116..653a61868ae6 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -112,6 +112,30 @@ int pci_enable_msix_range(struct pci_dev *dev, struct 
msix_entry *entries,
 EXPORT_SYMBOL(pci_enable_msix_range);
 
 /**
+ * pci_disable_msix() - Disable MSI-X interrupt mode on device
+ * @dev: the PCI device to operate on
+ *
+ * Legacy device driver API to disable MSI-X interrupt mode on device,
+ * free earlier-allocated interrupt vectors, and restore INTx emulation.
+ * The PCI device Linux IRQ (@dev->irq) is restored to its default pin
+ * assertion IRQ. This is the cleanup pair of pci_enable_msix_range().
+ *
+ * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API
+ * pair should, in general, be used instead.
+ */
+void pci_disable_msix(struct pci_dev *dev)
+{
+   if (!pci_msi_enabled() || !dev || !dev->msix_enabled)
+   return;
+
+   msi_lock_descs(>dev);
+   pci_msix_shutdown(dev);
+   pci_free_msi_irqs(dev);
+   msi_unlock_descs(>dev);
+}
+EXPORT_SYMBOL(pci_disable_msix);
+
+/**
  * pci_alloc_irq_vectors() - Allocate multiple device interrupt vectors
  * @dev:  the PCI device to operate on
  * @min_vecs: minimum required number of vectors (must be >= 1)
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 1226d66da992..6fa90d07d2e4 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -736,7 +736,7 @@ static int __pci_enable_msix(struct pci_dev *dev, struct 
msix_entry *entries,
return msix_capability_init(dev, entries, nvec, affd);
 }
 
-static void pci_msix_shutdown(struct pci_dev *dev)
+void pci_msix_shutdown(struct pci_dev *dev)
 {
struct msi_desc *desc;
 
@@ -758,18 +758,6 @@ static void pci_msix_shutdown(struct pci_dev *dev)
pcibios_alloc_irq(dev);
 }
 
-void pci_disable_msix(struct pci_dev *dev)
-{
-   if (!pci_msi_enable || !dev || !dev->msix_enabled)
-   return;
-
-   msi_lock_descs(>dev);
-   pci_msix_shutdown(dev);
-   pci_free_msi_irqs(dev);
-   msi_unlock_descs(>dev);
-}
-EXPORT_SYMBOL(pci_disable_msix);
-
 int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
   struct irq_affinity *affd)
 {
diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h
index 8c4a5289432d..77e2587f7e4f 100644
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -86,6 +86,7 @@ static inline __attribute_const__ u32 msi_multi_mask(struct 
msi_desc *desc)
 
 /* MSI internal functions invoked from the public APIs */
 void pci_msi_shutdown(struct pci_dev *dev);
+void pci_msix_shutdown(struct pci_dev *dev);
 void pci_free_msi_irqs(struct pci_dev *dev);
 int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct 
irq_affinity *affd);
 int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, 
int minvec,

[patch 33/39] PCI/MSI: Sanitize MSI-X checks

2022-11-11 Thread Thomas Gleixner

There is no point in doing the same sanity checks over and over in a loop
during MSI-X enablement. Put them in front of the loop and return early
when they fail.

Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/msi.c |   67 +-
 1 file changed, 34 insertions(+), 33 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -721,47 +721,31 @@ static int msix_capability_init(struct p
return ret;
 }
 
-static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
-int nvec, struct irq_affinity *affd, int flags)
+static bool pci_msix_validate_entries(struct msix_entry *entries, int nvec, 
int hwsize)
 {
-   int nr_entries;
int i, j;
 
-   if (!pci_msi_supported(dev, nvec) || dev->current_state != PCI_D0)
-   return -EINVAL;
+   if (!entries)
+   return true;
 
-   nr_entries = pci_msix_vec_count(dev);
-   if (nr_entries < 0)
-   return nr_entries;
-   if (nvec > nr_entries && !(flags & PCI_IRQ_VIRTUAL))
-   return nr_entries;
-
-   if (entries) {
-   /* Check for any invalid entries */
-   for (i = 0; i < nvec; i++) {
-   if (entries[i].entry >= nr_entries)
-   return -EINVAL; /* invalid entry */
-   for (j = i + 1; j < nvec; j++) {
-   if (entries[i].entry == entries[j].entry)
-   return -EINVAL; /* duplicate entry */
-   }
+   for (i = 0; i < nvec; i++) {
+   /* Entry within hardware limit? */
+   if (entries[i].entry >= hwsize)
+   return false;
+
+   /* Check for duplicate entries */
+   for (j = i + 1; j < nvec; j++) {
+   if (entries[i].entry == entries[j].entry)
+   return false;
}
}
-
-   /* Check whether driver already requested for MSI IRQ */
-   if (dev->msi_enabled) {
-   pci_info(dev, "can't enable MSI-X (MSI IRQ already 
assigned)\n");
-   return -EINVAL;
-   }
-   return msix_capability_init(dev, entries, nvec, affd);
+   return true;
 }
 
-int __pci_enable_msix_range(struct pci_dev *dev,
-   struct msix_entry *entries, int minvec,
-   int maxvec, struct irq_affinity *affd,
-   int flags)
+int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, 
int minvec,
+   int maxvec, struct irq_affinity *affd, int flags)
 {
-   int rc, nvec = maxvec;
+   int hwsize, rc, nvec = maxvec;
 
if (maxvec < minvec)
return -ERANGE;
@@ -774,6 +758,23 @@ int __pci_enable_msix_range(struct pci_d
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;
 
+   if (!pci_msi_supported(dev, nvec) || dev->current_state != PCI_D0)
+   return -EINVAL;
+
+   hwsize = pci_msix_vec_count(dev);
+   if (hwsize < 0)
+   return hwsize;
+
+   if (!pci_msix_validate_entries(entries, nvec, hwsize))
+   return -EINVAL;
+
+   /* PCI_IRQ_VIRTUAL is a horrible hack! */
+   if (nvec > hwsize && !(flags & PCI_IRQ_VIRTUAL))
+   nvec = hwsize;
+
+   if (nvec < minvec)
+   return -ENOSPC;
+
rc = pci_setup_msi_context(dev);
if (rc)
return rc;
@@ -785,7 +786,7 @@ int __pci_enable_msix_range(struct pci_d
return -ENOSPC;
}
 
-   rc = __pci_enable_msix(dev, entries, nvec, affd, flags);
+   rc = msix_capability_init(dev, entries, nvec, affd);
if (rc == 0)
return nvec;

[patch 18/39] PCI/MSI: Move mask and unmask helpers to msi.h

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

The upcoming support for per device MSI interrupt domains needs to share
some of the inline helpers with the MSI implementation.

Move them to the header file.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/msi.c | 61 +--
 drivers/pci/msi/msi.h | 83 +---
 2 files changed, 74 insertions(+), 70 deletions(-)
---
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 160af9f01669..5c310df55d0d 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -16,7 +16,7 @@
 static int pci_msi_enable = 1;
 int pci_msi_ignore_mask;
 
-static noinline void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 
set)
+void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
 {
raw_spinlock_t *lock = _pci_dev(desc->dev)->msi_lock;
unsigned long flags;
@@ -32,65 +32,6 @@ static noinline void pci_msi_update_mask(struct msi_desc 
*desc, u32 clear, u32 s
raw_spin_unlock_irqrestore(lock, flags);
 }
 
-static inline void pci_msi_mask(struct msi_desc *desc, u32 mask)
-{
-   pci_msi_update_mask(desc, 0, mask);
-}
-
-static inline void pci_msi_unmask(struct msi_desc *desc, u32 mask)
-{
-   pci_msi_update_mask(desc, mask, 0);
-}
-
-static inline void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
-{
-   return desc->pci.mask_base + desc->msi_index * PCI_MSIX_ENTRY_SIZE;
-}
-
-/*
- * This internal function does not flush PCI writes to the device.  All
- * users must ensure that they read from the device before either assuming
- * that the device state is up to date, or returning out of this file.
- * It does not affect the msi_desc::msix_ctrl cache either. Use with care!
- */
-static void pci_msix_write_vector_ctrl(struct msi_desc *desc, u32 ctrl)
-{
-   void __iomem *desc_addr = pci_msix_desc_addr(desc);
-
-   if (desc->pci.msi_attrib.can_mask)
-   writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
-}
-
-static inline void pci_msix_mask(struct msi_desc *desc)
-{
-   desc->pci.msix_ctrl |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
-   pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl);
-   /* Flush write to device */
-   readl(desc->pci.mask_base);
-}
-
-static inline void pci_msix_unmask(struct msi_desc *desc)
-{
-   desc->pci.msix_ctrl &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
-   pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl);
-}
-
-static void __pci_msi_mask_desc(struct msi_desc *desc, u32 mask)
-{
-   if (desc->pci.msi_attrib.is_msix)
-   pci_msix_mask(desc);
-   else
-   pci_msi_mask(desc, mask);
-}
-
-static void __pci_msi_unmask_desc(struct msi_desc *desc, u32 mask)
-{
-   if (desc->pci.msi_attrib.is_msix)
-   pci_msix_unmask(desc);
-   else
-   pci_msi_unmask(desc, mask);
-}
-
 /**
  * pci_msi_mask_irq - Generic IRQ chip callback to mask PCI/MSI interrupts
  * @data:  pointer to irqdata associated to that interrupt
diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h
index fc92603b33e1..d8f62d911f08 100644
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -8,21 +8,67 @@
 int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
 void pci_msi_teardown_msi_irqs(struct pci_dev *dev);
 
-#ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS
-int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
-void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev);
-#else
-static inline int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, 
int type)
+/* Mask/unmask helpers */
+void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set);
+
+static inline void pci_msi_mask(struct msi_desc *desc, u32 mask)
 {
-   WARN_ON_ONCE(1);
-   return -ENODEV;
+   pci_msi_update_mask(desc, 0, mask);
 }
 
-static inline void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev)
+static inline void pci_msi_unmask(struct msi_desc *desc, u32 mask)
 {
-   WARN_ON_ONCE(1);
+   pci_msi_update_mask(desc, mask, 0);
+}
+
+static inline void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
+{
+   return desc->pci.mask_base + desc->msi_index * PCI_MSIX_ENTRY_SIZE;
+}
+
+/*
+ * This internal function does not flush PCI writes to the device.  All
+ * users must ensure that they read from the device before either assuming
+ * that the device state is up to date, or returning out of this file.
+ * It does not affect the msi_desc::msix_ctrl cache either. Use with care!
+ */
+static inline void pci_msix_write_vector_ctrl(struct msi_desc *desc, u32 ctrl)
+{
+   void __iomem *desc_addr = pci_msix_desc_addr(desc);
+
+   if (desc->pci.msi_attrib.can_mask)
+   writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
+}
+
+static inline void pci_msix_mask(struct msi_desc *desc)
+{
+   d

[patch 19/39] PCI/MSI: Move pci_disable_msi() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

msi.c is a maze of randomly sorted functions which makes the code
unreadable. As a first step split the driver visible API and the internal
implementation which also allows proper API documentation via one file.

Create drivers/pci/msi/api.c to group all exported device-driver PCI/MSI
APIs in one C file.

Begin by moving pci_disable_msi() there and add kernel-doc for the function
as appropriate.

Suggested-by: Thomas Gleixner 
Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/Makefile |  3 +--
 drivers/pci/msi/api.c| 37 +
 drivers/pci/msi/msi.c| 22 +-
 drivers/pci/msi/msi.h|  4 
 4 files changed, 47 insertions(+), 19 deletions(-)
 create mode 100644 drivers/pci/msi/api.c
---
diff --git a/drivers/pci/msi/Makefile b/drivers/pci/msi/Makefile
index 4e0a7e07965e..839ff72d72a8 100644
--- a/drivers/pci/msi/Makefile
+++ b/drivers/pci/msi/Makefile
@@ -2,6 +2,5 @@
 #
 # Makefile for the PCI/MSI
 obj-$(CONFIG_PCI)  += pcidev_msi.o
-obj-$(CONFIG_PCI_MSI)  += msi.o
-obj-$(CONFIG_PCI_MSI)  += irqdomain.o
+obj-$(CONFIG_PCI_MSI)  += api.o msi.o irqdomain.o
 obj-$(CONFIG_PCI_MSI_ARCH_FALLBACKS)   += legacy.o
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
new file mode 100644
index ..7485942cbe5d
--- /dev/null
+++ b/drivers/pci/msi/api.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI MSI/MSI-X — Exported APIs for device drivers
+ *
+ * Copyright (C) 2003-2004 Intel
+ * Copyright (C) Tom Long Nguyen (tom.l.ngu...@intel.com)
+ * Copyright (C) 2016 Christoph Hellwig.
+ * Copyright (C) 2022 Linutronix GmbH
+ */
+
+#include 
+
+#include "msi.h"
+
+/**
+ * pci_disable_msi() - Disable MSI interrupt mode on device
+ * @dev: the PCI device to operate on
+ *
+ * Legacy device driver API to disable MSI interrupt mode on device,
+ * free earlier allocated interrupt vectors, and restore INTx emulation.
+ * The PCI device Linux IRQ (@dev->irq) is restored to its default
+ * pin-assertion IRQ. This is the cleanup pair of pci_enable_msi().
+ *
+ * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API
+ * pair should, in general, be used instead.
+ */
+void pci_disable_msi(struct pci_dev *dev)
+{
+   if (!pci_msi_enabled() || !dev || !dev->msi_enabled)
+   return;
+
+   msi_lock_descs(>dev);
+   pci_msi_shutdown(dev);
+   pci_free_msi_irqs(dev);
+   msi_unlock_descs(>dev);
+}
+EXPORT_SYMBOL(pci_disable_msi);
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 5c310df55d0d..4a1300b74518 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -163,7 +163,7 @@ void pci_write_msi_msg(unsigned int irq, struct msi_msg 
*msg)
 }
 EXPORT_SYMBOL_GPL(pci_write_msi_msg);
 
-static void free_msi_irqs(struct pci_dev *dev)
+void pci_free_msi_irqs(struct pci_dev *dev)
 {
pci_msi_teardown_msi_irqs(dev);
 
@@ -413,7 +413,7 @@ static int msi_capability_init(struct pci_dev *dev, int 
nvec,
 
 err:
pci_msi_unmask(entry, msi_multi_mask(entry));
-   free_msi_irqs(dev);
+   pci_free_msi_irqs(dev);
 fail:
dev->msi_enabled = 0;
 unlock:
@@ -531,7 +531,7 @@ static int msix_setup_interrupts(struct pci_dev *dev, void 
__iomem *base,
goto out_unlock;
 
 out_free:
-   free_msi_irqs(dev);
+   pci_free_msi_irqs(dev);
 out_unlock:
msi_unlock_descs(>dev);
kfree(masks);
@@ -680,7 +680,7 @@ int pci_msi_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msi_vec_count);
 
-static void pci_msi_shutdown(struct pci_dev *dev)
+void pci_msi_shutdown(struct pci_dev *dev)
 {
struct msi_desc *desc;
 
@@ -701,18 +701,6 @@ static void pci_msi_shutdown(struct pci_dev *dev)
pcibios_alloc_irq(dev);
 }
 
-void pci_disable_msi(struct pci_dev *dev)
-{
-   if (!pci_msi_enable || !dev || !dev->msi_enabled)
-   return;
-
-   msi_lock_descs(>dev);
-   pci_msi_shutdown(dev);
-   free_msi_irqs(dev);
-   msi_unlock_descs(>dev);
-}
-EXPORT_SYMBOL(pci_disable_msi);
-
 /**
  * pci_msix_vec_count - return the number of device's MSI-X table entries
  * @dev: pointer to the pci_dev data structure of MSI-X device function
@@ -797,7 +785,7 @@ void pci_disable_msix(struct pci_dev *dev)
 
msi_lock_descs(>dev);
pci_msix_shutdown(dev);
-   free_msi_irqs(dev);
+   pci_free_msi_irqs(dev);
msi_unlock_descs(>dev);
 }
 EXPORT_SYMBOL(pci_disable_msix);
diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h
index d8f62d911f08..634879277349 100644
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -84,6 +84,10 @@ static inline __attribute_const__ u32 msi_multi_mask(struct 
msi_desc *desc)
return (1 << (1 << desc->pci.msi_attrib.multi_cap)) - 1;
 }
 
+/* MSI internal functions invoked fro

[patch 28/39] PCI/MSI: Move pci_irq_get_affinity() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_irq_get_affinity() and let its kernel-doc match rest of the
file.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 43 +++
 drivers/pci/msi/msi.c | 38 --
 2 files changed, 43 insertions(+), 38 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 653a61868ae6..473df7ba0584 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -9,6 +9,7 @@
  */
 
 #include 
+#include 
 
 #include "msi.h"
 
@@ -251,6 +252,48 @@ int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
 EXPORT_SYMBOL(pci_irq_vector);
 
 /**
+ * pci_irq_get_affinity() - Get a device interrupt vector affinity
+ * @dev: the PCI device to operate on
+ * @nr:  device-relative interrupt vector index (0-based); has different
+ *   meanings, depending on interrupt mode
+ * MSI-Xthe index in the MSI-X vector table
+ * MSI  the index of the enabled MSI vectors
+ * INTx must be 0
+ *
+ * Return: MSI/MSI-X vector affinity, NULL if @nr is out of range or if
+ * the MSI(-X) vector was allocated without explicit affinity
+ * requirements (e.g., by pci_enable_msi(), pci_enable_msix_range(), or
+ * pci_alloc_irq_vectors() without the %PCI_IRQ_AFFINITY flag). Return a
+ * generic set of CPU ids representing all possible CPUs available
+ * during system boot if the device is in legacy INTx mode.
+ */
+const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr)
+{
+   int idx, irq = pci_irq_vector(dev, nr);
+   struct msi_desc *desc;
+
+   if (WARN_ON_ONCE(irq <= 0))
+   return NULL;
+
+   desc = irq_get_msi_desc(irq);
+   /* Non-MSI does not have the information handy */
+   if (!desc)
+   return cpu_possible_mask;
+
+   /* MSI[X] interrupts can be allocated without affinity descriptor */
+   if (!desc->affinity)
+   return NULL;
+
+   /*
+* MSI has a mask array in the descriptor.
+* MSI-X has a single mask.
+*/
+   idx = dev->msi_enabled ? nr : 0;
+   return >affinity[idx].mask;
+}
+EXPORT_SYMBOL(pci_irq_get_affinity);
+
+/**
  * pci_free_irq_vectors() - Free previously allocated IRQs for a device
  * @dev: the PCI device to operate on
  *
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 6fa90d07d2e4..d78646d1c116 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -854,44 +854,6 @@ int __pci_enable_msix_range(struct pci_dev *dev,
}
 }
 
-/**
- * pci_irq_get_affinity - return the affinity of a particular MSI vector
- * @dev:   PCI device to operate on
- * @nr:device-relative interrupt vector index (0-based).
- *
- * @nr has the following meanings depending on the interrupt mode:
- *   MSI-X:The index in the MSI-X vector table
- *   MSI:  The index of the enabled MSI vectors
- *   INTx: Must be 0
- *
- * Return: A cpumask pointer or NULL if @nr is out of range
- */
-const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr)
-{
-   int idx, irq = pci_irq_vector(dev, nr);
-   struct msi_desc *desc;
-
-   if (WARN_ON_ONCE(irq <= 0))
-   return NULL;
-
-   desc = irq_get_msi_desc(irq);
-   /* Non-MSI does not have the information handy */
-   if (!desc)
-   return cpu_possible_mask;
-
-   /* MSI[X] interrupts can be allocated without affinity descriptor */
-   if (!desc->affinity)
-   return NULL;
-
-   /*
-* MSI has a mask array in the descriptor.
-* MSI-X has a single mask.
-*/
-   idx = dev->msi_enabled ? nr : 0;
-   return >affinity[idx].mask;
-}
-EXPORT_SYMBOL(pci_irq_get_affinity);
-
 struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc)
 {
return to_pci_dev(desc->dev);

[patch 20/39] PCI/MSI: Move pci_enable_msi() API to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c all exported device-driver MSI APIs are now
to be grouped in one file, api.c.

Move pci_enable_msi() and make its kernel-doc comprehensive.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 23 +++
 drivers/pci/msi/msi.c | 14 ++
 drivers/pci/msi/msi.h |  1 +
 3 files changed, 26 insertions(+), 12 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 7485942cbe5d..63d7f8f6a284 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -13,6 +13,29 @@
 #include "msi.h"
 
 /**
+ * pci_enable_msi() - Enable MSI interrupt mode on device
+ * @dev: the PCI device to operate on
+ *
+ * Legacy device driver API to enable MSI interrupts mode on device and
+ * allocate a single interrupt vector. On success, the allocated vector
+ * Linux IRQ will be saved at @dev->irq. The driver must invoke
+ * pci_disable_msi() on cleanup.
+ *
+ * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API
+ * pair should, in general, be used instead.
+ *
+ * Return: 0 on success, errno otherwise
+ */
+int pci_enable_msi(struct pci_dev *dev)
+{
+   int rc = __pci_enable_msi_range(dev, 1, 1, NULL);
+   if (rc < 0)
+   return rc;
+   return 0;
+}
+EXPORT_SYMBOL(pci_enable_msi);
+
+/**
  * pci_disable_msi() - Disable MSI interrupt mode on device
  * @dev: the PCI device to operate on
  *
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 4a1300b74518..98f07ad9af62 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -790,8 +790,8 @@ void pci_disable_msix(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_disable_msix);
 
-static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
- struct irq_affinity *affd)
+int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
+  struct irq_affinity *affd)
 {
int nvec;
int rc;
@@ -844,16 +844,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int 
minvec, int maxvec,
}
 }
 
-/* deprecated, don't use */
-int pci_enable_msi(struct pci_dev *dev)
-{
-   int rc = __pci_enable_msi_range(dev, 1, 1, NULL);
-   if (rc < 0)
-   return rc;
-   return 0;
-}
-EXPORT_SYMBOL(pci_enable_msi);
-
 static int __pci_enable_msix_range(struct pci_dev *dev,
   struct msix_entry *entries, int minvec,
   int maxvec, struct irq_affinity *affd,
diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h
index 634879277349..00bb98d5bb0e 100644
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -87,6 +87,7 @@ static inline __attribute_const__ u32 msi_multi_mask(struct 
msi_desc *desc)
 /* MSI internal functions invoked from the public APIs */
 void pci_msi_shutdown(struct pci_dev *dev);
 void pci_free_msi_irqs(struct pci_dev *dev);
+int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct 
irq_affinity *affd);
 
 /* Legacy (!IRQDOMAIN) fallbacks */
 #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS

[patch 25/39] PCI/MSI: Move pci_free_irq_vectors() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_free_irq_vectors() and make its kernel-doc comprehensive.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 15 +++
 drivers/pci/msi/msi.c | 13 -
 2 files changed, 15 insertions(+), 13 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 0f1ec87e3f9f..2ff2a9ccfc47 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -205,3 +205,18 @@ int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
return irq ? irq : -EINVAL;
 }
 EXPORT_SYMBOL(pci_irq_vector);
+
+/**
+ * pci_free_irq_vectors() - Free previously allocated IRQs for a device
+ * @dev: the PCI device to operate on
+ *
+ * Undo the interrupt vector allocations and possible device MSI/MSI-X
+ * enablement earlier done through pci_alloc_irq_vectors_affinity() or
+ * pci_alloc_irq_vectors().
+ */
+void pci_free_irq_vectors(struct pci_dev *dev)
+{
+   pci_disable_msix(dev);
+   pci_disable_msi(dev);
+}
+EXPORT_SYMBOL(pci_free_irq_vectors);
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 38ad2fe4b85c..ed8caf5ac99f 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -887,19 +887,6 @@ int __pci_enable_msix_range(struct pci_dev *dev,
 }
 
 /**
- * pci_free_irq_vectors - free previously allocated IRQs for a device
- * @dev:   PCI device to operate on
- *
- * Undoes the allocations and enabling in pci_alloc_irq_vectors().
- */
-void pci_free_irq_vectors(struct pci_dev *dev)
-{
-   pci_disable_msix(dev);
-   pci_disable_msi(dev);
-}
-EXPORT_SYMBOL(pci_free_irq_vectors);
-
-/**
  * pci_irq_get_affinity - return the affinity of a particular MSI vector
  * @dev:   PCI device to operate on
  * @nr:device-relative interrupt vector index (0-based).

[patch 24/39] PCI/MSI: Move pci_irq_vector() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_irq_vector() and let its kernel-doc match the rest of the file.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 23 +++
 drivers/pci/msi/msi.c | 24 
 2 files changed, 23 insertions(+), 24 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 8546749afa6e..0f1ec87e3f9f 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -182,3 +182,26 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, 
unsigned int min_vecs,
return nvecs;
 }
 EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
+
+/**
+ * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector
+ * @dev: the PCI device to operate on
+ * @nr:  device-relative interrupt vector index (0-based); has different
+ *   meanings, depending on interrupt mode
+ * MSI-Xthe index in the MSI-X vector table
+ * MSI  the index of the enabled MSI vectors
+ * INTx must be 0
+ *
+ * Return: the Linux IRQ number, or -EINVAL if @nr is out of range
+ */
+int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
+{
+   unsigned int irq;
+
+   if (!dev->msi_enabled && !dev->msix_enabled)
+   return !nr ? dev->irq : -EINVAL;
+
+   irq = msi_get_virq(>dev, nr);
+   return irq ? irq : -EINVAL;
+}
+EXPORT_SYMBOL(pci_irq_vector);
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index a028774f438d..38ad2fe4b85c 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -900,30 +900,6 @@ void pci_free_irq_vectors(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_free_irq_vectors);
 
 /**
- * pci_irq_vector - return Linux IRQ number of a device vector
- * @dev:   PCI device to operate on
- * @nr:Interrupt vector index (0-based)
- *
- * @nr has the following meanings depending on the interrupt mode:
- *   MSI-X:The index in the MSI-X vector table
- *   MSI:  The index of the enabled MSI vectors
- *   INTx: Must be 0
- *
- * Return: The Linux interrupt number or -EINVAl if @nr is out of range.
- */
-int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
-{
-   unsigned int irq;
-
-   if (!dev->msi_enabled && !dev->msix_enabled)
-   return !nr ? dev->irq : -EINVAL;
-
-   irq = msi_get_virq(>dev, nr);
-   return irq ? irq : -EINVAL;
-}
-EXPORT_SYMBOL(pci_irq_vector);
-
-/**
  * pci_irq_get_affinity - return the affinity of a particular MSI vector
  * @dev:   PCI device to operate on
  * @nr:device-relative interrupt vector index (0-based).

[patch 34/39] PCI/MSI: Reject multi-MSI early

2022-11-11 Thread Thomas Gleixner

When hierarchical MSI interrupt domains are enabled then there is no point
to do tons of work and detect the missing support for multi-MSI late in the
allocation path.

Just query the domain feature flags right away. The query function is going
to be used for other purposes later and has a mode argument which influences
the result:

  ALLOW_LEGACY returns true when:
 - there is no irq domain attached (legacy support)
 - there is a irq domain attached which has the feature flag set

  DENY_LEGACY returns only true when:
 - there is a irq domain attached which has the feature flag set

This allows to use the function universally without ifdeffery in the
calling code.

Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/irqdomain.c |   22 ++
 drivers/pci/msi/msi.c   |4 
 drivers/pci/msi/msi.h   |9 +
 3 files changed, 35 insertions(+)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -187,6 +187,28 @@ struct irq_domain *pci_msi_create_irq_do
 }
 EXPORT_SYMBOL_GPL(pci_msi_create_irq_domain);
 
+/**
+ * pci_msi_domain_supports - Check for support of a particular feature flag
+ * @pdev:  The PCI device to operate on
+ * @feature_mask:  The feature mask to check for (full match)
+ * @mode:  If ALLOW_LEGACY this grants the feature when there is 
no irq domain
+ * associated to the device. If DENY_LEGACY the lack of an 
irq domain
+ * makes the feature unsupported
+ */
+bool pci_msi_domain_supports(struct pci_dev *pdev, unsigned int feature_mask,
+enum support_mode mode)
+{
+   struct msi_domain_info *info;
+   struct irq_domain *domain;
+
+   domain = dev_get_msi_domain(>dev);
+
+   if (!domain || !irq_domain_is_hierarchy(domain))
+   return mode == ALLOW_LEGACY;
+   info = domain->host_data;
+   return (info->flags & feature_mask) == feature_mask;
+}
+
 /*
  * Users of the generic MSI infrastructure expect a device to have a single ID,
  * so with DMA aliases we have to pick the least-worst compromise. Devices with
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -347,6 +347,10 @@ static int msi_capability_init(struct pc
struct msi_desc *entry;
int ret;
 
+   /* Reject multi-MSI early on irq domain enabled architectures */
+   if (nvec > 1 && !pci_msi_domain_supports(dev, MSI_FLAG_MULTI_PCI_MSI, 
ALLOW_LEGACY))
+   return 1;
+
/*
 * Disable MSI during setup in the hardware, but mark it enabled
 * so that setup code can evaluate it.
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -97,6 +97,15 @@ int __pci_enable_msix_range(struct pci_d
 void __pci_restore_msi_state(struct pci_dev *dev);
 void __pci_restore_msix_state(struct pci_dev *dev);
 
+/* irq_domain related functionality */
+
+enum support_mode {
+   ALLOW_LEGACY,
+   DENY_LEGACY,
+};
+
+bool pci_msi_domain_supports(struct pci_dev *dev, unsigned int feature_mask, 
enum support_mode mode);
+
 /* Legacy (!IRQDOMAIN) fallbacks */
 
 #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS

[patch 21/39] PCI/MSI: Move pci_enable_msix_range() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_enable_msix_range() and make its kernel-doc comprehensive.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 32 
 drivers/pci/msi/msi.c | 30 --
 drivers/pci/msi/msi.h |  3 +++
 3 files changed, 39 insertions(+), 26 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 63d7f8f6a284..d48050555d55 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -58,3 +58,35 @@ void pci_disable_msi(struct pci_dev *dev)
msi_unlock_descs(>dev);
 }
 EXPORT_SYMBOL(pci_disable_msi);
+
+/**
+ * pci_enable_msix_range() - Enable MSI-X interrupt mode on device
+ * @dev: the PCI device to operate on
+ * @entries: input/output parameter, array of MSI-X configuration entries
+ * @minvec:  minimum required number of MSI-X vectors
+ * @maxvec:  maximum desired number of MSI-X vectors
+ *
+ * Legacy device driver API to enable MSI-X interrupt mode on device and
+ * configure its MSI-X capability structure as appropriate.  The passed
+ * @entries array must have each of its members "entry" field set to a
+ * desired (valid) MSI-X vector number, where the range of valid MSI-X
+ * vector numbers can be queried through pci_msix_vec_count().  If
+ * successful, the driver must invoke pci_disable_msix() on cleanup.
+ *
+ * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API
+ * pair should, in general, be used instead.
+ *
+ * Return: number of MSI-X vectors allocated (which might be smaller
+ * than @maxvecs), where Linux IRQ numbers for such allocated vectors
+ * are saved back in the @entries array elements' "vector" field. Return
+ * -ENOSPC if less than @minvecs interrupt vectors are available.
+ * Return -EINVAL if one of the passed @entries members "entry" field
+ * was invalid or a duplicate, or if plain MSI interrupts mode was
+ * earlier enabled on device. Return other errnos otherwise.
+ */
+int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
+ int minvec, int maxvec)
+{
+   return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0);
+}
+EXPORT_SYMBOL(pci_enable_msix_range);
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 98f07ad9af62..6700ef1c734e 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -844,10 +844,10 @@ int __pci_enable_msi_range(struct pci_dev *dev, int 
minvec, int maxvec,
}
 }
 
-static int __pci_enable_msix_range(struct pci_dev *dev,
-  struct msix_entry *entries, int minvec,
-  int maxvec, struct irq_affinity *affd,
-  int flags)
+int __pci_enable_msix_range(struct pci_dev *dev,
+   struct msix_entry *entries, int minvec,
+   int maxvec, struct irq_affinity *affd,
+   int flags)
 {
int rc, nvec = maxvec;
 
@@ -887,28 +887,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 }
 
 /**
- * pci_enable_msix_range - configure device's MSI-X capability structure
- * @dev: pointer to the pci_dev data structure of MSI-X device function
- * @entries: pointer to an array of MSI-X entries
- * @minvec: minimum number of MSI-X IRQs requested
- * @maxvec: maximum number of MSI-X IRQs requested
- *
- * Setup the MSI-X capability structure of device function with a maximum
- * possible number of interrupts in the range between @minvec and @maxvec
- * upon its software driver call to request for MSI-X mode enabled on its
- * hardware device function. It returns a negative errno if an error occurs.
- * If it succeeds, it returns the actual number of interrupts allocated and
- * indicates the successful configuration of MSI-X capability structure
- * with new allocated MSI-X interrupts.
- **/
-int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
-   int minvec, int maxvec)
-{
-   return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0);
-}
-EXPORT_SYMBOL(pci_enable_msix_range);
-
-/**
  * pci_alloc_irq_vectors_affinity - allocate multiple IRQs for a device
  * @dev:   PCI device to operate on
  * @min_vecs:  minimum number of vectors required (must be >= 1)
diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h
index 00bb98d5bb0e..8c4a5289432d 100644
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -88,8 +88,11 @@ static inline __attribute_const__ u32 msi_multi_mask(struct 
msi_desc *desc)
 void pci_msi_shutdown(struct pci_dev *dev);
 void pci_free_msi_irqs(struct pci_dev *dev);
 int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct 
irq_affinity *affd);
+int __pci_enable_msix_range(st

[patch 30/39] PCI/MSI: Move pci_msi_restore_state() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_msi_enabled() and add kernel-doc for the function.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index ee9ed5ccd94d..8d1cf6db9bd7 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -308,6 +308,21 @@ void pci_free_irq_vectors(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_free_irq_vectors);
 
+/**
+ * pci_restore_msi_state() - Restore cached MSI(-X) state on device
+ * @dev: the PCI device to operate on
+ *
+ * Write the Linux-cached MSI(-X) state back on device. This is
+ * typically useful upon system resume, or after an error-recovery PCI
+ * adapter reset.
+ */
+void pci_restore_msi_state(struct pci_dev *dev)
+{
+   __pci_restore_msi_state(dev);
+   __pci_restore_msix_state(dev);
+}
+EXPORT_SYMBOL_GPL(pci_restore_msi_state);
+
 /**
  * pci_msi_enabled() - Are MSI(-X) interrupts enabled system-wide?
  *
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 59c33bc7fe81..a5d168c823ff 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -199,7 +199,7 @@ bool __weak arch_restore_msi_irqs(struct pci_dev *dev)
return true;
 }
 
-static void __pci_restore_msi_state(struct pci_dev *dev)
+void __pci_restore_msi_state(struct pci_dev *dev)
 {
struct msi_desc *entry;
u16 control;
@@ -231,7 +231,7 @@ static void pci_msix_clear_and_set_ctrl(struct pci_dev 
*dev, u16 clear, u16 set)
pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, ctrl);
 }
 
-static void __pci_restore_msix_state(struct pci_dev *dev)
+void __pci_restore_msix_state(struct pci_dev *dev)
 {
struct msi_desc *entry;
bool write_msg;
@@ -257,13 +257,6 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
 }
 
-void pci_restore_msi_state(struct pci_dev *dev)
-{
-   __pci_restore_msi_state(dev);
-   __pci_restore_msix_state(dev);
-}
-EXPORT_SYMBOL_GPL(pci_restore_msi_state);
-
 static void pcim_msi_release(void *pcidev)
 {
struct pci_dev *dev = pcidev;
diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h
index f3f4ede53171..8170ef2c5ad0 100644
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -94,6 +94,8 @@ void pci_free_msi_irqs(struct pci_dev *dev);
 int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct 
irq_affinity *affd);
 int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, 
int minvec,
int maxvec,  struct irq_affinity *affd, int flags);
+void __pci_restore_msi_state(struct pci_dev *dev);
+void __pci_restore_msix_state(struct pci_dev *dev);
 
 /* Legacy (!IRQDOMAIN) fallbacks */

[patch 29/39] PCI/MSI: Move pci_msi_enabled() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_msi_enabled() and make its kernel-doc comprehensive.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 12 
 drivers/pci/msi/msi.c | 14 +-
 drivers/pci/msi/msi.h |  3 +++
 3 files changed, 16 insertions(+), 13 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 473df7ba0584..ee9ed5ccd94d 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -307,3 +307,15 @@ void pci_free_irq_vectors(struct pci_dev *dev)
pci_disable_msi(dev);
 }
 EXPORT_SYMBOL(pci_free_irq_vectors);
+
+/**
+ * pci_msi_enabled() - Are MSI(-X) interrupts enabled system-wide?
+ *
+ * Return: true if MSI has not been globally disabled through ACPI FADT,
+ * PCI bridge quirks, or the "pci=nomsi" kernel command-line option.
+ */
+int pci_msi_enabled(void)
+{
+   return pci_msi_enable;
+}
+EXPORT_SYMBOL(pci_msi_enabled);
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index d78646d1c116..59c33bc7fe81 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -13,7 +13,7 @@
 #include "../pci.h"
 #include "msi.h"
 
-static int pci_msi_enable = 1;
+int pci_msi_enable = 1;
 int pci_msi_ignore_mask;
 
 void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
@@ -864,15 +864,3 @@ void pci_no_msi(void)
 {
pci_msi_enable = 0;
 }
-
-/**
- * pci_msi_enabled - is MSI enabled?
- *
- * Returns true if MSI has not been disabled by the command-line option
- * pci=nomsi.
- **/
-int pci_msi_enabled(void)
-{
-   return pci_msi_enable;
-}
-EXPORT_SYMBOL(pci_msi_enabled);
diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h
index 77e2587f7e4f..f3f4ede53171 100644
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -84,6 +84,9 @@ static inline __attribute_const__ u32 msi_multi_mask(struct 
msi_desc *desc)
return (1 << (1 << desc->pci.msi_attrib.multi_cap)) - 1;
 }
 
+/* Subsystem variables */
+extern int pci_msi_enable;
+
 /* MSI internal functions invoked from the public APIs */
 void pci_msi_shutdown(struct pci_dev *dev);
 void pci_msix_shutdown(struct pci_dev *dev);

[patch 38/39] genirq/msi: Remove msi_domain_ops::msi_check()

2022-11-11 Thread Thomas Gleixner

No more users.

Signed-off-by: Thomas Gleixner 
---
 include/linux/msi.h |4 
 kernel/irq/msi.c|   17 +
 2 files changed, 1 insertion(+), 20 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -288,7 +288,6 @@ struct msi_domain_info;
  * @get_hwirq: Retrieve the resulting hw irq number
  * @msi_init:  Domain specific init function for MSI interrupts
  * @msi_free:  Domain specific function to free a MSI interrupts
- * @msi_check: Callback for verification of the domain/info/dev data
  * @msi_prepare:   Prepare the allocation of the interrupts in the domain
  * @set_desc:  Set the msi descriptor for an interrupt
  * @domain_alloc_irqs: Optional function to override the default allocation
@@ -326,9 +325,6 @@ struct msi_domain_ops {
void(*msi_free)(struct irq_domain *domain,
struct msi_domain_info *info,
unsigned int virq);
-   int (*msi_check)(struct irq_domain *domain,
-struct msi_domain_info *info,
-struct device *dev);
int (*msi_prepare)(struct irq_domain *domain,
   struct device *dev, int nvec,
   msi_alloc_info_t *arg);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -617,17 +617,9 @@ static int msi_domain_ops_init(struct ir
return 0;
 }
 
-static int msi_domain_ops_check(struct irq_domain *domain,
-   struct msi_domain_info *info,
-   struct device *dev)
-{
-   return 0;
-}
-
 static struct msi_domain_ops msi_domain_ops_default = {
.get_hwirq  = msi_domain_ops_get_hwirq,
.msi_init   = msi_domain_ops_init,
-   .msi_check  = msi_domain_ops_check,
.msi_prepare= msi_domain_ops_prepare,
.set_desc   = msi_domain_ops_set_desc,
.domain_alloc_irqs  = __msi_domain_alloc_irqs,
@@ -655,8 +647,6 @@ static void msi_domain_update_dom_ops(st
ops->get_hwirq = msi_domain_ops_default.get_hwirq;
if (ops->msi_init == NULL)
ops->msi_init = msi_domain_ops_default.msi_init;
-   if (ops->msi_check == NULL)
-   ops->msi_check = msi_domain_ops_default.msi_check;
if (ops->msi_prepare == NULL)
ops->msi_prepare = msi_domain_ops_default.msi_prepare;
if (ops->set_desc == NULL)
@@ -707,13 +697,8 @@ int msi_domain_prepare_irqs(struct irq_d
 {
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
-   int ret;
-
-   ret = ops->msi_check(domain, info, dev);
-   if (ret == 0)
-   ret = ops->msi_prepare(domain, dev, nvec, arg);
 
-   return ret;
+   return ops->msi_prepare(domain, dev, nvec, arg);
 }
 
 int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,

[patch 26/39] PCI/MSI: Move pci_msix_vec_count() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Move pci_msix_vec_count() and make its kernel-doc comprehensive.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 20 
 drivers/pci/msi/msi.c | 20 
 2 files changed, 20 insertions(+), 20 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 2ff2a9ccfc47..83ea38ffa116 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -60,6 +60,26 @@ void pci_disable_msi(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_disable_msi);
 
 /**
+ * pci_msix_vec_count() - Get number of MSI-X interrupt vectors on device
+ * @dev: the PCI device to operate on
+ *
+ * Return: number of MSI-X interrupt vectors available on this device
+ * (i.e., the device's MSI-X capability structure "table size"), -EINVAL
+ * if the device is not MSI-X capable, other errnos otherwise.
+ */
+int pci_msix_vec_count(struct pci_dev *dev)
+{
+   u16 control;
+
+   if (!dev->msix_cap)
+   return -EINVAL;
+
+   pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, );
+   return msix_table_size(control);
+}
+EXPORT_SYMBOL(pci_msix_vec_count);
+
+/**
  * pci_enable_msix_range() - Enable MSI-X interrupt mode on device
  * @dev: the PCI device to operate on
  * @entries: input/output parameter, array of MSI-X configuration entries
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index ed8caf5ac99f..1226d66da992 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -701,26 +701,6 @@ void pci_msi_shutdown(struct pci_dev *dev)
pcibios_alloc_irq(dev);
 }
 
-/**
- * pci_msix_vec_count - return the number of device's MSI-X table entries
- * @dev: pointer to the pci_dev data structure of MSI-X device function
- * This function returns the number of device's MSI-X table entries and
- * therefore the number of MSI-X vectors device is capable of sending.
- * It returns a negative errno if the device is not capable of sending MSI-X
- * interrupts.
- **/
-int pci_msix_vec_count(struct pci_dev *dev)
-{
-   u16 control;
-
-   if (!dev->msix_cap)
-   return -EINVAL;
-
-   pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, );
-   return msix_table_size(control);
-}
-EXPORT_SYMBOL(pci_msix_vec_count);
-
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 int nvec, struct irq_affinity *affd, int flags)
 {

[patch 35/39] PCI/MSI: Reject MSI-X early

2022-11-11 Thread Thomas Gleixner

Similar to PCI multi-MSI reject MSI-X enablement when a irq domain is
attached to the device which does not support MSI-X.

Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/msi.c |4 
 1 file changed, 4 insertions(+)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -760,6 +760,10 @@ int __pci_enable_msix_range(struct pci_d
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;
 
+   /* Check MSI-X early on irq domain enabled architectures */
+   if (!pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX, ALLOW_LEGACY))
+   return -ENOTSUPP;
+
if (!pci_msi_supported(dev, nvec) || dev->current_state != PCI_D0)
return -EINVAL;

[patch 22/39] PCI/MSI: Move pci_alloc_irq_vectors() to api.c

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

To distangle the maze in msi.c, all exported device-driver MSI APIs are
now to be grouped in one file, api.c.

Make pci_alloc_irq_vectors() a real function instead of wrapper and add
proper kernel doc to it.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi/api.c | 33 +
 include/linux/pci.h   | 17 +
 2 files changed, 42 insertions(+), 8 deletions(-)
---
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index d48050555d55..1714905943fb 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -90,3 +90,36 @@ int pci_enable_msix_range(struct pci_dev *dev, struct 
msix_entry *entries,
return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0);
 }
 EXPORT_SYMBOL(pci_enable_msix_range);
+
+/**
+ * pci_alloc_irq_vectors() - Allocate multiple device interrupt vectors
+ * @dev:  the PCI device to operate on
+ * @min_vecs: minimum required number of vectors (must be >= 1)
+ * @max_vecs: maximum desired number of vectors
+ * @flags:One or more of:
+ *%PCI_IRQ_MSIX  Allow trying MSI-X vector allocations
+ *%PCI_IRQ_MSI   Allow trying MSI vector allocations
+ *%PCI_IRQ_LEGACYAllow trying legacy INTx interrupts, if
+ *   and only if @min_vecs == 1
+ *%PCI_IRQ_AFFINITY  Auto-manage IRQs affinity by spreading
+ *   the vectors around available CPUs
+ *
+ * Allocate up to @max_vecs interrupt vectors on device. MSI-X irq
+ * vector allocation has a higher precedence over plain MSI, which has a
+ * higher precedence over legacy INTx emulation.
+ *
+ * Upon a successful allocation, the caller should use pci_irq_vector()
+ * to get the Linux IRQ number to be passed to request_threaded_irq().
+ * The driver must call pci_free_irq_vectors() on cleanup.
+ *
+ * Return: number of allocated vectors (which might be smaller than
+ * @max_vecs), -ENOSPC if less than @min_vecs interrupt vectors are
+ * available, other errnos otherwise.
+ */
+int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+ unsigned int max_vecs, unsigned int flags)
+{
+   return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs,
+ flags, NULL);
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2bda4a4e47e8..6a356a39ba94 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1553,6 +1553,8 @@ static inline int pci_enable_msix_exact(struct pci_dev 
*dev,
return rc;
return 0;
 }
+int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+ unsigned int max_vecs, unsigned int flags);
 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
   unsigned int max_vecs, unsigned int flags,
   struct irq_affinity *affd);
@@ -1586,6 +1588,13 @@ pci_alloc_irq_vectors_affinity(struct pci_dev *dev, 
unsigned int min_vecs,
return 1;
return -ENOSPC;
 }
+static inline int
+pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+ unsigned int max_vecs, unsigned int flags)
+{
+   return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs,
+ flags, NULL);
+}
 
 static inline void pci_free_irq_vectors(struct pci_dev *dev)
 {
@@ -1900,14 +1909,6 @@ pci_alloc_irq_vectors_affinity(struct pci_dev *dev, 
unsigned int min_vecs,
 }
 #endif /* CONFIG_PCI */
 
-static inline int
-pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
- unsigned int max_vecs, unsigned int flags)
-{
-   return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
- NULL);
-}
-
 /* Include architecture-dependent settings and functions */
 
 #include

[patch 17/39] PCI/MSI: Get rid of externs in msi.h

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

Follow the style of 

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/msi.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -5,12 +5,12 @@
 
 #define msix_table_size(flags) ((flags & PCI_MSIX_FLAGS_QSIZE) + 1)
 
-extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
-extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev);
+int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+void pci_msi_teardown_msi_irqs(struct pci_dev *dev);
 
 #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS
-extern int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int 
type);
-extern void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev);
+int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev);
 #else
 static inline int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, 
int type)
 {

[patch 37/39] PCI/MSI: Remove redundant msi_check() callback

2022-11-11 Thread Thomas Gleixner

All these sanity checks are now done _before_ any allocation work
happens. No point in doing it twice.

Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/irqdomain.c |   48 
 1 file changed, 48 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -64,51 +64,6 @@ static irq_hw_number_t pci_msi_domain_ca
(pci_domain_nr(dev->bus) & 0x) << 27;
 }
 
-static inline bool pci_msi_desc_is_multi_msi(struct msi_desc *desc)
-{
-   return !desc->pci.msi_attrib.is_msix && desc->nvec_used > 1;
-}
-
-/**
- * pci_msi_domain_check_cap - Verify that @domain supports the capabilities
- *   for @dev
- * @domain:The interrupt domain to check
- * @info:  The domain info for verification
- * @dev:   The device to check
- *
- * Returns:
- *  0 if the functionality is supported
- *  1 if Multi MSI is requested, but the domain does not support it
- *  -ENOTSUPP otherwise
- */
-static int pci_msi_domain_check_cap(struct irq_domain *domain,
-   struct msi_domain_info *info,
-   struct device *dev)
-{
-   struct msi_desc *desc = msi_first_desc(dev, MSI_DESC_ALL);
-
-   /* Special handling to support __pci_enable_msi_range() */
-   if (pci_msi_desc_is_multi_msi(desc) &&
-   !(info->flags & MSI_FLAG_MULTI_PCI_MSI))
-   return 1;
-
-   if (desc->pci.msi_attrib.is_msix) {
-   if (!(info->flags & MSI_FLAG_PCI_MSIX))
-   return -ENOTSUPP;
-
-   if (info->flags & MSI_FLAG_MSIX_CONTIGUOUS) {
-   unsigned int idx = 0;
-
-   /* Check for gaps in the entry indices */
-   msi_for_each_desc(desc, dev, MSI_DESC_ALL) {
-   if (desc->msi_index != idx++)
-   return -ENOTSUPP;
-   }
-   }
-   }
-   return 0;
-}
-
 static void pci_msi_domain_set_desc(msi_alloc_info_t *arg,
struct msi_desc *desc)
 {
@@ -118,7 +73,6 @@ static void pci_msi_domain_set_desc(msi_
 
 static struct msi_domain_ops pci_msi_domain_ops_default = {
.set_desc   = pci_msi_domain_set_desc,
-   .msi_check  = pci_msi_domain_check_cap,
 };
 
 static void pci_msi_domain_update_dom_ops(struct msi_domain_info *info)
@@ -130,8 +84,6 @@ static void pci_msi_domain_update_dom_op
} else {
if (ops->set_desc == NULL)
ops->set_desc = pci_msi_domain_set_desc;
-   if (ops->msi_check == NULL)
-   ops->msi_check = pci_msi_domain_check_cap;
}
 }

[patch 02/39] iommu/vt-d: Remove bogus check for multi MSI-X

2022-11-11 Thread Thomas Gleixner

PCI/Multi-MSI is MSI specific and not supported for MSI-X.

Signed-off-by: Thomas Gleixner 
---
 drivers/iommu/intel/irq_remapping.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1334,8 +1334,7 @@ static int intel_irq_remapping_alloc(str
 
if (!info || !iommu)
return -EINVAL;
-   if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI &&
-   info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
+   if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
return -EINVAL;
 
/*

[patch 16/39] genirq: Get rid of GENERIC_MSI_IRQ_DOMAIN

2022-11-11 Thread Thomas Gleixner

Adjust to reality and remove another layer of pointless Kconfig
indirection. CONFIG_GENERIC_MSI_IRQ is good enough to serve
all purposes.

Signed-off-by: Thomas Gleixner 
---
 drivers/base/Makefile   |2 +-
 drivers/bus/fsl-mc/Kconfig  |2 +-
 drivers/dma/Kconfig |2 +-
 drivers/dma/qcom/hidma.c|8 
 drivers/iommu/Kconfig   |2 +-
 drivers/irqchip/Kconfig |6 +++---
 drivers/mailbox/Kconfig |2 +-
 drivers/pci/Kconfig |1 -
 drivers/perf/Kconfig|2 +-
 drivers/soc/ti/Kconfig  |2 +-
 include/asm-generic/msi.h   |4 ++--
 include/linux/device.h  |8 +++-
 include/linux/gpio/driver.h |2 +-
 include/linux/msi.h |   10 ++
 kernel/irq/Kconfig  |7 +--
 kernel/irq/msi.c|3 ---
 16 files changed, 23 insertions(+), 40 deletions(-)

--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -22,7 +22,7 @@ obj-$(CONFIG_REGMAP)  += regmap/
 obj-$(CONFIG_SOC_BUS) += soc.o
 obj-$(CONFIG_PINCTRL) += pinctrl.o
 obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
-obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
+obj-$(CONFIG_GENERIC_MSI_IRQ) += platform-msi.o
 obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
 obj-$(CONFIG_GENERIC_ARCH_NUMA) += arch_numa.o
 obj-$(CONFIG_ACPI) += physical_location.o
--- a/drivers/bus/fsl-mc/Kconfig
+++ b/drivers/bus/fsl-mc/Kconfig
@@ -8,7 +8,7 @@
 config FSL_MC_BUS
bool "QorIQ DPAA2 fsl-mc bus driver"
depends on OF && (ARCH_LAYERSCAPE || (COMPILE_TEST && (ARM || ARM64 || 
X86_LOCAL_APIC || PPC)))
-   select GENERIC_MSI_IRQ_DOMAIN
+   select GENERIC_MSI_IRQ
help
  Driver to enable the bus infrastructure for the QorIQ DPAA2
  architecture.  The fsl-mc bus driver handles discovery of
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -462,7 +462,7 @@ config MV_XOR_V2
select DMA_ENGINE
select DMA_ENGINE_RAID
select ASYNC_TX_ENABLE_CHANNEL_SWITCH
-   select GENERIC_MSI_IRQ_DOMAIN
+   select GENERIC_MSI_IRQ
help
  Enable support for the Marvell version 2 XOR engine.
 
--- a/drivers/dma/qcom/hidma.c
+++ b/drivers/dma/qcom/hidma.c
@@ -610,7 +610,7 @@ static irqreturn_t hidma_chirq_handler(i
return hidma_ll_inthandler(chirq, lldev);
 }
 
-#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+#ifdef CONFIG_GENERIC_MSI_IRQ
 static irqreturn_t hidma_chirq_handler_msi(int chirq, void *arg)
 {
struct hidma_lldev **lldevp = arg;
@@ -671,7 +671,7 @@ static int hidma_sysfs_init(struct hidma
return device_create_file(dev->ddev.dev, dev->chid_attrs);
 }
 
-#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+#ifdef CONFIG_GENERIC_MSI_IRQ
 static void hidma_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 {
struct device *dev = msi_desc_to_dev(desc);
@@ -687,7 +687,7 @@ static void hidma_write_msi_msg(struct m
 
 static void hidma_free_msis(struct hidma_dev *dmadev)
 {
-#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+#ifdef CONFIG_GENERIC_MSI_IRQ
struct device *dev = dmadev->ddev.dev;
int i, virq;
 
@@ -704,7 +704,7 @@ static void hidma_free_msis(struct hidma
 static int hidma_request_msi(struct hidma_dev *dmadev,
 struct platform_device *pdev)
 {
-#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+#ifdef CONFIG_GENERIC_MSI_IRQ
int rc, i, virq;
 
rc = platform_msi_domain_alloc_irqs(>dev, HIDMA_MSI_INTS,
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -389,7 +389,7 @@ config ARM_SMMU_V3
depends on ARM64
select IOMMU_API
select IOMMU_IO_PGTABLE_LPAE
-   select GENERIC_MSI_IRQ_DOMAIN
+   select GENERIC_MSI_IRQ
help
  Support for implementations of the ARM System MMU architecture
  version 3 providing translation support to a PCIe root complex.
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -38,7 +38,7 @@ config ARM_GIC_V3
 
 config ARM_GIC_V3_ITS
bool
-   select GENERIC_MSI_IRQ_DOMAIN
+   select GENERIC_MSI_IRQ
default ARM_GIC_V3
 
 config ARM_GIC_V3_ITS_PCI
@@ -375,7 +375,7 @@ config MVEBU_ICU
 
 config MVEBU_ODMI
bool
-   select GENERIC_MSI_IRQ_DOMAIN
+   select GENERIC_MSI_IRQ
 
 config MVEBU_PIC
bool
@@ -488,7 +488,7 @@ config IMX_MU_MSI
default m if ARCH_MXC
select IRQ_DOMAIN
select IRQ_DOMAIN_HIERARCHY
-   select GENERIC_MSI_IRQ_DOMAIN
+   select GENERIC_MSI_IRQ
help
  Provide a driver for the i.MX Messaging Unit block used as a
  CPU-to-CPU MSI controller. This requires a specially crafted DT
--- a/drivers/mailbox/Kconfig
+++ b/drivers/mailbox/Kconfig
@@ -223,7 +223,7 @@ config BCM_FLEXRM_MBOX
tristate "Broadcom FlexRM Mailbox"
depends on ARM64
depends on ARCH_BCM_IPROC || COMPILE_TEST
-   select GENERIC_MSI_IRQ_DOMAIN
+   select GENERIC_MSI_IRQ

[patch 10/39] genirq/msi: Make __msi_domain_free_irqs() static

2022-11-11 Thread Thomas Gleixner

Now that the last user is gone, confine it to the core code.

Signed-off-by: Thomas Gleixner 
---
 include/linux/msi.h |4 
 kernel/irq/msi.c|3 ++-
 2 files changed, 2 insertions(+), 5 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -335,9 +335,6 @@ struct msi_domain_info;
  * are set to the default implementation if NULL and even when
  * MSI_FLAG_USE_DEF_DOM_OPS is not set to avoid breaking existing users and
  * because these callbacks are obviously mandatory.
- *
- * __msi_domain_free_irqs() is exposed for PPC pseries to handle extra
- * work after all interrupts and descriptors have been freed.
  */
 struct msi_domain_ops {
irq_hw_number_t (*get_hwirq)(struct msi_domain_info *info,
@@ -432,7 +429,6 @@ int msi_domain_alloc_irqs_descs_locked(s
   int nvec);
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
  int nvec);
-void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
 void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct 
device *dev);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -463,6 +463,7 @@ static inline void msi_sysfs_remove_desc
 
 #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
 static int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device 
*dev, int nvec);
+static void __msi_domain_free_irqs(struct irq_domain *domain, struct device 
*dev);
 
 static inline void irq_chip_write_msi_msg(struct irq_data *data,
  struct msi_msg *msg)
@@ -978,7 +979,7 @@ int msi_domain_alloc_irqs(struct irq_dom
return ret;
 }
 
-void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+static void __msi_domain_free_irqs(struct irq_domain *domain, struct device 
*dev)
 {
struct msi_domain_info *info = domain->host_data;
struct irq_data *irqd;

[patch 13/39] PCI/MSI: Use msi_domain_info::bus_token

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

Set the bus token in the msi_domain_info structure and let the core code
handle the update.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/irqdomain.c |   11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -162,8 +162,6 @@ struct irq_domain *pci_msi_create_irq_do
 struct msi_domain_info *info,
 struct irq_domain *parent)
 {
-   struct irq_domain *domain;
-
if (WARN_ON(info->flags & MSI_FLAG_LEVEL_CAPABLE))
info->flags &= ~MSI_FLAG_LEVEL_CAPABLE;
 
@@ -178,13 +176,10 @@ struct irq_domain *pci_msi_create_irq_do
 
/* PCI-MSI is oneshot-safe */
info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
+   /* Let the core update the bus token */
+   info->bus_token = DOMAIN_BUS_PCI_MSI;
 
-   domain = msi_create_irq_domain(fwnode, info, parent);
-   if (!domain)
-   return NULL;
-
-   irq_domain_update_bus_token(domain, DOMAIN_BUS_PCI_MSI);
-   return domain;
+   return msi_create_irq_domain(fwnode, info, parent);
 }
 EXPORT_SYMBOL_GPL(pci_msi_create_irq_domain);

[patch 00/39] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 1 cleanups

2022-11-11 Thread Thomas Gleixner

Hi!

This is a three part series which provides support for per device MSI
interrupt domains. This solves conceptual problems of the current PCI/MSI
design which are in the way of providing support for PCI/MSI[-X] and
the upcoming PCI/IMS mechanism on the same device.

IMS (Interrupt Message Store] is a new specification which allows device
manufactures to provide implementaton defined storage for MSI messages
contrary to the uniform and specification defined storage mechanisms for
PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitions
of the MSI-X table, but also gives the device manufacturer the freedom to
store the message in arbitrary places, even in host memory which is shared
with the device.

There have been several attempts to glue this into the current MSI code,
but after lengthy discussions in various threads:

https://lore.kernel.org/all/20211126230957.239391...@linutronix.de

https://lore.kernel.org/all/mwhpr11mb188603d0d809c1079f5817dc8c...@mwhpr11mb1886.namprd11.prod.outlook.com

https://lore.kernel.org/all/160408357912.912050.17005584526266191420.st...@djiang5-desk3.ch.intel.com

it turned out that there is a fundamental design problem in the current
PCI/MSI-X implementation. This needs some historical background.

When PCI/MSI[-X] support was added around 2003 interrupt management was
completely different from what we have today in the actively developed
architectures. Interrupt management was completely architecture specific
and while there were attempts to create common infrastructure the
commonalities were rudimental and just providing shared data structures and
interfaces so that drivers could be written in an architecture agnostic
way.

The initial PCI/MSI[-X] support obviously plugged into this model which
resulted in some basic shared infrastructure in the PCI core code for
setting up MSI descriptors, which are a pure software construct for holding
data relevant for a particular MSI interrupt, but the actual association to
Linux interrupts was completely architecture specific. This model is still
supported today to keep museum architectures and notorious stranglers
alive.

In 2013 Intel tried to add support for hotplugable IO/APICs to the kernel
which was creating yet another architecture specific mechanism and resulted
in an unholy mess on top of the existing horrors of x86 interrupt handling.
The x86 interrupt management code was already an uncomprehensible maze of
indirections between the CPU vector management, interrupt remapping and the
actual IO/APIC and PCI/MSI[-X] implementation.

At roughly the same time ARM struggled with the ever growing SoC specific
extensions which were glued on top of the architected GIC interrupt
controller.

This resulted in a fundamental redesign of interrupt management and
provided the today prevailing concept of hierarchical interrupt
domains. This allows to disentangle the interactions between x86 vector
domain and interrupt remapping and also allowed ARM to handle the zoo of
SoC specific interrupt components in a sane way.

The concept of hierarchical interrupt domains aims to encapsulate the
functionality of particual IP blocks which are involved in interrupt
delivery so that they become extensible and pluggable. The X86
encapsulation looks like this:

|--- device 1
[Vector]---[Remapping]---[PCI/MSI]--|...
|--- device N

where the remapping domain is an optional component and in case that it is
not available the PCI/MSI[-X] domains have the vector domain as their
parent. This reduced the required interaction between the domains pretty
much to the initialization phase where it is obviously required to
establish the proper parent relation ship in the components of the
hierarchy.

While in most cases the model is strictly representing the chain of IP
blocks and abstracting them so they can be plugged together to form a
hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware
it's clear that the actual PCI/MSI[-X] interrupt controller is not a global
entity, but strict a per PCI device entity.

Here we took a short cut on the hierarchical model and went for the easy
solution of providing "global" PCI/MSI domains which was possible because
the PCI/MSI[-X] handling is uniform accross the devices. This also allowed
to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in
turn made it simple to keep the existing architecture specific management
alive.

A similar problem was created in the ARM world with support for IP block
specific message storage. Instead of going all the way to stack a IP block
specific domain on top of the generic MSI domain this ended in a construct
which provides a "global" platform MSI domain which allows overriding the
irq_write_msi_msg() callback per allocation.

In course of the lengthy discussions we identified other abuse of the MSI
infrastructure in wireless drivers, NTB etc.

[patch 05/39] genirq/msi: Remove filter from msi_free_descs_free_range()

2022-11-11 Thread Thomas Gleixner

When a range of descriptors is freed then all of them are not associated to
a linux interrupt. Remove the filter and add a warning to the free function.

Signed-off-by: Thomas Gleixner 
---
 drivers/base/platform-msi.c |2 +-
 include/linux/msi.h |5 ++---
 kernel/irq/msi.c|   19 ++-
 3 files changed, 13 insertions(+), 13 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -325,7 +325,7 @@ void platform_msi_device_domain_free(str
 
msi_lock_descs(data->dev);
irq_domain_free_irqs_common(domain, virq, nr_irqs);
-   msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, virq, virq + nr_irqs 
- 1);
+   msi_free_msi_descs_range(data->dev, virq, virq + nr_irqs - 1);
msi_unlock_descs(data->dev);
 }
 
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -247,8 +247,7 @@ static inline void pci_write_msi_msg(uns
 #endif /* CONFIG_PCI_MSI */
 
 int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
-void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
- unsigned int first_index, unsigned int 
last_index);
+void msi_free_msi_descs_range(struct device *dev, unsigned int first_index, 
unsigned int last_index);
 
 /**
  * msi_free_msi_descs - Free MSI descriptors of a device
@@ -256,7 +255,7 @@ void msi_free_msi_descs_range(struct dev
  */
 static inline void msi_free_msi_descs(struct device *dev)
 {
-   msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, MSI_MAX_INDEX);
+   msi_free_msi_descs_range(dev, 0, MSI_MAX_INDEX);
 }
 
 void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -120,7 +120,7 @@ static int msi_add_simple_msi_descs(stru
 fail_mem:
ret = -ENOMEM;
 fail:
-   msi_free_msi_descs_range(dev, MSI_DESC_ALL, index, last);
+   msi_free_msi_descs_range(dev, index, last);
return ret;
 }
 
@@ -141,12 +141,11 @@ static bool msi_desc_match(struct msi_de
 /**
  * msi_free_msi_descs_range - Free MSI descriptors of a device
  * @dev:   Device to free the descriptors
- * @filter:Descriptor state filter
  * @first_index:   Index to start freeing from
  * @last_index:Last index to be freed
  */
-void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
- unsigned int first_index, unsigned int last_index)
+void msi_free_msi_descs_range(struct device *dev, unsigned int first_index,
+ unsigned int last_index)
 {
struct xarray *xa = >msi.data->__store;
struct msi_desc *desc;
@@ -155,10 +154,12 @@ void msi_free_msi_descs_range(struct dev
lockdep_assert_held(>msi.data->mutex);
 
xa_for_each_range(xa, idx, desc, first_index, last_index) {
-   if (msi_desc_match(desc, filter)) {
-   xa_erase(xa, idx);
-   msi_free_desc(desc);
-   }
+   xa_erase(xa, idx);
+
+   /* Leak the descriptor when it is still referenced */
+   if (WARN_ON_ONCE(msi_desc_match(desc, MSI_DESC_ASSOCIATED)))
+   continue;
+   msi_free_desc(desc);
}
 }
 
@@ -739,7 +740,7 @@ int msi_domain_populate_irqs(struct irq_
 fail:
for (--virq; virq >= virq_base; virq--)
irq_domain_free_irqs_common(domain, virq, 1);
-   msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, virq_base + nvec 
- 1);
+   msi_free_msi_descs_range(dev, virq_base, virq_base + nvec - 1);
 unlock:
msi_unlock_descs(dev);
return ret;

[patch 06/39] genirq/msi: Add missing kernel doc to msi_next_desc()

2022-11-11 Thread Thomas Gleixner

W=1 complains about this.

Signed-off-by: Thomas Gleixner 
---
 kernel/irq/msi.c |1 +
 1 file changed, 1 insertion(+)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -282,6 +282,7 @@ EXPORT_SYMBOL_GPL(msi_first_desc);
 /**
  * msi_next_desc - Get the next MSI descriptor of a device
  * @dev:   Device to operate on
+ * @filter:Descriptor state filter
  *
  * The first invocation of msi_next_desc() has to be preceeded by a
  * successful invocation of __msi_first_desc(). Consecutive invocations are

[patch 11/39] genirq/irqdomain: Move bus token enum into a seperate header

2022-11-11 Thread Thomas Gleixner

Split the bus token defines out into a seperate header file to avoid
inclusion of irqdomain.h in msi.h.

Signed-off-by: Thomas Gleixner 
---
 include/linux/irqdomain.h  |   22 +-
 include/linux/irqdomain_defs.h |   26 ++
 2 files changed, 27 insertions(+), 21 deletions(-)

--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -31,6 +31,7 @@
 #define _LINUX_IRQDOMAIN_H
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -68,27 +69,6 @@ struct irq_fwspec {
 void of_phandle_args_to_fwspec(struct device_node *np, const u32 *args,
   unsigned int count, struct irq_fwspec *fwspec);
 
-/*
- * Should several domains have the same device node, but serve
- * different purposes (for example one domain is for PCI/MSI, and the
- * other for wired IRQs), they can be distinguished using a
- * bus-specific token. Most domains are expected to only carry
- * DOMAIN_BUS_ANY.
- */
-enum irq_domain_bus_token {
-   DOMAIN_BUS_ANY  = 0,
-   DOMAIN_BUS_WIRED,
-   DOMAIN_BUS_GENERIC_MSI,
-   DOMAIN_BUS_PCI_MSI,
-   DOMAIN_BUS_PLATFORM_MSI,
-   DOMAIN_BUS_NEXUS,
-   DOMAIN_BUS_IPI,
-   DOMAIN_BUS_FSL_MC_MSI,
-   DOMAIN_BUS_TI_SCI_INTA_MSI,
-   DOMAIN_BUS_WAKEUP,
-   DOMAIN_BUS_VMD_MSI,
-};
-
 /**
  * struct irq_domain_ops - Methods for irq_domain objects
  * @match: Match an interrupt controller device node to a host, returns
--- /dev/null
+++ b/include/linux/irqdomain_defs.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IRQDOMAIN_DEFS_H
+#define _LINUX_IRQDOMAIN_DEFS_H
+
+/*
+ * Should several domains have the same device node, but serve
+ * different purposes (for example one domain is for PCI/MSI, and the
+ * other for wired IRQs), they can be distinguished using a
+ * bus-specific token. Most domains are expected to only carry
+ * DOMAIN_BUS_ANY.
+ */
+enum irq_domain_bus_token {
+   DOMAIN_BUS_ANY  = 0,
+   DOMAIN_BUS_WIRED,
+   DOMAIN_BUS_GENERIC_MSI,
+   DOMAIN_BUS_PCI_MSI,
+   DOMAIN_BUS_PLATFORM_MSI,
+   DOMAIN_BUS_NEXUS,
+   DOMAIN_BUS_IPI,
+   DOMAIN_BUS_FSL_MC_MSI,
+   DOMAIN_BUS_TI_SCI_INTA_MSI,
+   DOMAIN_BUS_WAKEUP,
+   DOMAIN_BUS_VMD_MSI,
+};
+
+#endif /* _LINUX_IRQDOMAIN_DEFS_H */

[patch 01/39] PCI/MSI: Check for MSI enabled in __pci_msix_enable()

2022-11-11 Thread Thomas Gleixner

PCI/MSI and PCI/MSI-X are mutually exclusive, but the MSI-X enable code
lacks a check for already enabled MSI.

Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/msi.c |5 +
 1 file changed, 5 insertions(+)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -935,6 +935,11 @@ static int __pci_enable_msix_range(struc
if (maxvec < minvec)
return -ERANGE;
 
+   if (dev->msi_enabled) {
+   pci_info(dev, "can't enable MSI-X (MSI already enabled)\n");
+   return -EINVAL;
+   }
+
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;

[patch 07/39] genirq/msi: Make __msi_domain_alloc_irqs() static

2022-11-11 Thread Thomas Gleixner

Nothing outside of the core code requires this.

Signed-off-by: Thomas Gleixner 
---
 include/linux/msi.h |7 ++-
 kernel/irq/msi.c|6 --
 2 files changed, 6 insertions(+), 7 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -334,9 +334,8 @@ struct msi_domain_info;
  * MSI_FLAG_USE_DEF_DOM_OPS is not set to avoid breaking existing users and
  * because these callbacks are obviously mandatory.
  *
- * This is NOT meant to be abused, but it can be useful to build wrappers
- * for specialized MSI irq domains which need extra work before and after
- * calling __msi_domain_alloc_irqs()/__msi_domain_free_irqs().
+ * __msi_domain_free_irqs() is exposed for PPC pseries to handle extra
+ * work after all interrupts and descriptors have been freed.
  */
 struct msi_domain_ops {
irq_hw_number_t (*get_hwirq)(struct msi_domain_info *info,
@@ -425,8 +424,6 @@ int msi_domain_set_affinity(struct irq_d
 struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode,
 struct msi_domain_info *info,
 struct irq_domain *parent);
-int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
-   int nvec);
 int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct 
device *dev,
   int nvec);
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -462,6 +462,8 @@ static inline void msi_sysfs_remove_desc
 #endif /* !CONFIG_SYSFS */
 
 #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+static int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device 
*dev, int nvec);
+
 static inline void irq_chip_write_msi_msg(struct irq_data *data,
  struct msi_msg *msg)
 {
@@ -852,8 +854,8 @@ static int msi_init_virq(struct irq_doma
return 0;
 }
 
-int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
-   int nvec)
+static int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device 
*dev,
+  int nvec)
 {
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;

[patch 04/39] genirq/msi: Use MSI_DESC_ALL in msi_add_simple_msi_descs()

2022-11-11 Thread Thomas Gleixner

There are no associated MSI descriptors in the requested range when the MSI
descriptor allocation fails. Use MSI_DESC_ALL as the filter which prepares
the next step to get rid of the filter for freeing.

Signed-off-by: Thomas Gleixner 
---
 kernel/irq/msi.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -120,7 +120,7 @@ static int msi_add_simple_msi_descs(stru
 fail_mem:
ret = -ENOMEM;
 fail:
-   msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, index, last);
+   msi_free_msi_descs_range(dev, MSI_DESC_ALL, index, last);
return ret;
 }

[patch 14/39] PCI/MSI: Let the MSI core free descriptors

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

Let the core do the freeing of descriptors and just keep it around for the
legacy case.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 
---
 drivers/pci/msi/irqdomain.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -24,11 +24,12 @@ void pci_msi_teardown_msi_irqs(struct pc
struct irq_domain *domain;
 
domain = dev_get_msi_domain(>dev);
-   if (domain && irq_domain_is_hierarchy(domain))
+   if (domain && irq_domain_is_hierarchy(domain)) {
msi_domain_free_irqs_descs_locked(domain, >dev);
-   else
+   } else {
pci_msi_legacy_teardown_msi_irqs(dev);
-   msi_free_msi_descs(>dev);
+   msi_free_msi_descs(>dev);
+   }
 }
 
 /**
@@ -170,6 +171,9 @@ struct irq_domain *pci_msi_create_irq_do
if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
pci_msi_domain_update_chip_ops(info);
 
+   /* Let the core code free MSI descriptors when freeing interrupts */
+   info->flags |= MSI_FLAG_FREE_MSI_DESCS;
+
info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS;
if (IS_ENABLED(CONFIG_GENERIC_IRQ_RESERVATION_MODE))
info->flags |= MSI_FLAG_MUST_REACTIVATE;

[patch 15/39] PCI/MSI: Get rid of PCI_MSI_IRQ_DOMAIN

2022-11-11 Thread Thomas Gleixner

What a zoo:

 PCI_MSI
select GENERIC_MSI_IRQ

 PCI_MSI_IRQ_DOMAIN
def_bool y
depends on PCI_MSI
select GENERIC_MSI_IRQ_DOMAIN

Ergo PCI_MSI enables PCI_MSI_IRQ_DOMAIN which in turn selects
GENERIC_MSI_IRQ_DOMAIN. So all the dependencies on PCI_MSI_IRQ_DOMAIN are
just an indirection to PCI_MSI.

Match the reality and just admit that PCI_MSI requires
GENERIC_MSI_IRQ_DOMAIN.

Signed-off-by: Thomas Gleixner 
---
 arch/um/drivers/Kconfig |1 
 arch/um/include/asm/pci.h   |2 -
 arch/x86/Kconfig|1 
 arch/x86/include/asm/pci.h  |4 +-
 drivers/pci/Kconfig |8 +
 drivers/pci/controller/Kconfig  |   30 +---
 drivers/pci/controller/dwc/Kconfig  |   48 
 drivers/pci/controller/mobiveil/Kconfig |6 ++--
 drivers/pci/msi/Makefile|2 -
 drivers/pci/probe.c |2 -
 include/linux/msi.h |   32 ++---
 11 files changed, 56 insertions(+), 80 deletions(-)

--- a/arch/um/drivers/Kconfig
+++ b/arch/um/drivers/Kconfig
@@ -381,7 +381,6 @@ config UML_PCI_OVER_VIRTIO
select UML_IOMEM_EMULATION
select UML_DMA_EMULATION
select PCI_MSI
-   select PCI_MSI_IRQ_DOMAIN
select PCI_LOCKLESS_CONFIG
 
 config UML_PCI_OVER_VIRTIO_DEVICE_ID
--- a/arch/um/include/asm/pci.h
+++ b/arch/um/include/asm/pci.h
@@ -7,7 +7,7 @@
 /* Generic PCI */
 #include 
 
-#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
+#ifdef CONFIG_PCI_MSI
 /*
  * This is a bit of an annoying hack, and it assumes we only have
  * the virt-pci (if anything). Which is true, but still.
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1109,7 +1109,6 @@ config X86_LOCAL_APIC
def_bool y
depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || 
PCI_MSI
select IRQ_DOMAIN_HIERARCHY
-   select PCI_MSI_IRQ_DOMAIN if PCI_MSI
 
 config X86_IO_APIC
def_bool y
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -21,7 +21,7 @@ struct pci_sysdata {
 #ifdef CONFIG_X86_64
void*iommu; /* IOMMU private data */
 #endif
-#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
+#ifdef CONFIG_PCI_MSI
void*fwnode;/* IRQ domain for MSI assignment */
 #endif
 #if IS_ENABLED(CONFIG_VMD)
@@ -52,7 +52,7 @@ static inline int pci_proc_domain(struct
 }
 #endif
 
-#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
+#ifdef CONFIG_PCI_MSI
 static inline void *_pci_root_bus_fwnode(struct pci_bus *bus)
 {
return to_pci_sysdata(bus)->fwnode;
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -38,6 +38,7 @@ source "drivers/pci/pcie/Kconfig"
 
 config PCI_MSI
bool "Message Signaled Interrupts (MSI and MSI-X)"
+   select GENERIC_MSI_IRQ_DOMAIN
select GENERIC_MSI_IRQ
help
   This allows device drivers to enable MSI (Message Signaled
@@ -51,11 +52,6 @@ config PCI_MSI
 
   If you don't know what to do here, say Y.
 
-config PCI_MSI_IRQ_DOMAIN
-   def_bool y
-   depends on PCI_MSI
-   select GENERIC_MSI_IRQ_DOMAIN
-
 config PCI_MSI_ARCH_FALLBACKS
bool
 
@@ -192,7 +188,7 @@ config PCI_LABEL
 
 config PCI_HYPERV
tristate "Hyper-V PCI Frontend"
-   depends on ((X86 && X86_64) || ARM64) && HYPERV && PCI_MSI && 
PCI_MSI_IRQ_DOMAIN && SYSFS
+   depends on ((X86 && X86_64) || ARM64) && HYPERV && PCI_MSI && SYSFS
select PCI_HYPERV_INTERFACE
help
  The PCI device frontend driver allows the kernel to import arbitrary
--- a/drivers/pci/controller/Kconfig
+++ b/drivers/pci/controller/Kconfig
@@ -19,7 +19,7 @@ config PCI_AARDVARK
tristate "Aardvark PCIe controller"
depends on (ARCH_MVEBU && ARM64) || COMPILE_TEST
depends on OF
-   depends on PCI_MSI_IRQ_DOMAIN
+   depends on PCI_MSI
select PCI_BRIDGE_EMUL
help
 Add support for Aardvark 64bit PCIe Host Controller. This
@@ -29,7 +29,7 @@ config PCI_AARDVARK
 config PCIE_XILINX_NWL
bool "NWL PCIe Core"
depends on ARCH_ZYNQMP || COMPILE_TEST
-   depends on PCI_MSI_IRQ_DOMAIN
+   depends on PCI_MSI
help
 Say 'Y' here if you want kernel support for Xilinx
 NWL PCIe controller. The controller can act as Root Port
@@ -53,7 +53,7 @@ config PCI_IXP4XX
 config PCI_TEGRA
bool "NVIDIA Tegra PCIe controller"
depends on ARCH_TEGRA || COMPILE_TEST
-   depends on PCI_MSI_IRQ_DOMAIN
+   depends on PCI_MSI
help
  Say Y here if you want support for the PCIe host controller found
  on NVIDIA Tegra SoCs.
@@ -70,7 +70,7 @@ config PCI_RCAR_GEN2
 config PCIE_RCAR_HOST
bool "Renesas R-Car PCIe host controller"

[patch 09/39] powerpc/pseries/msi: Use msi_domain_ops::msi_post_free()

2022-11-11 Thread Thomas Gleixner

Use the new msi_post_free() callback which is invoked after the interrupts
have been freed to tell the hypervisor about the shutdown.

This allows to remove the exposure of __msi_domain_free_irqs().

Signed-off-by: Thomas Gleixner 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/platforms/pseries/msi.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -447,21 +447,18 @@ static void pseries_msi_ops_msi_free(str
  * RTAS can not disable one MSI at a time. It's all or nothing. Do it
  * at the end after all IRQs have been freed.
  */
-static void pseries_msi_domain_free_irqs(struct irq_domain *domain,
-struct device *dev)
+static void pseries_msi_post_free(struct irq_domain *domain, struct device 
*dev)
 {
if (WARN_ON_ONCE(!dev_is_pci(dev)))
return;
 
-   __msi_domain_free_irqs(domain, dev);
-
rtas_disable_msi(to_pci_dev(dev));
 }
 
 static struct msi_domain_ops pseries_pci_msi_domain_ops = {
.msi_prepare= pseries_msi_ops_prepare,
.msi_free   = pseries_msi_ops_msi_free,
-   .domain_free_irqs = pseries_msi_domain_free_irqs,
+   .msi_post_free  = pseries_msi_post_free,
 };
 
 static void pseries_msi_shutdown(struct irq_data *d)

[patch 08/39] genirq/msi: Provide msi_domain_ops::post_free()

2022-11-11 Thread Thomas Gleixner

To prepare for removing the exposure of __msi_domain_free_irqs() provide a
post_free() callback in the MSI domain ops which can be used to solve
the problem of the only user of __msi_domain_free_irqs() in arch/powerpc.

Signed-off-by: Thomas Gleixner 
---
 include/linux/msi.h |4 
 kernel/irq/msi.c|2 ++
 2 files changed, 6 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -315,6 +315,8 @@ struct msi_domain_info;
  * function.
  * @domain_free_irqs:  Optional function to override the default free
  * function.
+ * @msi_post_free: Optional function which is invoked after freeing
+ * all interrupts.
  *
  * @get_hwirq, @msi_init and @msi_free are callbacks used by the underlying
  * irqdomain.
@@ -359,6 +361,8 @@ struct msi_domain_ops {
 struct device *dev, int nvec);
void(*domain_free_irqs)(struct irq_domain *domain,
struct device *dev);
+   void(*msi_post_free)(struct irq_domain *domain,
+struct device *dev);
 };
 
 /**
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -1026,6 +1026,8 @@ void msi_domain_free_irqs_descs_locked(s
lockdep_assert_held(>msi.data->mutex);
 
ops->domain_free_irqs(domain, dev);
+   if (ops->msi_post_free)
+   ops->msi_post_free(domain, dev);
msi_domain_free_msi_descs(info, dev);
 }

[patch 12/39] genirq/msi: Add bus token to struct msi_domain_info

2022-11-11 Thread Thomas Gleixner

From: Ahmed S. Darwish 

Add a bus token member to struct msi_domain_info and let
msi_create_irq_domain() set the bus token.

That allows to remove the bus token updates at the call sites.

Suggested-by: Thomas Gleixner 
Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Thomas Gleixner 
---
 include/linux/msi.h |   19 +++
 kernel/irq/msi.c|7 +--
 2 files changed, 16 insertions(+), 10 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -16,6 +16,7 @@
  * abuse. The only function which is relevant for drivers is msi_get_virq().
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -365,6 +366,7 @@ struct msi_domain_ops {
 /**
  * struct msi_domain_info - MSI interrupt domain data
  * @flags: Flags to decribe features and capabilities
+ * @bus_token: The domain bus token
  * @ops:   The callback data structure
  * @chip:  Optional: associated interrupt chip
  * @chip_data: Optional: associated interrupt chip data
@@ -374,14 +376,15 @@ struct msi_domain_ops {
  * @data:  Optional: domain specific data
  */
 struct msi_domain_info {
-   u32 flags;
-   struct msi_domain_ops   *ops;
-   struct irq_chip *chip;
-   void*chip_data;
-   irq_flow_handler_t  handler;
-   void*handler_data;
-   const char  *handler_name;
-   void*data;
+   u32 flags;
+   enum irq_domain_bus_token   bus_token;
+   struct msi_domain_ops   *ops;
+   struct irq_chip *chip;
+   void*chip_data;
+   irq_flow_handler_t  handler;
+   void*handler_data;
+   const char  *handler_name;
+   void*data;
 };
 
 /* Flags for msi_domain_info */
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -694,8 +694,11 @@ struct irq_domain *msi_create_irq_domain
domain = irq_domain_create_hierarchy(parent, IRQ_DOMAIN_FLAG_MSI, 0,
 fwnode, _domain_ops, info);
 
-   if (domain && !domain->name && info->chip)
-   domain->name = info->chip->name;
+   if (domain) {
+   if (!domain->name && info->chip)
+   domain->name = info->chip->name;
+   irq_domain_update_bus_token(domain, info->bus_token);
+   }
 
return domain;
 }

[patch 03/39] iommu/amd: Remove bogus check for multi MSI-X

2022-11-11 Thread Thomas Gleixner

PCI/Multi-MSI is MSI specific and not supported for MSI-X

Signed-off-by: Thomas Gleixner 
---
 drivers/iommu/amd/iommu.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3294,8 +3294,7 @@ static int irq_remapping_alloc(struct ir
 
if (!info)
return -EINVAL;
-   if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI &&
-   info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
+   if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
return -EINVAL;
 
/*

[patch 39/39] x86/apic: Remove X86_IRQ_ALLOC_CONTIGUOUS_VECTORS

2022-11-11 Thread Thomas Gleixner

Now that the PCI/MSI core code does early checking for multi-MSI support
X86_IRQ_ALLOC_CONTIGUOUS_VECTORS is not required anymore.

Remove the flag and rely on MSI_FLAG_MULTI_PCI_MSI.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/include/asm/irqdomain.h|4 +---
 arch/x86/kernel/apic/msi.c  |6 ++
 arch/x86/kernel/apic/vector.c   |4 
 drivers/iommu/amd/iommu.c   |7 ---
 drivers/iommu/intel/irq_remapping.c |7 ---
 drivers/pci/controller/pci-hyperv.c |   15 +--
 6 files changed, 4 insertions(+), 39 deletions(-)

--- a/arch/x86/include/asm/irqdomain.h
+++ b/arch/x86/include/asm/irqdomain.h
@@ -7,9 +7,7 @@
 
 #ifdef CONFIG_X86_LOCAL_APIC
 enum {
-   /* Allocate contiguous CPU vectors */
-   X86_IRQ_ALLOC_CONTIGUOUS_VECTORS= 0x1,
-   X86_IRQ_ALLOC_LEGACY= 0x2,
+   X86_IRQ_ALLOC_LEGACY= 0x1,
 };
 
 extern int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec);
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -161,12 +161,10 @@ int pci_msi_prepare(struct irq_domain *d
msi_alloc_info_t *arg)
 {
init_irq_alloc_info(arg, NULL);
-   if (to_pci_dev(dev)->msix_enabled) {
+   if (to_pci_dev(dev)->msix_enabled)
arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSIX;
-   } else {
+   else
arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSI;
-   arg->flags |= X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
-   }
 
return 0;
 }
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -539,10 +539,6 @@ static int x86_vector_alloc_irqs(struct
if (disable_apic)
return -ENXIO;
 
-   /* Currently vector allocator can't guarantee contiguous allocations */
-   if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
-   return -ENOSYS;
-
/*
 * Catch any attempt to touch the cascade interrupt on a PIC
 * equipped system.
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3297,13 +3297,6 @@ static int irq_remapping_alloc(struct ir
if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
return -EINVAL;
 
-   /*
-* With IRQ remapping enabled, don't need contiguous CPU vectors
-* to support multiple MSI interrupts.
-*/
-   if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI)
-   info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
-
sbdf = get_devid(info);
if (sbdf < 0)
return -EINVAL;
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1337,13 +1337,6 @@ static int intel_irq_remapping_alloc(str
if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
return -EINVAL;
 
-   /*
-* With IRQ remapping enabled, don't need contiguous CPU vectors
-* to support multiple MSI interrupts.
-*/
-   if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI)
-   info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
-
ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg);
if (ret < 0)
return ret;
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -611,20 +611,7 @@ static unsigned int hv_msi_get_int_vecto
return cfg->vector;
 }
 
-static int hv_msi_prepare(struct irq_domain *domain, struct device *dev,
- int nvec, msi_alloc_info_t *info)
-{
-   int ret = pci_msi_prepare(domain, dev, nvec, info);
-
-   /*
-* By using the interrupt remapper in the hypervisor IOMMU, contiguous
-* CPU vectors is not needed for multi-MSI
-*/
-   if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI)
-   info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
-
-   return ret;
-}
+#define hv_msi_prepare pci_msi_prepare
 
 /**
  * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current

Re: [PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2022-05-14 Thread Thomas Gleixner

On Fri, May 13 2022 at 15:16, Ricardo Neri wrote:
> On Mon, May 09, 2022 at 04:03:39PM +0200, Thomas Gleixner wrote:
>> > +  /* If we are here, IPI shorthands are enabled. */
>> > +  apic->send_IPI_allbutself(NMI_VECTOR);
>> 
>> So if the monitored cpumask is a subset of online CPUs, which is the
>> case when isolation features are enabled, then you still send NMIs to
>> those isolated CPUs. I'm sure the isolation folks will be enthused.
>
> Yes, I acknowledged this limitation in the cover letter. I should also update
> Documentation/admin-guide/lockup-watchdogs.rst.
>
> This patchset proposes the HPET NMI watchdog as an opt-in feature.
>
> Perhaps the limitation might be mitigated by adding a check for 
> non-housekeeping
> and non-monitored CPUs in exc_nmi(). However, that will not eliminate the
> problem of isolated CPUs also getting the NMI.

Right. It's a mess...

Re: [PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()

2022-05-14 Thread Thomas Gleixner

On Fri, May 13 2022 at 14:19, Ricardo Neri wrote:
> On Fri, May 06, 2022 at 11:41:13PM +0200, Thomas Gleixner wrote:
>> The argument about not bloating the code
>> with an "obvious???" function which is quite small is slightly beyond my
>> comprehension level.
>
> That obvious function would look like this:
>
> void hpet_set_comparator_one_shot(int channel, u32 delta)
> {
>   u32 count;
>
>   count = hpet_readl(HPET_COUNTER);
>   count += delta;
>   hpet_writel(count, HPET_Tn_CMP(channel));
> }

This function only works reliably when the delta is large. See
hpet_clkevt_set_next_event().

Thanks,

tglx

Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

2022-05-14 Thread Thomas Gleixner

On Fri, May 13 2022 at 16:45, Ricardo Neri wrote:
> On Fri, May 13, 2022 at 10:50:09PM +0200, Thomas Gleixner wrote:
>> > Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips
>> > INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to check for 
>> > NULL.
>> > They currently unconditionally call the parent irq_chip's 
>> > irq_set_affinity().
>> > I see that there is a irq_chip_set_affinity_parent() function. Perhaps it 
>> > can
>> > be used for this check?
>> 
>> Yes, this lacks obviously a NMI specific set_affinity callback and this
>> can be very trivial and does not have any of the complexity of interrupt
>> affinity assignment. First online CPU in the mask with a fallback to any
>> online CPU.
>
> Why would we need a fallback to any online CPU? Shouldn't it fail if it cannot
> find an online CPU in the mask?

Might as well fail. Let me think about it.

>> I did not claim that this is complete. This was for illustration.
>
> In the reworked patch, may I add a Co-developed-by with your name and your 
> SOB?

Suggested-by is good enough.

Thanks,

tglx

Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

2022-05-13 Thread Thomas Gleixner

On Fri, May 13 2022 at 11:03, Ricardo Neri wrote:
> On Fri, May 06, 2022 at 11:12:20PM +0200, Thomas Gleixner wrote:
>> Why would a NMI ever end up in this code? There is no vector management
>> required and this find cpu exercise is pointless.
>
> But even if the NMI has a fixed vector, it is still necessary to determine
> which CPU will get the NMI. It is still necessary to determine what to
> write in the Destination ID field of the MSI message.
>
> irq_matrix_find_best_cpu() would find the CPU with the lowest number of
> managed vectors so that the NMI is directed to that CPU. 

What's the point to send it to the CPU with the lowest number of
interrupts. It's not that this NMI happens every 50 microseconds.
We pick one online CPU and are done.

> In today's code, an NMI would end up here because we rely on the existing
> interrupt management infrastructure... Unless, the check is done the entry
> points as you propose.

Correct. We don't want to call into functions which are not designed for
NMIs.
 
>> > +
>> > +  if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
>> > +  cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest);
>> > +  apicd->cpu = cpu;
>> > +  vector = 0;
>> > +  goto no_vector;
>> > +  }
>> 
>> This code can never be reached for a NMI delivery. If so, then it's a
>> bug.
>> 
>> This all is special purpose for that particular HPET NMI watchdog use
>> case and we are not exposing this to anything else at all.
>> 
>> So why are you sprinkling this NMI nonsense all over the place? Just
>> because? There are well defined entry points to all of this where this
>> can be fenced off.
>
> I put the NMI checks in these points because assign_vector_locked() and
> assign_managed_vector() are reached through multiple paths and these are
> the two places where the allocation of the vector is requested and the
> destination CPU is determined.
>
> I do observe this code being reached for an NMI, but that is because this
> code still does not know about NMIs... Unless the checks for NMI are put
> in the entry points as you pointed out.
>
> The intent was to refactor the code in a generic manner and not to focus
> only in the NMI watchdog. That would have looked hacky IMO.

We don't want to have more of this really. Supporting NMIs on x86 in a
broader way is simply not reasonable because there is only one NMI
vector and we have no sensible way to get to the cause of the NMI
without a massive overhead.

Even if we get multiple NMI vectors some shiny day, this will be
fundamentally different than regular interrupts and certainly not
exposed broadly. There will be 99.99% fixed vectors for simplicity sake.

>> +if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
>> +/*
>> + * NMIs have a fixed vector and need their own
>> + * interrupt chip so nothing can end up in the
>> + * regular local APIC management code except the
>> + * MSI message composing callback.
>> + */
>> +irqd->chip = _nmi_controller;
>> +/*
>> + * Don't allow affinity setting attempts for NMIs.
>> + * This cannot work with the regular affinity
>> + * mechanisms and for the intended HPET NMI
>> + * watchdog use case it's not required.
>
> But we do need the ability to set affinity, right? As stated above, we need
> to know what Destination ID to write in the MSI message or in the interrupt
> remapping table entry.
>
> It cannot be any CPU because only one specific CPU is supposed to handle the
> NMI from the HPET channel.
>
> We cannot hard-code a CPU for that because it may go offline (and ignore NMIs)
> or not be part of the monitored CPUs.
>
> Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips
> INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to check for 
> NULL.
> They currently unconditionally call the parent irq_chip's irq_set_affinity().
> I see that there is a irq_chip_set_affinity_parent() function. Perhaps it can
> be used for this check?

Yes, this lacks obviously a NMI specific set_affinity callback and this
can be very trivial and does not have any of the complexity of interrupt
affinity assignment. First online CPU in the mask with a fallback to any
online CPU.

I did not claim that this is complete. This was for illustration.

>> + */
>> +irqd_set_no_balance(irqd);
>
> This code does not set apicd->hw_irq_cfg.delivery_mode as NMI, right?
> I had to add that to make it work.

I assumed you can figure that out on your own :)

Thanks,

tglx

Re: [PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz

2022-05-10 Thread Thomas Gleixner

On Tue, May 10 2022 at 21:16, Nicholas Piggin wrote:
> Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am:
>> +/*
>> + * If in use, the HPET hardlockup detector relies on tsc_khz.
>> + * Reconfigure it to make use of the refined tsc_khz.
>> + */
>> +lockup_detector_reconfigure();
>
> I don't know if the API is conceptually good.
>
> You change something that the lockup detector is currently using, 
> *while* the detector is running asynchronously, and then reconfigure
> it. What happens in the window? If this code is only used for small
> adjustments maybe it does not really matter but in principle it's
> a bad API to export.
>
> lockup_detector_reconfigure as an internal API is okay because it
> reconfigures things while the watchdog is stopped [actually that
> looks untrue for soft dog which uses watchdog_thresh in
> is_softlockup(), but that should be fixed].
>
> You're the arch so you're allowed to stop the watchdog and configure
> it, e.g., hardlockup_detector_perf_stop() is called in arch/.
>
> So you want to disable HPET watchdog if it was enabled, then update
> wherever you're using tsc_khz, then re-enable.

The real question is whether making this refined tsc_khz value
immediately effective matters at all. IMO, it does not because up to
that point the watchdog was happily using the coarse calibrated value
and the whole use TSC to assess whether the HPET fired mechanism is just
a guestimate anyway. So what's the point of trying to guess 'more
correct'.

Thanks,

tglx

Re: [PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2022-05-09 Thread Thomas Gleixner

On Thu, May 05 2022 at 17:00, Ricardo Neri wrote:
> + if (is_hpet_hld_interrupt(hdata)) {
> + /*
> +  * Kick the timer first. If the HPET channel is periodic, it
> +  * helps to reduce the delta between the expected TSC value and
> +  * its actual value the next time the HPET channel fires.
> +  */
> + kick_timer(hdata, !(hdata->has_periodic));
> +
> + if (cpumask_weight(hld_data->monitored_cpumask) > 1) {
> + /*
> +  * Since we cannot know the source of an NMI, the best
> +  * we can do is to use a flag to indicate to all online
> +  * CPUs that they will get an NMI and that the source of
> +  * that NMI is the hardlockup detector. Offline CPUs
> +  * also receive the NMI but they ignore it.
> +  *
> +  * Even though we are in NMI context, we have concluded
> +  * that the NMI came from the HPET channel assigned to
> +  * the detector, an event that is infrequent and only
> +  * occurs in the handling CPU. There should not be races
> +  * with other NMIs.
> +  */
> + cpumask_copy(hld_data->inspect_cpumask,
> +  cpu_online_mask);
> +
> + /* If we are here, IPI shorthands are enabled. */
> + apic->send_IPI_allbutself(NMI_VECTOR);

So if the monitored cpumask is a subset of online CPUs, which is the
case when isolation features are enabled, then you still send NMIs to
those isolated CPUs. I'm sure the isolation folks will be enthused.

Thanks,

tglx

Re: [PATCH v6 21/29] x86/nmi: Add an NMI_WATCHDOG NMI handler category

2022-05-09 Thread Thomas Gleixner

On Thu, May 05 2022 at 17:00, Ricardo Neri wrote:
> Add a NMI_WATCHDOG as a new category of NMI handler. This new category
> is to be used with the HPET-based hardlockup detector. This detector
> does not have a direct way of checking if the HPET timer is the source of
> the NMI. Instead, it indirectly estimates it using the time-stamp counter.
>
> Therefore, we may have false-positives in case another NMI occurs within
> the estimated time window. For this reason, we want the handler of the
> detector to be called after all the NMI_LOCAL handlers. A simple way
> of achieving this with a new NMI handler category.
>
> @@ -379,6 +385,10 @@ static noinstr void default_do_nmi(struct pt_regs *regs)
>   }
>   raw_spin_unlock(_reason_lock);
>  
> + handled = nmi_handle(NMI_WATCHDOG, regs);
> + if (handled == NMI_HANDLED)
> + goto out;
> +

How is this supposed to work reliably?

If perf is active and the HPET NMI and the perf NMI come in around the
same time, then nmi_handle(LOCAL) can swallow the NMI and the watchdog
won't be checked. Because MSI is strictly edge and the message is only
sent once, this can result in a stale watchdog, no?

Thanks,

tglx

1 2 3 4 5 6 7 8 >

1 - 100 of 766 matches

Mail list logo