Re: [PATCH v2 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files
On Tue, May 07 2024 at 15:23, Vignesh Balasubramanian wrote: > +struct xfeat_component { > + u32 xfeat_type; > + u32 xfeat_sz; > + u32 xfeat_off; > + u32 xfeat_flags; > +} __packed; Why repeating xfeat_ for all member names? u32 type; u32 size; u32 offset; u32 flags; is sufficient and obvious, no? > +enum custom_feature { > + FEATURE_XSAVE_FP = 0, > + FEATURE_XSAVE_SSE = 1, > + FEATURE_XSAVE_YMM = 2, > + FEATURE_XSAVE_BNDREGS = 3, > + FEATURE_XSAVE_BNDCSR = 4, > + FEATURE_XSAVE_OPMASK = 5, > + FEATURE_XSAVE_ZMM_Hi256 = 6, > + FEATURE_XSAVE_Hi16_ZMM = 7, > + FEATURE_XSAVE_PT = 8, > + FEATURE_XSAVE_PKRU = 9, > + FEATURE_XSAVE_PASID = 10, > + FEATURE_XSAVE_CET_USER = 11, > + FEATURE_XSAVE_CET_SHADOW_STACK = 12, > + FEATURE_XSAVE_HDC = 13, > + FEATURE_XSAVE_UINTR = 14, > + FEATURE_XSAVE_LBR = 15, > + FEATURE_XSAVE_HWP = 16, > + FEATURE_XSAVE_XTILE_CFG = 17, > + FEATURE_XSAVE_XTILE_DATA = 18, > + FEATURE_MAX, > + FEATURE_XSAVE_EXTENDED_START = FEATURE_XSAVE_YMM, > + FEATURE_XSAVE_EXTENDED_END = FEATURE_XSAVE_XTILE_DATA, > +}; Why can't this use the existing 'enum xfeature' which is providing exactly the same information already? > +#ifdef CONFIG_COREDUMP > +static int get_sub_leaf(int custom_xfeat) > +{ > + switch (custom_xfeat) { > + case FEATURE_XSAVE_YMM: return XFEATURE_YMM; > + case FEATURE_XSAVE_BNDREGS: return XFEATURE_BNDREGS; > + case FEATURE_XSAVE_BNDCSR: return XFEATURE_BNDCSR; > + case FEATURE_XSAVE_OPMASK: return XFEATURE_OPMASK; > + case FEATURE_XSAVE_ZMM_Hi256: return XFEATURE_ZMM_Hi256; > + case FEATURE_XSAVE_Hi16_ZMM:return XFEATURE_Hi16_ZMM; > + case FEATURE_XSAVE_PT: return > XFEATURE_PT_UNIMPLEMENTED_SO_FAR; > + case FEATURE_XSAVE_PKRU:return XFEATURE_PKRU; > + case FEATURE_XSAVE_PASID: return XFEATURE_PASID; > + case FEATURE_XSAVE_CET_USER:return XFEATURE_CET_USER; > + case FEATURE_XSAVE_CET_SHADOW_STACK:return > XFEATURE_CET_KERNEL_UNUSED; > + case FEATURE_XSAVE_HDC: return XFEATURE_RSRVD_COMP_13; > + case FEATURE_XSAVE_UINTR: return XFEATURE_RSRVD_COMP_14; > + case FEATURE_XSAVE_LBR: return XFEATURE_LBR; > + case FEATURE_XSAVE_HWP: return XFEATURE_RSRVD_COMP_16; > + case FEATURE_XSAVE_XTILE_CFG: return XFEATURE_XTILE_CFG; > + case FEATURE_XSAVE_XTILE_DATA: return XFEATURE_XTILE_DATA; > + default: > + pr_warn_ratelimited("Not a valid XSAVE Feature."); > + return 0; > + } > +} This function then maps the identical enums one to one. The only actual "functionality" is the default case and that's completely pointless. > +/* > + * Dump type, size, offset and flag values for every xfeature that is > present. > + */ > +static int dump_xsave_layout_desc(struct coredump_params *cprm) > +{ > + u32 supported_features = 0; > + struct xfeat_component xc; > + u32 eax, ebx, ecx, edx; > + int num_records = 0; > + int sub_leaf = 0; > + int i; > + > + /* Find supported extended features */ > + cpuid_count(XSTATE_CPUID, 0, , , , ); > + supported_features = eax; Why does this need to re-evaluate CPUID instead of just using the existing fpu_user_cfg.max_features? > + for (i = FEATURE_XSAVE_EXTENDED_START; > + i <= FEATURE_XSAVE_EXTENDED_END; i++) { Please use the full 100 character line width. > + sub_leaf = get_sub_leaf(i); > + if (!sub_leaf) > + continue; > + if (supported_features & (1U << sub_leaf)) { > + cpuid_count(XSTATE_CPUID, sub_leaf, , , , > ); > + xc.xfeat_type = i; > + xc.xfeat_sz = eax; > + xc.xfeat_off = ebx; > + /* Reserved for future use */ > + xc.xfeat_flags = 0; > + > + if (!dump_emit(cprm, , > +sizeof(struct xfeat_component))) sizeof(xc), no? > + return 0; > + num_records++; > + } > + } This whole thing can be written as: for_each_extended_xfeature(i, fpu_user_cfg.max_features) { struct xfeat_component xc = { .type = i, .size = xstate_sizes[i], .offset = xstate_offsets[i], }; if (!dump_emit(cprm, , sizeof(xc))) return 0; num_records++; } It omits the features which are supported by the CPU, but not enabled by the kernel. That's perfectly fine because: 1) the corresponding xfeature
Re: [PATCHv12 1/4] genirq: Provide a snapshot mechanism for interrupt statistics
On Wed, Apr 10 2024 at 14:45, Bitao Hu wrote: > On 2024/4/9 17:58, Thomas Gleixner wrote: > By the way, what do you think of my reason for using printk() instead of > pr_crit()? Should I change this part of the code in v13? Either way is fine. Just put a proper explanation into the change log if you stick with printk(). Thanks, tglx
Re: [PATCHv12 1/4] genirq: Provide a snapshot mechanism for interrupt statistics
On Wed, Mar 06 2024 at 20:52, Bitao Hu wrote: > The soft lockup detector lacks a mechanism to identify interrupt storms > as root cause of a lockup. To enable this the detector needs a > mechanism to snapshot the interrupt count statistics on a CPU when the > detector observes a potential lockup scenario and compare that against > the interrupt count when it warns about the lockup later on. The number > of interrupts in that period give a hint whether the lockup might be > caused by an interrupt storm. > > Instead of having extra storage in the lockup detector and accessing > the internals of the interrupt descriptor directly, convert the per CPU > irq_desc::kstat_irq member to a data structure which contains the > counter plus a snapshot member and provide interfaces to take a > snapshot of all interrupts on the current CPU and to retrieve the delta > of a specific interrupt later on. > > Originally-by: Thomas Gleixner > Signed-off-by: Bitao Hu > Reviewed-by: Liu Song > Reviewed-by: Douglas Anderson This does not apply anymore. Also can you please split this apart to convert kstat_irqs to a struct with just the count in it and then add the snapshot mechanics on top. Thanks, tglx
Re: [PATCHv12 4/4] watchdog/softlockup: report the most frequent interrupts
On Wed, Mar 06 2024 at 20:52, Bitao Hu wrote: > + if (__this_cpu_read(snapshot_taken)) { > + for_each_active_irq(i) { > + count = kstat_get_irq_since_snapshot(i); > + tabulate_irq_count(irq_counts_sorted, i, count, > NUM_HARDIRQ_REPORT); > + } > + > + /* > + * We do not want the "watchdog: " prefix on every line, > + * hence we use "printk" instead of "pr_crit". > + */ You are not providing any justification why the prefix is not wanted. Just saying 'We do not want' does not cut it and who is 'We'. I certainly not. I really disagree because the prefixes are very useful for searching log files. So not having it makes it harder to filter out for no reason. Thanks, tglx
Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files
On Thu, Mar 14 2024 at 17:29, Borislav Petkov wrote: > On Thu, Mar 14, 2024 at 09:19:15AM -0700, Dave Hansen wrote: >> Are you envisioning an *XSAVE* state component that's not described by >> CPUID? > > I want to be prepared. You can imagine some of the short cuts and > corners cutting hw guys would do so I'd want to be prepared there and > not tie this to CPUID. Anything which is not enumerated in CPUID does not exist in XSTATE. Period and end of story. Stop paving the way for hardware people to have an excuse for being stupid. Aside of that XSTATE is a dead end as we all know. Thanks, tglx
Re: [v2 PATCH 0/3] arch: mm, vdso: consolidate PAGE_SIZE definition
On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote: > From: Arnd Bergmann > > Naresh noticed that the newly added usage of the PAGE_SIZE macro in > include/vdso/datapage.h introduced a build regression. I had an older > patch that I revived to have this defined through Kconfig rather than > through including asm/page.h, which is not allowed in vdso code. > > The vdso patch series now has a temporary workaround, but I still want to > get this into v6.9 so we can place the hack with CONFIG_PAGE_SIZE > in the vdso. Thank you for cleaning this up! tglx
Re: [PATCH v2 3/3] arch: define CONFIG_PAGE_SIZE_*KB on all architectures
On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote: > From: Arnd Bergmann > > Most architectures only support a single hardcoded page size. In order > to ensure that each one of these sets the corresponding Kconfig symbols, > change over the PAGE_SHIFT definition to the common one and allow > only the hardware page size to be selected. > > Acked-by: Guo Ren > Acked-by: Heiko Carstens > Acked-by: Stafford Horne > Acked-by: Johannes Berg > Signed-off-by: Arnd Bergmann Reviewed-by: Thomas Gleixner
Re: [PATCH v2 2/3] arch: simplify architecture specific page size configuration
On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote: > From: Arnd Bergmann > > arc, arm64, parisc and powerpc all have their own Kconfig symbols > in place of the common CONFIG_PAGE_SIZE_4KB symbols. Change these > so the common symbols are the ones that are actually used, while > leaving the arhcitecture specific ones as the user visible > place for configuring it, to avoid breaking user configs. > > Reviewed-by: Christophe Leroy (powerpc32) > Acked-by: Catalin Marinas > Acked-by: Helge Deller # parisc > Signed-off-by: Arnd Bergmann Reviewed-by: Thomas Gleixner
Re: [PATCH v2 1/3] arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions
On Wed, Mar 06 2024 at 15:14, Arnd Bergmann wrote: > From: Arnd Bergmann > > These four architectures define the same Kconfig symbols for configuring > the page size. Move the logic into a common place where it can be shared > with all other architectures. > > Signed-off-by: Arnd Bergmann Reviewed-by: Thomas Gleixner
Re: [PATCHv11 2/4] genirq: Provide a snapshot mechanism for interrupt statistics
On Tue, Mar 05 2024 at 18:57, Bitao Hu wrote: > On 2024/3/4 22:24, Thomas Gleixner wrote: > "GENERIC_IRQ_STAT_SNAPSHOT" visible to the user. However, after > analyzing the previous emails, it seems that what you were actually > proposing was to directly disable "GENERIC_IRQ_STAT_SNAPSHOT" when > "SOFTLOCKUP_DETECTOR_INTR_STORM" is not enabled, as a way to save > memory. If my current understanding is correct, then the code for that > part would look something like the following. Correct. > diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig > index 2531f3496ab6..a28e5ac5fc79 100644 > --- a/kernel/irq/Kconfig > +++ b/kernel/irq/Kconfig > @@ -108,6 +108,10 @@ config GENERIC_IRQ_MATRIX_ALLOCATOR > config GENERIC_IRQ_RESERVATION_MODE > bool > > +# Snapshot for interrupt statistics > +config GENERIC_IRQ_STAT_SNAPSHOT > + bool > + > # Support forced irq threading > config IRQ_FORCED_THREADING > bool > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index 49f652674bd8..899b69fcb598 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1032,6 +1032,7 @@ config SOFTLOCKUP_DETECTOR > config SOFTLOCKUP_DETECTOR_INTR_STORM > bool "Detect Interrupt Storm in Soft Lockups" > depends on SOFTLOCKUP_DETECTOR && IRQ_TIME_ACCOUNTING > + select GENERIC_IRQ_STAT_SNAPSHOT This goes into the patch which adds the lockup detector parts. Thanks, tglx
Re: [PATCHv11 2/4] genirq: Provide a snapshot mechanism for interrupt statistics
On Mon, Mar 04 2024 at 20:00, Bitao Hu wrote: >> +# Snapshot for interrupt statistics >> +config GENERIC_IRQ_STAT_SNAPSHOT >> +bool >> + >> # Support forced irq threading >> config IRQ_FORCED_THREADING >> bool > > I think we should follow Douglas's suggestion by making > "config GENERIC_IRQ_STAT_SNAPSHOT" automatically selectable by > "config SOFTLOCKUP_DETECTOR_INTR_STORM". This can prevent users > from inadvertently disabling "config GENERIC_IRQ_STAT_SNAPSHOT" > while enabling "config SOFTLOCKUP_DETECTOR_INTR_STORM". The above is not even configurable by the user. It's only selectable by some other config option. > +# Snapshot for interrupt statistics > +config GENERIC_IRQ_STAT_SNAPSHOT > + bool > + help > + > + Say Y here to enable the kernel to provide a snapshot mechanism > + for interrupt statistics. That makes is visible which is pointless because it's only relevant when there is an actual user. Thanks, tglx
Re: [PATCH v2 4/7] mm/x86: Drop two unnecessary pud_leaf() definitions
On Thu, Feb 29 2024 at 16:42, pet...@redhat.com wrote: > From: Peter Xu > > pud_leaf() has a fallback macro defined in include/linux/pgtable.h already. > Drop the extra two for x86. > > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: Dave Hansen > Cc: x...@kernel.org > Signed-off-by: Peter Xu Acked-by: Thomas Gleixner
Re: [PATCHv11 2/4] genirq: Provide a snapshot mechanism for interrupt statistics
Doug! On Wed, Feb 28 2024 at 14:44, Doug Anderson wrote: > I won't insist on it, but I continue to worry about memory > implications with large numbers of CPUs. With a 4-byte int, 8192 max > CPUs, and 100 IRQs the extra "ref" value takes up over 3MB of memory > (8192 * 4 bytes * 100). > > Technically, you could add a new symbol like "config > NEED_IRQ_SNAPSHOTS". This wouldn't be a symbol selectable by the end > user but would automatically be selected by "config > SOFTLOCKUP_DETECTOR_INTR_STORM". If the config wasn't defined then the > struct wouldn't contain "ref" and the snapshot routines would just be > static inline stubs. > > Maybe Thomas has an opinion about whether this is something to worry > about. Worst case it wouldn't be hard to do in a follow-up patch. I'd say it makes sense to give people a choice to save memory especially when the softlock detector code is not enabled. It's rather straight forward to do. Thanks, tglx --- --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -24,7 +24,9 @@ struct pt_regs; */ struct irqstat { unsigned intcnt; +#ifdef CONFIG_GENIRQ_STAT_SNAPSHOT unsigned intref; +#endif }; /** --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -978,6 +978,7 @@ static unsigned int kstat_irqs(unsigned return sum; } +#ifdef CONFIG_GENIRQ_STAT_SNAPSHOT void kstat_snapshot_irqs(void) { struct irq_desc *desc; @@ -998,6 +999,7 @@ unsigned int kstat_get_irq_since_snapsho return 0; return this_cpu_read(desc->kstat_irqs->cnt) - this_cpu_read(desc->kstat_irqs->ref); } +#endif /** * kstat_irqs_usr - Get the statistics for an interrupt from thread context --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -108,6 +108,10 @@ config GENERIC_IRQ_MATRIX_ALLOCATOR config GENERIC_IRQ_RESERVATION_MODE bool +# Snapshot for interrupt statistics +config GENERIC_IRQ_STAT_SNAPSHOT + bool + # Support forced irq threading config IRQ_FORCED_THREADING bool
Re: [PATCHv10 3/4] genirq: Avoid summation loops for /proc/interrupts
On Tue, Feb 27 2024 at 19:20, Bitao Hu wrote: > On 2024/2/27 17:26, Thomas Gleixner wrote: >> >> and then let kstat_irqs() and show_interrupts() use it. See? > > I have a concern. kstat_irqs() uses for_each_possible_cpu() for > summation. However, show_interrupts() uses for_each_online_cpu(), > which means it only outputs interrupt statistics for online cpus. > If we use for_each_possible_cpu() in show_interrupts() to calculate > 'any_count', there could be a problem with the following scenario: > If an interrupt has a count of zero on online cpus but a non-zero > count on possible cpus, then 'any_count' would not be zero, and the > statistics for that interrupt would be output, which is not the > desired behavior for show_interrupts(). Therefore, I think it's not > good to have kstat_irqs() and show_interrupts() both use the same > logic. What do you think? Good point. But you simply can have unsigned int kstat_irq_desc(struct irq_desc *desc, const struct cpumask *mask) and hand in the appropriate cpumask, which still shares the code, no? Thanks, tglx
Re: [PATCHv10 3/4] genirq: Avoid summation loops for /proc/interrupts
On Mon, Feb 26 2024 at 10:09, Bitao Hu wrote: > We could use the irq_desc::tot_count member to avoid the summation > loop for interrupts which are not marked as 'PER_CPU' interrupts in > 'show_interrupts'. This could reduce the time overhead of reading > /proc/interrupts. "Could" is not really a technical term. Either we do or we do not. Also please provide context for your change and avoid the 'We'. > --- a/include/linux/irqdesc.h > +++ b/include/linux/irqdesc.h > @@ -121,6 +121,8 @@ static inline void irq_unlock_sparse(void) { } > extern struct irq_desc irq_desc[NR_IRQS]; > #endif > > +extern bool irq_is_nmi(struct irq_desc *desc); > + If at all this wants to be in kernel/irq/internal.h. There is zero reason to expose this globally. > -static bool irq_is_nmi(struct irq_desc *desc) > +bool irq_is_nmi(struct irq_desc *desc) > { > return desc->istate & IRQS_NMI; > } If at all this really wants to be a static inline in internals.h, but instead of blindly copying code this can be done smarter: unsigned int kstat_irq_desc(struct irq_desc *desc) { unsigned int sum = 0; int cpu; if (!irq_settings_is_per_cpu_devid(desc) && !irq_settings_is_per_cpu(desc) && !irq_is_nmi(desc)) return data_race(desc->tot_count); for_each_possible_cpu(cpu) sum += data_race(*per_cpu_ptr(desc->kstat_irqs, cpu)); return sum; } and then let kstat_irqs() and show_interrupts() use it. See? With that a proper changelog would be: show_interrupts() unconditionally accumulates the per CPU interrupt statistics to determine whether an interrupt was ever raised. This can be avoided for all interrupts which are not strictly per CPU and not of type NMI because those interrupts provide already an accumulated counter. The required logic is already implemented in kstat_irqs(). Split the inner access logic out of kstat_irqs() and use it for kstat_irqs() and show_interrupts() to avoid the accumulation loop when possible. Thanks, tglx
Re: [PATCHv9 2/3] irq: use a struct for the kstat_irqs in the interrupt descriptor
On Fri, Feb 23 2024 at 15:18, Bitao Hu wrote: > On 2024/2/22 21:22, Thomas Gleixner wrote: >>> - if (desc->kstat_irqs) { >>> - for_each_online_cpu(j) >>> - any_count |= data_race(*per_cpu_ptr(desc->kstat_irqs, >>> j)); >>> - } >>> + if (desc->kstat_irqs) >>> + any_count = data_race(desc->tot_count); >> >> This is an unrelated change and needs to be split out into a separate >> patch with a proper changelog which explains why this is equivalent. >> > > Alright, I will remove this change witch is not related to the purpose > of this patch. > > I guess that the purpose of suggesting this change in your V1 response > was to speedup the 'show_interrupts'. However, after reviewing the > usage of 'desc->tot_count' in 'unsigned int kstat_irqs(unsigned int > irq)', I think the change might be as follows: > > diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c > index 623b8136e9af..53b8d6edd7ac 100644 > --- a/kernel/irq/proc.c > +++ b/kernel/irq/proc.c > @@ -489,8 +489,13 @@ int show_interrupts(struct seq_file *p, void *v) > goto outsparse; > > if (desc->kstat_irqs) { > - for_each_online_cpu(j) > - any_count |= > data_race(per_cpu(desc->kstat_irqs->cnt, j)); > + if (!irq_settings_is_per_cpu_devid(desc) && > + !irq_settings_is_per_cpu(desc) && > + !irq_is_nmi(desc)) > + any_count = data_race(desc->tot_count); > + else > + for_each_online_cpu(j) > + any_count |= > data_race(per_cpu(desc->kstat_irqs->cnt, j)); > } > > if ((!desc->action || irq_desc_is_chained(desc)) && !any_count) > > Is my idea correct? Yes.
Re: [PATCHv9 2/3] irq: use a struct for the kstat_irqs in the interrupt descriptor
On Thu, Feb 22 2024 at 17:34, Bitao Hu wrote: First of all the subsystem prefix is 'genirq:'. 'git log kernel/irq/' gives you a pretty good hint. It's documented Secondly the subject line does not match what this patch is about. It's not about using a struct, it's about providing a snapshot mechanism, no? > The current implementation uses an int for the kstat_irqs in the > interrupt descriptor. > > However, we need to know the number of interrupts which happened > since softlockup detection took a snapshot in order to analyze > the problem caused by an interrupt storm. > > Replacing an int with a struct and providing sensible interfaces > for the watchdog code can keep it self contained to the interrupt > core code. So something like this makes a useful change log for this: Subject: genirq: Provide a snapshot mechanism for interrupt statistics The soft lockup detector lacks a mechanism to identify interrupt storms as root cause of a lockup. To enable this the detector needs a mechanism to snapshot the interrupt count statistics on a CPU when the detector observes a potential lockup scenario and compare that against the interrupt count when it warns about the lockup later on. The number of interrupts in that period give a hint whether the lockup might be caused by an interrupt storm. Instead of having extra storage in the lockup detector and accessing the internals of the interrupt descriptor directly, convert the per CPU irq_desc::kstat_irq member to a data structure which contains the counter plus a snapshot member and provide interfaces to take a snapshot of all interrupts on the current CPU and to retrieve the delta of a specific interrupt later on. Hmm? > Signed-off-by: Bitao Hu Interesting. You fully authored the patch? That's not how it works. You cannot take work from others and claim that it is yours. The minimal courtesy is to add a 'Originally-by:' tag. > diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c > index 623b8136e9af..3ad40cf30c66 100644 > --- a/kernel/irq/proc.c > +++ b/kernel/irq/proc.c > @@ -488,18 +488,15 @@ int show_interrupts(struct seq_file *p, void *v) > if (!desc || irq_settings_is_hidden(desc)) > goto outsparse; > > - if (desc->kstat_irqs) { > - for_each_online_cpu(j) > - any_count |= data_race(*per_cpu_ptr(desc->kstat_irqs, > j)); > - } > + if (desc->kstat_irqs) > + any_count = data_race(desc->tot_count); This is an unrelated change and needs to be split out into a separate patch with a proper changelog which explains why this is equivalent. Thanks, tglx
Re: [PATCHv8 2/2] watchdog/softlockup: report the most frequent interrupts
On Tue, Feb 20 2024 at 00:19, Bitao Hu wrote: > arch/mips/dec/setup.c| 2 +- > arch/parisc/kernel/smp.c | 2 +- > arch/powerpc/kvm/book3s_hv_rm_xics.c | 2 +- > include/linux/irqdesc.h | 9 ++- > include/linux/kernel_stat.h | 4 + > kernel/irq/internals.h | 2 +- > kernel/irq/irqdesc.c | 34 ++-- > kernel/irq/proc.c| 9 +-- This really wants to be split into two patches. Interrupt infrastructure first and then the actual usage site in the watchdog code. Thanks, tglx
Re: [PATCH tty v1 00/74] serial: wrappers for uart port lock
On Thu, Sep 14 2023 at 20:01, Maciej W. Rozycki wrote: > On Thu, 14 Sep 2023, John Ogness wrote: > >> Patches 2-74 switch all uart port locking call sites to use the new >> wrappers. These patches were automatically generated using coccinelle. > > Hmm, no need to do this for drivers/tty/serial/zs.c? zs.c does not use port lock at all. It has like a couple of other drivers a local homebrewn spinlock. Thanks, tglx
Re: [PATCH v4 00/10] Introduce SMT level and add PowerPC support
On Fri, Jul 28 2023 at 14:23, Rui Zhang wrote: > On Fri, 2023-07-28 at 09:40 +0200, Thomas Gleixner wrote: >> On Sun, Jul 09 2023 at 15:25, Rui Zhang wrote: >> > I ran into a boot hang regression with latest upstream code, and it >> > took me a while to bisect the offending commit and workaround it. >> >> Where is the bug report and the analysis? And what's the workaround? > > As it is an iwlwifi regression, I didn't paste the link here. > > The regression was reported at > https://lore.kernel.org/all/b533071f38804247f06da9e52a04f15cce7a3836.ca...@intel.com/ > > And it was fixed later by below commit in 6.5-rc2. Ah, ok. I was worried that you ran into issues with the parallel bootup muck.
Re: [PATCH v4 00/10] Introduce SMT level and add PowerPC support
Laurent, Michael! On Wed, Jul 05 2023 at 16:51, Laurent Dufour wrote: > I'm taking over the series Michael sent previously [1] which is smartly > reviewing the initial series I sent [2]. This series is addressing the > comments sent by Thomas and me on the Michael's one. Thanks for getting this into shape. I've merged it into: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/core and tagged it at patch 7 for consumption into the powerpc tree, so the powerpc specific changes can be applied there on top: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp-core-for-ppc-23-07-28 Thanks, tglx
Re: [PATCH v4 00/10] Introduce SMT level and add PowerPC support
Rui! On Sun, Jul 09 2023 at 15:25, Rui Zhang wrote: > I ran into a boot hang regression with latest upstream code, and it > took me a while to bisect the offending commit and workaround it. Where is the bug report and the analysis? And what's the workaround? Thanks, tglx
Re: [PATCH 07/10] cpu/SMT: Allow enabling partial SMT states via sysfs
On Thu, Jun 15 2023 at 17:46, Laurent Dufour wrote: > > - if (ctrlval != cpu_smt_control) { > + orig_threads = cpu_smt_num_threads; > + cpu_smt_num_threads = num_threads; > + > + if (num_threads > orig_threads) { > + ret = cpuhp_smt_enable(); > + } else if (num_threads < orig_threads) { > + ret = cpuhp_smt_disable(ctrlval); > + } else if (ctrlval != cpu_smt_control) { > switch (ctrlval) { > case CPU_SMT_ENABLED: > ret = cpuhp_smt_enable(); This switch() is still as pointless as in the previous version. OFF -> ON, ON -> OFF, ON -> FORCE_OFF are covered by the num_threads comparisons. So the only case where (ctrlval != cpu_smt_control) is relevant is the OFF -> FORCE_OFF transition because in that case the number of threads is not changing. force_off = ctrlval != cpu_smt_control && ctrval == CPU_SMT_FORCE_DISABLED; if (num_threads > orig_threads) ret = cpuhp_smt_enable(); else if (num_threads < orig_threads || force_off) ret = cpuhp_smt_disable(ctrlval); Should just work, no? Thanks, tglx
Re: [PATCH 01/10] cpu/SMT: Move SMT prototypes into cpu_smt.h
On Thu, Jun 15 2023 at 17:46, Laurent Dufour wrote: > From: Michael Ellerman > > A subsequent patch would like to use the cpuhp_smt_control enum as part > of the interface between generic and arch code. This still has the 'patch' and 'arch' style which I pointed out before. It seems you fixed it only for one patch in the series. Thanks, tglx
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
Kent! On Sun, Jun 18 2023 at 19:14, Kent Overstreet wrote: > On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote: > > Thomas, you're confusing an internal interface with external No. I am not. Whether that's an internal function or not does not make any difference at all. Having a function growing _eight_ parameters where _six_ of them are derived from a well defined data structure is a clear sign of design fail. It's not rocket science to do: struct generic_allocation_info { }; struct execmem_info { struct generic_allocation_info alloc_info; ; struct execmem_param { ... struct execmem_info[NTYPES]; }; and having a function which can either operate on execmem_param and type or on generic_allocation_info itself. It does not matter as long as the data structure which is handed into this internal function is describing it completely or needs a supplementary argument, i.e. flags. Having tons of wrappers which do: a = generic_info.a; b = generic_info.b; n = generic_info.n; internal_func(a, b, ,, n); is just hillarious and to repeat myself tasteless and therefore disgusting. That's CS course first semester hackery, but TBH, I can only tell from imagination because I did not take CS courses - maybe that's the problem... Data structure driven design works not from the usage site down to the internals. It's the other way round: 1) Define a data structure which describes what the internal function needs to know 2) Implement use case specific variants which describe that 3) Hand the use case specific variant to the internal function eventually with some minimal supplementary information. Object oriented basics, right? There is absolutely nothing wrong with having internal_func(generic_info *, modifier); but having internal_func(a, b, ... n) is fundamentally wrong in the context of an extensible and future proof internal function. Guess what. Today it's sufficient to have _eight_ arguments and we are happy to have 10 nonsensical wrappers around this internal function. Tomorrow there happens to be a use case which needs another argument so you end up: Changing 10 wrappers plus the function declaration and definition in one go instead of adding One new naturally 0 initialized member to generic_info and be done with it. Look at the evolution of execmem_alloc() in this very pathset which I pointed out. That very patchset covers _two_ of at least _six_ cases Song and myself identified. It already had _three_ steps of evolution from _one_ to _five_ to _eight_ parameters. C is not like e.g. python where you can "solve" that problem by simply doing: - internal_func(a, b, c): + internal_func(a, b, c, d=None, ..., n=None): But that's not a solution either. That's a horrible workaround even in python once your parameter space gets sufficiently large. The way out in python is to have **kwargs. But that's not an option in C, and not necessarily the best option for python either. Even in python or any other object oriented language you get to the point where you have to rethink your approach, go back to the drawing board and think about data representation. But creating a new interface based on "let's see what we need over time and add parameters as we see fit" is simply wrong to begin with independent of the programming language. Even if the _eight_ parameters are the end of the range, then they are beyond justifyable because that's way beyond the natural register argument space of any architecture and you are offloading your lazyness to wrappers and the compiler to emit pointlessly horrible code. There is a reason why new syscalls which need more than a few parameters are based on 'struct DATA_WHICH_I_NEED_TO_KNOW' and 'flags'. We've got burned on the non-extensibilty often enough. Why would a new internal function have any different requirements especially as it is neither implemented to the full extent nor a hotpath function? Now you might argue that it _is_ a "hotpath" due to the BPF usage, but then even more so as any intermediate wrapper which converts from one data representation to another data representation is not going to increase performance, right? > ... I made the same mistake reviewing Song's patchset... Songs series had rough edges, but was way more data structure driven and palatable than this hackery. The fact that you made a mistake while reviewing Songs series has absolutely nothing to do with the above or my previous reply to Mike. Thanks, tglx
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
Mike! Sorry for being late on this ... On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote: > > +void *execmem_data_alloc(size_t size) > +{ > + unsigned long start = execmem_params.modules.data.start; > + unsigned long end = execmem_params.modules.data.end; > + pgprot_t pgprot = execmem_params.modules.data.pgprot; > + unsigned int align = execmem_params.modules.data.alignment; > + unsigned long fallback_start = > execmem_params.modules.data.fallback_start; > + unsigned long fallback_end = execmem_params.modules.data.fallback_end; > + bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW; While I know for sure that you read up on the discussion I had with Song about data structures, it seems you completely failed to understand it. > + return execmem_alloc(size, start, end, align, pgprot, > + fallback_start, fallback_end, kasan); Having _seven_ intermediate variables to fill _eight_ arguments of a function instead of handing in @size and a proper struct pointer is tasteless and disgusting at best. Six out of those seven parameters are from: execmem_params.module.data while the KASAN shadow part is retrieved from execmem_params.module.flags So what prevents you from having a uniform data structure, which is extensible and decribes _all_ types of allocations? Absolutely nothing. The flags part can either be in the type dependend part or you make the type configs an array as I had suggested originally and then execmem_alloc() becomes: void *execmem_alloc(type, size) and static inline void *execmem_data_alloc(size_t size) { return execmem_alloc(EXECMEM_TYPE_DATA, size); } which gets the type independent parts from @execmem_param. Just read through your own series and watch the evolution of execmem_alloc(): static void *execmem_alloc(size_t size) static void *execmem_alloc(size_t size, unsigned long start, unsigned long end, unsigned int align, pgprot_t pgprot) static void *execmem_alloc(size_t len, unsigned long start, unsigned long end, unsigned int align, pgprot_t pgprot, unsigned long fallback_start, unsigned long fallback_end, bool kasan) In a month from now this function will have _ten_ parameters and tons of horrible wrappers which convert an already existing data structure into individual function arguments. Seriously? If you want this function to be [ab]used outside of the exec_param configuration space for whatever non-sensical reasons then this still can be either: void *execmem_alloc(params, type, size) static inline void *execmem_data_alloc(size_t size) { return execmem_alloc(_param, EXECMEM_TYPE_DATA, size); } or void *execmem_alloc(type_params, size); static inline void *execmem_data_alloc(size_t size) { return execmem_alloc(_param.data, size); } which both allows you to provide alternative params, right? Coming back to my conversation with Song: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships." You might want to reread: https://lore.kernel.org/r/87lenuukj0.ffs@tglx and the subsequent thread. The fact that my suggestions had a 'mod_' namespace prefix does not make any of my points moot. Song did an extremly good job in abstracting things out, but you decided to ditch his ground work instead of building on it and keeping the good parts. That's beyond sad. Worse, you went the full 'not invented here' path with an outcome which is even worse than the original hackery from Song which started the above referenced thread. I don't know what caused you to decide that ad hoc hackery is better than proper data structure based design patterns. I actually don't want to know. As much as my voice counts: NAK-ed-by: Thomas Gleixner Thanks, tglx
Re: [PATCH 3/9] cpu/SMT: Store the current/max number of threads
On Tue, Jun 13 2023 at 19:16, Laurent Dufour wrote: > On 10/06/2023 23:26:18, Thomas Gleixner wrote: >> On Thu, May 25 2023 at 01:56, Michael Ellerman wrote: >>> #ifdef CONFIG_HOTPLUG_SMT >>> enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED; >>> +static unsigned int cpu_smt_max_threads __ro_after_init; >>> +unsigned int cpu_smt_num_threads; >> >> Why needs this to be global? cpu_smt_control is pointlessly global already. > > I agree that cpu_smt_*_threads should be static. > > Howwever, regarding cpu_smt_control, it is used in 2 places in the x86 code: > - arch/x86/power/hibernate.c in arch_resume_nosmt() > - arch/x86/kernel/cpu/bugs.c in spectre_v2_user_select_mitigation() Bah. I must have fatfingered the grep then. > An accessor function may be introduced to read that value in these 2 > functions, but I'm wondering if that's really the best option. > > Unless there is a real need to change this through this series, I think > cpu_smt_control can remain global. That's fine. Thanks, tglx
Re: [PATCH 8/9] powerpc: Add HOTPLUG_SMT support
On Mon, Jun 12 2023 at 17:20, Laurent Dufour wrote: > On 10/06/2023 23:10:02, Thomas Gleixner wrote: >> This needs more thoughts to avoid a completely inconsistent duct tape >> mess. > > Despite the test against smt_enabled_at_boot, mentioned above, I can't see > anything else to rework. Am I missing something? See my comments on the core code changes. >> Btw, the command line parser and the variable smt_enabled_at_boot being >> type int allow negative number of threads too... Maybe not what you want. > > I do agree, it should an unsigned type. I assume you'll fix that yourself. :) Thanks, tglx
Re: [PATCH 5/9] cpu/SMT: Create topology_smt_thread_allowed()
On Thu, May 25 2023 at 01:56, Michael Ellerman wrote: > A subsequent patch will enable partial SMT states, ie. when not all SMT > threads are brought online. Nitpick. I stumbled over this 'subsequent patch' theme a couple of times now because it's very similar to the 'This patch does' phrase. Just explain what you want to achieve at the end. > #else > #define topology_max_packages() (1) > static inline int > @@ -159,6 +160,7 @@ static inline int topology_max_smt_threads(void) { return > 1; } > static inline bool topology_is_primary_thread(unsigned int cpu) { return > true; } > static inline bool topology_smt_supported(void) { return false; } > static inline bool topology_smt_threads_supported(unsigned int threads) { > return false; } > +static inline bool topology_smt_thread_allowed(unsigned int cpu) { return > false; } Not all these functions need a !SMP stub. Think about the context in which they are called. There is probably precedence for pointless ones, but that does not make an argument to add more. > +/** > + * topology_smt_thread_allowed - When enabling SMT check whether this > particular > + *CPU thread is allowed to be brought online. > + * @cpu: CPU to check > + */ > +bool topology_smt_thread_allowed(unsigned int cpu) > +{ > + /* > + * No extra logic s required here to support different thread values > + * because threads will always == 1 or smp_num_siblings because of > + * topology_smt_threads_supported(). > + */ > + return true; > +} > + As x86 only supoorts the on/off model there is no need for this function if you pick up the CONFIG_SMT_NUM_THREADS_DYNAMIC idea. You still need something like that for your PPC use case, but that reduces the overall impact, right? Thanks, tglx
Re: [PATCH 4/9] cpu/SMT: Create topology_smt_threads_supported()
On Thu, May 25 2023 at 01:56, Michael Ellerman wrote: > +/** > + * topology_smt_threads_supported - Check if the given number of SMT threads > + * is supported. > + * > + * @threads: The number of SMT threads. > + */ > +bool topology_smt_threads_supported(unsigned int threads) > +{ > + // Only support a single thread or all threads. > + return threads == 1 || threads == smp_num_siblings; > +} You can make that a simple core function when cpu_smt_*_threads is consistent along the lines of my previous reply. static bool cpu_smt_num_threads_valid(unsigned int threads) { if (IS_ENABLED(CONFIG_SMT_NUM_THREADS_DYNAMIC)) return threads >= 1 && threads <= cpu_smt_max_threads; return threads == 1 || threads == cpu_smt_max_threads; } Or something like that. Thanks, tglx
Re: [PATCH 3/9] cpu/SMT: Store the current/max number of threads
On Sat, Jun 10 2023 at 23:26, Thomas Gleixner wrote: > On Thu, May 25 2023 at 01:56, Michael Ellerman wrote: > /* > * The decision whether SMT is supported can only be done after the full > * CPU identification. Called from architecture code. > */ > -void __init cpu_smt_check_topology(void) > +void __init cpu_smt_set_num_threads(unsigned int max_threads, unsigned int > num_threads) > { > - if (!topology_smt_supported()) > + if (max_threads == 1) Which makes topology_smt_supported() redundant, i.e. it can be removed. Thanks, tglx
Re: [PATCH 3/9] cpu/SMT: Store the current/max number of threads
On Thu, May 25 2023 at 01:56, Michael Ellerman wrote: > #ifdef CONFIG_HOTPLUG_SMT > enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED; > +static unsigned int cpu_smt_max_threads __ro_after_init; > +unsigned int cpu_smt_num_threads; Why needs this to be global? cpu_smt_control is pointlessly global already. > void __init cpu_smt_disable(bool force) > { > @@ -433,10 +435,18 @@ void __init cpu_smt_disable(bool force) > * The decision whether SMT is supported can only be done after the full > * CPU identification. Called from architecture code. > */ > -void __init cpu_smt_check_topology(void) > +void __init cpu_smt_check_topology(unsigned int num_threads) > { > if (!topology_smt_supported()) > cpu_smt_control = CPU_SMT_NOT_SUPPORTED; > + > + cpu_smt_max_threads = num_threads; > + > + // May already be disabled by nosmt command line parameter > + if (cpu_smt_control != CPU_SMT_ENABLED) > + cpu_smt_num_threads = 1; > + else > + cpu_smt_num_threads = num_threads; Taking Laurents findings into account this should be something like the incomplete below. x86 would simply invoke cpu_smt_set_num_threads() with both arguments as smp_num_siblings while PPC can funnel its command line parameter through the num_threads argument. Thanks, tglx --- --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -414,6 +414,8 @@ void __weak arch_smt_update(void) { } #ifdef CONFIG_HOTPLUG_SMT enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED; +static unsigned int cpu_smt_max_threads __ro_after_init; +static unsigned int cpu_smt_num_threads = UINT_MAX; void __init cpu_smt_disable(bool force) { @@ -427,24 +429,31 @@ void __init cpu_smt_disable(bool force) pr_info("SMT: disabled\n"); cpu_smt_control = CPU_SMT_DISABLED; } + cpu_smt_num_threads = 1; } /* * The decision whether SMT is supported can only be done after the full * CPU identification. Called from architecture code. */ -void __init cpu_smt_check_topology(void) +void __init cpu_smt_set_num_threads(unsigned int max_threads, unsigned int num_threads) { - if (!topology_smt_supported()) + if (max_threads == 1) cpu_smt_control = CPU_SMT_NOT_SUPPORTED; -} -static int __init smt_cmdline_disable(char *str) -{ - cpu_smt_disable(str && !strcmp(str, "force")); - return 0; + cpu_smt_max_threads = max_threads; + + /* +* If SMT has been disabled via the kernel command line or SMT is +* not supported, set cpu_smt_num_threads to 1 for consistency. +* If enabled, take the architecture requested number of threads +* to bring up into account. +*/ + if (cpu_smt_control != CPU_SMT_ENABLED) + cpu_smt_num_threads = 1; + else if (num_threads < cpu_smt_num_threads) + cpu_smt_num_threads = num_threads; } -early_param("nosmt", smt_cmdline_disable); static inline bool cpu_smt_allowed(unsigned int cpu) { @@ -463,6 +472,13 @@ static inline bool cpu_smt_allowed(unsig return !cpumask_test_cpu(cpu, _booted_once_mask); } +static int __init smt_cmdline_disable(char *str) +{ + cpu_smt_disable(str && !strcmp(str, "force")); + return 0; +} +early_param("nosmt", smt_cmdline_disable); + /* Returns true if SMT is not supported of forcefully (irreversibly) disabled */ bool cpu_smt_possible(void) {
Re: [PATCH 8/9] powerpc: Add HOTPLUG_SMT support
On Thu, Jun 01 2023 at 18:19, Laurent Dufour wrote: > @@ -435,12 +435,17 @@ void __init cpu_smt_disable(bool force) > * The decision whether SMT is supported can only be done after the full > * CPU identification. Called from architecture code. > */ > -void __init cpu_smt_check_topology(unsigned int num_threads) > +void __init cpu_smt_check_topology(unsigned int num_threads, > +unsigned int max_threads) > { > if (!topology_smt_supported()) > cpu_smt_control = CPU_SMT_NOT_SUPPORTED; > > - cpu_smt_max_threads = num_threads; > + cpu_smt_max_threads = max_threads; > + > + WARN_ON(num_threads > max_threads); > + if (num_threads > max_threads) > + num_threads = max_threads; This does not work. The call site does: > + cpu_smt_check_topology(smt_enabled_at_boot, threads_per_core); smt_enabled_at_boot is 0 when 'smt-enabled=off', which is not what the hotplug core expects. If SMT is disabled it brings up the primary thread, which means cpu_smt_num_threads = 1. This needs more thoughts to avoid a completely inconsistent duct tape mess. Btw, the command line parser and the variable smt_enabled_at_boot being type int allow negative number of threads too... Maybe not what you want. Thanks, tglx
Re: [PATCH 6/9] cpu/SMT: Allow enabling partial SMT states via sysfs
On Thu, May 25 2023 at 01:56, Michael Ellerman wrote: > There is a hook which allows arch code to control how many threads per Can you please write out architecture in changelogs and comments? I know 'arch' is commonly used but while my brain parser tolerates 'arch_' prefixes it raises an exception on 'arch' in prose as 'arch' is a regular word with a completely different meaning. Changelogs and comments are not space constraint. > @@ -2505,20 +2505,38 @@ __store_smt_control(struct device *dev, struct > device_attribute *attr, > if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED) > return -ENODEV; > > - if (sysfs_streq(buf, "on")) > + if (sysfs_streq(buf, "on")) { > ctrlval = CPU_SMT_ENABLED; > - else if (sysfs_streq(buf, "off")) > + num_threads = cpu_smt_max_threads; > + } else if (sysfs_streq(buf, "off")) { > ctrlval = CPU_SMT_DISABLED; > - else if (sysfs_streq(buf, "forceoff")) > + num_threads = 1; > + } else if (sysfs_streq(buf, "forceoff")) { > ctrlval = CPU_SMT_FORCE_DISABLED; > - else > + num_threads = 1; > + } else if (kstrtoint(buf, 10, _threads) == 0) { > + if (num_threads == 1) > + ctrlval = CPU_SMT_DISABLED; > + else if (num_threads > 1 && > topology_smt_threads_supported(num_threads)) > + ctrlval = CPU_SMT_ENABLED; > + else > + return -EINVAL; > + } else { > return -EINVAL; > + } > > ret = lock_device_hotplug_sysfs(); > if (ret) > return ret; > > - if (ctrlval != cpu_smt_control) { > + orig_threads = cpu_smt_num_threads; > + cpu_smt_num_threads = num_threads; > + > + if (num_threads > orig_threads) { > + ret = cpuhp_smt_enable(); > + } else if (num_threads < orig_threads) { > + ret = cpuhp_smt_disable(ctrlval); > + } else if (ctrlval != cpu_smt_control) { > switch (ctrlval) { > case CPU_SMT_ENABLED: > ret = cpuhp_smt_enable(); This switch case does not make sense anymore. The only situation which reaches this is when the control value goes from CPU_SMT_DISABLED to CPU_SMT_FORCE_DISABLED because that's not changing the number of threads. So something like this is completely sufficient: if (num_threads > orig_threads) ret = cpuhp_smt_enable(); else if (num_threads < orig_threads || ctrval == CPU_SMT_FORCE_DISABLED) ret = cpuhp_smt_disable(ctrlval); No? Thanks, tglx
Re: [PATCH 6/9] cpu/SMT: Allow enabling partial SMT states via sysfs
On Sat, Jun 10 2023 at 22:09, Thomas Gleixner wrote: > On Thu, May 25 2023 at 01:56, Michael Ellerman wrote: >> There is a hook which allows arch code to control how many threads per > > Can you please write out architecture in changelogs and comments? > > I know 'arch' is commonly used but while my brain parser tolerates > 'arch_' prefixes it raises an exception on 'arch' in prose as 'arch' is > a regular word with a completely different meaning. Changelogs and > comments are not space constraint. > >> @@ -2505,20 +2505,38 @@ __store_smt_control(struct device *dev, struct >> device_attribute *attr, >> if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED) >> return -ENODEV; >> >> -if (sysfs_streq(buf, "on")) >> +if (sysfs_streq(buf, "on")) { >> ctrlval = CPU_SMT_ENABLED; >> -else if (sysfs_streq(buf, "off")) >> +num_threads = cpu_smt_max_threads; >> +} else if (sysfs_streq(buf, "off")) { >> ctrlval = CPU_SMT_DISABLED; >> -else if (sysfs_streq(buf, "forceoff")) >> +num_threads = 1; >> +} else if (sysfs_streq(buf, "forceoff")) { >> ctrlval = CPU_SMT_FORCE_DISABLED; >> -else >> +num_threads = 1; >> +} else if (kstrtoint(buf, 10, _threads) == 0) { >> +if (num_threads == 1) >> +ctrlval = CPU_SMT_DISABLED; >> +else if (num_threads > 1 && >> topology_smt_threads_supported(num_threads)) Why does this not simply check cpu_smt_max_threads? else if (num_threads > 1 && num_threads <= cpu_smt_max_threads) cpu_smt_max_threads should have been established already, no? Thanks, tglx
Re: [patch 05/39] genirq/msi: Remove filter from msi_free_descs_free_range()
Miquel! On Wed, Mar 01 2023 at 11:55, Miquel Raynal wrote: > t...@linutronix.de wrote on Fri, 11 Nov 2022 14:54:22 +0100 (CET): > >> When a range of descriptors is freed then all of them are not associated to >> a linux interrupt. Remove the filter and add a warning to the free function. >> +/* Leak the descriptor when it is still referenced */ >> +if (WARN_ON_ONCE(msi_desc_match(desc, MSI_DESC_ASSOCIATED))) >> +continue; >> +msi_free_desc(desc); >> } >> } > > It looks like since this commit I am getting warnings upon EPROBE_DEFER > errors in the mvpp2 Marvell Ethernet driver. I looked a bit at the > internals to understand why this warning was shown, and it seems that > nothing "de-references" the descriptors, which would mean here: > resetting desc->irq to 0. Correct. This platform-msi ^(*&!@&^ hack really needs to die ASAP. Marc, where are we on that? Is this still in limbo? > I am wondering how useful thisd WARN_ON() is, or otherwise where the It is useful as it caught bugs already. > desc->irq entry should be zeroed (if I understand that correctly), any > help will be appreciated. Untested workaround below. I hate it with a passion, but *shrug*. Thanks, tglx --- drivers/base/platform-msi.c |1 + include/linux/msi.h |2 ++ kernel/irq/msi.c| 23 ++- 3 files changed, 25 insertions(+), 1 deletion(-) --- a/drivers/base/platform-msi.c +++ b/drivers/base/platform-msi.c @@ -324,6 +324,7 @@ void platform_msi_device_domain_free(str struct platform_msi_priv_data *data = domain->host_data; msi_lock_descs(data->dev); + msi_domain_depopulate_descs(data->dev, virq, nr_irqs); irq_domain_free_irqs_common(domain, virq, nr_irqs); msi_free_msi_descs_range(data->dev, virq, virq + nr_irqs - 1); msi_unlock_descs(data->dev); --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -631,6 +631,8 @@ int msi_domain_prepare_irqs(struct irq_d int nvec, msi_alloc_info_t *args); int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev, int virq, int nvec, msi_alloc_info_t *args); +void msi_domain_depopulate_descs(struct device *dev, int virq, int nvec); + struct irq_domain * __platform_msi_create_device_domain(struct device *dev, unsigned int nvec, --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -1109,14 +1109,35 @@ int msi_domain_populate_irqs(struct irq_ return 0; fail: - for (--virq; virq >= virq_base; virq--) + for (--virq; virq >= virq_base; virq--) { + msi_domain_depopulate_descs(dev, virq, 1); irq_domain_free_irqs_common(domain, virq, 1); + } msi_domain_free_descs(dev, ); unlock: msi_unlock_descs(dev); return ret; } +void msi_domain_depopulate_descs(struct device *dev, int virq_base, int nvec) +{ + struct msi_ctrl ctrl = { + .domid = MSI_DEFAULT_DOMAIN, + .first = virq_base, + .last = virq_base + nvec - 1, + }; + struct msi_desc *desc; + struct xarray *xa; + unsigned long idx; + + if (!msi_ctrl_valid(dev, )) + return; + + xa = >msi.data->__domains[ctrl.domid].store; + xa_for_each_range(xa, idx, desc, ctrl.first, ctrl.last) + desc->irq = 0; +} + /* * Carefully check whether the device can use reservation mode. If * reservation mode is enabled then the early activation will assign a
Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu
On Sun, Nov 27 2022 at 09:53, Paul E. McKenney wrote: > On Sun, Nov 27, 2022 at 01:40:28PM +0100, Thomas Gleixner wrote: >> There are quite some reasons why a CPU-hotplug or a hot-unplug operation >> can fail, which is not a fatal problem, really. >> >> So if a CPU hotplug operation fails, then why can't the torture test >> just move on and validate that the system still behaves correctly? >> >> That gives us more coverage than just testing the good case and giving >> up when something unexpected happens. > > Agreed, with access to a function like the tick_nohz_full_timekeeper() > suggested earlier in this email thread, then yes, it would make sense to > try to offline the CPU anyway, then forgive the failure in cases where > the CPU matches that indicated by tick_nohz_full_timekeeper(). Why special casing this? There are other valid reasons why offlining can fail. So we special case timekeeper today and then next week we special case something else just because. That does not make sense. If it fails there is a reason and you can log it. The important part is that the system is functional and stable after the fail and the rollback. >> I even argue that the torture test should inject random failures into >> the hotplug state machine to achieve extended code coverage. > > I could imagine torture_onoff() telling various CPU-hotplug notifiers > to refuse the transition using some TBD interface. There is already an interface which is exposed to sysfs which allows you to enforce a "fail" at a defined hotplug state. > That would better test the CPU-hotplug common code's ability to deal > with failures. Correct. > Or did you have something else/additional in mind? No. Thanks, tglx
Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu
Zhouyi, On Sun, Nov 27 2022 at 10:45, Zhouyi Zhou wrote: > On Sun, Nov 27, 2022 at 1:05 AM Thomas Gleixner wrote: > > So, I should construct my patch as: > We avoid ... by ... Not "We avoid". Avoid this behaviour by >> No. We are not exporting this just to make a bogus test case happy. >> >> Fix the torture code to handle -EBUSY correctly. > I am going to do a study on this, for now, I do a grep in the kernel tree: > find . -name "*.c"|xargs grep cpuhp_setup_state|wc -l > The result of the grep command shows that there are 268 > cpuhp_setup_state* cases. > which may make our task more complicated. Why? The whole point of this torture thing is to stress the infrastructure. There are quite some reasons why a CPU-hotplug or a hot-unplug operation can fail, which is not a fatal problem, really. So if a CPU hotplug operation fails, then why can't the torture test just move on and validate that the system still behaves correctly? That gives us more coverage than just testing the good case and giving up when something unexpected happens. I even argue that the torture test should inject random failures into the hotplug state machine to achieve extended code coverage. Thanks, tglx
Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu
On Mon, Nov 21 2022 at 11:51, Zhouyi Zhou wrote: > During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to > offline tick_do_timer_cpu, the operation will fail because in > function tick_nohz_cpu_down: > ``` > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > return -EBUSY; > ``` > Above bug was first discovered in torture tests performed in PPC VM How is this a bug? > of Open Source Lab of Oregon State University, and reproducable in RISC-V > and X86-64 (with additional kernel commandline cpu0_hotplug). > > In this patch, we avoid offline tick_do_timer_cpu by distribute > the offlining cpu among remaining cpus. Please read Documentation/process. Search for 'this patch'... > > Signed-off-by: Zhouyi Zhou > --- > include/linux/tick.h| 1 + > kernel/time/tick-common.c | 1 + > kernel/time/tick-internal.h | 1 - > kernel/torture.c| 10 ++ > 4 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/include/linux/tick.h b/include/linux/tick.h > index bfd571f18cfd..23cc0b205853 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -14,6 +14,7 @@ > #include > > #ifdef CONFIG_GENERIC_CLOCKEVENTS > +extern int tick_do_timer_cpu __read_mostly; > extern void __init tick_init(void); > /* Should be core only, but ARM BL switcher requires it */ > extern void tick_suspend_local(void); > diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c > index 46789356f856..87b9b9afa320 100644 > --- a/kernel/time/tick-common.c > +++ b/kernel/time/tick-common.c > @@ -48,6 +48,7 @@ ktime_t tick_next_period; > *procedure also covers cpu hotplug. > */ > int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT; > +EXPORT_SYMBOL_GPL(tick_do_timer_cpu); No. We are not exporting this just to make a bogus test case happy. Fix the torture code to handle -EBUSY correctly. Thanks, tglx
Re: [patch 39/39] x86/apic: Remove X86_IRQ_ALLOC_CONTIGUOUS_VECTORS
Jason, Bjorn, Ashok! On Wed, Nov 16 2022 at 14:05, Jason Gunthorpe wrote: > On Fri, Nov 11, 2022 at 02:55:17PM +0100, Thomas Gleixner wrote: >> Now that the PCI/MSI core code does early checking for multi-MSI support >> X86_IRQ_ALLOC_CONTIGUOUS_VECTORS is not required anymore. > >> Remove the flag and rely on MSI_FLAG_MULTI_PCI_MSI. > > Reviewed-by: Jason Gunthorpe Thanks for taking the time to go through this pile! tglx
Re: [patch 01/39] PCI/MSI: Check for MSI enabled in __pci_msix_enable()
On Wed, Nov 16 2022 at 07:39, Ashok Raj wrote: > On Fri, Nov 11, 2022 at 02:54:15PM +0100, Thomas Gleixner wrote: > > Can the pre-enabled checks for msi and msix be moved up before any vector > range check? > > not that it matters for how it fails, does EBUSY sound better? Does any caller care about the error code or about the ordering in which the caller stupity is detected?
Re: [patch 34/39] PCI/MSI: Reject multi-MSI early
On Wed, Nov 16 2022 at 10:31, Bjorn Helgaas wrote: > On Fri, Nov 11, 2022 at 02:55:09PM +0100, Thomas Gleixner wrote: >> >> +/** >> + * pci_msi_domain_supports - Check for support of a particular feature flag >> + * @pdev: The PCI device to operate on >> + * @feature_mask: The feature mask to check for (full match) >> + * @mode: If ALLOW_LEGACY this grants the feature when there is >> no irq domain >> + * associated to the device. If DENY_LEGACY the lack of an >> irq domain >> + * makes the feature unsupported > > Looks like some of these might be wider than 80 columns, which I think > was the typical width of this file. I accommodated to the general sentiment that 80 columns is yesterday. My new cutoff is 100. Thanks, tglx
Re: [patch 13/39] PCI/MSI: Use msi_domain_info::bus_token
On Wed, Nov 16 2022 at 13:51, Jason Gunthorpe wrote: > On Fri, Nov 11, 2022 at 02:54:35PM +0100, Thomas Gleixner wrote: >> /* PCI-MSI is oneshot-safe */ >> info->chip->flags |= IRQCHIP_ONESHOT_SAFE; >> +/* Let the core update the bus token */ >> +info->bus_token = DOMAIN_BUS_PCI_MSI; > > comment seems a bit obvious :) > Reviewed-by: Jason Gunthorpe > > Should the callers be updated to set this in their "struct > msi_domain_info" ? For PCI/MSI we can handle that in the core for all of them. :) The other msi_domain_info usage in various places needs obviously special care. Thanks, tglx
Re: [patch 12/39] genirq/msi: Add bus token to struct msi_domain_info
On Wed, Nov 16 2022 at 13:49, Jason Gunthorpe wrote: > On Fri, Nov 11, 2022 at 02:54:33PM +0100, Thomas Gleixner wrote: >> From: Ahmed S. Darwish >> >> Add a bus token member to struct msi_domain_info and let >> msi_create_irq_domain() set the bus token. >> >> That allows to remove the bus token updates at the call sites. >> >> Suggested-by: Thomas Gleixner >> Signed-off-by: Ahmed S. Darwish >> Signed-off-by: Thomas Gleixner >> --- >> include/linux/msi.h | 19 +++ >> kernel/irq/msi.c|7 +-- >> 2 files changed, 16 insertions(+), 10 deletions(-) > > Reviewed-by: Jason Gunthorpe > >> struct msi_domain_info { >> -u32 flags; >> -struct msi_domain_ops *ops; >> -struct irq_chip *chip; >> -void*chip_data; >> -irq_flow_handler_t handler; >> -void*handler_data; >> -const char *handler_name; >> -void*data; >> +u32 flags; >> +enum irq_domain_bus_token bus_token; >> +struct msi_domain_ops *ops; >> +struct irq_chip *chip; >> +void*chip_data; >> +irq_flow_handler_t handler; >> +void*handler_data; >> +const char *handler_name; >> +void*data; >> }; > > This is why I've been frowning on horizontal alignment :( Yes, it's annoying when you have to adjust it, but it's fundamentaly simpler to parse than the clogged together word salad.
Re: [patch 08/39] genirq/msi: Provide msi_domain_ops::post_free()
On Wed, Nov 16 2022 at 13:44, Jason Gunthorpe wrote: > On Fri, Nov 11, 2022 at 02:54:27PM +0100, Thomas Gleixner wrote: >> To prepare for removing the exposure of __msi_domain_free_irqs() provide a >> post_free() callback in the MSI domain ops which can be used to solve >> the problem of the only user of __msi_domain_free_irqs() in arch/powerpc. >> >> Signed-off-by: Thomas Gleixner >> --- >> include/linux/msi.h |4 >> kernel/irq/msi.c|2 ++ >> 2 files changed, 6 insertions(+) > > Make sense, but I do wonder why PPC needs this ordering.. Some hypervisor thing.
Re: [patch 32/39] PCI/MSI: Reorder functions in msi.c
On Wed, Nov 16 2022 at 10:28, Bjorn Helgaas wrote: > On Fri, Nov 11, 2022 at 02:55:06PM +0100, Thomas Gleixner wrote: >> From: Ahmed S. Darwish >> >> There is no way to navigate msi.c without banging the head against the wall >> every now and then because MSI and MSI-X specific functions are >> intermingled and the code flow is completely non-obvious. >> >> Reorder everthing so common helpers, MSI and MSI-X specific functions are >> grouped together. > > s/everthing/everything/ > >> Suggested-by: Thomas Gleixner >> Signed-off-by: Ahmed S. Darwish >> Signed-off-by: Thomas Gleixner > > Acked-by: Bjorn Helgaas > > I assume this is pure code movement, so I didn't even look at the > text below. It is.
Re: [patch 27/39] PCI/MSI: Move pci_disable_msix() to api.c
On Wed, Nov 16 2022 at 10:26, Bjorn Helgaas wrote: >> /** >> + * pci_disable_msix() - Disable MSI-X interrupt mode on device >> + * @dev: the PCI device to operate on >> + * >> + * Legacy device driver API to disable MSI-X interrupt mode on device, >> + * free earlier-allocated interrupt vectors, and restore INTx emulation. > > Isn't INTx *emulation* a PCIe implementation detail? Doesn't seem > relevant to callers that it's emulated. Don't know how the emulation ended up there. It restores INTx which is obvioulsy not a wire on PCIe... But yes, it's uninteresting.
Re: [patch 23/39] PCI/MSI: Move pci_alloc_irq_vectors_affinity() to api.c
On Wed, Nov 16 2022 at 10:23, Bjorn Helgaas wrote: > On Fri, Nov 11, 2022 at 02:54:51PM +0100, Thomas Gleixner wrote: >> + * Same as pci_alloc_irq_vectors(), but with the extra @affd parameter. >> + * Check that function docs, and irq_affinity, for more details. > > Is " irq_affinity" some kernel-doc syntax, or is the "&" > superfluous? The latter.
Re: [patch 22/39] PCI/MSI: Move pci_alloc_irq_vectors() to api.c
On Wed, Nov 16 2022 at 10:22, Bjorn Helgaas wrote: >> + * Allocate up to @max_vecs interrupt vectors on device. MSI-X irq > > s/irq/IRQ/ > >> + * vector allocation has a higher precedence over plain MSI, which has a >> + * higher precedence over legacy INTx emulation. >> + * >> + * Upon a successful allocation, the caller should use pci_irq_vector() >> + * to get the Linux IRQ number to be passed to request_threaded_irq(). >> + * The driver must call pci_free_irq_vectors() on cleanup. >> + * >> + * Return: number of allocated vectors (which might be smaller than >> + * @max_vecs), -ENOSPC if less than @min_vecs interrupt vectors are > > s/less/fewer/ (also in some previous patches, IIRC) Will fix.
Re: [patch 20/39] PCI/MSI: Move pci_enable_msi() API to api.c
On Wed, Nov 16 2022 at 10:18, Bjorn Helgaas wrote: > On Fri, Nov 11, 2022 at 02:54:46PM +0100, Thomas Gleixner wrote: >> From: Ahmed S. Darwish >> >> To distangle the maze in msi.c all exported device-driver MSI APIs are now >> to be grouped in one file, api.c. >> >> Move pci_enable_msi() and make its kernel-doc comprehensive. >> >> Signed-off-by: Ahmed S. Darwish >> Signed-off-by: Thomas Gleixner > > Acked-by: Bjorn Helgaas > > Nit: suggest "disentangle" or "untangle" for "distangle" here and in > subsequent patches. My fault. I suggested the word to Ahmed well knowing that this is one of the words I never get spelled correctly :)
Re: [patch 15/39] PCI/MSI: Get rid of PCI_MSI_IRQ_DOMAIN
On Wed, Nov 16 2022 at 10:12, Bjorn Helgaas wrote: > On Fri, Nov 11, 2022 at 02:54:38PM +0100, Thomas Gleixner wrote: >> What a zoo: >> >> PCI_MSI >> select GENERIC_MSI_IRQ >> >> PCI_MSI_IRQ_DOMAIN >> def_bool y >> depends on PCI_MSI >> select GENERIC_MSI_IRQ_DOMAIN >> >> Ergo PCI_MSI enables PCI_MSI_IRQ_DOMAIN which in turn selects >> GENERIC_MSI_IRQ_DOMAIN. So all the dependencies on PCI_MSI_IRQ_DOMAIN are >> just an indirection to PCI_MSI. >> >> Match the reality and just admit that PCI_MSI requires >> GENERIC_MSI_IRQ_DOMAIN. >> >> Signed-off-by: Thomas Gleixner > > Acked-by: Bjorn Helgaas > > Just FYI, this will conflict with my work-in-progress to add more > COMPILE_TEST coverage: > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?id=72b5e7c401a1 > > No *actual* conflicts, just textually next door, so should be sipmle > to resolve. Worst case I can postpone my patch until the next cycle. Linus should be able to resolve that conflict :)
Re: [patch 03/39] iommu/amd: Remove bogus check for multi MSI-X
On Wed, Nov 16 2022 at 08:02, Ashok Raj wrote: > On Fri, Nov 11, 2022 at 02:54:19PM +0100, Thomas Gleixner wrote: >> PCI/Multi-MSI is MSI specific and not supported for MSI-X >> >> Signed-off-by: Thomas Gleixner >> --- >> drivers/iommu/amd/iommu.c |3 +-- >> 1 file changed, 1 insertion(+), 2 deletions(-) >> >> --- a/drivers/iommu/amd/iommu.c >> +++ b/drivers/iommu/amd/iommu.c >> @@ -3294,8 +3294,7 @@ static int irq_remapping_alloc(struct ir >> >> if (!info) >> return -EINVAL; >> -if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI && >> -info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX) >> +if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI) >> return -EINVAL; >> >> /* >> > > nit: > > maybe better to merge patch2/3 since both seem related? What's better about it? It's two different drivers.
Re: [patch 02/39] iommu/vt-d: Remove bogus check for multi MSI-X
On Wed, Nov 16 2022 at 07:52, Ashok Raj wrote: > On Fri, Nov 11, 2022 at 02:54:17PM +0100, Thomas Gleixner wrote: >> PCI/Multi-MSI is MSI specific and not supported for MSI-X. >> >> Signed-off-by: Thomas Gleixner >> --- >> drivers/iommu/intel/irq_remapping.c |3 +-- >> 1 file changed, 1 insertion(+), 2 deletions(-) >> >> --- a/drivers/iommu/intel/irq_remapping.c >> +++ b/drivers/iommu/intel/irq_remapping.c >> @@ -1334,8 +1334,7 @@ static int intel_irq_remapping_alloc(str >> >> if (!info || !iommu) >> return -EINVAL; >> -if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI && >> -info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX) >> +if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI) >> return -EINVAL; >> >> /* >> > > This check is only making sure that when multi-msi is requested that the > type has to be either MSI/MSIX. MSI-X does not support multi vector allocations on a single entry. > Wouldn't this change return -EINVAL when type = MSIX? Rightfully so. MSIX vectors are allocated one by one. Has been that way forever. Thanks, tglx
[patch 32/39] PCI/MSI: Reorder functions in msi.c
From: Ahmed S. Darwish There is no way to navigate msi.c without banging the head against the wall every now and then because MSI and MSI-X specific functions are intermingled and the code flow is completely non-obvious. Reorder everthing so common helpers, MSI and MSI-X specific functions are grouped together. Suggested-by: Thomas Gleixner Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/msi.c | 577 +- 1 file changed, 295 insertions(+), 282 deletions(-) --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -16,6 +16,97 @@ int pci_msi_enable = 1; int pci_msi_ignore_mask; +/** + * pci_msi_supported - check whether MSI may be enabled on a device + * @dev: pointer to the pci_dev data structure of MSI device function + * @nvec: how many MSIs have been requested? + * + * Look at global flags, the device itself, and its parent buses + * to determine if MSI/-X are supported for the device. If MSI/-X is + * supported return 1, else return 0. + **/ +static int pci_msi_supported(struct pci_dev *dev, int nvec) +{ + struct pci_bus *bus; + + /* MSI must be globally enabled and supported by the device */ + if (!pci_msi_enable) + return 0; + + if (!dev || dev->no_msi) + return 0; + + /* +* You can't ask to have 0 or less MSIs configured. +* a) it's stupid .. +* b) the list manipulation code assumes nvec >= 1. +*/ + if (nvec < 1) + return 0; + + /* +* Any bridge which does NOT route MSI transactions from its +* secondary bus to its primary bus must set NO_MSI flag on +* the secondary pci_bus. +* +* The NO_MSI flag can either be set directly by: +* - arch-specific PCI host bus controller drivers (deprecated) +* - quirks for specific PCI bridges +* +* or indirectly by platform-specific PCI host bridge drivers by +* advertising the 'msi_domain' property, which results in +* the NO_MSI flag when no MSI domain is found for this bridge +* at probe time. +*/ + for (bus = dev->bus; bus; bus = bus->parent) + if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI) + return 0; + + return 1; +} + +static void pcim_msi_release(void *pcidev) +{ + struct pci_dev *dev = pcidev; + + dev->is_msi_managed = false; + pci_free_irq_vectors(dev); +} + +/* + * Needs to be separate from pcim_release to prevent an ordering problem + * vs. msi_device_data_release() in the MSI core code. + */ +static int pcim_setup_msi_release(struct pci_dev *dev) +{ + int ret; + + if (!pci_is_managed(dev) || dev->is_msi_managed) + return 0; + + ret = devm_add_action(>dev, pcim_msi_release, dev); + if (!ret) + dev->is_msi_managed = true; + return ret; +} + +/* + * Ordering vs. devres: msi device data has to be installed first so that + * pcim_msi_release() is invoked before it on device release. + */ +static int pci_setup_msi_context(struct pci_dev *dev) +{ + int ret = msi_setup_device_data(>dev); + + if (!ret) + ret = pcim_setup_msi_release(dev); + return ret; +} + +/* + * Helper functions for mask/unmask and MSI message handling + */ + void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set) { raw_spinlock_t *lock = _pci_dev(desc->dev)->msi_lock; @@ -163,15 +254,8 @@ void pci_write_msi_msg(unsigned int irq, } EXPORT_SYMBOL_GPL(pci_write_msi_msg); -void pci_free_msi_irqs(struct pci_dev *dev) -{ - pci_msi_teardown_msi_irqs(dev); - if (dev->msix_base) { - iounmap(dev->msix_base); - dev->msix_base = NULL; - } -} +/* PCI/MSI specific functionality */ static void pci_intx_for_msi(struct pci_dev *dev, int enable) { @@ -190,111 +274,6 @@ static void pci_msi_set_enable(struct pc pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control); } -/* - * Architecture override returns true when the PCI MSI message should be - * written by the generic restore function. - */ -bool __weak arch_restore_msi_irqs(struct pci_dev *dev) -{ - return true; -} - -void __pci_restore_msi_state(struct pci_dev *dev) -{ - struct msi_desc *entry; - u16 control; - - if (!dev->msi_enabled) - return; - - entry = irq_get_msi_desc(dev->irq); - - pci_intx_for_msi(dev, 0); - pci_msi_set_enable(dev, 0); - if (arch_restore_msi_irqs(dev)) - __pci_write_msi_msg(entry, >msg); - - pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, ); - pci_msi_update_mask(entry, 0, 0); - control &= ~PCI_MSI_FLAGS_QSIZE; - control |= (entry->pci.msi_attrib.multiple << 4) | PCI_MSI_FLAGS_
[patch 36/39] PCI/MSI: Validate MSIX contiguous restriction early
With interrupt domains the sanity check for MSI-X vector validation can be done _before_ any allocation happens. The sanity check only applies to the allocation functions which have an 'entries' array argument. The entries array is filled by the caller with the requested MSI-X indicies. Some drivers have gaps in the index space which is not supported on all architectures. The PCI/MSI irqdomain has a 'feature' bit to enforce this validation late during the allocation phase. Just do it right away before doing any other work along with the other sanity checks on that array. Signed-off-by: Thomas Gleixner --- drivers/pci/msi/msi.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -725,13 +725,17 @@ static int msix_capability_init(struct p return ret; } -static bool pci_msix_validate_entries(struct msix_entry *entries, int nvec, int hwsize) +static bool pci_msix_validate_entries(struct pci_dev *dev, struct msix_entry *entries, + int nvec, int hwsize) { + bool nogap; int i, j; if (!entries) return true; + nogap = pci_msi_domain_supports(dev, MSI_FLAG_MSIX_CONTIGUOUS, DENY_LEGACY); + for (i = 0; i < nvec; i++) { /* Entry within hardware limit? */ if (entries[i].entry >= hwsize) @@ -742,6 +746,9 @@ static bool pci_msix_validate_entries(st if (entries[i].entry == entries[j].entry) return false; } + /* Check for unsupported gaps */ + if (nogap && entries[i].entry != i) + return false; } return true; } @@ -773,7 +780,7 @@ int __pci_enable_msix_range(struct pci_d if (hwsize < 0) return hwsize; - if (!pci_msix_validate_entries(entries, nvec, hwsize)) + if (!pci_msix_validate_entries(dev, entries, nvec, hwsize)) return -EINVAL; /* PCI_IRQ_VIRTUAL is a horrible hack! */
[patch 23/39] PCI/MSI: Move pci_alloc_irq_vectors_affinity() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_alloc_irq_vectors_affinity() and let its kernel-doc reference pci_alloc_irq_vectors() documentation added in parent commit. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 59 +++- drivers/pci/msi/msi.c | 65 + 2 files changed, 59 insertions(+), 65 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 1714905943fb..8546749afa6e 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -123,3 +123,62 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, flags, NULL); } EXPORT_SYMBOL(pci_alloc_irq_vectors); + +/** + * pci_alloc_irq_vectors_affinity() - Allocate multiple device interrupt + *vectors with affinity requirements + * @dev: the PCI device to operate on + * @min_vecs: minimum required number of vectors (must be >= 1) + * @max_vecs: maximum desired number of vectors + * @flags:allocation flags, as in pci_alloc_irq_vectors() + * @affd: affinity requirements (can be %NULL). + * + * Same as pci_alloc_irq_vectors(), but with the extra @affd parameter. + * Check that function docs, and irq_affinity, for more details. + */ +int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, + unsigned int max_vecs, unsigned int flags, + struct irq_affinity *affd) +{ + struct irq_affinity msi_default_affd = {0}; + int nvecs = -ENOSPC; + + if (flags & PCI_IRQ_AFFINITY) { + if (!affd) + affd = _default_affd; + } else { + if (WARN_ON(affd)) + affd = NULL; + } + + if (flags & PCI_IRQ_MSIX) { + nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs, + affd, flags); + if (nvecs > 0) + return nvecs; + } + + if (flags & PCI_IRQ_MSI) { + nvecs = __pci_enable_msi_range(dev, min_vecs, max_vecs, affd); + if (nvecs > 0) + return nvecs; + } + + /* use legacy IRQ if allowed */ + if (flags & PCI_IRQ_LEGACY) { + if (min_vecs == 1 && dev->irq) { + /* +* Invoke the affinity spreading logic to ensure that +* the device driver can adjust queue configuration +* for the single interrupt case. +*/ + if (affd) + irq_create_affinity_masks(1, affd); + pci_intx(dev, 1); + return 1; + } + } + + return nvecs; +} +EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity); diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 6700ef1c734e..a028774f438d 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -887,71 +887,6 @@ int __pci_enable_msix_range(struct pci_dev *dev, } /** - * pci_alloc_irq_vectors_affinity - allocate multiple IRQs for a device - * @dev: PCI device to operate on - * @min_vecs: minimum number of vectors required (must be >= 1) - * @max_vecs: maximum (desired) number of vectors - * @flags: flags or quirks for the allocation - * @affd: optional description of the affinity requirements - * - * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X or MSI - * vectors if available, and fall back to a single legacy vector - * if neither is available. Return the number of vectors allocated, - * (which might be smaller than @max_vecs) if successful, or a negative - * error code on error. If less than @min_vecs interrupt vectors are - * available for @dev the function will fail with -ENOSPC. - * - * To get the Linux IRQ number used for a vector that can be passed to - * request_irq() use the pci_irq_vector() helper. - */ -int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, - unsigned int max_vecs, unsigned int flags, - struct irq_affinity *affd) -{ - struct irq_affinity msi_default_affd = {0}; - int nvecs = -ENOSPC; - - if (flags & PCI_IRQ_AFFINITY) { - if (!affd) - affd = _default_affd; - } else { - if (WARN_ON(affd)) - affd = NULL; - } - - if (flags & PCI_IRQ_MSIX) { - nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs, -
[patch 31/39] Documentation: PCI: Add reference to PCI/MSI device driver APIs
From: Ahmed S. Darwish All exported device-driver MSI APIs are now grouped in one place at drivers/pci/msi/api.c with comprehensive kernel-docs added. Reference these kernel-docs in the official PCI/MSI howto. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- Documentation/PCI/msi-howto.rst | 10 ++ 1 file changed, 10 insertions(+) --- --- a/Documentation/PCI/msi-howto.rst +++ b/Documentation/PCI/msi-howto.rst @@ -285,3 +285,13 @@ to bridges between the PCI root and the It is also worth checking the device driver to see whether it supports MSIs. For example, it may contain calls to pci_alloc_irq_vectors() with the PCI_IRQ_MSI or PCI_IRQ_MSIX flags. + + +List of device drivers MSI(-X) APIs +=== + +The PCI/MSI subystem has a dedicated C file for its exported device driver +APIs — `drivers/pci/msi/api.c`. The following functions are exported: + +.. kernel-doc:: drivers/pci/msi/api.c + :export:
[patch 27/39] PCI/MSI: Move pci_disable_msix() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_disable_msix() and make its kernel-doc comprehensive. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 24 drivers/pci/msi/msi.c | 14 +- drivers/pci/msi/msi.h | 1 + 3 files changed, 26 insertions(+), 13 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 83ea38ffa116..653a61868ae6 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -112,6 +112,30 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, EXPORT_SYMBOL(pci_enable_msix_range); /** + * pci_disable_msix() - Disable MSI-X interrupt mode on device + * @dev: the PCI device to operate on + * + * Legacy device driver API to disable MSI-X interrupt mode on device, + * free earlier-allocated interrupt vectors, and restore INTx emulation. + * The PCI device Linux IRQ (@dev->irq) is restored to its default pin + * assertion IRQ. This is the cleanup pair of pci_enable_msix_range(). + * + * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API + * pair should, in general, be used instead. + */ +void pci_disable_msix(struct pci_dev *dev) +{ + if (!pci_msi_enabled() || !dev || !dev->msix_enabled) + return; + + msi_lock_descs(>dev); + pci_msix_shutdown(dev); + pci_free_msi_irqs(dev); + msi_unlock_descs(>dev); +} +EXPORT_SYMBOL(pci_disable_msix); + +/** * pci_alloc_irq_vectors() - Allocate multiple device interrupt vectors * @dev: the PCI device to operate on * @min_vecs: minimum required number of vectors (must be >= 1) diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 1226d66da992..6fa90d07d2e4 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -736,7 +736,7 @@ static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, return msix_capability_init(dev, entries, nvec, affd); } -static void pci_msix_shutdown(struct pci_dev *dev) +void pci_msix_shutdown(struct pci_dev *dev) { struct msi_desc *desc; @@ -758,18 +758,6 @@ static void pci_msix_shutdown(struct pci_dev *dev) pcibios_alloc_irq(dev); } -void pci_disable_msix(struct pci_dev *dev) -{ - if (!pci_msi_enable || !dev || !dev->msix_enabled) - return; - - msi_lock_descs(>dev); - pci_msix_shutdown(dev); - pci_free_msi_irqs(dev); - msi_unlock_descs(>dev); -} -EXPORT_SYMBOL(pci_disable_msix); - int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct irq_affinity *affd) { diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h index 8c4a5289432d..77e2587f7e4f 100644 --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -86,6 +86,7 @@ static inline __attribute_const__ u32 msi_multi_mask(struct msi_desc *desc) /* MSI internal functions invoked from the public APIs */ void pci_msi_shutdown(struct pci_dev *dev); +void pci_msix_shutdown(struct pci_dev *dev); void pci_free_msi_irqs(struct pci_dev *dev); int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct irq_affinity *affd); int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int minvec,
[patch 33/39] PCI/MSI: Sanitize MSI-X checks
There is no point in doing the same sanity checks over and over in a loop during MSI-X enablement. Put them in front of the loop and return early when they fail. Signed-off-by: Thomas Gleixner --- drivers/pci/msi/msi.c | 67 +- 1 file changed, 34 insertions(+), 33 deletions(-) --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -721,47 +721,31 @@ static int msix_capability_init(struct p return ret; } -static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, -int nvec, struct irq_affinity *affd, int flags) +static bool pci_msix_validate_entries(struct msix_entry *entries, int nvec, int hwsize) { - int nr_entries; int i, j; - if (!pci_msi_supported(dev, nvec) || dev->current_state != PCI_D0) - return -EINVAL; + if (!entries) + return true; - nr_entries = pci_msix_vec_count(dev); - if (nr_entries < 0) - return nr_entries; - if (nvec > nr_entries && !(flags & PCI_IRQ_VIRTUAL)) - return nr_entries; - - if (entries) { - /* Check for any invalid entries */ - for (i = 0; i < nvec; i++) { - if (entries[i].entry >= nr_entries) - return -EINVAL; /* invalid entry */ - for (j = i + 1; j < nvec; j++) { - if (entries[i].entry == entries[j].entry) - return -EINVAL; /* duplicate entry */ - } + for (i = 0; i < nvec; i++) { + /* Entry within hardware limit? */ + if (entries[i].entry >= hwsize) + return false; + + /* Check for duplicate entries */ + for (j = i + 1; j < nvec; j++) { + if (entries[i].entry == entries[j].entry) + return false; } } - - /* Check whether driver already requested for MSI IRQ */ - if (dev->msi_enabled) { - pci_info(dev, "can't enable MSI-X (MSI IRQ already assigned)\n"); - return -EINVAL; - } - return msix_capability_init(dev, entries, nvec, affd); + return true; } -int __pci_enable_msix_range(struct pci_dev *dev, - struct msix_entry *entries, int minvec, - int maxvec, struct irq_affinity *affd, - int flags) +int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int minvec, + int maxvec, struct irq_affinity *affd, int flags) { - int rc, nvec = maxvec; + int hwsize, rc, nvec = maxvec; if (maxvec < minvec) return -ERANGE; @@ -774,6 +758,23 @@ int __pci_enable_msix_range(struct pci_d if (WARN_ON_ONCE(dev->msix_enabled)) return -EINVAL; + if (!pci_msi_supported(dev, nvec) || dev->current_state != PCI_D0) + return -EINVAL; + + hwsize = pci_msix_vec_count(dev); + if (hwsize < 0) + return hwsize; + + if (!pci_msix_validate_entries(entries, nvec, hwsize)) + return -EINVAL; + + /* PCI_IRQ_VIRTUAL is a horrible hack! */ + if (nvec > hwsize && !(flags & PCI_IRQ_VIRTUAL)) + nvec = hwsize; + + if (nvec < minvec) + return -ENOSPC; + rc = pci_setup_msi_context(dev); if (rc) return rc; @@ -785,7 +786,7 @@ int __pci_enable_msix_range(struct pci_d return -ENOSPC; } - rc = __pci_enable_msix(dev, entries, nvec, affd, flags); + rc = msix_capability_init(dev, entries, nvec, affd); if (rc == 0) return nvec;
[patch 18/39] PCI/MSI: Move mask and unmask helpers to msi.h
From: Ahmed S. Darwish The upcoming support for per device MSI interrupt domains needs to share some of the inline helpers with the MSI implementation. Move them to the header file. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/msi.c | 61 +-- drivers/pci/msi/msi.h | 83 +--- 2 files changed, 74 insertions(+), 70 deletions(-) --- diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 160af9f01669..5c310df55d0d 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -16,7 +16,7 @@ static int pci_msi_enable = 1; int pci_msi_ignore_mask; -static noinline void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set) +void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set) { raw_spinlock_t *lock = _pci_dev(desc->dev)->msi_lock; unsigned long flags; @@ -32,65 +32,6 @@ static noinline void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 s raw_spin_unlock_irqrestore(lock, flags); } -static inline void pci_msi_mask(struct msi_desc *desc, u32 mask) -{ - pci_msi_update_mask(desc, 0, mask); -} - -static inline void pci_msi_unmask(struct msi_desc *desc, u32 mask) -{ - pci_msi_update_mask(desc, mask, 0); -} - -static inline void __iomem *pci_msix_desc_addr(struct msi_desc *desc) -{ - return desc->pci.mask_base + desc->msi_index * PCI_MSIX_ENTRY_SIZE; -} - -/* - * This internal function does not flush PCI writes to the device. All - * users must ensure that they read from the device before either assuming - * that the device state is up to date, or returning out of this file. - * It does not affect the msi_desc::msix_ctrl cache either. Use with care! - */ -static void pci_msix_write_vector_ctrl(struct msi_desc *desc, u32 ctrl) -{ - void __iomem *desc_addr = pci_msix_desc_addr(desc); - - if (desc->pci.msi_attrib.can_mask) - writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL); -} - -static inline void pci_msix_mask(struct msi_desc *desc) -{ - desc->pci.msix_ctrl |= PCI_MSIX_ENTRY_CTRL_MASKBIT; - pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl); - /* Flush write to device */ - readl(desc->pci.mask_base); -} - -static inline void pci_msix_unmask(struct msi_desc *desc) -{ - desc->pci.msix_ctrl &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT; - pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl); -} - -static void __pci_msi_mask_desc(struct msi_desc *desc, u32 mask) -{ - if (desc->pci.msi_attrib.is_msix) - pci_msix_mask(desc); - else - pci_msi_mask(desc, mask); -} - -static void __pci_msi_unmask_desc(struct msi_desc *desc, u32 mask) -{ - if (desc->pci.msi_attrib.is_msix) - pci_msix_unmask(desc); - else - pci_msi_unmask(desc, mask); -} - /** * pci_msi_mask_irq - Generic IRQ chip callback to mask PCI/MSI interrupts * @data: pointer to irqdata associated to that interrupt diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h index fc92603b33e1..d8f62d911f08 100644 --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -8,21 +8,67 @@ int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); void pci_msi_teardown_msi_irqs(struct pci_dev *dev); -#ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS -int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); -void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev); -#else -static inline int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) +/* Mask/unmask helpers */ +void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set); + +static inline void pci_msi_mask(struct msi_desc *desc, u32 mask) { - WARN_ON_ONCE(1); - return -ENODEV; + pci_msi_update_mask(desc, 0, mask); } -static inline void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev) +static inline void pci_msi_unmask(struct msi_desc *desc, u32 mask) { - WARN_ON_ONCE(1); + pci_msi_update_mask(desc, mask, 0); +} + +static inline void __iomem *pci_msix_desc_addr(struct msi_desc *desc) +{ + return desc->pci.mask_base + desc->msi_index * PCI_MSIX_ENTRY_SIZE; +} + +/* + * This internal function does not flush PCI writes to the device. All + * users must ensure that they read from the device before either assuming + * that the device state is up to date, or returning out of this file. + * It does not affect the msi_desc::msix_ctrl cache either. Use with care! + */ +static inline void pci_msix_write_vector_ctrl(struct msi_desc *desc, u32 ctrl) +{ + void __iomem *desc_addr = pci_msix_desc_addr(desc); + + if (desc->pci.msi_attrib.can_mask) + writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL); +} + +static inline void pci_msix_mask(struct msi_desc *desc) +{ + d
[patch 19/39] PCI/MSI: Move pci_disable_msi() to api.c
From: Ahmed S. Darwish msi.c is a maze of randomly sorted functions which makes the code unreadable. As a first step split the driver visible API and the internal implementation which also allows proper API documentation via one file. Create drivers/pci/msi/api.c to group all exported device-driver PCI/MSI APIs in one C file. Begin by moving pci_disable_msi() there and add kernel-doc for the function as appropriate. Suggested-by: Thomas Gleixner Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/Makefile | 3 +-- drivers/pci/msi/api.c| 37 + drivers/pci/msi/msi.c| 22 +- drivers/pci/msi/msi.h| 4 4 files changed, 47 insertions(+), 19 deletions(-) create mode 100644 drivers/pci/msi/api.c --- diff --git a/drivers/pci/msi/Makefile b/drivers/pci/msi/Makefile index 4e0a7e07965e..839ff72d72a8 100644 --- a/drivers/pci/msi/Makefile +++ b/drivers/pci/msi/Makefile @@ -2,6 +2,5 @@ # # Makefile for the PCI/MSI obj-$(CONFIG_PCI) += pcidev_msi.o -obj-$(CONFIG_PCI_MSI) += msi.o -obj-$(CONFIG_PCI_MSI) += irqdomain.o +obj-$(CONFIG_PCI_MSI) += api.o msi.o irqdomain.o obj-$(CONFIG_PCI_MSI_ARCH_FALLBACKS) += legacy.o diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c new file mode 100644 index ..7485942cbe5d --- /dev/null +++ b/drivers/pci/msi/api.c @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCI MSI/MSI-X — Exported APIs for device drivers + * + * Copyright (C) 2003-2004 Intel + * Copyright (C) Tom Long Nguyen (tom.l.ngu...@intel.com) + * Copyright (C) 2016 Christoph Hellwig. + * Copyright (C) 2022 Linutronix GmbH + */ + +#include + +#include "msi.h" + +/** + * pci_disable_msi() - Disable MSI interrupt mode on device + * @dev: the PCI device to operate on + * + * Legacy device driver API to disable MSI interrupt mode on device, + * free earlier allocated interrupt vectors, and restore INTx emulation. + * The PCI device Linux IRQ (@dev->irq) is restored to its default + * pin-assertion IRQ. This is the cleanup pair of pci_enable_msi(). + * + * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API + * pair should, in general, be used instead. + */ +void pci_disable_msi(struct pci_dev *dev) +{ + if (!pci_msi_enabled() || !dev || !dev->msi_enabled) + return; + + msi_lock_descs(>dev); + pci_msi_shutdown(dev); + pci_free_msi_irqs(dev); + msi_unlock_descs(>dev); +} +EXPORT_SYMBOL(pci_disable_msi); diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 5c310df55d0d..4a1300b74518 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -163,7 +163,7 @@ void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg) } EXPORT_SYMBOL_GPL(pci_write_msi_msg); -static void free_msi_irqs(struct pci_dev *dev) +void pci_free_msi_irqs(struct pci_dev *dev) { pci_msi_teardown_msi_irqs(dev); @@ -413,7 +413,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec, err: pci_msi_unmask(entry, msi_multi_mask(entry)); - free_msi_irqs(dev); + pci_free_msi_irqs(dev); fail: dev->msi_enabled = 0; unlock: @@ -531,7 +531,7 @@ static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base, goto out_unlock; out_free: - free_msi_irqs(dev); + pci_free_msi_irqs(dev); out_unlock: msi_unlock_descs(>dev); kfree(masks); @@ -680,7 +680,7 @@ int pci_msi_vec_count(struct pci_dev *dev) } EXPORT_SYMBOL(pci_msi_vec_count); -static void pci_msi_shutdown(struct pci_dev *dev) +void pci_msi_shutdown(struct pci_dev *dev) { struct msi_desc *desc; @@ -701,18 +701,6 @@ static void pci_msi_shutdown(struct pci_dev *dev) pcibios_alloc_irq(dev); } -void pci_disable_msi(struct pci_dev *dev) -{ - if (!pci_msi_enable || !dev || !dev->msi_enabled) - return; - - msi_lock_descs(>dev); - pci_msi_shutdown(dev); - free_msi_irqs(dev); - msi_unlock_descs(>dev); -} -EXPORT_SYMBOL(pci_disable_msi); - /** * pci_msix_vec_count - return the number of device's MSI-X table entries * @dev: pointer to the pci_dev data structure of MSI-X device function @@ -797,7 +785,7 @@ void pci_disable_msix(struct pci_dev *dev) msi_lock_descs(>dev); pci_msix_shutdown(dev); - free_msi_irqs(dev); + pci_free_msi_irqs(dev); msi_unlock_descs(>dev); } EXPORT_SYMBOL(pci_disable_msix); diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h index d8f62d911f08..634879277349 100644 --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -84,6 +84,10 @@ static inline __attribute_const__ u32 msi_multi_mask(struct msi_desc *desc) return (1 << (1 << desc->pci.msi_attrib.multi_cap)) - 1; } +/* MSI internal functions invoked fro
[patch 28/39] PCI/MSI: Move pci_irq_get_affinity() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_irq_get_affinity() and let its kernel-doc match rest of the file. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 43 +++ drivers/pci/msi/msi.c | 38 -- 2 files changed, 43 insertions(+), 38 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 653a61868ae6..473df7ba0584 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -9,6 +9,7 @@ */ #include +#include #include "msi.h" @@ -251,6 +252,48 @@ int pci_irq_vector(struct pci_dev *dev, unsigned int nr) EXPORT_SYMBOL(pci_irq_vector); /** + * pci_irq_get_affinity() - Get a device interrupt vector affinity + * @dev: the PCI device to operate on + * @nr: device-relative interrupt vector index (0-based); has different + * meanings, depending on interrupt mode + * MSI-Xthe index in the MSI-X vector table + * MSI the index of the enabled MSI vectors + * INTx must be 0 + * + * Return: MSI/MSI-X vector affinity, NULL if @nr is out of range or if + * the MSI(-X) vector was allocated without explicit affinity + * requirements (e.g., by pci_enable_msi(), pci_enable_msix_range(), or + * pci_alloc_irq_vectors() without the %PCI_IRQ_AFFINITY flag). Return a + * generic set of CPU ids representing all possible CPUs available + * during system boot if the device is in legacy INTx mode. + */ +const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) +{ + int idx, irq = pci_irq_vector(dev, nr); + struct msi_desc *desc; + + if (WARN_ON_ONCE(irq <= 0)) + return NULL; + + desc = irq_get_msi_desc(irq); + /* Non-MSI does not have the information handy */ + if (!desc) + return cpu_possible_mask; + + /* MSI[X] interrupts can be allocated without affinity descriptor */ + if (!desc->affinity) + return NULL; + + /* +* MSI has a mask array in the descriptor. +* MSI-X has a single mask. +*/ + idx = dev->msi_enabled ? nr : 0; + return >affinity[idx].mask; +} +EXPORT_SYMBOL(pci_irq_get_affinity); + +/** * pci_free_irq_vectors() - Free previously allocated IRQs for a device * @dev: the PCI device to operate on * diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 6fa90d07d2e4..d78646d1c116 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -854,44 +854,6 @@ int __pci_enable_msix_range(struct pci_dev *dev, } } -/** - * pci_irq_get_affinity - return the affinity of a particular MSI vector - * @dev: PCI device to operate on - * @nr:device-relative interrupt vector index (0-based). - * - * @nr has the following meanings depending on the interrupt mode: - * MSI-X:The index in the MSI-X vector table - * MSI: The index of the enabled MSI vectors - * INTx: Must be 0 - * - * Return: A cpumask pointer or NULL if @nr is out of range - */ -const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) -{ - int idx, irq = pci_irq_vector(dev, nr); - struct msi_desc *desc; - - if (WARN_ON_ONCE(irq <= 0)) - return NULL; - - desc = irq_get_msi_desc(irq); - /* Non-MSI does not have the information handy */ - if (!desc) - return cpu_possible_mask; - - /* MSI[X] interrupts can be allocated without affinity descriptor */ - if (!desc->affinity) - return NULL; - - /* -* MSI has a mask array in the descriptor. -* MSI-X has a single mask. -*/ - idx = dev->msi_enabled ? nr : 0; - return >affinity[idx].mask; -} -EXPORT_SYMBOL(pci_irq_get_affinity); - struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc) { return to_pci_dev(desc->dev);
[patch 20/39] PCI/MSI: Move pci_enable_msi() API to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_enable_msi() and make its kernel-doc comprehensive. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 23 +++ drivers/pci/msi/msi.c | 14 ++ drivers/pci/msi/msi.h | 1 + 3 files changed, 26 insertions(+), 12 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 7485942cbe5d..63d7f8f6a284 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -13,6 +13,29 @@ #include "msi.h" /** + * pci_enable_msi() - Enable MSI interrupt mode on device + * @dev: the PCI device to operate on + * + * Legacy device driver API to enable MSI interrupts mode on device and + * allocate a single interrupt vector. On success, the allocated vector + * Linux IRQ will be saved at @dev->irq. The driver must invoke + * pci_disable_msi() on cleanup. + * + * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API + * pair should, in general, be used instead. + * + * Return: 0 on success, errno otherwise + */ +int pci_enable_msi(struct pci_dev *dev) +{ + int rc = __pci_enable_msi_range(dev, 1, 1, NULL); + if (rc < 0) + return rc; + return 0; +} +EXPORT_SYMBOL(pci_enable_msi); + +/** * pci_disable_msi() - Disable MSI interrupt mode on device * @dev: the PCI device to operate on * diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 4a1300b74518..98f07ad9af62 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -790,8 +790,8 @@ void pci_disable_msix(struct pci_dev *dev) } EXPORT_SYMBOL(pci_disable_msix); -static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, - struct irq_affinity *affd) +int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, + struct irq_affinity *affd) { int nvec; int rc; @@ -844,16 +844,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, } } -/* deprecated, don't use */ -int pci_enable_msi(struct pci_dev *dev) -{ - int rc = __pci_enable_msi_range(dev, 1, 1, NULL); - if (rc < 0) - return rc; - return 0; -} -EXPORT_SYMBOL(pci_enable_msi); - static int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int minvec, int maxvec, struct irq_affinity *affd, diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h index 634879277349..00bb98d5bb0e 100644 --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -87,6 +87,7 @@ static inline __attribute_const__ u32 msi_multi_mask(struct msi_desc *desc) /* MSI internal functions invoked from the public APIs */ void pci_msi_shutdown(struct pci_dev *dev); void pci_free_msi_irqs(struct pci_dev *dev); +int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct irq_affinity *affd); /* Legacy (!IRQDOMAIN) fallbacks */ #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS
[patch 25/39] PCI/MSI: Move pci_free_irq_vectors() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_free_irq_vectors() and make its kernel-doc comprehensive. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 15 +++ drivers/pci/msi/msi.c | 13 - 2 files changed, 15 insertions(+), 13 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 0f1ec87e3f9f..2ff2a9ccfc47 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -205,3 +205,18 @@ int pci_irq_vector(struct pci_dev *dev, unsigned int nr) return irq ? irq : -EINVAL; } EXPORT_SYMBOL(pci_irq_vector); + +/** + * pci_free_irq_vectors() - Free previously allocated IRQs for a device + * @dev: the PCI device to operate on + * + * Undo the interrupt vector allocations and possible device MSI/MSI-X + * enablement earlier done through pci_alloc_irq_vectors_affinity() or + * pci_alloc_irq_vectors(). + */ +void pci_free_irq_vectors(struct pci_dev *dev) +{ + pci_disable_msix(dev); + pci_disable_msi(dev); +} +EXPORT_SYMBOL(pci_free_irq_vectors); diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 38ad2fe4b85c..ed8caf5ac99f 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -887,19 +887,6 @@ int __pci_enable_msix_range(struct pci_dev *dev, } /** - * pci_free_irq_vectors - free previously allocated IRQs for a device - * @dev: PCI device to operate on - * - * Undoes the allocations and enabling in pci_alloc_irq_vectors(). - */ -void pci_free_irq_vectors(struct pci_dev *dev) -{ - pci_disable_msix(dev); - pci_disable_msi(dev); -} -EXPORT_SYMBOL(pci_free_irq_vectors); - -/** * pci_irq_get_affinity - return the affinity of a particular MSI vector * @dev: PCI device to operate on * @nr:device-relative interrupt vector index (0-based).
[patch 24/39] PCI/MSI: Move pci_irq_vector() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_irq_vector() and let its kernel-doc match the rest of the file. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 23 +++ drivers/pci/msi/msi.c | 24 2 files changed, 23 insertions(+), 24 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 8546749afa6e..0f1ec87e3f9f 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -182,3 +182,26 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, return nvecs; } EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity); + +/** + * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector + * @dev: the PCI device to operate on + * @nr: device-relative interrupt vector index (0-based); has different + * meanings, depending on interrupt mode + * MSI-Xthe index in the MSI-X vector table + * MSI the index of the enabled MSI vectors + * INTx must be 0 + * + * Return: the Linux IRQ number, or -EINVAL if @nr is out of range + */ +int pci_irq_vector(struct pci_dev *dev, unsigned int nr) +{ + unsigned int irq; + + if (!dev->msi_enabled && !dev->msix_enabled) + return !nr ? dev->irq : -EINVAL; + + irq = msi_get_virq(>dev, nr); + return irq ? irq : -EINVAL; +} +EXPORT_SYMBOL(pci_irq_vector); diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index a028774f438d..38ad2fe4b85c 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -900,30 +900,6 @@ void pci_free_irq_vectors(struct pci_dev *dev) EXPORT_SYMBOL(pci_free_irq_vectors); /** - * pci_irq_vector - return Linux IRQ number of a device vector - * @dev: PCI device to operate on - * @nr:Interrupt vector index (0-based) - * - * @nr has the following meanings depending on the interrupt mode: - * MSI-X:The index in the MSI-X vector table - * MSI: The index of the enabled MSI vectors - * INTx: Must be 0 - * - * Return: The Linux interrupt number or -EINVAl if @nr is out of range. - */ -int pci_irq_vector(struct pci_dev *dev, unsigned int nr) -{ - unsigned int irq; - - if (!dev->msi_enabled && !dev->msix_enabled) - return !nr ? dev->irq : -EINVAL; - - irq = msi_get_virq(>dev, nr); - return irq ? irq : -EINVAL; -} -EXPORT_SYMBOL(pci_irq_vector); - -/** * pci_irq_get_affinity - return the affinity of a particular MSI vector * @dev: PCI device to operate on * @nr:device-relative interrupt vector index (0-based).
[patch 34/39] PCI/MSI: Reject multi-MSI early
When hierarchical MSI interrupt domains are enabled then there is no point to do tons of work and detect the missing support for multi-MSI late in the allocation path. Just query the domain feature flags right away. The query function is going to be used for other purposes later and has a mode argument which influences the result: ALLOW_LEGACY returns true when: - there is no irq domain attached (legacy support) - there is a irq domain attached which has the feature flag set DENY_LEGACY returns only true when: - there is a irq domain attached which has the feature flag set This allows to use the function universally without ifdeffery in the calling code. Signed-off-by: Thomas Gleixner --- drivers/pci/msi/irqdomain.c | 22 ++ drivers/pci/msi/msi.c |4 drivers/pci/msi/msi.h |9 + 3 files changed, 35 insertions(+) --- a/drivers/pci/msi/irqdomain.c +++ b/drivers/pci/msi/irqdomain.c @@ -187,6 +187,28 @@ struct irq_domain *pci_msi_create_irq_do } EXPORT_SYMBOL_GPL(pci_msi_create_irq_domain); +/** + * pci_msi_domain_supports - Check for support of a particular feature flag + * @pdev: The PCI device to operate on + * @feature_mask: The feature mask to check for (full match) + * @mode: If ALLOW_LEGACY this grants the feature when there is no irq domain + * associated to the device. If DENY_LEGACY the lack of an irq domain + * makes the feature unsupported + */ +bool pci_msi_domain_supports(struct pci_dev *pdev, unsigned int feature_mask, +enum support_mode mode) +{ + struct msi_domain_info *info; + struct irq_domain *domain; + + domain = dev_get_msi_domain(>dev); + + if (!domain || !irq_domain_is_hierarchy(domain)) + return mode == ALLOW_LEGACY; + info = domain->host_data; + return (info->flags & feature_mask) == feature_mask; +} + /* * Users of the generic MSI infrastructure expect a device to have a single ID, * so with DMA aliases we have to pick the least-worst compromise. Devices with --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -347,6 +347,10 @@ static int msi_capability_init(struct pc struct msi_desc *entry; int ret; + /* Reject multi-MSI early on irq domain enabled architectures */ + if (nvec > 1 && !pci_msi_domain_supports(dev, MSI_FLAG_MULTI_PCI_MSI, ALLOW_LEGACY)) + return 1; + /* * Disable MSI during setup in the hardware, but mark it enabled * so that setup code can evaluate it. --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -97,6 +97,15 @@ int __pci_enable_msix_range(struct pci_d void __pci_restore_msi_state(struct pci_dev *dev); void __pci_restore_msix_state(struct pci_dev *dev); +/* irq_domain related functionality */ + +enum support_mode { + ALLOW_LEGACY, + DENY_LEGACY, +}; + +bool pci_msi_domain_supports(struct pci_dev *dev, unsigned int feature_mask, enum support_mode mode); + /* Legacy (!IRQDOMAIN) fallbacks */ #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS
[patch 21/39] PCI/MSI: Move pci_enable_msix_range() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_enable_msix_range() and make its kernel-doc comprehensive. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 32 drivers/pci/msi/msi.c | 30 -- drivers/pci/msi/msi.h | 3 +++ 3 files changed, 39 insertions(+), 26 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 63d7f8f6a284..d48050555d55 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -58,3 +58,35 @@ void pci_disable_msi(struct pci_dev *dev) msi_unlock_descs(>dev); } EXPORT_SYMBOL(pci_disable_msi); + +/** + * pci_enable_msix_range() - Enable MSI-X interrupt mode on device + * @dev: the PCI device to operate on + * @entries: input/output parameter, array of MSI-X configuration entries + * @minvec: minimum required number of MSI-X vectors + * @maxvec: maximum desired number of MSI-X vectors + * + * Legacy device driver API to enable MSI-X interrupt mode on device and + * configure its MSI-X capability structure as appropriate. The passed + * @entries array must have each of its members "entry" field set to a + * desired (valid) MSI-X vector number, where the range of valid MSI-X + * vector numbers can be queried through pci_msix_vec_count(). If + * successful, the driver must invoke pci_disable_msix() on cleanup. + * + * NOTE: The newer pci_alloc_irq_vectors() / pci_free_irq_vectors() API + * pair should, in general, be used instead. + * + * Return: number of MSI-X vectors allocated (which might be smaller + * than @maxvecs), where Linux IRQ numbers for such allocated vectors + * are saved back in the @entries array elements' "vector" field. Return + * -ENOSPC if less than @minvecs interrupt vectors are available. + * Return -EINVAL if one of the passed @entries members "entry" field + * was invalid or a duplicate, or if plain MSI interrupts mode was + * earlier enabled on device. Return other errnos otherwise. + */ +int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, + int minvec, int maxvec) +{ + return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0); +} +EXPORT_SYMBOL(pci_enable_msix_range); diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 98f07ad9af62..6700ef1c734e 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -844,10 +844,10 @@ int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, } } -static int __pci_enable_msix_range(struct pci_dev *dev, - struct msix_entry *entries, int minvec, - int maxvec, struct irq_affinity *affd, - int flags) +int __pci_enable_msix_range(struct pci_dev *dev, + struct msix_entry *entries, int minvec, + int maxvec, struct irq_affinity *affd, + int flags) { int rc, nvec = maxvec; @@ -887,28 +887,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev, } /** - * pci_enable_msix_range - configure device's MSI-X capability structure - * @dev: pointer to the pci_dev data structure of MSI-X device function - * @entries: pointer to an array of MSI-X entries - * @minvec: minimum number of MSI-X IRQs requested - * @maxvec: maximum number of MSI-X IRQs requested - * - * Setup the MSI-X capability structure of device function with a maximum - * possible number of interrupts in the range between @minvec and @maxvec - * upon its software driver call to request for MSI-X mode enabled on its - * hardware device function. It returns a negative errno if an error occurs. - * If it succeeds, it returns the actual number of interrupts allocated and - * indicates the successful configuration of MSI-X capability structure - * with new allocated MSI-X interrupts. - **/ -int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, - int minvec, int maxvec) -{ - return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0); -} -EXPORT_SYMBOL(pci_enable_msix_range); - -/** * pci_alloc_irq_vectors_affinity - allocate multiple IRQs for a device * @dev: PCI device to operate on * @min_vecs: minimum number of vectors required (must be >= 1) diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h index 00bb98d5bb0e..8c4a5289432d 100644 --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -88,8 +88,11 @@ static inline __attribute_const__ u32 msi_multi_mask(struct msi_desc *desc) void pci_msi_shutdown(struct pci_dev *dev); void pci_free_msi_irqs(struct pci_dev *dev); int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct irq_affinity *affd); +int __pci_enable_msix_range(st
[patch 30/39] PCI/MSI: Move pci_msi_restore_state() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_msi_enabled() and add kernel-doc for the function. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index ee9ed5ccd94d..8d1cf6db9bd7 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -308,6 +308,21 @@ void pci_free_irq_vectors(struct pci_dev *dev) } EXPORT_SYMBOL(pci_free_irq_vectors); +/** + * pci_restore_msi_state() - Restore cached MSI(-X) state on device + * @dev: the PCI device to operate on + * + * Write the Linux-cached MSI(-X) state back on device. This is + * typically useful upon system resume, or after an error-recovery PCI + * adapter reset. + */ +void pci_restore_msi_state(struct pci_dev *dev) +{ + __pci_restore_msi_state(dev); + __pci_restore_msix_state(dev); +} +EXPORT_SYMBOL_GPL(pci_restore_msi_state); + /** * pci_msi_enabled() - Are MSI(-X) interrupts enabled system-wide? * diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index 59c33bc7fe81..a5d168c823ff 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -199,7 +199,7 @@ bool __weak arch_restore_msi_irqs(struct pci_dev *dev) return true; } -static void __pci_restore_msi_state(struct pci_dev *dev) +void __pci_restore_msi_state(struct pci_dev *dev) { struct msi_desc *entry; u16 control; @@ -231,7 +231,7 @@ static void pci_msix_clear_and_set_ctrl(struct pci_dev *dev, u16 clear, u16 set) pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, ctrl); } -static void __pci_restore_msix_state(struct pci_dev *dev) +void __pci_restore_msix_state(struct pci_dev *dev) { struct msi_desc *entry; bool write_msg; @@ -257,13 +257,6 @@ static void __pci_restore_msix_state(struct pci_dev *dev) pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0); } -void pci_restore_msi_state(struct pci_dev *dev) -{ - __pci_restore_msi_state(dev); - __pci_restore_msix_state(dev); -} -EXPORT_SYMBOL_GPL(pci_restore_msi_state); - static void pcim_msi_release(void *pcidev) { struct pci_dev *dev = pcidev; diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h index f3f4ede53171..8170ef2c5ad0 100644 --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -94,6 +94,8 @@ void pci_free_msi_irqs(struct pci_dev *dev); int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, struct irq_affinity *affd); int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int minvec, int maxvec, struct irq_affinity *affd, int flags); +void __pci_restore_msi_state(struct pci_dev *dev); +void __pci_restore_msix_state(struct pci_dev *dev); /* Legacy (!IRQDOMAIN) fallbacks */
[patch 29/39] PCI/MSI: Move pci_msi_enabled() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_msi_enabled() and make its kernel-doc comprehensive. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 12 drivers/pci/msi/msi.c | 14 +- drivers/pci/msi/msi.h | 3 +++ 3 files changed, 16 insertions(+), 13 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 473df7ba0584..ee9ed5ccd94d 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -307,3 +307,15 @@ void pci_free_irq_vectors(struct pci_dev *dev) pci_disable_msi(dev); } EXPORT_SYMBOL(pci_free_irq_vectors); + +/** + * pci_msi_enabled() - Are MSI(-X) interrupts enabled system-wide? + * + * Return: true if MSI has not been globally disabled through ACPI FADT, + * PCI bridge quirks, or the "pci=nomsi" kernel command-line option. + */ +int pci_msi_enabled(void) +{ + return pci_msi_enable; +} +EXPORT_SYMBOL(pci_msi_enabled); diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index d78646d1c116..59c33bc7fe81 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -13,7 +13,7 @@ #include "../pci.h" #include "msi.h" -static int pci_msi_enable = 1; +int pci_msi_enable = 1; int pci_msi_ignore_mask; void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set) @@ -864,15 +864,3 @@ void pci_no_msi(void) { pci_msi_enable = 0; } - -/** - * pci_msi_enabled - is MSI enabled? - * - * Returns true if MSI has not been disabled by the command-line option - * pci=nomsi. - **/ -int pci_msi_enabled(void) -{ - return pci_msi_enable; -} -EXPORT_SYMBOL(pci_msi_enabled); diff --git a/drivers/pci/msi/msi.h b/drivers/pci/msi/msi.h index 77e2587f7e4f..f3f4ede53171 100644 --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -84,6 +84,9 @@ static inline __attribute_const__ u32 msi_multi_mask(struct msi_desc *desc) return (1 << (1 << desc->pci.msi_attrib.multi_cap)) - 1; } +/* Subsystem variables */ +extern int pci_msi_enable; + /* MSI internal functions invoked from the public APIs */ void pci_msi_shutdown(struct pci_dev *dev); void pci_msix_shutdown(struct pci_dev *dev);
[patch 38/39] genirq/msi: Remove msi_domain_ops::msi_check()
No more users. Signed-off-by: Thomas Gleixner --- include/linux/msi.h |4 kernel/irq/msi.c| 17 + 2 files changed, 1 insertion(+), 20 deletions(-) --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -288,7 +288,6 @@ struct msi_domain_info; * @get_hwirq: Retrieve the resulting hw irq number * @msi_init: Domain specific init function for MSI interrupts * @msi_free: Domain specific function to free a MSI interrupts - * @msi_check: Callback for verification of the domain/info/dev data * @msi_prepare: Prepare the allocation of the interrupts in the domain * @set_desc: Set the msi descriptor for an interrupt * @domain_alloc_irqs: Optional function to override the default allocation @@ -326,9 +325,6 @@ struct msi_domain_ops { void(*msi_free)(struct irq_domain *domain, struct msi_domain_info *info, unsigned int virq); - int (*msi_check)(struct irq_domain *domain, -struct msi_domain_info *info, -struct device *dev); int (*msi_prepare)(struct irq_domain *domain, struct device *dev, int nvec, msi_alloc_info_t *arg); --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -617,17 +617,9 @@ static int msi_domain_ops_init(struct ir return 0; } -static int msi_domain_ops_check(struct irq_domain *domain, - struct msi_domain_info *info, - struct device *dev) -{ - return 0; -} - static struct msi_domain_ops msi_domain_ops_default = { .get_hwirq = msi_domain_ops_get_hwirq, .msi_init = msi_domain_ops_init, - .msi_check = msi_domain_ops_check, .msi_prepare= msi_domain_ops_prepare, .set_desc = msi_domain_ops_set_desc, .domain_alloc_irqs = __msi_domain_alloc_irqs, @@ -655,8 +647,6 @@ static void msi_domain_update_dom_ops(st ops->get_hwirq = msi_domain_ops_default.get_hwirq; if (ops->msi_init == NULL) ops->msi_init = msi_domain_ops_default.msi_init; - if (ops->msi_check == NULL) - ops->msi_check = msi_domain_ops_default.msi_check; if (ops->msi_prepare == NULL) ops->msi_prepare = msi_domain_ops_default.msi_prepare; if (ops->set_desc == NULL) @@ -707,13 +697,8 @@ int msi_domain_prepare_irqs(struct irq_d { struct msi_domain_info *info = domain->host_data; struct msi_domain_ops *ops = info->ops; - int ret; - - ret = ops->msi_check(domain, info, dev); - if (ret == 0) - ret = ops->msi_prepare(domain, dev, nvec, arg); - return ret; + return ops->msi_prepare(domain, dev, nvec, arg); } int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
[patch 26/39] PCI/MSI: Move pci_msix_vec_count() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Move pci_msix_vec_count() and make its kernel-doc comprehensive. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 20 drivers/pci/msi/msi.c | 20 2 files changed, 20 insertions(+), 20 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index 2ff2a9ccfc47..83ea38ffa116 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -60,6 +60,26 @@ void pci_disable_msi(struct pci_dev *dev) EXPORT_SYMBOL(pci_disable_msi); /** + * pci_msix_vec_count() - Get number of MSI-X interrupt vectors on device + * @dev: the PCI device to operate on + * + * Return: number of MSI-X interrupt vectors available on this device + * (i.e., the device's MSI-X capability structure "table size"), -EINVAL + * if the device is not MSI-X capable, other errnos otherwise. + */ +int pci_msix_vec_count(struct pci_dev *dev) +{ + u16 control; + + if (!dev->msix_cap) + return -EINVAL; + + pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, ); + return msix_table_size(control); +} +EXPORT_SYMBOL(pci_msix_vec_count); + +/** * pci_enable_msix_range() - Enable MSI-X interrupt mode on device * @dev: the PCI device to operate on * @entries: input/output parameter, array of MSI-X configuration entries diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c index ed8caf5ac99f..1226d66da992 100644 --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -701,26 +701,6 @@ void pci_msi_shutdown(struct pci_dev *dev) pcibios_alloc_irq(dev); } -/** - * pci_msix_vec_count - return the number of device's MSI-X table entries - * @dev: pointer to the pci_dev data structure of MSI-X device function - * This function returns the number of device's MSI-X table entries and - * therefore the number of MSI-X vectors device is capable of sending. - * It returns a negative errno if the device is not capable of sending MSI-X - * interrupts. - **/ -int pci_msix_vec_count(struct pci_dev *dev) -{ - u16 control; - - if (!dev->msix_cap) - return -EINVAL; - - pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, ); - return msix_table_size(control); -} -EXPORT_SYMBOL(pci_msix_vec_count); - static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec, struct irq_affinity *affd, int flags) {
[patch 35/39] PCI/MSI: Reject MSI-X early
Similar to PCI multi-MSI reject MSI-X enablement when a irq domain is attached to the device which does not support MSI-X. Signed-off-by: Thomas Gleixner --- drivers/pci/msi/msi.c |4 1 file changed, 4 insertions(+) --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -760,6 +760,10 @@ int __pci_enable_msix_range(struct pci_d if (WARN_ON_ONCE(dev->msix_enabled)) return -EINVAL; + /* Check MSI-X early on irq domain enabled architectures */ + if (!pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX, ALLOW_LEGACY)) + return -ENOTSUPP; + if (!pci_msi_supported(dev, nvec) || dev->current_state != PCI_D0) return -EINVAL;
[patch 22/39] PCI/MSI: Move pci_alloc_irq_vectors() to api.c
From: Ahmed S. Darwish To distangle the maze in msi.c, all exported device-driver MSI APIs are now to be grouped in one file, api.c. Make pci_alloc_irq_vectors() a real function instead of wrapper and add proper kernel doc to it. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/api.c | 33 + include/linux/pci.h | 17 + 2 files changed, 42 insertions(+), 8 deletions(-) --- diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index d48050555d55..1714905943fb 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -90,3 +90,36 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0); } EXPORT_SYMBOL(pci_enable_msix_range); + +/** + * pci_alloc_irq_vectors() - Allocate multiple device interrupt vectors + * @dev: the PCI device to operate on + * @min_vecs: minimum required number of vectors (must be >= 1) + * @max_vecs: maximum desired number of vectors + * @flags:One or more of: + *%PCI_IRQ_MSIX Allow trying MSI-X vector allocations + *%PCI_IRQ_MSI Allow trying MSI vector allocations + *%PCI_IRQ_LEGACYAllow trying legacy INTx interrupts, if + * and only if @min_vecs == 1 + *%PCI_IRQ_AFFINITY Auto-manage IRQs affinity by spreading + * the vectors around available CPUs + * + * Allocate up to @max_vecs interrupt vectors on device. MSI-X irq + * vector allocation has a higher precedence over plain MSI, which has a + * higher precedence over legacy INTx emulation. + * + * Upon a successful allocation, the caller should use pci_irq_vector() + * to get the Linux IRQ number to be passed to request_threaded_irq(). + * The driver must call pci_free_irq_vectors() on cleanup. + * + * Return: number of allocated vectors (which might be smaller than + * @max_vecs), -ENOSPC if less than @min_vecs interrupt vectors are + * available, other errnos otherwise. + */ +int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, + unsigned int max_vecs, unsigned int flags) +{ + return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, + flags, NULL); +} +EXPORT_SYMBOL(pci_alloc_irq_vectors); diff --git a/include/linux/pci.h b/include/linux/pci.h index 2bda4a4e47e8..6a356a39ba94 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1553,6 +1553,8 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev, return rc; return 0; } +int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, + unsigned int max_vecs, unsigned int flags); int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, unsigned int max_vecs, unsigned int flags, struct irq_affinity *affd); @@ -1586,6 +1588,13 @@ pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, return 1; return -ENOSPC; } +static inline int +pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, + unsigned int max_vecs, unsigned int flags) +{ + return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, + flags, NULL); +} static inline void pci_free_irq_vectors(struct pci_dev *dev) { @@ -1900,14 +1909,6 @@ pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, } #endif /* CONFIG_PCI */ -static inline int -pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, - unsigned int max_vecs, unsigned int flags) -{ - return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags, - NULL); -} - /* Include architecture-dependent settings and functions */ #include
[patch 17/39] PCI/MSI: Get rid of externs in msi.h
From: Ahmed S. Darwish Follow the style of Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/msi.h |8 1 file changed, 4 insertions(+), 4 deletions(-) --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -5,12 +5,12 @@ #define msix_table_size(flags) ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) -extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); -extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev); +int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); +void pci_msi_teardown_msi_irqs(struct pci_dev *dev); #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS -extern int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); -extern void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev); +int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); +void pci_msi_legacy_teardown_msi_irqs(struct pci_dev *dev); #else static inline int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) {
[patch 37/39] PCI/MSI: Remove redundant msi_check() callback
All these sanity checks are now done _before_ any allocation work happens. No point in doing it twice. Signed-off-by: Thomas Gleixner --- drivers/pci/msi/irqdomain.c | 48 1 file changed, 48 deletions(-) --- a/drivers/pci/msi/irqdomain.c +++ b/drivers/pci/msi/irqdomain.c @@ -64,51 +64,6 @@ static irq_hw_number_t pci_msi_domain_ca (pci_domain_nr(dev->bus) & 0x) << 27; } -static inline bool pci_msi_desc_is_multi_msi(struct msi_desc *desc) -{ - return !desc->pci.msi_attrib.is_msix && desc->nvec_used > 1; -} - -/** - * pci_msi_domain_check_cap - Verify that @domain supports the capabilities - * for @dev - * @domain:The interrupt domain to check - * @info: The domain info for verification - * @dev: The device to check - * - * Returns: - * 0 if the functionality is supported - * 1 if Multi MSI is requested, but the domain does not support it - * -ENOTSUPP otherwise - */ -static int pci_msi_domain_check_cap(struct irq_domain *domain, - struct msi_domain_info *info, - struct device *dev) -{ - struct msi_desc *desc = msi_first_desc(dev, MSI_DESC_ALL); - - /* Special handling to support __pci_enable_msi_range() */ - if (pci_msi_desc_is_multi_msi(desc) && - !(info->flags & MSI_FLAG_MULTI_PCI_MSI)) - return 1; - - if (desc->pci.msi_attrib.is_msix) { - if (!(info->flags & MSI_FLAG_PCI_MSIX)) - return -ENOTSUPP; - - if (info->flags & MSI_FLAG_MSIX_CONTIGUOUS) { - unsigned int idx = 0; - - /* Check for gaps in the entry indices */ - msi_for_each_desc(desc, dev, MSI_DESC_ALL) { - if (desc->msi_index != idx++) - return -ENOTSUPP; - } - } - } - return 0; -} - static void pci_msi_domain_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc) { @@ -118,7 +73,6 @@ static void pci_msi_domain_set_desc(msi_ static struct msi_domain_ops pci_msi_domain_ops_default = { .set_desc = pci_msi_domain_set_desc, - .msi_check = pci_msi_domain_check_cap, }; static void pci_msi_domain_update_dom_ops(struct msi_domain_info *info) @@ -130,8 +84,6 @@ static void pci_msi_domain_update_dom_op } else { if (ops->set_desc == NULL) ops->set_desc = pci_msi_domain_set_desc; - if (ops->msi_check == NULL) - ops->msi_check = pci_msi_domain_check_cap; } }
[patch 02/39] iommu/vt-d: Remove bogus check for multi MSI-X
PCI/Multi-MSI is MSI specific and not supported for MSI-X. Signed-off-by: Thomas Gleixner --- drivers/iommu/intel/irq_remapping.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -1334,8 +1334,7 @@ static int intel_irq_remapping_alloc(str if (!info || !iommu) return -EINVAL; - if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI && - info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX) + if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI) return -EINVAL; /*
[patch 16/39] genirq: Get rid of GENERIC_MSI_IRQ_DOMAIN
Adjust to reality and remove another layer of pointless Kconfig indirection. CONFIG_GENERIC_MSI_IRQ is good enough to serve all purposes. Signed-off-by: Thomas Gleixner --- drivers/base/Makefile |2 +- drivers/bus/fsl-mc/Kconfig |2 +- drivers/dma/Kconfig |2 +- drivers/dma/qcom/hidma.c|8 drivers/iommu/Kconfig |2 +- drivers/irqchip/Kconfig |6 +++--- drivers/mailbox/Kconfig |2 +- drivers/pci/Kconfig |1 - drivers/perf/Kconfig|2 +- drivers/soc/ti/Kconfig |2 +- include/asm-generic/msi.h |4 ++-- include/linux/device.h |8 +++- include/linux/gpio/driver.h |2 +- include/linux/msi.h | 10 ++ kernel/irq/Kconfig |7 +-- kernel/irq/msi.c|3 --- 16 files changed, 23 insertions(+), 40 deletions(-) --- a/drivers/base/Makefile +++ b/drivers/base/Makefile @@ -22,7 +22,7 @@ obj-$(CONFIG_REGMAP) += regmap/ obj-$(CONFIG_SOC_BUS) += soc.o obj-$(CONFIG_PINCTRL) += pinctrl.o obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o -obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o +obj-$(CONFIG_GENERIC_MSI_IRQ) += platform-msi.o obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o obj-$(CONFIG_GENERIC_ARCH_NUMA) += arch_numa.o obj-$(CONFIG_ACPI) += physical_location.o --- a/drivers/bus/fsl-mc/Kconfig +++ b/drivers/bus/fsl-mc/Kconfig @@ -8,7 +8,7 @@ config FSL_MC_BUS bool "QorIQ DPAA2 fsl-mc bus driver" depends on OF && (ARCH_LAYERSCAPE || (COMPILE_TEST && (ARM || ARM64 || X86_LOCAL_APIC || PPC))) - select GENERIC_MSI_IRQ_DOMAIN + select GENERIC_MSI_IRQ help Driver to enable the bus infrastructure for the QorIQ DPAA2 architecture. The fsl-mc bus driver handles discovery of --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -462,7 +462,7 @@ config MV_XOR_V2 select DMA_ENGINE select DMA_ENGINE_RAID select ASYNC_TX_ENABLE_CHANNEL_SWITCH - select GENERIC_MSI_IRQ_DOMAIN + select GENERIC_MSI_IRQ help Enable support for the Marvell version 2 XOR engine. --- a/drivers/dma/qcom/hidma.c +++ b/drivers/dma/qcom/hidma.c @@ -610,7 +610,7 @@ static irqreturn_t hidma_chirq_handler(i return hidma_ll_inthandler(chirq, lldev); } -#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN +#ifdef CONFIG_GENERIC_MSI_IRQ static irqreturn_t hidma_chirq_handler_msi(int chirq, void *arg) { struct hidma_lldev **lldevp = arg; @@ -671,7 +671,7 @@ static int hidma_sysfs_init(struct hidma return device_create_file(dev->ddev.dev, dev->chid_attrs); } -#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN +#ifdef CONFIG_GENERIC_MSI_IRQ static void hidma_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg) { struct device *dev = msi_desc_to_dev(desc); @@ -687,7 +687,7 @@ static void hidma_write_msi_msg(struct m static void hidma_free_msis(struct hidma_dev *dmadev) { -#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN +#ifdef CONFIG_GENERIC_MSI_IRQ struct device *dev = dmadev->ddev.dev; int i, virq; @@ -704,7 +704,7 @@ static void hidma_free_msis(struct hidma static int hidma_request_msi(struct hidma_dev *dmadev, struct platform_device *pdev) { -#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN +#ifdef CONFIG_GENERIC_MSI_IRQ int rc, i, virq; rc = platform_msi_domain_alloc_irqs(>dev, HIDMA_MSI_INTS, --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -389,7 +389,7 @@ config ARM_SMMU_V3 depends on ARM64 select IOMMU_API select IOMMU_IO_PGTABLE_LPAE - select GENERIC_MSI_IRQ_DOMAIN + select GENERIC_MSI_IRQ help Support for implementations of the ARM System MMU architecture version 3 providing translation support to a PCIe root complex. --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -38,7 +38,7 @@ config ARM_GIC_V3 config ARM_GIC_V3_ITS bool - select GENERIC_MSI_IRQ_DOMAIN + select GENERIC_MSI_IRQ default ARM_GIC_V3 config ARM_GIC_V3_ITS_PCI @@ -375,7 +375,7 @@ config MVEBU_ICU config MVEBU_ODMI bool - select GENERIC_MSI_IRQ_DOMAIN + select GENERIC_MSI_IRQ config MVEBU_PIC bool @@ -488,7 +488,7 @@ config IMX_MU_MSI default m if ARCH_MXC select IRQ_DOMAIN select IRQ_DOMAIN_HIERARCHY - select GENERIC_MSI_IRQ_DOMAIN + select GENERIC_MSI_IRQ help Provide a driver for the i.MX Messaging Unit block used as a CPU-to-CPU MSI controller. This requires a specially crafted DT --- a/drivers/mailbox/Kconfig +++ b/drivers/mailbox/Kconfig @@ -223,7 +223,7 @@ config BCM_FLEXRM_MBOX tristate "Broadcom FlexRM Mailbox" depends on ARM64 depends on ARCH_BCM_IPROC || COMPILE_TEST - select GENERIC_MSI_IRQ_DOMAIN + select GENERIC_MSI_IRQ
[patch 10/39] genirq/msi: Make __msi_domain_free_irqs() static
Now that the last user is gone, confine it to the core code. Signed-off-by: Thomas Gleixner --- include/linux/msi.h |4 kernel/irq/msi.c|3 ++- 2 files changed, 2 insertions(+), 5 deletions(-) --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -335,9 +335,6 @@ struct msi_domain_info; * are set to the default implementation if NULL and even when * MSI_FLAG_USE_DEF_DOM_OPS is not set to avoid breaking existing users and * because these callbacks are obviously mandatory. - * - * __msi_domain_free_irqs() is exposed for PPC pseries to handle extra - * work after all interrupts and descriptors have been freed. */ struct msi_domain_ops { irq_hw_number_t (*get_hwirq)(struct msi_domain_info *info, @@ -432,7 +429,6 @@ int msi_domain_alloc_irqs_descs_locked(s int nvec); int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec); -void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev); void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev); void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev); struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain); --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -463,6 +463,7 @@ static inline void msi_sysfs_remove_desc #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN static int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec); +static void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev); static inline void irq_chip_write_msi_msg(struct irq_data *data, struct msi_msg *msg) @@ -978,7 +979,7 @@ int msi_domain_alloc_irqs(struct irq_dom return ret; } -void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev) +static void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev) { struct msi_domain_info *info = domain->host_data; struct irq_data *irqd;
[patch 13/39] PCI/MSI: Use msi_domain_info::bus_token
From: Ahmed S. Darwish Set the bus token in the msi_domain_info structure and let the core code handle the update. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/irqdomain.c | 11 +++ 1 file changed, 3 insertions(+), 8 deletions(-) --- a/drivers/pci/msi/irqdomain.c +++ b/drivers/pci/msi/irqdomain.c @@ -162,8 +162,6 @@ struct irq_domain *pci_msi_create_irq_do struct msi_domain_info *info, struct irq_domain *parent) { - struct irq_domain *domain; - if (WARN_ON(info->flags & MSI_FLAG_LEVEL_CAPABLE)) info->flags &= ~MSI_FLAG_LEVEL_CAPABLE; @@ -178,13 +176,10 @@ struct irq_domain *pci_msi_create_irq_do /* PCI-MSI is oneshot-safe */ info->chip->flags |= IRQCHIP_ONESHOT_SAFE; + /* Let the core update the bus token */ + info->bus_token = DOMAIN_BUS_PCI_MSI; - domain = msi_create_irq_domain(fwnode, info, parent); - if (!domain) - return NULL; - - irq_domain_update_bus_token(domain, DOMAIN_BUS_PCI_MSI); - return domain; + return msi_create_irq_domain(fwnode, info, parent); } EXPORT_SYMBOL_GPL(pci_msi_create_irq_domain);
[patch 00/39] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 1 cleanups
Hi! This is a three part series which provides support for per device MSI interrupt domains. This solves conceptual problems of the current PCI/MSI design which are in the way of providing support for PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device. IMS (Interrupt Message Store] is a new specification which allows device manufactures to provide implementaton defined storage for MSI messages contrary to the uniform and specification defined storage mechanisms for PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitions of the MSI-X table, but also gives the device manufacturer the freedom to store the message in arbitrary places, even in host memory which is shared with the device. There have been several attempts to glue this into the current MSI code, but after lengthy discussions in various threads: https://lore.kernel.org/all/20211126230957.239391...@linutronix.de https://lore.kernel.org/all/mwhpr11mb188603d0d809c1079f5817dc8c...@mwhpr11mb1886.namprd11.prod.outlook.com https://lore.kernel.org/all/160408357912.912050.17005584526266191420.st...@djiang5-desk3.ch.intel.com it turned out that there is a fundamental design problem in the current PCI/MSI-X implementation. This needs some historical background. When PCI/MSI[-X] support was added around 2003 interrupt management was completely different from what we have today in the actively developed architectures. Interrupt management was completely architecture specific and while there were attempts to create common infrastructure the commonalities were rudimental and just providing shared data structures and interfaces so that drivers could be written in an architecture agnostic way. The initial PCI/MSI[-X] support obviously plugged into this model which resulted in some basic shared infrastructure in the PCI core code for setting up MSI descriptors, which are a pure software construct for holding data relevant for a particular MSI interrupt, but the actual association to Linux interrupts was completely architecture specific. This model is still supported today to keep museum architectures and notorious stranglers alive. In 2013 Intel tried to add support for hotplugable IO/APICs to the kernel which was creating yet another architecture specific mechanism and resulted in an unholy mess on top of the existing horrors of x86 interrupt handling. The x86 interrupt management code was already an uncomprehensible maze of indirections between the CPU vector management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X] implementation. At roughly the same time ARM struggled with the ever growing SoC specific extensions which were glued on top of the architected GIC interrupt controller. This resulted in a fundamental redesign of interrupt management and provided the today prevailing concept of hierarchical interrupt domains. This allows to disentangle the interactions between x86 vector domain and interrupt remapping and also allowed ARM to handle the zoo of SoC specific interrupt components in a sane way. The concept of hierarchical interrupt domains aims to encapsulate the functionality of particual IP blocks which are involved in interrupt delivery so that they become extensible and pluggable. The X86 encapsulation looks like this: |--- device 1 [Vector]---[Remapping]---[PCI/MSI]--|... |--- device N where the remapping domain is an optional component and in case that it is not available the PCI/MSI[-X] domains have the vector domain as their parent. This reduced the required interaction between the domains pretty much to the initialization phase where it is obviously required to establish the proper parent relation ship in the components of the hierarchy. While in most cases the model is strictly representing the chain of IP blocks and abstracting them so they can be plugged together to form a hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware it's clear that the actual PCI/MSI[-X] interrupt controller is not a global entity, but strict a per PCI device entity. Here we took a short cut on the hierarchical model and went for the easy solution of providing "global" PCI/MSI domains which was possible because the PCI/MSI[-X] handling is uniform accross the devices. This also allowed to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in turn made it simple to keep the existing architecture specific management alive. A similar problem was created in the ARM world with support for IP block specific message storage. Instead of going all the way to stack a IP block specific domain on top of the generic MSI domain this ended in a construct which provides a "global" platform MSI domain which allows overriding the irq_write_msi_msg() callback per allocation. In course of the lengthy discussions we identified other abuse of the MSI infrastructure in wireless drivers, NTB etc.
[patch 05/39] genirq/msi: Remove filter from msi_free_descs_free_range()
When a range of descriptors is freed then all of them are not associated to a linux interrupt. Remove the filter and add a warning to the free function. Signed-off-by: Thomas Gleixner --- drivers/base/platform-msi.c |2 +- include/linux/msi.h |5 ++--- kernel/irq/msi.c| 19 ++- 3 files changed, 13 insertions(+), 13 deletions(-) --- a/drivers/base/platform-msi.c +++ b/drivers/base/platform-msi.c @@ -325,7 +325,7 @@ void platform_msi_device_domain_free(str msi_lock_descs(data->dev); irq_domain_free_irqs_common(domain, virq, nr_irqs); - msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, virq, virq + nr_irqs - 1); + msi_free_msi_descs_range(data->dev, virq, virq + nr_irqs - 1); msi_unlock_descs(data->dev); } --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -247,8 +247,7 @@ static inline void pci_write_msi_msg(uns #endif /* CONFIG_PCI_MSI */ int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc); -void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter, - unsigned int first_index, unsigned int last_index); +void msi_free_msi_descs_range(struct device *dev, unsigned int first_index, unsigned int last_index); /** * msi_free_msi_descs - Free MSI descriptors of a device @@ -256,7 +255,7 @@ void msi_free_msi_descs_range(struct dev */ static inline void msi_free_msi_descs(struct device *dev) { - msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, MSI_MAX_INDEX); + msi_free_msi_descs_range(dev, 0, MSI_MAX_INDEX); } void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg); --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -120,7 +120,7 @@ static int msi_add_simple_msi_descs(stru fail_mem: ret = -ENOMEM; fail: - msi_free_msi_descs_range(dev, MSI_DESC_ALL, index, last); + msi_free_msi_descs_range(dev, index, last); return ret; } @@ -141,12 +141,11 @@ static bool msi_desc_match(struct msi_de /** * msi_free_msi_descs_range - Free MSI descriptors of a device * @dev: Device to free the descriptors - * @filter:Descriptor state filter * @first_index: Index to start freeing from * @last_index:Last index to be freed */ -void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter, - unsigned int first_index, unsigned int last_index) +void msi_free_msi_descs_range(struct device *dev, unsigned int first_index, + unsigned int last_index) { struct xarray *xa = >msi.data->__store; struct msi_desc *desc; @@ -155,10 +154,12 @@ void msi_free_msi_descs_range(struct dev lockdep_assert_held(>msi.data->mutex); xa_for_each_range(xa, idx, desc, first_index, last_index) { - if (msi_desc_match(desc, filter)) { - xa_erase(xa, idx); - msi_free_desc(desc); - } + xa_erase(xa, idx); + + /* Leak the descriptor when it is still referenced */ + if (WARN_ON_ONCE(msi_desc_match(desc, MSI_DESC_ASSOCIATED))) + continue; + msi_free_desc(desc); } } @@ -739,7 +740,7 @@ int msi_domain_populate_irqs(struct irq_ fail: for (--virq; virq >= virq_base; virq--) irq_domain_free_irqs_common(domain, virq, 1); - msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, virq_base + nvec - 1); + msi_free_msi_descs_range(dev, virq_base, virq_base + nvec - 1); unlock: msi_unlock_descs(dev); return ret;
[patch 06/39] genirq/msi: Add missing kernel doc to msi_next_desc()
W=1 complains about this. Signed-off-by: Thomas Gleixner --- kernel/irq/msi.c |1 + 1 file changed, 1 insertion(+) --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -282,6 +282,7 @@ EXPORT_SYMBOL_GPL(msi_first_desc); /** * msi_next_desc - Get the next MSI descriptor of a device * @dev: Device to operate on + * @filter:Descriptor state filter * * The first invocation of msi_next_desc() has to be preceeded by a * successful invocation of __msi_first_desc(). Consecutive invocations are
[patch 11/39] genirq/irqdomain: Move bus token enum into a seperate header
Split the bus token defines out into a seperate header file to avoid inclusion of irqdomain.h in msi.h. Signed-off-by: Thomas Gleixner --- include/linux/irqdomain.h | 22 +- include/linux/irqdomain_defs.h | 26 ++ 2 files changed, 27 insertions(+), 21 deletions(-) --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -31,6 +31,7 @@ #define _LINUX_IRQDOMAIN_H #include +#include #include #include #include @@ -68,27 +69,6 @@ struct irq_fwspec { void of_phandle_args_to_fwspec(struct device_node *np, const u32 *args, unsigned int count, struct irq_fwspec *fwspec); -/* - * Should several domains have the same device node, but serve - * different purposes (for example one domain is for PCI/MSI, and the - * other for wired IRQs), they can be distinguished using a - * bus-specific token. Most domains are expected to only carry - * DOMAIN_BUS_ANY. - */ -enum irq_domain_bus_token { - DOMAIN_BUS_ANY = 0, - DOMAIN_BUS_WIRED, - DOMAIN_BUS_GENERIC_MSI, - DOMAIN_BUS_PCI_MSI, - DOMAIN_BUS_PLATFORM_MSI, - DOMAIN_BUS_NEXUS, - DOMAIN_BUS_IPI, - DOMAIN_BUS_FSL_MC_MSI, - DOMAIN_BUS_TI_SCI_INTA_MSI, - DOMAIN_BUS_WAKEUP, - DOMAIN_BUS_VMD_MSI, -}; - /** * struct irq_domain_ops - Methods for irq_domain objects * @match: Match an interrupt controller device node to a host, returns --- /dev/null +++ b/include/linux/irqdomain_defs.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_IRQDOMAIN_DEFS_H +#define _LINUX_IRQDOMAIN_DEFS_H + +/* + * Should several domains have the same device node, but serve + * different purposes (for example one domain is for PCI/MSI, and the + * other for wired IRQs), they can be distinguished using a + * bus-specific token. Most domains are expected to only carry + * DOMAIN_BUS_ANY. + */ +enum irq_domain_bus_token { + DOMAIN_BUS_ANY = 0, + DOMAIN_BUS_WIRED, + DOMAIN_BUS_GENERIC_MSI, + DOMAIN_BUS_PCI_MSI, + DOMAIN_BUS_PLATFORM_MSI, + DOMAIN_BUS_NEXUS, + DOMAIN_BUS_IPI, + DOMAIN_BUS_FSL_MC_MSI, + DOMAIN_BUS_TI_SCI_INTA_MSI, + DOMAIN_BUS_WAKEUP, + DOMAIN_BUS_VMD_MSI, +}; + +#endif /* _LINUX_IRQDOMAIN_DEFS_H */
[patch 01/39] PCI/MSI: Check for MSI enabled in __pci_msix_enable()
PCI/MSI and PCI/MSI-X are mutually exclusive, but the MSI-X enable code lacks a check for already enabled MSI. Signed-off-by: Thomas Gleixner --- drivers/pci/msi/msi.c |5 + 1 file changed, 5 insertions(+) --- a/drivers/pci/msi/msi.c +++ b/drivers/pci/msi/msi.c @@ -935,6 +935,11 @@ static int __pci_enable_msix_range(struc if (maxvec < minvec) return -ERANGE; + if (dev->msi_enabled) { + pci_info(dev, "can't enable MSI-X (MSI already enabled)\n"); + return -EINVAL; + } + if (WARN_ON_ONCE(dev->msix_enabled)) return -EINVAL;
[patch 07/39] genirq/msi: Make __msi_domain_alloc_irqs() static
Nothing outside of the core code requires this. Signed-off-by: Thomas Gleixner --- include/linux/msi.h |7 ++- kernel/irq/msi.c|6 -- 2 files changed, 6 insertions(+), 7 deletions(-) --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -334,9 +334,8 @@ struct msi_domain_info; * MSI_FLAG_USE_DEF_DOM_OPS is not set to avoid breaking existing users and * because these callbacks are obviously mandatory. * - * This is NOT meant to be abused, but it can be useful to build wrappers - * for specialized MSI irq domains which need extra work before and after - * calling __msi_domain_alloc_irqs()/__msi_domain_free_irqs(). + * __msi_domain_free_irqs() is exposed for PPC pseries to handle extra + * work after all interrupts and descriptors have been freed. */ struct msi_domain_ops { irq_hw_number_t (*get_hwirq)(struct msi_domain_info *info, @@ -425,8 +424,6 @@ int msi_domain_set_affinity(struct irq_d struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode, struct msi_domain_info *info, struct irq_domain *parent); -int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, - int nvec); int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev, int nvec); int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -462,6 +462,8 @@ static inline void msi_sysfs_remove_desc #endif /* !CONFIG_SYSFS */ #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN +static int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec); + static inline void irq_chip_write_msi_msg(struct irq_data *data, struct msi_msg *msg) { @@ -852,8 +854,8 @@ static int msi_init_virq(struct irq_doma return 0; } -int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, - int nvec) +static int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, + int nvec) { struct msi_domain_info *info = domain->host_data; struct msi_domain_ops *ops = info->ops;
[patch 04/39] genirq/msi: Use MSI_DESC_ALL in msi_add_simple_msi_descs()
There are no associated MSI descriptors in the requested range when the MSI descriptor allocation fails. Use MSI_DESC_ALL as the filter which prepares the next step to get rid of the filter for freeing. Signed-off-by: Thomas Gleixner --- kernel/irq/msi.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -120,7 +120,7 @@ static int msi_add_simple_msi_descs(stru fail_mem: ret = -ENOMEM; fail: - msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, index, last); + msi_free_msi_descs_range(dev, MSI_DESC_ALL, index, last); return ret; }
[patch 14/39] PCI/MSI: Let the MSI core free descriptors
From: Ahmed S. Darwish Let the core do the freeing of descriptors and just keep it around for the legacy case. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- drivers/pci/msi/irqdomain.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) --- a/drivers/pci/msi/irqdomain.c +++ b/drivers/pci/msi/irqdomain.c @@ -24,11 +24,12 @@ void pci_msi_teardown_msi_irqs(struct pc struct irq_domain *domain; domain = dev_get_msi_domain(>dev); - if (domain && irq_domain_is_hierarchy(domain)) + if (domain && irq_domain_is_hierarchy(domain)) { msi_domain_free_irqs_descs_locked(domain, >dev); - else + } else { pci_msi_legacy_teardown_msi_irqs(dev); - msi_free_msi_descs(>dev); + msi_free_msi_descs(>dev); + } } /** @@ -170,6 +171,9 @@ struct irq_domain *pci_msi_create_irq_do if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS) pci_msi_domain_update_chip_ops(info); + /* Let the core code free MSI descriptors when freeing interrupts */ + info->flags |= MSI_FLAG_FREE_MSI_DESCS; + info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS; if (IS_ENABLED(CONFIG_GENERIC_IRQ_RESERVATION_MODE)) info->flags |= MSI_FLAG_MUST_REACTIVATE;
[patch 15/39] PCI/MSI: Get rid of PCI_MSI_IRQ_DOMAIN
What a zoo: PCI_MSI select GENERIC_MSI_IRQ PCI_MSI_IRQ_DOMAIN def_bool y depends on PCI_MSI select GENERIC_MSI_IRQ_DOMAIN Ergo PCI_MSI enables PCI_MSI_IRQ_DOMAIN which in turn selects GENERIC_MSI_IRQ_DOMAIN. So all the dependencies on PCI_MSI_IRQ_DOMAIN are just an indirection to PCI_MSI. Match the reality and just admit that PCI_MSI requires GENERIC_MSI_IRQ_DOMAIN. Signed-off-by: Thomas Gleixner --- arch/um/drivers/Kconfig |1 arch/um/include/asm/pci.h |2 - arch/x86/Kconfig|1 arch/x86/include/asm/pci.h |4 +- drivers/pci/Kconfig |8 + drivers/pci/controller/Kconfig | 30 +--- drivers/pci/controller/dwc/Kconfig | 48 drivers/pci/controller/mobiveil/Kconfig |6 ++-- drivers/pci/msi/Makefile|2 - drivers/pci/probe.c |2 - include/linux/msi.h | 32 ++--- 11 files changed, 56 insertions(+), 80 deletions(-) --- a/arch/um/drivers/Kconfig +++ b/arch/um/drivers/Kconfig @@ -381,7 +381,6 @@ config UML_PCI_OVER_VIRTIO select UML_IOMEM_EMULATION select UML_DMA_EMULATION select PCI_MSI - select PCI_MSI_IRQ_DOMAIN select PCI_LOCKLESS_CONFIG config UML_PCI_OVER_VIRTIO_DEVICE_ID --- a/arch/um/include/asm/pci.h +++ b/arch/um/include/asm/pci.h @@ -7,7 +7,7 @@ /* Generic PCI */ #include -#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN +#ifdef CONFIG_PCI_MSI /* * This is a bit of an annoying hack, and it assumes we only have * the virt-pci (if anything). Which is true, but still. --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1109,7 +1109,6 @@ config X86_LOCAL_APIC def_bool y depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI select IRQ_DOMAIN_HIERARCHY - select PCI_MSI_IRQ_DOMAIN if PCI_MSI config X86_IO_APIC def_bool y --- a/arch/x86/include/asm/pci.h +++ b/arch/x86/include/asm/pci.h @@ -21,7 +21,7 @@ struct pci_sysdata { #ifdef CONFIG_X86_64 void*iommu; /* IOMMU private data */ #endif -#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN +#ifdef CONFIG_PCI_MSI void*fwnode;/* IRQ domain for MSI assignment */ #endif #if IS_ENABLED(CONFIG_VMD) @@ -52,7 +52,7 @@ static inline int pci_proc_domain(struct } #endif -#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN +#ifdef CONFIG_PCI_MSI static inline void *_pci_root_bus_fwnode(struct pci_bus *bus) { return to_pci_sysdata(bus)->fwnode; --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -38,6 +38,7 @@ source "drivers/pci/pcie/Kconfig" config PCI_MSI bool "Message Signaled Interrupts (MSI and MSI-X)" + select GENERIC_MSI_IRQ_DOMAIN select GENERIC_MSI_IRQ help This allows device drivers to enable MSI (Message Signaled @@ -51,11 +52,6 @@ config PCI_MSI If you don't know what to do here, say Y. -config PCI_MSI_IRQ_DOMAIN - def_bool y - depends on PCI_MSI - select GENERIC_MSI_IRQ_DOMAIN - config PCI_MSI_ARCH_FALLBACKS bool @@ -192,7 +188,7 @@ config PCI_LABEL config PCI_HYPERV tristate "Hyper-V PCI Frontend" - depends on ((X86 && X86_64) || ARM64) && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && SYSFS + depends on ((X86 && X86_64) || ARM64) && HYPERV && PCI_MSI && SYSFS select PCI_HYPERV_INTERFACE help The PCI device frontend driver allows the kernel to import arbitrary --- a/drivers/pci/controller/Kconfig +++ b/drivers/pci/controller/Kconfig @@ -19,7 +19,7 @@ config PCI_AARDVARK tristate "Aardvark PCIe controller" depends on (ARCH_MVEBU && ARM64) || COMPILE_TEST depends on OF - depends on PCI_MSI_IRQ_DOMAIN + depends on PCI_MSI select PCI_BRIDGE_EMUL help Add support for Aardvark 64bit PCIe Host Controller. This @@ -29,7 +29,7 @@ config PCI_AARDVARK config PCIE_XILINX_NWL bool "NWL PCIe Core" depends on ARCH_ZYNQMP || COMPILE_TEST - depends on PCI_MSI_IRQ_DOMAIN + depends on PCI_MSI help Say 'Y' here if you want kernel support for Xilinx NWL PCIe controller. The controller can act as Root Port @@ -53,7 +53,7 @@ config PCI_IXP4XX config PCI_TEGRA bool "NVIDIA Tegra PCIe controller" depends on ARCH_TEGRA || COMPILE_TEST - depends on PCI_MSI_IRQ_DOMAIN + depends on PCI_MSI help Say Y here if you want support for the PCIe host controller found on NVIDIA Tegra SoCs. @@ -70,7 +70,7 @@ config PCI_RCAR_GEN2 config PCIE_RCAR_HOST bool "Renesas R-Car PCIe host controller"
[patch 09/39] powerpc/pseries/msi: Use msi_domain_ops::msi_post_free()
Use the new msi_post_free() callback which is invoked after the interrupts have been freed to tell the hypervisor about the shutdown. This allows to remove the exposure of __msi_domain_free_irqs(). Signed-off-by: Thomas Gleixner Cc: Michael Ellerman Cc: Christophe Leroy Cc: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/platforms/pseries/msi.c |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) --- a/arch/powerpc/platforms/pseries/msi.c +++ b/arch/powerpc/platforms/pseries/msi.c @@ -447,21 +447,18 @@ static void pseries_msi_ops_msi_free(str * RTAS can not disable one MSI at a time. It's all or nothing. Do it * at the end after all IRQs have been freed. */ -static void pseries_msi_domain_free_irqs(struct irq_domain *domain, -struct device *dev) +static void pseries_msi_post_free(struct irq_domain *domain, struct device *dev) { if (WARN_ON_ONCE(!dev_is_pci(dev))) return; - __msi_domain_free_irqs(domain, dev); - rtas_disable_msi(to_pci_dev(dev)); } static struct msi_domain_ops pseries_pci_msi_domain_ops = { .msi_prepare= pseries_msi_ops_prepare, .msi_free = pseries_msi_ops_msi_free, - .domain_free_irqs = pseries_msi_domain_free_irqs, + .msi_post_free = pseries_msi_post_free, }; static void pseries_msi_shutdown(struct irq_data *d)
[patch 08/39] genirq/msi: Provide msi_domain_ops::post_free()
To prepare for removing the exposure of __msi_domain_free_irqs() provide a post_free() callback in the MSI domain ops which can be used to solve the problem of the only user of __msi_domain_free_irqs() in arch/powerpc. Signed-off-by: Thomas Gleixner --- include/linux/msi.h |4 kernel/irq/msi.c|2 ++ 2 files changed, 6 insertions(+) --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -315,6 +315,8 @@ struct msi_domain_info; * function. * @domain_free_irqs: Optional function to override the default free * function. + * @msi_post_free: Optional function which is invoked after freeing + * all interrupts. * * @get_hwirq, @msi_init and @msi_free are callbacks used by the underlying * irqdomain. @@ -359,6 +361,8 @@ struct msi_domain_ops { struct device *dev, int nvec); void(*domain_free_irqs)(struct irq_domain *domain, struct device *dev); + void(*msi_post_free)(struct irq_domain *domain, +struct device *dev); }; /** --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -1026,6 +1026,8 @@ void msi_domain_free_irqs_descs_locked(s lockdep_assert_held(>msi.data->mutex); ops->domain_free_irqs(domain, dev); + if (ops->msi_post_free) + ops->msi_post_free(domain, dev); msi_domain_free_msi_descs(info, dev); }
[patch 12/39] genirq/msi: Add bus token to struct msi_domain_info
From: Ahmed S. Darwish Add a bus token member to struct msi_domain_info and let msi_create_irq_domain() set the bus token. That allows to remove the bus token updates at the call sites. Suggested-by: Thomas Gleixner Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner --- include/linux/msi.h | 19 +++ kernel/irq/msi.c|7 +-- 2 files changed, 16 insertions(+), 10 deletions(-) --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -16,6 +16,7 @@ * abuse. The only function which is relevant for drivers is msi_get_virq(). */ +#include #include #include #include @@ -365,6 +366,7 @@ struct msi_domain_ops { /** * struct msi_domain_info - MSI interrupt domain data * @flags: Flags to decribe features and capabilities + * @bus_token: The domain bus token * @ops: The callback data structure * @chip: Optional: associated interrupt chip * @chip_data: Optional: associated interrupt chip data @@ -374,14 +376,15 @@ struct msi_domain_ops { * @data: Optional: domain specific data */ struct msi_domain_info { - u32 flags; - struct msi_domain_ops *ops; - struct irq_chip *chip; - void*chip_data; - irq_flow_handler_t handler; - void*handler_data; - const char *handler_name; - void*data; + u32 flags; + enum irq_domain_bus_token bus_token; + struct msi_domain_ops *ops; + struct irq_chip *chip; + void*chip_data; + irq_flow_handler_t handler; + void*handler_data; + const char *handler_name; + void*data; }; /* Flags for msi_domain_info */ --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -694,8 +694,11 @@ struct irq_domain *msi_create_irq_domain domain = irq_domain_create_hierarchy(parent, IRQ_DOMAIN_FLAG_MSI, 0, fwnode, _domain_ops, info); - if (domain && !domain->name && info->chip) - domain->name = info->chip->name; + if (domain) { + if (!domain->name && info->chip) + domain->name = info->chip->name; + irq_domain_update_bus_token(domain, info->bus_token); + } return domain; }
[patch 03/39] iommu/amd: Remove bogus check for multi MSI-X
PCI/Multi-MSI is MSI specific and not supported for MSI-X Signed-off-by: Thomas Gleixner --- drivers/iommu/amd/iommu.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3294,8 +3294,7 @@ static int irq_remapping_alloc(struct ir if (!info) return -EINVAL; - if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI && - info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX) + if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI) return -EINVAL; /*
[patch 39/39] x86/apic: Remove X86_IRQ_ALLOC_CONTIGUOUS_VECTORS
Now that the PCI/MSI core code does early checking for multi-MSI support X86_IRQ_ALLOC_CONTIGUOUS_VECTORS is not required anymore. Remove the flag and rely on MSI_FLAG_MULTI_PCI_MSI. Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/irqdomain.h|4 +--- arch/x86/kernel/apic/msi.c |6 ++ arch/x86/kernel/apic/vector.c |4 drivers/iommu/amd/iommu.c |7 --- drivers/iommu/intel/irq_remapping.c |7 --- drivers/pci/controller/pci-hyperv.c | 15 +-- 6 files changed, 4 insertions(+), 39 deletions(-) --- a/arch/x86/include/asm/irqdomain.h +++ b/arch/x86/include/asm/irqdomain.h @@ -7,9 +7,7 @@ #ifdef CONFIG_X86_LOCAL_APIC enum { - /* Allocate contiguous CPU vectors */ - X86_IRQ_ALLOC_CONTIGUOUS_VECTORS= 0x1, - X86_IRQ_ALLOC_LEGACY= 0x2, + X86_IRQ_ALLOC_LEGACY= 0x1, }; extern int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec); --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -161,12 +161,10 @@ int pci_msi_prepare(struct irq_domain *d msi_alloc_info_t *arg) { init_irq_alloc_info(arg, NULL); - if (to_pci_dev(dev)->msix_enabled) { + if (to_pci_dev(dev)->msix_enabled) arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSIX; - } else { + else arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSI; - arg->flags |= X86_IRQ_ALLOC_CONTIGUOUS_VECTORS; - } return 0; } --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -539,10 +539,6 @@ static int x86_vector_alloc_irqs(struct if (disable_apic) return -ENXIO; - /* Currently vector allocator can't guarantee contiguous allocations */ - if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1) - return -ENOSYS; - /* * Catch any attempt to touch the cascade interrupt on a PIC * equipped system. --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3297,13 +3297,6 @@ static int irq_remapping_alloc(struct ir if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI) return -EINVAL; - /* -* With IRQ remapping enabled, don't need contiguous CPU vectors -* to support multiple MSI interrupts. -*/ - if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI) - info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS; - sbdf = get_devid(info); if (sbdf < 0) return -EINVAL; --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -1337,13 +1337,6 @@ static int intel_irq_remapping_alloc(str if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI) return -EINVAL; - /* -* With IRQ remapping enabled, don't need contiguous CPU vectors -* to support multiple MSI interrupts. -*/ - if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI) - info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS; - ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg); if (ret < 0) return ret; --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -611,20 +611,7 @@ static unsigned int hv_msi_get_int_vecto return cfg->vector; } -static int hv_msi_prepare(struct irq_domain *domain, struct device *dev, - int nvec, msi_alloc_info_t *info) -{ - int ret = pci_msi_prepare(domain, dev, nvec, info); - - /* -* By using the interrupt remapper in the hypervisor IOMMU, contiguous -* CPU vectors is not needed for multi-MSI -*/ - if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI) - info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS; - - return ret; -} +#define hv_msi_prepare pci_msi_prepare /** * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current
Re: [PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
On Fri, May 13 2022 at 15:16, Ricardo Neri wrote: > On Mon, May 09, 2022 at 04:03:39PM +0200, Thomas Gleixner wrote: >> > + /* If we are here, IPI shorthands are enabled. */ >> > + apic->send_IPI_allbutself(NMI_VECTOR); >> >> So if the monitored cpumask is a subset of online CPUs, which is the >> case when isolation features are enabled, then you still send NMIs to >> those isolated CPUs. I'm sure the isolation folks will be enthused. > > Yes, I acknowledged this limitation in the cover letter. I should also update > Documentation/admin-guide/lockup-watchdogs.rst. > > This patchset proposes the HPET NMI watchdog as an opt-in feature. > > Perhaps the limitation might be mitigated by adding a check for > non-housekeeping > and non-monitored CPUs in exc_nmi(). However, that will not eliminate the > problem of isolated CPUs also getting the NMI. Right. It's a mess...
Re: [PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()
On Fri, May 13 2022 at 14:19, Ricardo Neri wrote: > On Fri, May 06, 2022 at 11:41:13PM +0200, Thomas Gleixner wrote: >> The argument about not bloating the code >> with an "obvious???" function which is quite small is slightly beyond my >> comprehension level. > > That obvious function would look like this: > > void hpet_set_comparator_one_shot(int channel, u32 delta) > { > u32 count; > > count = hpet_readl(HPET_COUNTER); > count += delta; > hpet_writel(count, HPET_Tn_CMP(channel)); > } This function only works reliably when the delta is large. See hpet_clkevt_set_next_event(). Thanks, tglx
Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs
On Fri, May 13 2022 at 16:45, Ricardo Neri wrote: > On Fri, May 13, 2022 at 10:50:09PM +0200, Thomas Gleixner wrote: >> > Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips >> > INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to check for >> > NULL. >> > They currently unconditionally call the parent irq_chip's >> > irq_set_affinity(). >> > I see that there is a irq_chip_set_affinity_parent() function. Perhaps it >> > can >> > be used for this check? >> >> Yes, this lacks obviously a NMI specific set_affinity callback and this >> can be very trivial and does not have any of the complexity of interrupt >> affinity assignment. First online CPU in the mask with a fallback to any >> online CPU. > > Why would we need a fallback to any online CPU? Shouldn't it fail if it cannot > find an online CPU in the mask? Might as well fail. Let me think about it. >> I did not claim that this is complete. This was for illustration. > > In the reworked patch, may I add a Co-developed-by with your name and your > SOB? Suggested-by is good enough. Thanks, tglx
Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs
On Fri, May 13 2022 at 11:03, Ricardo Neri wrote: > On Fri, May 06, 2022 at 11:12:20PM +0200, Thomas Gleixner wrote: >> Why would a NMI ever end up in this code? There is no vector management >> required and this find cpu exercise is pointless. > > But even if the NMI has a fixed vector, it is still necessary to determine > which CPU will get the NMI. It is still necessary to determine what to > write in the Destination ID field of the MSI message. > > irq_matrix_find_best_cpu() would find the CPU with the lowest number of > managed vectors so that the NMI is directed to that CPU. What's the point to send it to the CPU with the lowest number of interrupts. It's not that this NMI happens every 50 microseconds. We pick one online CPU and are done. > In today's code, an NMI would end up here because we rely on the existing > interrupt management infrastructure... Unless, the check is done the entry > points as you propose. Correct. We don't want to call into functions which are not designed for NMIs. >> > + >> > + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) { >> > + cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest); >> > + apicd->cpu = cpu; >> > + vector = 0; >> > + goto no_vector; >> > + } >> >> This code can never be reached for a NMI delivery. If so, then it's a >> bug. >> >> This all is special purpose for that particular HPET NMI watchdog use >> case and we are not exposing this to anything else at all. >> >> So why are you sprinkling this NMI nonsense all over the place? Just >> because? There are well defined entry points to all of this where this >> can be fenced off. > > I put the NMI checks in these points because assign_vector_locked() and > assign_managed_vector() are reached through multiple paths and these are > the two places where the allocation of the vector is requested and the > destination CPU is determined. > > I do observe this code being reached for an NMI, but that is because this > code still does not know about NMIs... Unless the checks for NMI are put > in the entry points as you pointed out. > > The intent was to refactor the code in a generic manner and not to focus > only in the NMI watchdog. That would have looked hacky IMO. We don't want to have more of this really. Supporting NMIs on x86 in a broader way is simply not reasonable because there is only one NMI vector and we have no sensible way to get to the cause of the NMI without a massive overhead. Even if we get multiple NMI vectors some shiny day, this will be fundamentally different than regular interrupts and certainly not exposed broadly. There will be 99.99% fixed vectors for simplicity sake. >> +if (info->flags & X86_IRQ_ALLOC_AS_NMI) { >> +/* >> + * NMIs have a fixed vector and need their own >> + * interrupt chip so nothing can end up in the >> + * regular local APIC management code except the >> + * MSI message composing callback. >> + */ >> +irqd->chip = _nmi_controller; >> +/* >> + * Don't allow affinity setting attempts for NMIs. >> + * This cannot work with the regular affinity >> + * mechanisms and for the intended HPET NMI >> + * watchdog use case it's not required. > > But we do need the ability to set affinity, right? As stated above, we need > to know what Destination ID to write in the MSI message or in the interrupt > remapping table entry. > > It cannot be any CPU because only one specific CPU is supposed to handle the > NMI from the HPET channel. > > We cannot hard-code a CPU for that because it may go offline (and ignore NMIs) > or not be part of the monitored CPUs. > > Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips > INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to check for > NULL. > They currently unconditionally call the parent irq_chip's irq_set_affinity(). > I see that there is a irq_chip_set_affinity_parent() function. Perhaps it can > be used for this check? Yes, this lacks obviously a NMI specific set_affinity callback and this can be very trivial and does not have any of the complexity of interrupt affinity assignment. First online CPU in the mask with a fallback to any online CPU. I did not claim that this is complete. This was for illustration. >> + */ >> +irqd_set_no_balance(irqd); > > This code does not set apicd->hw_irq_cfg.delivery_mode as NMI, right? > I had to add that to make it work. I assumed you can figure that out on your own :) Thanks, tglx
Re: [PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz
On Tue, May 10 2022 at 21:16, Nicholas Piggin wrote: > Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am: >> +/* >> + * If in use, the HPET hardlockup detector relies on tsc_khz. >> + * Reconfigure it to make use of the refined tsc_khz. >> + */ >> +lockup_detector_reconfigure(); > > I don't know if the API is conceptually good. > > You change something that the lockup detector is currently using, > *while* the detector is running asynchronously, and then reconfigure > it. What happens in the window? If this code is only used for small > adjustments maybe it does not really matter but in principle it's > a bad API to export. > > lockup_detector_reconfigure as an internal API is okay because it > reconfigures things while the watchdog is stopped [actually that > looks untrue for soft dog which uses watchdog_thresh in > is_softlockup(), but that should be fixed]. > > You're the arch so you're allowed to stop the watchdog and configure > it, e.g., hardlockup_detector_perf_stop() is called in arch/. > > So you want to disable HPET watchdog if it was enabled, then update > wherever you're using tsc_khz, then re-enable. The real question is whether making this refined tsc_khz value immediately effective matters at all. IMO, it does not because up to that point the watchdog was happily using the coarse calibrated value and the whole use TSC to assess whether the HPET fired mechanism is just a guestimate anyway. So what's the point of trying to guess 'more correct'. Thanks, tglx
Re: [PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
On Thu, May 05 2022 at 17:00, Ricardo Neri wrote: > + if (is_hpet_hld_interrupt(hdata)) { > + /* > + * Kick the timer first. If the HPET channel is periodic, it > + * helps to reduce the delta between the expected TSC value and > + * its actual value the next time the HPET channel fires. > + */ > + kick_timer(hdata, !(hdata->has_periodic)); > + > + if (cpumask_weight(hld_data->monitored_cpumask) > 1) { > + /* > + * Since we cannot know the source of an NMI, the best > + * we can do is to use a flag to indicate to all online > + * CPUs that they will get an NMI and that the source of > + * that NMI is the hardlockup detector. Offline CPUs > + * also receive the NMI but they ignore it. > + * > + * Even though we are in NMI context, we have concluded > + * that the NMI came from the HPET channel assigned to > + * the detector, an event that is infrequent and only > + * occurs in the handling CPU. There should not be races > + * with other NMIs. > + */ > + cpumask_copy(hld_data->inspect_cpumask, > + cpu_online_mask); > + > + /* If we are here, IPI shorthands are enabled. */ > + apic->send_IPI_allbutself(NMI_VECTOR); So if the monitored cpumask is a subset of online CPUs, which is the case when isolation features are enabled, then you still send NMIs to those isolated CPUs. I'm sure the isolation folks will be enthused. Thanks, tglx
Re: [PATCH v6 21/29] x86/nmi: Add an NMI_WATCHDOG NMI handler category
On Thu, May 05 2022 at 17:00, Ricardo Neri wrote: > Add a NMI_WATCHDOG as a new category of NMI handler. This new category > is to be used with the HPET-based hardlockup detector. This detector > does not have a direct way of checking if the HPET timer is the source of > the NMI. Instead, it indirectly estimates it using the time-stamp counter. > > Therefore, we may have false-positives in case another NMI occurs within > the estimated time window. For this reason, we want the handler of the > detector to be called after all the NMI_LOCAL handlers. A simple way > of achieving this with a new NMI handler category. > > @@ -379,6 +385,10 @@ static noinstr void default_do_nmi(struct pt_regs *regs) > } > raw_spin_unlock(_reason_lock); > > + handled = nmi_handle(NMI_WATCHDOG, regs); > + if (handled == NMI_HANDLED) > + goto out; > + How is this supposed to work reliably? If perf is active and the HPET NMI and the perf NMI come in around the same time, then nmi_handle(LOCAL) can swallow the NMI and the watchdog won't be checked. Because MSI is strictly edge and the message is only sent once, this can result in a stale watchdog, no? Thanks, tglx