Re: [PATCH v2 9/9] iommu/vt-d: Use pci core's DVSEC functionality
On Thu, Sep 23, 2021 at 10:26:47AM -0700, Ben Widawsky wrote: > */ > static int siov_find_pci_dvsec(struct pci_dev *pdev) > { > + return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5); > } I hink the siov_find_pci_dvsec helper is pretty pointless now and can be folded into its only caller. And independent of that: this capability really needs a symbolic name. Especially for a vendor like Intel that might have a few there should be a list of them somewhere.
Re: [PATCH] powerpc: don't select KFENCE on platform PPC_FSL_BOOK3E
Le 24/09/2021 à 08:39, Liu Shixin a écrit : On platform PPC_FSL_BOOK3E, all lowmem is managed by tlbcam. That means we didn't really map the kfence pool with page granularity. Therefore, if KFENCE is enabled, the system will hit the following panic: Could you please explain a bit more what the problem is ? KFENCE has been implemented with the same logic as DEBUG_PAGEALLOC. DEBUG_PAGEALLOC is enabled on FSL_BOOK3E. In MMU_setup(), __map_without_ltlbs is set to 1 when KFENCE is enabled. __map_without_ltlbs should disable the use of tlbcam. So what's wrong really ? Does DEBUG_PAGEALLOC work on FSL_BOOK3E ? Thanks Christophe BUG: Kernel NULL pointer dereference on read at 0x Faulting instruction address: 0xc01de598 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K SMP NR_CPUS=4 MPC8544 DS Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3+ #298 NIP: c01de598 LR: c08ae9c4 CTR: REGS: c0b4bea0 TRAP: 0300 Not tainted (5.12.0-rc3+) MSR: 00021000 CR: 24000228 XER: 2000 DEAR: ESR: GPR00: c08ae9c4 c0b4bf60 c0ad64e0 ef72 00021000 0200 GPR08: c0ad5000 0004 008fbb30 GPR16: c000 GPR24: c08ca004 c08ca004 c0b6a0e0 c0b6 c0b58f00 c085 c08ca000 ef72 NIP [c01de598] kfence_protect+0x44/0x6c LR [c08ae9c4] kfence_init+0xfc/0x2a4 Call Trace: [c0b4bf60] [efffe160] 0xefffe160 (unreliable) [c0b4bf70] [c08ae9c4] kfence_init+0xfc/0x2a4 [c0b4bfb0] [c0894d3c] start_kernel+0x3bc/0x574 [c0b4bff0] [c470] set_ivor+0x14c/0x188 Instruction dump: 7c0802a6 8109d594 546a653a 90010014 54630026 3920 7d48502e 2c0a 41820010 554a0026 5469b53a 7d295214 <8149> 38831000 554a003c 9149 random: get_random_bytes called from print_oops_end_marker+0x40/0x78 with crng_init=0 ---[ end trace ]--- Signed-off-by: Liu Shixin --- arch/powerpc/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d46db0bfb998..cffd57bcb5e4 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -185,7 +185,7 @@ config PPC select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14 select HAVE_ARCH_KASAN_VMALLOC if PPC32 && PPC_PAGE_SHIFT <= 14 select HAVE_ARCH_KGDB - select HAVE_ARCH_KFENCE if PPC32 + select HAVE_ARCH_KFENCE if PPC32 && !PPC_FSL_BOOK3E select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT select HAVE_ARCH_NVRAM_OPS
[PATCH] powerpc: don't select KFENCE on platform PPC_FSL_BOOK3E
On platform PPC_FSL_BOOK3E, all lowmem is managed by tlbcam. That means we didn't really map the kfence pool with page granularity. Therefore, if KFENCE is enabled, the system will hit the following panic: BUG: Kernel NULL pointer dereference on read at 0x Faulting instruction address: 0xc01de598 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K SMP NR_CPUS=4 MPC8544 DS Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3+ #298 NIP: c01de598 LR: c08ae9c4 CTR: REGS: c0b4bea0 TRAP: 0300 Not tainted (5.12.0-rc3+) MSR: 00021000 CR: 24000228 XER: 2000 DEAR: ESR: GPR00: c08ae9c4 c0b4bf60 c0ad64e0 ef72 00021000 0200 GPR08: c0ad5000 0004 008fbb30 GPR16: c000 GPR24: c08ca004 c08ca004 c0b6a0e0 c0b6 c0b58f00 c085 c08ca000 ef72 NIP [c01de598] kfence_protect+0x44/0x6c LR [c08ae9c4] kfence_init+0xfc/0x2a4 Call Trace: [c0b4bf60] [efffe160] 0xefffe160 (unreliable) [c0b4bf70] [c08ae9c4] kfence_init+0xfc/0x2a4 [c0b4bfb0] [c0894d3c] start_kernel+0x3bc/0x574 [c0b4bff0] [c470] set_ivor+0x14c/0x188 Instruction dump: 7c0802a6 8109d594 546a653a 90010014 54630026 3920 7d48502e 2c0a 41820010 554a0026 5469b53a 7d295214 <8149> 38831000 554a003c 9149 random: get_random_bytes called from print_oops_end_marker+0x40/0x78 with crng_init=0 ---[ end trace ]--- Signed-off-by: Liu Shixin --- arch/powerpc/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d46db0bfb998..cffd57bcb5e4 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -185,7 +185,7 @@ config PPC select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14 select HAVE_ARCH_KASAN_VMALLOC if PPC32 && PPC_PAGE_SHIFT <= 14 select HAVE_ARCH_KGDB - select HAVE_ARCH_KFENCE if PPC32 + select HAVE_ARCH_KFENCE if PPC32 && !PPC_FSL_BOOK3E select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT select HAVE_ARCH_NVRAM_OPS -- 2.18.0.huawei.25
Re: [PATCH 3/3] memblock: cleanup memblock_free interface
On Thu, Sep 23, 2021 at 03:54:46PM +0200, Christophe Leroy wrote: > > Le 23/09/2021 à 14:01, Mike Rapoport a écrit : > > On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote: > > > > > > > > > Le 23/09/2021 à 09:43, Mike Rapoport a écrit : > > > > From: Mike Rapoport > > > > > > > > For ages memblock_free() interface dealt with physical addresses even > > > > despite the existence of memblock_alloc_xx() functions that return a > > > > virtual pointer. > > > > > > > > Introduce memblock_phys_free() for freeing physical ranges and repurpose > > > > memblock_free() to free virtual pointers to make the following pairing > > > > abundantly clear: > > > > > > > > int memblock_phys_free(phys_addr_t base, phys_addr_t size); > > > > phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t > > > > size); > > > > > > > > void *memblock_alloc(phys_addr_t size, phys_addr_t align); > > > > void memblock_free(void *ptr, size_t size); > > > > > > > > Replace intermediate memblock_free_ptr() with memblock_free() and drop > > > > unnecessary aliases memblock_free_early() and memblock_free_early_nid(). > > > > > > > > Suggested-by: Linus Torvalds > > > > Signed-off-by: Mike Rapoport > > > > --- > > > > > > > diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c > > > > index 1a04e5bdf655..37826d8c4f74 100644 > > > > --- a/arch/s390/kernel/smp.c > > > > +++ b/arch/s390/kernel/smp.c > > > > @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void) > > > > /* Get the CPU registers */ > > > > smp_save_cpu_regs(sa, addr, is_boot_cpu, page); > > > > } > > > > - memblock_free(page, PAGE_SIZE); > > > > + memblock_phys_free(page, PAGE_SIZE); > > > > diag_amode31_ops.diag308_reset(); > > > > pcpu_set_smt(0); > > > >} > > > > @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void) > > > > /* Add CPUs present at boot */ > > > > __smp_rescan_cpus(info, true); > > > > - memblock_free_early((unsigned long)info, sizeof(*info)); > > > > + memblock_free(info, sizeof(*info)); > > > >} > > > >/* > > > > > > I'm a bit lost. IIUC memblock_free_early() and memblock_free() where > > > identical. > > > > Yes, they were, but all calls to memblock_free_early() were using > > __pa(vaddr) because they had a virtual address at hand. > > I'm still not following. In the above memblock_free_early() was taking > (unsigned long)info . Was it a bug ? Not really because s390 has pa == va: https://elixir.bootlin.com/linux/latest/source/arch/s390/include/asm/page.h#L169 -- Sincerely yours, Mike.
Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()
Srikar Dronamraju writes: > * Michael Ellerman [2021-09-23 17:29:32]: > >> Nathan Lynch writes: >> > Srikar Dronamraju writes: >> > >> >> * Nathan Lynch [2021-09-22 11:01:12]: >> >> >> >>> Srikar Dronamraju writes: ... >> >> Or can I understand how debug_smp_processor_id() is useful if >> >> __smp_processor_id() is defined as raw_smp_processor_id()? >> >> debug_smp_processor_id() is useful on powerpc, as well as other arches, >> because it checks that we're in a context where the processor id won't >> change out from under us. >> >> eg. something like this is unsafe: >> >> int counts[NR_CPUS]; >> int tmp, cpu; >> >> cpu = smp_processor_id(); >> tmp = counts[cpu]; >> <- preempted here and migrated to another CPU >> counts[cpu] = tmp + 1; >> > > If lets say the above call was replaced by raw_smp_processor_id(), how would > it avoid the preemption / migration to another CPU? > > Replacing it with raw_smp_processor_id() may avoid, the debug splat but the > underlying issue would still remain as is. No? Correct. Using raw_smp_processor_id() is generally the wrong answer. For this example the correct fix is to disable preemption around the code, eg: int counts[NR_CPUS]; int tmp, cpu; preempt_disable() cpu = smp_processor_id(); tmp = counts[cpu]; counts[cpu] = tmp + 1; preempt_enable(); For the original issue I think it is OK to use raw_smp_processor_id(), because we're already doing a racy check of whether another CPU has been preempted by the hypervisor. if (!is_kvm_guest()) { int first_cpu = cpu_first_thread_sibling(smp_processor_id()); if (cpu_first_thread_sibling(cpu) == first_cpu) return false; } We could disable preemption around that, eg: if (!is_kvm_guest()) { int first_cpu; bool is_sibling; preempt_disable(); first_cpu = cpu_first_thread_sibling(smp_processor_id()); is_sibling = (cpu_first_thread_sibling(cpu) == first_cpu) preempt_enable(); // Can be preempted here if (is_sibling) return false; } But before we return we could be preempted, and then is_sibling is no longer necessarily correct. So that doesn't really gain us anything. The only way to make that result stable is to disable preemption in the caller, but most callers don't want to AFAICS, because they know they're doing a racy check to begin with. cheers
Re: [PATCH 3/3] memblock: cleanup memblock_free interface
On 9/23/21 9:43 AM, Mike Rapoport wrote: > From: Mike Rapoport > > For ages memblock_free() interface dealt with physical addresses even > despite the existence of memblock_alloc_xx() functions that return a > virtual pointer. > > Introduce memblock_phys_free() for freeing physical ranges and repurpose > memblock_free() to free virtual pointers to make the following pairing > abundantly clear: > > int memblock_phys_free(phys_addr_t base, phys_addr_t size); > phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size); > > void *memblock_alloc(phys_addr_t size, phys_addr_t align); > void memblock_free(void *ptr, size_t size); > > Replace intermediate memblock_free_ptr() with memblock_free() and drop > unnecessary aliases memblock_free_early() and memblock_free_early_nid(). > > Suggested-by: Linus Torvalds > Signed-off-by: Mike Rapoport arch/arc part: Reviewed-by: Shahab Vahedi Thanks, Shahab
Re: [PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC
On 9/23/2021 1:26 PM, Ben Widawsky wrote: Add pci_find_dvsec_capability to locate a Designated Vendor-Specific Extended Capability with the specified DVSEC ID. The Designated Vendor-Specific Extended Capability (DVSEC) allows one or more vendor specific capabilities that aren't tied to the vendor ID of the PCI component. DVSEC is critical for both the Compute Express Link (CXL) driver as well as the driver for OpenCAPI coherent accelerator (OCXL). Cc: David E. Box Cc: Jonathan Cameron Cc: Bjorn Helgaas Cc: Dan Williams Cc: linux-...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Andrew Donnellan Cc: Lu Baolu Reviewed-by: Frederic Barrat Signed-off-by: Ben Widawsky Applied the interface for the perf uncore driver as below. The interface works properly. Tested-by: Kan Liang From ebb69ba386dca91fb372522b13af9feb84adcbc0 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Thu, 23 Sep 2021 13:59:24 -0700 Subject: [PATCH] perf/x86/intel/uncore: Use pci core's DVSEC functionality Apply standard interface pci_find_dvsec_capability for perf uncore driver and remove unused macros. Reduce maintenance burden of DVSEC query implementation. Signed-off-by: Kan Liang --- arch/x86/events/intel/uncore_discovery.c | 41 +++- arch/x86/events/intel/uncore_discovery.h | 6 - 2 files changed, 19 insertions(+), 28 deletions(-) diff --git a/arch/x86/events/intel/uncore_discovery.c b/arch/x86/events/intel/uncore_discovery.c index 3049c64..f8ea092 100644 --- a/arch/x86/events/intel/uncore_discovery.c +++ b/arch/x86/events/intel/uncore_discovery.c @@ -21,7 +21,7 @@ static bool has_generic_discovery_table(void) return false; /* A discovery table device has the unique capability ID. */ - dvsec = pci_find_next_ext_capability(dev, 0, UNCORE_EXT_CAP_ID_DISCOVERY); + dvsec = pci_find_next_ext_capability(dev, 0, PCI_EXT_CAP_ID_DVSEC); pci_dev_put(dev); if (dvsec) return true; @@ -260,7 +260,7 @@ static int parse_discovery_table(struct pci_dev *dev, int die, bool intel_uncore_has_discovery_tables(void) { - u32 device, val, entry_id, bar_offset; + u32 device, val, bar_offset; int die, dvsec = 0, ret = true; struct pci_dev *dev = NULL; bool parsed = false; @@ -275,27 +275,24 @@ bool intel_uncore_has_discovery_tables(void) * the discovery table devices. */ while ((dev = pci_get_device(PCI_VENDOR_ID_INTEL, device, dev)) != NULL) { - while ((dvsec = pci_find_next_ext_capability(dev, dvsec, UNCORE_EXT_CAP_ID_DISCOVERY))) { - pci_read_config_dword(dev, dvsec + UNCORE_DISCOVERY_DVSEC_OFFSET, &val); - entry_id = val & UNCORE_DISCOVERY_DVSEC_ID_MASK; - if (entry_id != UNCORE_DISCOVERY_DVSEC_ID_PMON) - continue; - - pci_read_config_dword(dev, dvsec + UNCORE_DISCOVERY_DVSEC2_OFFSET, &val); - - if (val & ~UNCORE_DISCOVERY_DVSEC2_BIR_MASK) { - ret = false; - goto err; - } - bar_offset = UNCORE_DISCOVERY_BIR_BASE + - (val & UNCORE_DISCOVERY_DVSEC2_BIR_MASK) * UNCORE_DISCOVERY_BIR_STEP; - - die = get_device_die_id(dev); - if (die < 0) - continue; - - parse_discovery_table(dev, die, bar_offset, &parsed); + dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_INTEL, UNCORE_DISCOVERY_DVSEC_ID_PMON); + if (!dvsec) + continue; + + pci_read_config_dword(dev, dvsec + UNCORE_DISCOVERY_DVSEC2_OFFSET, &val); + + if (val & ~UNCORE_DISCOVERY_DVSEC2_BIR_MASK) { + ret = false; + goto err; } + bar_offset = UNCORE_DISCOVERY_BIR_BASE + + (val & UNCORE_DISCOVERY_DVSEC2_BIR_MASK) * UNCORE_DISCOVERY_BIR_STEP; + + die = get_device_die_id(dev); + if (die < 0) + continue; + + parse_discovery_table(dev, die, bar_offset, &parsed); } /* None of the discovery tables are available */ diff --git a/arch/x86/events/intel/uncore_discovery.h b/arch/x86/events/intel/uncore_discovery.h index 6d735611..84d56e5 100644 --- a/arch/x86/events/intel/uncore_discovery.h +++ b/arch/x86/events/intel/uncore_discovery.h @@ -2,12 +2,6 @@ /* Generic device ID of a discovery table device */ #define UNCORE_DISCOVERY_TABLE_DEVICE 0x09a7 -/* Capability ID for a discovery table device */ -#define UNCORE_EXT_CAP_ID_DISCOVERY0x23 -/* First DVSEC offset */ -#define UNCORE_DISCOVERY_DVSEC_OFFSET 0x8 -/* Mask of the supported discovery entry type */ -#define UNCORE_DISCOVERY_DVSEC_ID_MASK 0x /* PMON discovery ent
Re: [PATCH v1] powerpc/64s: Fix unrecoverable MCE crash
On 9/22/21 7:32 AM, Nicholas Piggin wrote: The machine check handler is not considered NMI on 64s. The early handler is the true NMI handler, and then it schedules the machine_check_exception handler to run when interrupts are enabled. This works fine except the case of an unrecoverable MCE, where the true NMI is taken when MSR[RI] is clear, it can not recover to schedule the next handler, so it calls machine_check_exception directly so something might be done about it. Calling an async handler from NMI context can result in irq state and other things getting corrupted. This can also trigger the BUG at arch/powerpc/include/asm/interrupt.h:168. Fix this by just making the 64s machine_check_exception handler an NMI like it is on other subarchs. Signed-off-by: Nicholas Piggin --- Hi Nick, If I inject control memory access error in LPAR on top of this patch https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210906084303.183921-1-ganes...@linux.ibm.com/ I see the following warning trace WARNING: CPU: 130 PID: 7122 at arch/powerpc/include/asm/interrupt.h:319 machine_check_exception+0x310/0x340 Modules linked in: CPU: 130 PID: 7122 Comm: inj_access_err Kdump: loaded Tainted: G M 5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty #22 NIP: c002f980 LR: c002f7e8 CTR: c0a31860 REGS: c039fe51bb20 TRAP: 0700 Tainted: G M (5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty) MSR: 80029033 CR: 88000222 XER: 2004 CFAR: c002f844 IRQMASK: 0 GPR00: c002f798 c039fe51bdc0 c20d 0001 GPR04: 4002 4000 19af GPR08: 0077e5ad c077ee16c700 0080 GPR12: 88000222 c077ee16c700 GPR16: GPR20: GPR24: c20fecd8 GPR28: 0001 0001 c039fe51be80 NIP [c002f980] machine_check_exception+0x310/0x340 LR [c002f7e8] machine_check_exception+0x178/0x340 Call Trace: [c039fe51bdc0] [c002f798] machine_check_exception+0x128/0x340 (unreliable) [c039fe51be10] [c00086ec] machine_check_common+0x1ac/0x1b0 --- interrupt: 200 at 0x1968 NIP: 1968 LR: 1958 CTR: REGS: c039fe51be80 TRAP: 0200 Tainted: G M (5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty) MSR: 82a0f033 CR: 22000824 XER: CFAR: 021c DAR: 7fffb00c DSISR: 0208 IRQMASK: 0 GPR00: 22000824 7fffc9647770 10027f00 7fffb00c GPR04: GPR08: 7fffb00c 0001 GPR12: 7fffb015a330 GPR16: GPR20: GPR24: 185c GPR28: 7fffc9647d18 0001 19b0 7fffc9647770 NIP [1968] 0x1968 LR [1958] 0x1958 --- interrupt: 200
Re: [PATCH 0/3] memblock: cleanup memblock_free interface
Hi Linus, On Thu, Sep 23, 2021 at 09:01:46AM -0700, Linus Torvalds wrote: > On Thu, Sep 23, 2021 at 12:43 AM Mike Rapoport wrote: > > > You need to be a LOT more careful. > > From a trivial check - exactly because I looked at doing it with a > script, and decided it's not so easy - I found cases like this: > > - memblock_free(__pa(paca_ptrs) + new_ptrs_size, > + memblock_free(paca_ptrs + new_ptrs_size, > > which is COMPLETELY wrong. I did use a coccinelle script that's slightly more robust that a sed you've sent, but then I did a manual review, hence the two small patches with fixes. Indeed I missed this one, so to be on the safe side I'll rename only the obvious cases where coccinelle can be used reliably and leave all the rest as it's now. If somebody cares enough they can update it later. > And no, making the scripting just replace '__pa(x)' with '(void *)(x)' These were actually manual and they are required for variables that used as virtual addresses but have unsigned long type, like e.g. initrd_start. So it's either __pa(x) or (void *). -- Sincerely yours, Mike.
Re: [PATCH v4] lockdown,selinux: fix wrong subject in some SELinux lockdown checks
On Wed, Sep 15, 2021 at 10:59 PM Paul Moore wrote: > > On Mon, Sep 13, 2021 at 5:05 PM Paul Moore wrote: > > > > On Mon, Sep 13, 2021 at 10:02 AM Ondrej Mosnacek > > wrote: > > > > > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > > > lockdown") added an implementation of the locked_down LSM hook to > > > SELinux, with the aim to restrict which domains are allowed to perform > > > operations that would breach lockdown. > > > > > > However, in several places the security_locked_down() hook is called in > > > situations where the current task isn't doing any action that would > > > directly breach lockdown, leading to SELinux checks that are basically > > > bogus. > > > > > > To fix this, add an explicit struct cred pointer argument to > > > security_lockdown() and define NULL as a special value to pass instead > > > of current_cred() in such situations. LSMs that take the subject > > > credentials into account can then fall back to some default or ignore > > > such calls altogether. In the SELinux lockdown hook implementation, use > > > SECINITSID_KERNEL in case the cred argument is NULL. > > > > > > Most of the callers are updated to pass current_cred() as the cred > > > pointer, thus maintaining the same behavior. The following callers are > > > modified to pass NULL as the cred pointer instead: > > > 1. arch/powerpc/xmon/xmon.c > > > Seems to be some interactive debugging facility. It appears that > > > the lockdown hook is called from interrupt context here, so it > > > should be more appropriate to request a global lockdown decision. > > > 2. fs/tracefs/inode.c:tracefs_create_file() > > > Here the call is used to prevent creating new tracefs entries when > > > the kernel is locked down. Assumes that locking down is one-way - > > > i.e. if the hook returns non-zero once, it will never return zero > > > again, thus no point in creating these files. Also, the hook is > > > often called by a module's init function when it is loaded by > > > userspace, where it doesn't make much sense to do a check against > > > the current task's creds, since the task itself doesn't actually > > > use the tracing functionality (i.e. doesn't breach lockdown), just > > > indirectly makes some new tracepoints available to whoever is > > > authorized to use them. > > > 3. net/xfrm/xfrm_user.c:copy_to_user_*() > > > Here a cryptographic secret is redacted based on the value returned > > > from the hook. There are two possible actions that may lead here: > > > a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the > > > task context is relevant, since the dumped data is sent back to > > > the current task. > > > b) When adding/deleting/updating an SA via XFRM_MSG_xxxSA, the > > > dumped SA is broadcasted to tasks subscribed to XFRM events - > > > here the current task context is not relevant as it doesn't > > > represent the tasks that could potentially see the secret. > > > It doesn't seem worth it to try to keep using the current task's > > > context in the a) case, since the eventual data leak can be > > > circumvented anyway via b), plus there is no way for the task to > > > indicate that it doesn't care about the actual key value, so the > > > check could generate a lot of "false alert" denials with SELinux. > > > Thus, let's pass NULL instead of current_cred() here faute de > > > mieux. > > > > > > Improvements-suggested-by: Casey Schaufler > > > Improvements-suggested-by: Paul Moore > > > Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux > > > lockdown") > > > Acked-by: Dan Williams [cxl] > > > Acked-by: Steffen Klassert [xfrm] > > > Signed-off-by: Ondrej Mosnacek > > > --- > > > > > > v4: > > > - rebase on top of TODO > > > - fix rebase conflicts: > > > * drivers/cxl/pci.c > > > - trivial: the lockdown reason was corrected in mainline > > > * kernel/bpf/helpers.c, kernel/trace/bpf_trace.c > > > - trivial: LOCKDOWN_BPF_READ was renamed to LOCKDOWN_BPF_READ_KERNEL > > > in mainline > > > * kernel/power/hibernate.c > > > - trivial: !secretmem_active() was added to the condition in > > > hibernation_available() > > > - cover new security_locked_down() call in kernel/bpf/helpers.c > > > (LOCKDOWN_BPF_WRITE_USER in BPF_FUNC_probe_write_user case) > > > > > > v3: > > > https://lore.kernel.org/lkml/20210616085118.1141101-1-omosn...@redhat.com/ > > > - add the cred argument to security_locked_down() and adapt all callers > > > - keep using current_cred() in BPF, as the hook calls have been shifted > > > to program load time (commit ff40e51043af ("bpf, lockdown, audit: Fix > > > buggy SELinux lockdown permission checks")) > > > - in SELinux, don't ignore hook calls where cred == NULL, but use > > > SECINITSID_KERNEL as the subject instead > > > - update explanations
Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()
On Thu, Sep 23, 2021 at 12:05:58AM +0300, Kirill A. Shutemov wrote: > Unless we find other way to guarantee RIP-relative access, we must use > fixup_pointer() to access any global variables. Yah, I've asked compiler folks about any guarantees we have wrt rip-relative addresses but it doesn't look good. Worst case, we'd have to do the fixup_pointer() thing. In the meantime, Tom and I did some more poking at this and here's a diff ontop. The direction being that we'll stick both the AMD and Intel *cc_platform_has() call into cc_platform.c for which instrumentation will be disabled so no issues with that. And that will keep all that querying all together in a single file. --- diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h index a73712b6ee0e..2d4f5c17d79c 100644 --- a/arch/x86/include/asm/mem_encrypt.h +++ b/arch/x86/include/asm/mem_encrypt.h @@ -51,7 +51,6 @@ void __init mem_encrypt_free_decrypted_mem(void); void __init mem_encrypt_init(void); void __init sev_es_init_vc_handling(void); -bool amd_cc_platform_has(enum cc_attr attr); #define __bss_decrypted __section(".bss..decrypted") @@ -74,7 +73,6 @@ static inline void __init sme_encrypt_kernel(struct boot_params *bp) { } static inline void __init sme_enable(struct boot_params *bp) { } static inline void sev_es_init_vc_handling(void) { } -static inline bool amd_cc_platform_has(enum cc_attr attr) { return false; } static inline int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 0; } @@ -103,12 +101,6 @@ static inline u64 sme_get_me_mask(void) return sme_me_mask; } -#if defined(CONFIG_CPU_SUP_INTEL) && defined(CONFIG_ARCH_HAS_CC_PLATFORM) -bool intel_cc_platform_has(enum cc_attr attr); -#else -static inline bool intel_cc_platform_has(enum cc_attr attr) { return false; } -#endif - #endif /* __ASSEMBLY__ */ #endif /* __X86_MEM_ENCRYPT_H__ */ diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c index da54a1805211..97ede7052f77 100644 --- a/arch/x86/kernel/cc_platform.c +++ b/arch/x86/kernel/cc_platform.c @@ -13,6 +13,52 @@ #include +static bool intel_cc_platform_has(enum cc_attr attr) +{ +#ifdef CONFIG_INTEL_TDX_GUEST + return false; +#else + return false; +#endif +} + +/* + * SME and SEV are very similar but they are not the same, so there are + * times that the kernel will need to distinguish between SME and SEV. The + * cc_platform_has() function is used for this. When a distinction isn't + * needed, the CC_ATTR_MEM_ENCRYPT attribute can be used. + * + * The trampoline code is a good example for this requirement. Before + * paging is activated, SME will access all memory as decrypted, but SEV + * will access all memory as encrypted. So, when APs are being brought + * up under SME the trampoline area cannot be encrypted, whereas under SEV + * the trampoline area must be encrypted. + */ +static bool amd_cc_platform_has(enum cc_attr attr) +{ +#ifdef CONFIG_AMD_MEM_ENCRYPT + switch (attr) { + case CC_ATTR_MEM_ENCRYPT: + return sme_me_mask; + + case CC_ATTR_HOST_MEM_ENCRYPT: + return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED); + + case CC_ATTR_GUEST_MEM_ENCRYPT: + return sev_status & MSR_AMD64_SEV_ENABLED; + + case CC_ATTR_GUEST_STATE_ENCRYPT: + return sev_status & MSR_AMD64_SEV_ES_ENABLED; + + default: + return false; + } +#else + return false; +#endif +} + + bool cc_platform_has(enum cc_attr attr) { if (sme_me_mask) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 53756ff12295..8321c43554a1 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -60,13 +60,6 @@ static u64 msr_test_ctrl_cache __ro_after_init; */ static bool cpu_model_supports_sld __ro_after_init; -#ifdef CONFIG_ARCH_HAS_CC_PLATFORM -bool intel_cc_platform_has(enum cc_attr attr) -{ - return false; -} -#endif - /* * Processors which have self-snooping capability can handle conflicting * memory type across CPUs by snooping its own cache. However, there exists diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 9417d404ea92..23d54b810f08 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -361,38 +361,6 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size) return early_set_memory_enc_dec(vaddr, size, true); } -/* - * SME and SEV are very similar but they are not the same, so there are - * times that the kernel will need to distinguish between SME and SEV. The - * cc_platform_has() function is used for this. When a distinction isn't - * needed, the CC_ATTR_MEM_ENCRYPT attribute can be used. - * - * The trampoline code is a good example for this requirement. Before - * paging is activated, SME will access all memory as decrypted, but SEV - * will access all memory a
Re: [PATCH 5/9] xen/x86: make "earlyprintk=xen" work for HVM/PVH DomU
On 07.09.21 12:10, Jan Beulich wrote: xenboot_write_console() is dealing with these quite fine so I don't see why xenboot_console_setup() would return -ENOENT in this case. Adjust documentation accordingly. Signed-off-by: Jan Beulich Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()
* Michael Ellerman [2021-09-23 17:29:32]: > Nathan Lynch writes: > > Srikar Dronamraju writes: > > > >> * Nathan Lynch [2021-09-22 11:01:12]: > >> > >>> Srikar Dronamraju writes: > >>> > * Nathan Lynch [2021-09-20 22:12:13]: > >>> > > >>> >> vcpu_is_preempted() can be used outside of preempt-disabled critical > >>> >> sections, yielding warnings such as: > >>> >> > >>> >> BUG: using smp_processor_id() in preemptible [] code: > >>> >> systemd-udevd/185 > >>> >> caller is rwsem_spin_on_owner+0x1cc/0x2d0 > >>> >> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33 > >>> >> Call Trace: > >>> >> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 > >>> >> (unreliable) > >>> >> [c00012907b00] [c1371f70] > >>> >> check_preemption_disabled+0x150/0x160 > >>> >> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0 > >>> >> [c00012907be0] [c01e1408] > >>> >> rwsem_down_write_slowpath+0x478/0x9a0 > >>> >> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0 > >>> >> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0 > >>> >> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70 > >>> >> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0 > >>> >> [c00012907e10] [c000c54c] system_call_common+0xec/0x250 > >>> >> > >>> >> The result of vcpu_is_preempted() is always subject to invalidation by > >>> >> events inside and outside of Linux; it's just a best guess at a point > >>> >> in > >>> >> time. Use raw_smp_processor_id() to avoid such warnings. > >>> > > >>> > Typically smp_processor_id() and raw_smp_processor_id() except for the > >>> > CONFIG_DEBUG_PREEMPT. > >>> > >>> Sorry, I don't follow... > >> > >> I meant, Unless CONFIG_DEBUG_PREEMPT, smp_processor_id() is defined as > >> raw_processor_id(). > >> > >>> > >>> > In the CONFIG_DEBUG_PREEMPT case, smp_processor_id() > >>> > is actually debug_smp_processor_id(), which does all the checks. > >>> > >>> Yes, OK. > >>> > >>> > I believe these checks in debug_smp_processor_id() are only valid for > >>> > x86 > >>> > case (aka cases were they have __smp_processor_id() defined.) > >>> > >>> Hmm, I am under the impression that the checks in > >>> debug_smp_processor_id() are valid regardless of whether the arch > >>> overrides __smp_processor_id(). > >> > >> From include/linux/smp.h > >> > >> /* > >> * Allow the architecture to differentiate between a stable and unstable > >> read. > >> * For example, x86 uses an IRQ-safe asm-volatile read for the unstable > >> but a > >> * regular asm read for the stable. > >> */ > >> #ifndef __smp_processor_id > >> #define __smp_processor_id(x) raw_smp_processor_id(x) > >> #endif > >> > >> As far as I see, only x86 has a definition of __smp_processor_id. > >> So for archs like Powerpc, __smp_processor_id(), is always > >> defined as raw_smp_processor_id(). Right? > > > > Sure, yes. > > > >> I would think debug_smp_processor_id() would be useful if > >> __smp_processor_id() > >> is different from raw_smp_processor_id(). Do note debug_smp_processor_id() > >> calls raw_smp_processor_id(). > > Agree. > > > I do not think the utility of debug_smp_processor_id() is related to > > whether the arch defines __smp_processor_id(). > > > >> Or can I understand how debug_smp_processor_id() is useful if > >> __smp_processor_id() is defined as raw_smp_processor_id()? > > debug_smp_processor_id() is useful on powerpc, as well as other arches, > because it checks that we're in a context where the processor id won't > change out from under us. > > eg. something like this is unsafe: > > int counts[NR_CPUS]; > int tmp, cpu; > > cpu = smp_processor_id(); > tmp = counts[cpu]; > <- preempted here and migrated to another CPU > counts[cpu] = tmp + 1; > If lets say the above call was replaced by raw_smp_processor_id(), how would it avoid the preemption / migration to another CPU? Replacing it with raw_smp_processor_id() may avoid, the debug splat but the underlying issue would still remain as is. No? > > > So, for powerpc with DEBUG_PREEMPT unset, a call to smp_procesor_id() > > expands to __smp_processor_id() which expands to raw_smp_processor_id(), > > avoiding the preempt safety checks. This is working as intended. > > > > For powerpc with DEBUG_PREEMPT=y, a call to smp_processor_id() expands > > to the out of line call to debug_smp_processor_id(), which calls > > raw_smp_processor_id() and performs the checks, warning if called in an > > inappropriate context, as seen here. Also working as intended. > > > > AFAICT __smp_processor_id() is a performance/codegen-oriented hook, and > > not really related to the debug facility. Please see 9ed7d75b2f09d > > ("x86/percpu: Relax smp_processor_id()"). > > Yeah good find. > > cheers -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 3/9] xen/x86: make "earlyprintk=xen" work better for PVH Dom0
On 07.09.21 12:09, Jan Beulich wrote: The xen_hvm_early_write() path better wouldn't be taken in this case; while port 0xE9 can be used, the hypercall path is quite a bit more efficient. Put that first, as it may also work for DomU-s (see also xen_raw_console_write()). While there also bail from the function when the first domU_write_console() failed - later ones aren't going to succeed. Signed-off-by: Jan Beulich Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
* Michael Ellerman [2021-09-23 21:17:25]: > Srikar Dronamraju writes: > > * Michael Ellerman [2021-08-26 23:36:53]: > > > >> Srikar Dronamraju writes: > >> > Scheduler expects unique number of node distances to be available at > >> > boot. > ... > > > >> > Fake the offline node's distance_lookup_table entries so that all > >> > possible node distances are updated. > >> > >> Does this work if we have a single node offline at boot? > >> > > > > It should. > > > >> Say we start with: > >> > >> node distances: > >> node 0 1 > >> 0: 10 20 > >> 1: 20 10 > >> > >> And node 2 is offline at boot. We can only initialise that nodes entries > >> in the distance_lookup_table: > >> > >>while (i--) > >>distance_lookup_table[node][i] = node; > >> > >> By filling them all with 2 that causes node_distance(2, X) to return the > >> maximum distance for all other nodes X, because we won't break out of > >> the loop in __node_distance(): > >> > >>for (i = 0; i < distance_ref_points_depth; i++) { > >>if (distance_lookup_table[a][i] == distance_lookup_table[b][i]) > >>break; > >> > >>/* Double the distance for each NUMA level */ > >>distance *= 2; > >>} > >> > >> If distance_ref_points_depth was 4 we'd return 160. > > > > As you already know, distance 10, 20, .. are defined by Powerpc, form1 > > affinity. PAPR doesn't define actual distances, it only provides us the > > associativity. If there are distance_ref_points_depth is 4, > > (distance_ref_points_depth doesn't take local distance into consideration) > > 10, 20, 40, 80, 160. > > > >> > >> That'd leave us with 3 unique distances at boot, 10, 20, 160. > >> > > > > So if there are unique distances, then the distances as per the current > > code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in > > the series. like having 160 without 80. > > I'm confused what you mean there. > At the outset, if we have a better probable solution, do let me know, I am willing to try that too. > If we have a node that's offline at boot then we get 160 for that node, > that's just the result of having no info for it, so we never break out > of the for loop. > > So if we have two nodes, one hop apart, and then an offline node we get > 10, 20, 160. > > Or if you're using depth = 3 then it's 10, 20, 80. > My understanding is as below: device-tree provides the max hops by way of ibm,associativity-reference-points. This is mapped to distance_ref_points_depth in Linux-powerpc. Now Linux-powerpc encodes hops as (dis-regarding local distance) 20, 40, 80, 160, 320 ... So if the distance_ref_points_depth is 3, then the hops are 20, 40, 80. Do you disagree? > >> But when node 2 comes online it might introduce more than 1 new distance > >> value, eg. it could be that the actual distances are: > >> > >> node distances: > >> node 0 1 2 > >> 0: 10 20 40 > >> 1: 20 10 80 > >> 2: 40 80 10 > >> > >> ie. we now have 4 distances, 10, 20, 40, 80. > >> > >> What am I missing? > > > > As I said above, I am not sure how we can have a break in the series. > > If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as > > atleast for form1 affinity. > > I agree for depth 3 we have to see 10, 20, 40, 80. But nothing > guarantees we see each value (other than 10). The hop distances are not from the device-tree, the device-tree only gives us the max hops possible. Linux-powerpc is actually hard-coding the distances which each hop distance being 2x the previous. So we may not see any nodes at a particular hop, but we know maximum hops. And if distance_ref_points_depth is 3, then hops are 20, 40, 80 only. > > We can have two nodes one hop apart, so we have 10 and 20, then a third > node is added 3 hops away, so we get 10, 20, 80. > > The real problem is that the third node could be 3 hops from node 0 > and 2 hops from node 1, and so the addition of the third node causes > two new distance values (40 & 80) to be required. So here the max hops as given by device-tree is 3. So we know that we are looking for max-distance of 80 by way of distance_ref_points_depth. Even if the 3rd node was at 4 hops, we would already know the max distance of 160, by way of distance_ref_points_depth. However in the most unlikely scenarios where the number of possible nodes are less than the distance_ref_points_depth(aka max hops) + there are CPUless/memoryless nodes we may not have initialized to the right distances. > > I think maybe what you're saying is that in practice we don't see setups > like that. But I don't know if I'm happy with a solution that doesn't > work in the general case, and relies on the particular properties of our > current set of systems. > But our current set of systems are having a problem (Systems can likely crash on adding a CPU to a node.) The only other way I can think of is the previous approach were we ask scheduler hook which tells ho
[PATCH v2 9/9] iommu/vt-d: Use pci core's DVSEC functionality
Reduce maintenance burden of DVSEC query implementation by using the centralized PCI core implementation. Cc: io...@lists.linux-foundation.org Cc: David Woodhouse Cc: Lu Baolu Signed-off-by: Ben Widawsky --- drivers/iommu/intel/iommu.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index d75f59ae28e6..30c97181f0ae 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5398,20 +5398,7 @@ static int intel_iommu_disable_sva(struct device *dev) */ static int siov_find_pci_dvsec(struct pci_dev *pdev) { - int pos; - u16 vendor, id; - - pos = pci_find_next_ext_capability(pdev, 0, 0x23); - while (pos) { - pci_read_config_word(pdev, pos + 4, &vendor); - pci_read_config_word(pdev, pos + 8, &id); - if (vendor == PCI_VENDOR_ID_INTEL && id == 5) - return pos; - - pos = pci_find_next_ext_capability(pdev, pos, 0x23); - } - - return 0; + return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5); } static bool -- 2.33.0
[PATCH v2 8/9] ocxl: Use pci core's DVSEC functionality
Reduce maintenance burden of DVSEC query implementation by using the centralized PCI core implementation. There are two obvious places to simply drop in the new core implementation. There remains find_dvsec_from_pos() which would benefit from using a core implementation. As that change is less trivial it is reserved for later. Cc: linuxppc-dev@lists.ozlabs.org Cc: Andrew Donnellan Acked-by: Frederic Barrat (v1) Signed-off-by: Ben Widawsky --- arch/powerpc/platforms/powernv/ocxl.c | 3 ++- drivers/misc/ocxl/config.c| 13 + 2 files changed, 3 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c index 9105efcf242a..28b009b46464 100644 --- a/arch/powerpc/platforms/powernv/ocxl.c +++ b/arch/powerpc/platforms/powernv/ocxl.c @@ -107,7 +107,8 @@ static int get_max_afu_index(struct pci_dev *dev, int *afu_idx) int pos; u32 val; - pos = find_dvsec_from_pos(dev, OCXL_DVSEC_FUNC_ID, 0); + pos = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM, + OCXL_DVSEC_FUNC_ID); if (!pos) return -ESRCH; diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c index a68738f38252..e401a51596b9 100644 --- a/drivers/misc/ocxl/config.c +++ b/drivers/misc/ocxl/config.c @@ -33,18 +33,7 @@ static int find_dvsec(struct pci_dev *dev, int dvsec_id) { - int vsec = 0; - u16 vendor, id; - - while ((vsec = pci_find_next_ext_capability(dev, vsec, - OCXL_EXT_CAP_ID_DVSEC))) { - pci_read_config_word(dev, vsec + OCXL_DVSEC_VENDOR_OFFSET, - &vendor); - pci_read_config_word(dev, vsec + OCXL_DVSEC_ID_OFFSET, &id); - if (vendor == PCI_VENDOR_ID_IBM && id == dvsec_id) - return vsec; - } - return 0; + return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM, dvsec_id); } static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 afu_idx) -- 2.33.0
[PATCH v2 7/9] cxl/pci: Use pci core's DVSEC functionality
Reduce maintenance burden of DVSEC query implementation by using the centralized PCI core implementation. Signed-off-by: Ben Widawsky --- drivers/cxl/pci.c | 20 +--- 1 file changed, 1 insertion(+), 19 deletions(-) diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 5eaf2736f779..79d4d9b16d83 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -340,25 +340,7 @@ static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, struct cxl_register_map static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec) { - int pos; - - pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC); - if (!pos) - return 0; - - while (pos) { - u16 vendor, id; - - pci_read_config_word(pdev, pos + PCI_DVSEC_HEADER1, &vendor); - pci_read_config_word(pdev, pos + PCI_DVSEC_HEADER2, &id); - if (vendor == PCI_DVSEC_VENDOR_ID_CXL && dvsec == id) - return pos; - - pos = pci_find_next_ext_capability(pdev, pos, - PCI_EXT_CAP_ID_DVSEC); - } - - return 0; + return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL, dvsec); } static int cxl_probe_regs(struct cxl_mem *cxlm, struct cxl_register_map *map) -- 2.33.0
[PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC
Add pci_find_dvsec_capability to locate a Designated Vendor-Specific Extended Capability with the specified DVSEC ID. The Designated Vendor-Specific Extended Capability (DVSEC) allows one or more vendor specific capabilities that aren't tied to the vendor ID of the PCI component. DVSEC is critical for both the Compute Express Link (CXL) driver as well as the driver for OpenCAPI coherent accelerator (OCXL). Cc: David E. Box Cc: Jonathan Cameron Cc: Bjorn Helgaas Cc: Dan Williams Cc: linux-...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Andrew Donnellan Cc: Lu Baolu Reviewed-by: Frederic Barrat Signed-off-by: Ben Widawsky --- drivers/pci/pci.c | 32 include/linux/pci.h | 1 + 2 files changed, 33 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index ce2ab62b64cf..94ac86ff28b0 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -732,6 +732,38 @@ u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap) } EXPORT_SYMBOL_GPL(pci_find_vsec_capability); +/** + * pci_find_dvsec_capability - Find DVSEC for vendor + * @dev: PCI device to query + * @vendor: Vendor ID to match for the DVSEC + * @dvsec: Designated Vendor-specific capability ID + * + * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability + * offset in config space; otherwise return 0. + */ +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec) +{ + int pos; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC); + if (!pos) + return 0; + + while (pos) { + u16 v, id; + + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &v); + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &id); + if (vendor == v && dvsec == id) + return pos; + + pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC); + } + + return 0; +} +EXPORT_SYMBOL_GPL(pci_find_dvsec_capability); + /** * pci_find_parent_resource - return resource region of parent bus of given * region diff --git a/include/linux/pci.h b/include/linux/pci.h index cd8aa6fce204..c93ccfa4571b 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1130,6 +1130,7 @@ u16 pci_find_ext_capability(struct pci_dev *dev, int cap); u16 pci_find_next_ext_capability(struct pci_dev *dev, u16 pos, int cap); struct pci_bus *pci_find_next_bus(const struct pci_bus *from); u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap); +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec); u64 pci_get_dsn(struct pci_dev *dev); -- 2.33.0
[PATCH v2 5/9] cxl/pci: Make more use of cxl_register_map
The structure exists to pass around information about register mapping. Using it more extensively cleans up many existing functions. Signed-off-by: Ben Widawsky --- drivers/cxl/cxl.h | 1 + drivers/cxl/pci.c | 36 +--- 2 files changed, 18 insertions(+), 19 deletions(-) diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 7d6b011dd963..3b128ce71564 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -140,6 +140,7 @@ struct cxl_device_reg_map { }; struct cxl_register_map { + void __iomem *base; u64 block_offset; u8 reg_type; u8 barno; diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index bbbacbc94fbf..5eaf2736f779 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -306,35 +306,36 @@ static int cxl_pci_setup_mailbox(struct cxl_mem *cxlm) return 0; } -static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm, - u8 bar, u64 offset) +static int cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map) { - void __iomem *addr; + int bar = map->barno; struct device *dev = cxlm->dev; struct pci_dev *pdev = to_pci_dev(dev); + resource_size_t offset = map->block_offset; /* Basic sanity check that BAR is big enough */ if (pci_resource_len(pdev, bar) < offset) { dev_err(dev, "BAR%d: %pr: too small (offset: %#llx)\n", bar, &pdev->resource[bar], (unsigned long long)offset); - return IOMEM_ERR_PTR(-ENXIO); + return -ENXIO; } - addr = pci_iomap(pdev, bar, 0); - if (!addr) { + map->base = pci_iomap(pdev, bar, 0); + if (!map->base) { dev_err(dev, "failed to map registers\n"); - return addr; + return PTR_ERR(map->base); } dev_dbg(dev, "Mapped CXL Memory Device resource bar %u @ %#llx\n", bar, offset); - return addr; + return 0; } -static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, void __iomem *base) +static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map) { - pci_iounmap(to_pci_dev(cxlm->dev), base); + pci_iounmap(to_pci_dev(cxlm->dev), map->base); + map->base = 0; } static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec) @@ -360,9 +361,9 @@ static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec) return 0; } -static int cxl_probe_regs(struct cxl_mem *cxlm, void __iomem *base, - struct cxl_register_map *map) +static int cxl_probe_regs(struct cxl_mem *cxlm, struct cxl_register_map *map) { + void __iomem *base = map->base + map->block_offset; struct cxl_component_reg_map *comp_map; struct cxl_device_reg_map *dev_map; struct device *dev = cxlm->dev; @@ -487,7 +488,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm) for (i = 0; i < ARRAY_SIZE(types); i++) { struct cxl_register_map map; - void __iomem *base; rc = find_register_block(pdev, types[i], &map); if (rc) { @@ -498,14 +498,12 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm) break; } - base = cxl_pci_map_regblock(cxlm, map.barno, map.block_offset); - if (!base) { - rc = -ENOMEM; + rc = cxl_pci_map_regblock(cxlm, &map); + if (rc) break; - } - rc = cxl_probe_regs(cxlm, base + map.block_offset, &map); - cxl_pci_unmap_regblock(cxlm, base); + rc = cxl_probe_regs(cxlm, &map); + cxl_pci_unmap_regblock(cxlm, &map); if (rc) break; -- 2.33.0
[PATCH v2 4/9] cxl/pci: Refactor cxl_pci_setup_regs
In preparation for moving parts of register mapping to cxl_core, the cxl_pci driver is refactored to utilize a new helper to find register blocks by type. cxl_pci scanned through all register blocks and mapping the ones that the driver will use. This logic is inverted so that the driver specifically requests the register blocks from a new helper. Under the hood, the same implementation of scanning through all register locator DVSEC entries exists. There are 2 behavioral changes (#2 is arguable): 1. A dev_err is introduced if cxl_map_regs fails. 2. The previous logic would try to map component registers and device registers multiple times if there were present and keep the mapping of the last one found (furthest offset in the register locator). While this is disallowed in the spec, CXL 2.0 8.1.9: "Each register block identifier shall only occur once in the Register Locator DVSEC structure" it was how the driver would respond to the spec violation. The new logic will take the first found register block by type and move on. Signed-off-by: Ben Widawsky --- Changes since v1: --- drivers/cxl/pci.c | 126 +- 1 file changed, 70 insertions(+), 56 deletions(-) diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 7256c236fdb3..bbbacbc94fbf 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -428,6 +428,45 @@ static void cxl_decode_register_block(u32 reg_lo, u32 reg_hi, *reg_type = FIELD_GET(CXL_REGLOC_RBI_MASK, reg_lo); } +static int find_register_block(struct pci_dev *pdev, enum cxl_regloc_type type, + struct cxl_register_map *map) +{ + int regloc, i, rc = -ENODEV; + u32 regloc_size, regblocks; + + regloc = cxl_pci_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC_DVSEC_ID); + if (!regloc) + return -ENXIO; + + pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, ®loc_size); + regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size); + + regloc += PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET; + regblocks = (regloc_size - PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET) / 8; + + for (i = 0; i < regblocks; i++, regloc += 8) { + u32 reg_lo, reg_hi; + u8 reg_type, bar; + u64 offset; + + pci_read_config_dword(pdev, regloc, ®_lo); + pci_read_config_dword(pdev, regloc + 4, ®_hi); + + cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset, + ®_type); + + if (reg_type == type) { + map->barno = bar; + map->block_offset = offset; + map->reg_type = reg_type; + rc = 0; + break; + } + } + + return rc; +} + /** * cxl_pci_setup_regs() - Setup necessary MMIO. * @cxlm: The CXL memory device to communicate with. @@ -440,69 +479,44 @@ static void cxl_decode_register_block(u32 reg_lo, u32 reg_hi, */ static int cxl_pci_setup_regs(struct cxl_mem *cxlm) { - void __iomem *base; - u32 regloc_size, regblocks; - int regloc, i, n_maps, ret = 0; + int rc, i; struct device *dev = cxlm->dev; struct pci_dev *pdev = to_pci_dev(dev); - struct cxl_register_map *map, maps[CXL_REGLOC_RBI_TYPES]; + const enum cxl_regloc_type types[] = { CXL_REGLOC_RBI_MEMDEV, + CXL_REGLOC_RBI_COMPONENT }; - regloc = cxl_pci_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC_DVSEC_ID); - if (!regloc) { - dev_err(dev, "register location dvsec not found\n"); - return -ENXIO; - } + for (i = 0; i < ARRAY_SIZE(types); i++) { + struct cxl_register_map map; + void __iomem *base; - /* Get the size of the Register Locator DVSEC */ - pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, ®loc_size); - regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size); - - regloc += PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET; - regblocks = (regloc_size - PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET) / 8; - - for (i = 0, n_maps = 0; i < regblocks; i++, regloc += 8) { - u32 reg_lo, reg_hi; - u8 reg_type; - u64 offset; - u8 bar; - - pci_read_config_dword(pdev, regloc, ®_lo); - pci_read_config_dword(pdev, regloc + 4, ®_hi); - - cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset, - ®_type); - - /* Ignore unknown register block types */ - if (reg_type > CXL_REGLOC_RBI_MEMDEV) - continue; - - base = cxl_pci_map_regblock(cxlm, bar, offset); - if (!base) - return -ENOMEM; - - map = &maps[n_maps]; -
[PATCH v2 3/9] cxl/pci: Remove pci request/release regions
Quoting Dan, "... the request + release regions should probably just be dropped. It's not like any of the register enumeration would collide with someone else who already has the registers mapped. The collision only comes when the registers are mapped for their final usage, and that will have more precision in the request." Recommended-by: Dan Williams Signed-off-by: Ben Widawsky --- drivers/cxl/pci.c | 5 - 1 file changed, 5 deletions(-) diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index ccc7c2573ddc..7256c236fdb3 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -453,9 +453,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm) return -ENXIO; } - if (pci_request_mem_regions(pdev, pci_name(pdev))) - return -ENODEV; - /* Get the size of the Register Locator DVSEC */ pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, ®loc_size); regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size); @@ -499,8 +496,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm) n_maps++; } - pci_release_mem_regions(pdev); - for (i = 0; i < n_maps; i++) { ret = cxl_map_regs(cxlm, &maps[i]); if (ret) -- 2.33.0
[PATCH v2 2/9] cxl/pci: Remove dev_dbg for unknown register blocks
While interesting to driver developers, the dev_dbg message doesn't do much except clutter up logs. This information should be attainable through sysfs, and someday lspci like utilities. This change additionally helps reduce the LOC in a subsequent patch to refactor some of cxl_pci register mapping. Signed-off-by: Ben Widawsky --- drivers/cxl/pci.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 64180f46c895..ccc7c2573ddc 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -475,9 +475,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm) cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset, ®_type); - dev_dbg(dev, "Found register block in bar %u @ 0x%llx of type %u\n", - bar, offset, reg_type); - /* Ignore unknown register block types */ if (reg_type > CXL_REGLOC_RBI_MEMDEV) continue; -- 2.33.0
[PATCH v2 1/9] cxl: Convert "RBI" to enum
In preparation for passing around the Register Block Indicator (RBI) as a parameter, it is desirable to convert the type to an enum so that the interface can use a well defined type checked parameter. As a result of this change, should future versions of the spec add sparsely defined identifiers, it could become a problem if checking for invalid identifiers since the code currently checks for the max identifier. This is not an issue with current spec, and the algorithm to obtain the register blocks will change before those possible additions are made. Signed-off-by: Ben Widawsky --- drivers/cxl/pci.h | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/cxl/pci.h b/drivers/cxl/pci.h index 8c1a58813816..7d3e4bf06b45 100644 --- a/drivers/cxl/pci.h +++ b/drivers/cxl/pci.h @@ -20,13 +20,15 @@ #define CXL_REGLOC_BIR_MASK GENMASK(2, 0) /* Register Block Identifier (RBI) */ -#define CXL_REGLOC_RBI_MASK GENMASK(15, 8) -#define CXL_REGLOC_RBI_EMPTY 0 -#define CXL_REGLOC_RBI_COMPONENT 1 -#define CXL_REGLOC_RBI_VIRT 2 -#define CXL_REGLOC_RBI_MEMDEV 3 -#define CXL_REGLOC_RBI_TYPES CXL_REGLOC_RBI_MEMDEV + 1 +enum cxl_regloc_type { + CXL_REGLOC_RBI_EMPTY = 0, + CXL_REGLOC_RBI_COMPONENT, + CXL_REGLOC_RBI_VIRT, + CXL_REGLOC_RBI_MEMDEV, + CXL_REGLOC_RBI_TYPES +}; +#define CXL_REGLOC_RBI_MASK GENMASK(15, 8) #define CXL_REGLOC_ADDR_MASK GENMASK(31, 16) #endif /* __CXL_PCI_H__ */ -- 2.33.0
[PATCH v2 0/9] cxl_pci refactor for reusability
Changes since v1 [3]: - get_max_afu_index() updated to new interface (Dan) - siov_find_pci_dvsec updated to new interface (Dan) - remove unnecessary pci request/release regions with refactor (Dan) - move @base into cxl_register_map (Dan) - Expanded the Cc list to include other users of PCIe DVSEC. (Dan) Three spots are not converted to use the new PCI core DVSEC helper as the changes are non-trivial. - find_dvsec_afu_ctrl (ocxl) - pmt_pci_probe (David) - intel_uncore_has_discovery_tables (Kan) The interdiff is below. A range-diff can be generated against v1[3] if desired. Discussion occurred partially offlist between Dan and myself. To summarize, Dan contends that patch 4, "cxl/pci: Refactor cxl_pci_setup_regs" should either be rewrittn, or have a precursor patch that cxl_pci will still scan all register blocks, but only claim the specified types. While the current diff is somewhat complex, I contend that this is unneeded churn because we can easily test this change in our existing regression testing. It also ultimately results in a simpler function in the cxl_port patch series when cxl_pci stops trying to map the component register block (loop is removed entirely). A tiebreak here would be great :-) --- Original commit message: Provide the ability to obtain CXL register blocks as discrete functionality. This functionality will become useful for other CXL drivers that need access to CXL register blocks. It is also in line with other additions to core which moves register mapping functionality. At the introduction of the CXL driver the only user of CXL MMIO was cxl_pci (then known as cxl_mem). As the driver has evolved it is clear that cxl_pci will not be the only entity that needs access to CXL MMIO. This series stops short of moving the generalized functionality into cxl_core for the sake of getting eyes on the important foundational bits sooner rather than later. The ultimate plan is to move much of the code into cxl_core. Via review of two previous patches [1] & [2] it has been suggested that the bits which are being used for DVSEC enumeration move into PCI core. As CXL core is soon going to require these, let's try to get the ball rolling now on making that happen. --- [1]: https://lore.kernel.org/linux-pci/20210913190131.xiiszmno46qie...@intel.com/ [2]: https://lore.kernel.org/linux-cxl/20210920225638.1729482-1-ben.widaw...@intel.com/ [3]: https://lore.kernel.org/linux-cxl/20210921220459.2437386-1-ben.widaw...@intel.com/ Ben Widawsky (9): cxl: Convert "RBI" to enum cxl/pci: Remove dev_dbg for unknown register blocks cxl/pci: Remove pci request/release regions cxl/pci: Refactor cxl_pci_setup_regs cxl/pci: Make more use of cxl_register_map PCI: Add pci_find_dvsec_capability to find designated VSEC cxl/pci: Use pci core's DVSEC functionality ocxl: Use pci core's DVSEC functionality iommu/vt-d: Use pci core's DVSEC functionality arch/powerpc/platforms/powernv/ocxl.c | 3 +- drivers/cxl/cxl.h | 1 + drivers/cxl/pci.c | 176 -- drivers/cxl/pci.h | 14 +- drivers/iommu/intel/iommu.c | 15 +-- drivers/misc/ocxl/config.c| 13 +- drivers/pci/pci.c | 32 + include/linux/pci.h | 1 + 8 files changed, 127 insertions(+), 128 deletions(-) Interdiff against v1: diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c index 9105efcf242a..28b009b46464 100644 --- a/arch/powerpc/platforms/powernv/ocxl.c +++ b/arch/powerpc/platforms/powernv/ocxl.c @@ -107,7 +107,8 @@ static int get_max_afu_index(struct pci_dev *dev, int *afu_idx) int pos; u32 val; - pos = find_dvsec_from_pos(dev, OCXL_DVSEC_FUNC_ID, 0); + pos = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM, + OCXL_DVSEC_FUNC_ID); if (!pos) return -ESRCH; diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 7d6b011dd963..3b128ce71564 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -140,6 +140,7 @@ struct cxl_device_reg_map { }; struct cxl_register_map { + void __iomem *base; u64 block_offset; u8 reg_type; u8 barno; diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 040379f727ad..79d4d9b16d83 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -306,9 +306,8 @@ static int cxl_pci_setup_mailbox(struct cxl_mem *cxlm) return 0; } -static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map) +static int cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map *map) { - void __iomem *addr; int bar = map->barno; struct device *dev = cxlm->dev; struct pci_dev *pdev = to_pci_dev(dev); @@ -318,24 +317,25 @@ static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_regis if (pci_resource_len(pde
[Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
https://bugzilla.kernel.org/show_bug.cgi?id=213837 --- Comment #9 from Erhard F. (erhar...@mailbox.org) --- Created attachment 298933 --> https://bugzilla.kernel.org/attachment.cgi?id=298933&action=edit System.map (5.15-rc2 + patch, PowerMac G5 11,2) (In reply to mpe from comment #8) > So it looks like you have actually overran your stack, rather than > something else clobbering your stack. > > Can you attach your System.map for that exact kernel? We might be able > to work out what functions we were in when we overran. > > You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep > the system running a bit longer and give us some other clues. > > cheers Hm, interesting... What I do to trigger this bug is building llvm-12 on the G5 via distcc (on the other side is a 16-core Opteron) and MAKEOPTS="-j10 -l3". As the G5 got 16 GiB RAM building runs in a zstd-compressed ext2 filesystem (/sbin/zram-init -d1 -s2 -azstd -text2 -orelatime -m1777 -Lvar_tmp_dir 49152 /var/tmp). Most of the time the bug is triggered very shortly after the actual building starts via meson. At this time the build directory /var/tmp/portage occupies about 800 MiB. Also sometimes I don't get a proper stack trace via netconsole but this: BUG: unable to handle kernel data access on write at 0xc00037c82040 BUG: unable to handle kernel data access on write at 0xc00037c8 Please find the relevant System.map attached. I'll do another kernel build with CONFIG_THREAD_SHIFT=15 and see if anything changes. Thanks for investigating this! -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug. You are watching someone on the CC list of the bug.
Re: [PATCH 0/2] kvm: fix KVM_MAX_VCPU_ID handling
On 13/09/21 15:57, Juergen Gross wrote: Revert commit 76b4f357d0e7d8f6f00 which was based on wrong reasoning and rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS in order to avoid the same issue in future. Juergen Gross (2): x86/kvm: revert commit 76b4f357d0e7d8f6f00 kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS Documentation/virt/kvm/devices/xics.rst| 2 +- Documentation/virt/kvm/devices/xive.rst| 2 +- arch/mips/kvm/mips.c | 2 +- arch/powerpc/include/asm/kvm_book3s.h | 2 +- arch/powerpc/include/asm/kvm_host.h| 4 ++-- arch/powerpc/kvm/book3s_xive.c | 2 +- arch/powerpc/kvm/powerpc.c | 2 +- arch/x86/include/asm/kvm_host.h| 2 +- arch/x86/kvm/ioapic.c | 2 +- arch/x86/kvm/ioapic.h | 4 ++-- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 4 ++-- tools/testing/selftests/kvm/kvm_create_max_vcpus.c | 2 +- virt/kvm/kvm_main.c| 2 +- 14 files changed, 17 insertions(+), 17 deletions(-) Queued, thanks. Paolo
Re: [PATCH 0/3] memblock: cleanup memblock_free interface
On Thu, Sep 23, 2021 at 12:43 AM Mike Rapoport wrote: > > The core change is in the third patch that makes memblock_free() a > counterpart of memblock_alloc() and adds memblock_phys_alloc() to be a ^^^ > counterpart of memblock_phys_alloc(). That should be 'memblock_phys_free()' HOWEVER. The real reason I'm replying is that this patch is horribly buggy, and will cause subtle problems that are nasty to debug. You need to be a LOT more careful. >From a trivial check - exactly because I looked at doing it with a script, and decided it's not so easy - I found cases like this: - memblock_free(__pa(paca_ptrs) + new_ptrs_size, + memblock_free(paca_ptrs + new_ptrs_size, which is COMPLETELY wrong. Why? Because now that addition is done as _pointer_ addition, not as an integer addition, and the end result is something completely different. pcac_ptrs is of type 'struct paca_struct **', so when you add new_ptrs_size to it, it will add it in terms of that many pointers, not that many bytes. You need to use some smarter scripting, or some way to validate it. And no, making the scripting just replace '__pa(x)' with '(void *)(x)' - which _would_ be mindless and get the same result - is not acceptable either, because it avoids one of the big improvements from using the right interface, namely having compiler type checking (and saner code that people understand). So NAK. No broken automated scripting patches. Linus
[PATCH] KVM: PPC: Book3S HV: Use GLOBAL_TOC for kvmppc_h_set_dabr/xdabr()
kvmppc_h_set_dabr(), and kvmppc_h_set_xdabr() which jumps into it, need to use _GLOBAL_TOC to setup the kernel TOC pointer, because kvmppc_h_set_dabr() uses LOAD_REG_ADDR() to load dawr_force_enable. When called from hcall_try_real_mode() we have the kernel TOC in r2, established near the start of kvmppc_interrupt_hv(), so there is no issue. But they can also be called from kvmppc_pseries_do_hcall() which is module code, so the access ends up happening with the kvm-hv module's r2, which will not point at dawr_force_enable and could even cause a fault. With the current code layout and compilers we haven't observed a fault in practice, the load hits somewhere in kvm-hv.ko and silently returns some bogus value. Note that we we expect p8/p9 guests to use the DAWR, but SLOF uses h_set_dabr() to test if sc1 works correctly, see SLOF's lib/libhvcall/brokensc1.c. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Michael Ellerman --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 90484425a1e6..30a8a07cff18 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1999,7 +1999,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) .globl hcall_real_table_end hcall_real_table_end: -_GLOBAL(kvmppc_h_set_xdabr) +_GLOBAL_TOC(kvmppc_h_set_xdabr) EXPORT_SYMBOL_GPL(kvmppc_h_set_xdabr) andi. r0, r5, DABRX_USER | DABRX_KERNEL beq 6f @@ -2009,7 +2009,7 @@ EXPORT_SYMBOL_GPL(kvmppc_h_set_xdabr) 6: li r3, H_PARAMETER blr -_GLOBAL(kvmppc_h_set_dabr) +_GLOBAL_TOC(kvmppc_h_set_dabr) EXPORT_SYMBOL_GPL(kvmppc_h_set_dabr) li r5, DABRX_USER | DABRX_KERNEL 3: -- 2.25.1
[Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
https://bugzilla.kernel.org/show_bug.cgi?id=213837 --- Comment #8 from m...@ellerman.id.au --- bugzilla-dae...@bugzilla.kernel.org writes: > https://bugzilla.kernel.org/show_bug.cgi?id=213837 > > --- Comment #7 from Erhard F. (erhar...@mailbox.org) --- > Created attachment 298919 > --> https://bugzilla.kernel.org/attachment.cgi?id=298919&action=edit > dmesg (5.15-rc2 + patch, PowerMac G5 11,2) > > (In reply to mpe from comment #6) >> Can you try this patch, it might help us work out what is corrupting the >> stack. > With your patch applied to recent v5.15-rc2 the output looks like this: > > [...] > stack corrupted? stack end = 0xc00029fdc000 > stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a > ... > Can't make much sense out of it but hopefully you can. ;) Thanks. Obvious isn't it? ;) stack corrupted? stack end = 0xc00029fdc000 stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a stack: c00029fdbc10: 0ddc 7c10 |... stack: c00029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a ).NAg=K. stack: c00029fdbc30: 0ddc 8e10 stack: c00029fdbc40: 41fc4e41 673d41a3 A.NAg=A. stack: c00029fdbc50: 5a5a5a5a 5a5a5a5a stack: c00029fdbc60: 0ddc 8e0c stack: c00029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a y.NAg=M. stack: c00029fdbc80: 0ddc 9008 stack: c00029fdbc90: 91fc4e41 673d4573 ..NAg=Es stack: c00029fdbca0: 5a5a5a5a 5a5a5a5a stack: c00029fdbcb0: 0dd7 ac16 stack: c00029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a ..NAg=B. stack: c00029fdbcd0: 0ddc 6c04 l... stack: c00029fdbce0: e1fc4e41 673d474b ..NAg=GK stack: c00029fdbcf0: 5a5a5a5a 5a5a5a5a stack: c00029fdbd00: 0ddc 8800 stack: c00029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a ..NAg=AC stack: c00029fdbd20: 0ddb 6c0e l... stack: c00029fdbd30: 31fd4e41 673d4f43 1.NAg=OC stack: c00029fdbd40: 5a5a5a5a 5a5a5a5a stack: c00029fdbd50: 0ddc 8e08 stack: c00029fdbd60: 69fd4e41 673d407b 5a5a5a5a 5a5a5a5a i.NAg=@{ stack: c00029fdbd70: 0ddc 9208 stack: c00029fdbd80: 81fd4e41 673d4633 ..NAg=F3 stack: c00029fdbd90: 5a5a5a5a 5a5a5a5a stack: c00029fdbda0: 0ddb 4218 B... stack: c00029fdbdb0: b9fd4e41 673d42fb 5a5a5a5a 5a5a5a5a ..NAg=B. stack: c00029fdbdc0: 0ddc 7e18 ~... stack: c00029fdbdd0: d1fd4e41 673d4a1b ..NAg=J. stack: c00029fdbde0: 5a5a5a5a 5a5a5a5a stack: c00029fdbdf0: 0ddc 8e04 stack: c00029fdbe00: 09fe4e41 673d4ee3 5a5a5a5a 5a5a5a5a ..NAg=N. stack: c00029fdbe10: 0dd9 721c r... stack: c00029fdbe20: 21fe4e41 673d4fa3 !.NAg=O. That's slab data. It's not clear what the actual data is, but because you booted with slub_debug=FZP we can see the red zones and poison. The is SLUB_RED_ACTIVE, and 5a5a5a5a is POISON_INUSE (see poison.h) stack: c00029fdbe30: c000 29fdbeb0 )... But then here we have an obvious pointer (big endian FTW). And it points nearby, just slightly higher in memory, so that looks suspiciously like a stack back chain pointer. There's more similar values if you look further. But we shouldn't be seeing the stack yet, it's meant to start (end) at c00029fdc000 ... stack: c00029fdbe40: 0ddc 9400 stack: c00029fdbe50: 59fe4e41 673d4933 5a5a5a5a 5a5a5a5a Y.NAg=I3 stack: c00029fdbe60: 0dd9 6024 `..$ stack: c00029fdbe70: 71fe4e41 673d416b q.NAg=Ak stack: c00029fdbe80: 5a5a5a5a 5a5a5a5a stack: c00029fdbe90: 0ddc 600c `... stack: c00029fdbea0: c000 29fdbf20 0002 ).. stack: c00029fdbeb0: c000 29fdbf30 0ddc 7e1c )..0~... <---
Re: [Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
bugzilla-dae...@bugzilla.kernel.org writes: > https://bugzilla.kernel.org/show_bug.cgi?id=213837 > > --- Comment #7 from Erhard F. (erhar...@mailbox.org) --- > Created attachment 298919 > --> https://bugzilla.kernel.org/attachment.cgi?id=298919&action=edit > dmesg (5.15-rc2 + patch, PowerMac G5 11,2) > > (In reply to mpe from comment #6) >> Can you try this patch, it might help us work out what is corrupting the >> stack. > With your patch applied to recent v5.15-rc2 the output looks like this: > > [...] > stack corrupted? stack end = 0xc00029fdc000 > stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a ... > Can't make much sense out of it but hopefully you can. ;) Thanks. Obvious isn't it? ;) stack corrupted? stack end = 0xc00029fdc000 stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a stack: c00029fdbc10: 0ddc 7c10 |... stack: c00029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a ).NAg=K. stack: c00029fdbc30: 0ddc 8e10 stack: c00029fdbc40: 41fc4e41 673d41a3 A.NAg=A. stack: c00029fdbc50: 5a5a5a5a 5a5a5a5a stack: c00029fdbc60: 0ddc 8e0c stack: c00029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a y.NAg=M. stack: c00029fdbc80: 0ddc 9008 stack: c00029fdbc90: 91fc4e41 673d4573 ..NAg=Es stack: c00029fdbca0: 5a5a5a5a 5a5a5a5a stack: c00029fdbcb0: 0dd7 ac16 stack: c00029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a ..NAg=B. stack: c00029fdbcd0: 0ddc 6c04 l... stack: c00029fdbce0: e1fc4e41 673d474b ..NAg=GK stack: c00029fdbcf0: 5a5a5a5a 5a5a5a5a stack: c00029fdbd00: 0ddc 8800 stack: c00029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a ..NAg=AC stack: c00029fdbd20: 0ddb 6c0e l... stack: c00029fdbd30: 31fd4e41 673d4f43 1.NAg=OC stack: c00029fdbd40: 5a5a5a5a 5a5a5a5a stack: c00029fdbd50: 0ddc 8e08 stack: c00029fdbd60: 69fd4e41 673d407b 5a5a5a5a 5a5a5a5a i.NAg=@{ stack: c00029fdbd70: 0ddc 9208 stack: c00029fdbd80: 81fd4e41 673d4633 ..NAg=F3 stack: c00029fdbd90: 5a5a5a5a 5a5a5a5a stack: c00029fdbda0: 0ddb 4218 B... stack: c00029fdbdb0: b9fd4e41 673d42fb 5a5a5a5a 5a5a5a5a ..NAg=B. stack: c00029fdbdc0: 0ddc 7e18 ~... stack: c00029fdbdd0: d1fd4e41 673d4a1b ..NAg=J. stack: c00029fdbde0: 5a5a5a5a 5a5a5a5a stack: c00029fdbdf0: 0ddc 8e04 stack: c00029fdbe00: 09fe4e41 673d4ee3 5a5a5a5a 5a5a5a5a ..NAg=N. stack: c00029fdbe10: 0dd9 721c r... stack: c00029fdbe20: 21fe4e41 673d4fa3 !.NAg=O. That's slab data. It's not clear what the actual data is, but because you booted with slub_debug=FZP we can see the red zones and poison. The is SLUB_RED_ACTIVE, and 5a5a5a5a is POISON_INUSE (see poison.h) stack: c00029fdbe30: c000 29fdbeb0 )... But then here we have an obvious pointer (big endian FTW). And it points nearby, just slightly higher in memory, so that looks suspiciously like a stack back chain pointer. There's more similar values if you look further. But we shouldn't be seeing the stack yet, it's meant to start (end) at c00029fdc000 ... stack: c00029fdbe40: 0ddc 9400 stack: c00029fdbe50: 59fe4e41 673d4933 5a5a5a5a 5a5a5a5a Y.NAg=I3 stack: c00029fdbe60: 0dd9 6024 `..$ stack: c00029fdbe70: 71fe4e41 673d416b q.NAg=Ak stack: c00029fdbe80: 5a5a5a5a 5a5a5a5a stack: c00029fdbe90: 0ddc 600c `... stack: c00029fdbea0: c000 29fdbf20 0002 ).. stack: c00029fdbeb0: c000 29fdbf30 0ddc 7e1c )..0~... <--- stack: c00029fdbec0: c000 29fdbf40 c1fe4e41 673d4723 )..@..NAg=G# stack: c0002
Re: [PATCH 3/3] memblock: cleanup memblock_free interface
Le 23/09/2021 à 14:01, Mike Rapoport a écrit : On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote: Le 23/09/2021 à 09:43, Mike Rapoport a écrit : From: Mike Rapoport For ages memblock_free() interface dealt with physical addresses even despite the existence of memblock_alloc_xx() functions that return a virtual pointer. Introduce memblock_phys_free() for freeing physical ranges and repurpose memblock_free() to free virtual pointers to make the following pairing abundantly clear: int memblock_phys_free(phys_addr_t base, phys_addr_t size); phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size); void *memblock_alloc(phys_addr_t size, phys_addr_t align); void memblock_free(void *ptr, size_t size); Replace intermediate memblock_free_ptr() with memblock_free() and drop unnecessary aliases memblock_free_early() and memblock_free_early_nid(). Suggested-by: Linus Torvalds Signed-off-by: Mike Rapoport --- diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 1a04e5bdf655..37826d8c4f74 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void) /* Get the CPU registers */ smp_save_cpu_regs(sa, addr, is_boot_cpu, page); } - memblock_free(page, PAGE_SIZE); + memblock_phys_free(page, PAGE_SIZE); diag_amode31_ops.diag308_reset(); pcpu_set_smt(0); } @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void) /* Add CPUs present at boot */ __smp_rescan_cpus(info, true); - memblock_free_early((unsigned long)info, sizeof(*info)); + memblock_free(info, sizeof(*info)); } /* I'm a bit lost. IIUC memblock_free_early() and memblock_free() where identical. Yes, they were, but all calls to memblock_free_early() were using __pa(vaddr) because they had a virtual address at hand. I'm still not following. In the above memblock_free_early() was taking (unsigned long)info . Was it a bug ? It looks odd to hide bug fixes in such a big patch, should that bug fix go in patch 2 ? In the first hunk memblock_free() gets replaced by memblock_phys_free() In the second hunk memblock_free_early() gets replaced by memblock_free() In the first hunk the memory is allocated with memblock_phys_alloc() and we have a physical range to free. In the second hunk the memory is allocated with memblock_alloc() and we are freeing a virtual pointer. I think it would be easier to follow if you could split it in several patches: It was an explicit request from Linus to make it a single commit: but the actual commit can and should be just a single commit that just fixes 'memblock_free()' to have sane interfaces. I don't feel strongly about splitting it (except my laziness really objects), but I don't think doing the conversion in several steps worth the churn. The commit is quite big (55 files changed, approx 100 lines modified). If done in the right order the change should be minimal. It is rather not-easy to follow and review when a function that was existing (namely memblock_free() ) disappears and re-appears in the same commit but to do something different. You do: - memblock_free() ==> memblock_phys_free() - memblock_free_ptr() ==> memblock_free() At least you could split in two patches, the advantage would be that between first and second patch memblock() doesn't exist anymore so you can check you really don't have anymore user. - First patch: Create memblock_phys_free() and change all relevant memblock_free() to memblock_phys_free() - Or change memblock_free() to memblock_phys_free() and make memblock_free() an alias of it. - Second patch: Make memblock_free_ptr() become memblock_free() and change all remaining callers to the new semantics (IIUC memblock_free(__pa(ptr)) becomes memblock_free(ptr) and make memblock_free_ptr() an alias of memblock_free() - Fourth patch: Replace and drop memblock_free_ptr() - Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() (All users should have been upgraded to memblock_free_phys() in patch 1 or memblock_free() in patch 2) Christophe
Re: [PATCH v2 1/2] powerpc/powermac: add missing g5_phy_disable_cpu1() declaration
On 23/09/2021 15:21, Michael Ellerman wrote: > Krzysztof Kozlowski writes: >> g5_phy_disable_cpu1() is used outside of platforms/powermac/feature.c, >> so it should have a declaration to fix W=1 warning: >> >> arch/powerpc/platforms/powermac/feature.c:1533:6: >> error: no previous prototype for ‘g5_phy_disable_cpu1’ >> [-Werror=missing-prototypes] >> >> Signed-off-by: Krzysztof Kozlowski >> >> --- >> >> Changes since v1: >> 1. Drop declaration in powermac/smp.c > > Sorry I missed v1, so I'm going to ask for more changes :} > >> arch/powerpc/include/asm/pmac_feature.h | 4 > > Putting it here exposes it to the whole kernel, but it's only needed > inside arch/powerpc/platforms/powermac. > > The right place for the prototype is arch/powerpc/platforms/powermac/pmac.h, > which is for platform internal prototypes. I'll fix it up. > >> arch/powerpc/platforms/powermac/smp.c | 2 -- >> 2 files changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/pmac_feature.h >> b/arch/powerpc/include/asm/pmac_feature.h >> index e08e829261b6..7703e5bf1203 100644 >> --- a/arch/powerpc/include/asm/pmac_feature.h >> +++ b/arch/powerpc/include/asm/pmac_feature.h >> @@ -143,6 +143,10 @@ >> */ >> struct device_node; >> >> +#ifdef CONFIG_PPC64 >> +void g5_phy_disable_cpu1(void); >> +#endif /* CONFIG_PPC64 */ > > The ifdef around the prototype doesn't gain much, and is extra visual > noise, so I'd rather without it. Sure. Best regards, Krzysztof
Re: [PATCH v2 1/2] powerpc/powermac: add missing g5_phy_disable_cpu1() declaration
Krzysztof Kozlowski writes: > g5_phy_disable_cpu1() is used outside of platforms/powermac/feature.c, > so it should have a declaration to fix W=1 warning: > > arch/powerpc/platforms/powermac/feature.c:1533:6: > error: no previous prototype for ‘g5_phy_disable_cpu1’ > [-Werror=missing-prototypes] > > Signed-off-by: Krzysztof Kozlowski > > --- > > Changes since v1: > 1. Drop declaration in powermac/smp.c Sorry I missed v1, so I'm going to ask for more changes :} > arch/powerpc/include/asm/pmac_feature.h | 4 Putting it here exposes it to the whole kernel, but it's only needed inside arch/powerpc/platforms/powermac. The right place for the prototype is arch/powerpc/platforms/powermac/pmac.h, which is for platform internal prototypes. > arch/powerpc/platforms/powermac/smp.c | 2 -- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/pmac_feature.h > b/arch/powerpc/include/asm/pmac_feature.h > index e08e829261b6..7703e5bf1203 100644 > --- a/arch/powerpc/include/asm/pmac_feature.h > +++ b/arch/powerpc/include/asm/pmac_feature.h > @@ -143,6 +143,10 @@ > */ > struct device_node; > > +#ifdef CONFIG_PPC64 > +void g5_phy_disable_cpu1(void); > +#endif /* CONFIG_PPC64 */ The ifdef around the prototype doesn't gain much, and is extra visual noise, so I'd rather without it. cheers
Re: [PATCH 3/3] memblock: cleanup memblock_free interface
On 23.09.21 09:43, Mike Rapoport wrote: From: Mike Rapoport For ages memblock_free() interface dealt with physical addresses even despite the existence of memblock_alloc_xx() functions that return a virtual pointer. Introduce memblock_phys_free() for freeing physical ranges and repurpose memblock_free() to free virtual pointers to make the following pairing abundantly clear: int memblock_phys_free(phys_addr_t base, phys_addr_t size); phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size); void *memblock_alloc(phys_addr_t size, phys_addr_t align); void memblock_free(void *ptr, size_t size); Replace intermediate memblock_free_ptr() with memblock_free() and drop unnecessary aliases memblock_free_early() and memblock_free_early_nid(). Suggested-by: Linus Torvalds Signed-off-by: Mike Rapoport arch/x86/xen/ parts: Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH 2/3] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer
On 23.09.21 09:43, Mike Rapoport wrote: From: Mike Rapoport free_p2m_page() wrongly passes a virtual pointer to memblock_free() that treats it as a physical address. Call memblock_free_ptr() instead that gets a virtual address to free the memory. Signed-off-by: Mike Rapoport Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH 3/3] memblock: cleanup memblock_free interface
On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote: > > > Le 23/09/2021 à 09:43, Mike Rapoport a écrit : > > From: Mike Rapoport > > > > For ages memblock_free() interface dealt with physical addresses even > > despite the existence of memblock_alloc_xx() functions that return a > > virtual pointer. > > > > Introduce memblock_phys_free() for freeing physical ranges and repurpose > > memblock_free() to free virtual pointers to make the following pairing > > abundantly clear: > > > > int memblock_phys_free(phys_addr_t base, phys_addr_t size); > > phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size); > > > > void *memblock_alloc(phys_addr_t size, phys_addr_t align); > > void memblock_free(void *ptr, size_t size); > > > > Replace intermediate memblock_free_ptr() with memblock_free() and drop > > unnecessary aliases memblock_free_early() and memblock_free_early_nid(). > > > > Suggested-by: Linus Torvalds > > Signed-off-by: Mike Rapoport > > --- > > > diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c > > index 1a04e5bdf655..37826d8c4f74 100644 > > --- a/arch/s390/kernel/smp.c > > +++ b/arch/s390/kernel/smp.c > > @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void) > > /* Get the CPU registers */ > > smp_save_cpu_regs(sa, addr, is_boot_cpu, page); > > } > > - memblock_free(page, PAGE_SIZE); > > + memblock_phys_free(page, PAGE_SIZE); > > diag_amode31_ops.diag308_reset(); > > pcpu_set_smt(0); > > } > > @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void) > > /* Add CPUs present at boot */ > > __smp_rescan_cpus(info, true); > > - memblock_free_early((unsigned long)info, sizeof(*info)); > > + memblock_free(info, sizeof(*info)); > > } > > /* > > I'm a bit lost. IIUC memblock_free_early() and memblock_free() where > identical. Yes, they were, but all calls to memblock_free_early() were using __pa(vaddr) because they had a virtual address at hand. > In the first hunk memblock_free() gets replaced by memblock_phys_free() > In the second hunk memblock_free_early() gets replaced by memblock_free() In the first hunk the memory is allocated with memblock_phys_alloc() and we have a physical range to free. In the second hunk the memory is allocated with memblock_alloc() and we are freeing a virtual pointer. > I think it would be easier to follow if you could split it in several > patches: It was an explicit request from Linus to make it a single commit: but the actual commit can and should be just a single commit that just fixes 'memblock_free()' to have sane interfaces. I don't feel strongly about splitting it (except my laziness really objects), but I don't think doing the conversion in several steps worth the churn. > - First patch: Create memblock_phys_free() and change all relevant > memblock_free() to memblock_phys_free() - Or change memblock_free() to > memblock_phys_free() and make memblock_free() an alias of it. > - Second patch: Make memblock_free_ptr() become memblock_free() and change > all remaining callers to the new semantics (IIUC memblock_free(__pa(ptr)) > becomes memblock_free(ptr) and make memblock_free_ptr() an alias of > memblock_free() > - Fourth patch: Replace and drop memblock_free_ptr() > - Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() (All > users should have been upgraded to memblock_free_phys() in patch 1 or > memblock_free() in patch 2) > > Christophe -- Sincerely yours, Mike.
Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
Srikar Dronamraju writes: > * Michael Ellerman [2021-08-26 23:36:53]: > >> Srikar Dronamraju writes: >> > Scheduler expects unique number of node distances to be available at >> > boot. ... > >> > Fake the offline node's distance_lookup_table entries so that all >> > possible node distances are updated. >> >> Does this work if we have a single node offline at boot? >> > > It should. > >> Say we start with: >> >> node distances: >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> And node 2 is offline at boot. We can only initialise that nodes entries >> in the distance_lookup_table: >> >> while (i--) >> distance_lookup_table[node][i] = node; >> >> By filling them all with 2 that causes node_distance(2, X) to return the >> maximum distance for all other nodes X, because we won't break out of >> the loop in __node_distance(): >> >> for (i = 0; i < distance_ref_points_depth; i++) { >> if (distance_lookup_table[a][i] == distance_lookup_table[b][i]) >> break; >> >> /* Double the distance for each NUMA level */ >> distance *= 2; >> } >> >> If distance_ref_points_depth was 4 we'd return 160. > > As you already know, distance 10, 20, .. are defined by Powerpc, form1 > affinity. PAPR doesn't define actual distances, it only provides us the > associativity. If there are distance_ref_points_depth is 4, > (distance_ref_points_depth doesn't take local distance into consideration) > 10, 20, 40, 80, 160. > >> >> That'd leave us with 3 unique distances at boot, 10, 20, 160. >> > > So if there are unique distances, then the distances as per the current > code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in > the series. like having 160 without 80. I'm confused what you mean there. If we have a node that's offline at boot then we get 160 for that node, that's just the result of having no info for it, so we never break out of the for loop. So if we have two nodes, one hop apart, and then an offline node we get 10, 20, 160. Or if you're using depth = 3 then it's 10, 20, 80. >> But when node 2 comes online it might introduce more than 1 new distance >> value, eg. it could be that the actual distances are: >> >> node distances: >> node 0 1 2 >> 0: 10 20 40 >> 1: 20 10 80 >> 2: 40 80 10 >> >> ie. we now have 4 distances, 10, 20, 40, 80. >> >> What am I missing? > > As I said above, I am not sure how we can have a break in the series. > If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as > atleast for form1 affinity. I agree for depth 3 we have to see 10, 20, 40, 80. But nothing guarantees we see each value (other than 10). We can have two nodes one hop apart, so we have 10 and 20, then a third node is added 3 hops away, so we get 10, 20, 80. The real problem is that the third node could be 3 hops from node 0 and 2 hops from node 1, and so the addition of the third node causes two new distance values (40 & 80) to be required. I think maybe what you're saying is that in practice we don't see setups like that. But I don't know if I'm happy with a solution that doesn't work in the general case, and relies on the particular properties of our current set of systems. Possibly we just need to detect that case and WARN about it. The only problem is we won't know until the system is already up and running, ie. we can't know at boot that the onlining of the third node will cause 2 new distance values to be added. cheers
Re: [PATCH 3/3] memblock: cleanup memblock_free interface
Le 23/09/2021 à 09:43, Mike Rapoport a écrit : From: Mike Rapoport For ages memblock_free() interface dealt with physical addresses even despite the existence of memblock_alloc_xx() functions that return a virtual pointer. Introduce memblock_phys_free() for freeing physical ranges and repurpose memblock_free() to free virtual pointers to make the following pairing abundantly clear: int memblock_phys_free(phys_addr_t base, phys_addr_t size); phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size); void *memblock_alloc(phys_addr_t size, phys_addr_t align); void memblock_free(void *ptr, size_t size); Replace intermediate memblock_free_ptr() with memblock_free() and drop unnecessary aliases memblock_free_early() and memblock_free_early_nid(). Suggested-by: Linus Torvalds Signed-off-by: Mike Rapoport --- diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 1a04e5bdf655..37826d8c4f74 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void) /* Get the CPU registers */ smp_save_cpu_regs(sa, addr, is_boot_cpu, page); } - memblock_free(page, PAGE_SIZE); + memblock_phys_free(page, PAGE_SIZE); diag_amode31_ops.diag308_reset(); pcpu_set_smt(0); } @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void) /* Add CPUs present at boot */ __smp_rescan_cpus(info, true); - memblock_free_early((unsigned long)info, sizeof(*info)); + memblock_free(info, sizeof(*info)); } /* I'm a bit lost. IIUC memblock_free_early() and memblock_free() where identical. In the first hunk memblock_free() gets replaced by memblock_phys_free() In the second hunk memblock_free_early() gets replaced by memblock_free() I think it would be easier to follow if you could split it in several patches: - First patch: Create memblock_phys_free() and change all relevant memblock_free() to memblock_phys_free() - Or change memblock_free() to memblock_phys_free() and make memblock_free() an alias of it. - Second patch: Make memblock_free_ptr() become memblock_free() and change all remaining callers to the new semantics (IIUC memblock_free(__pa(ptr)) becomes memblock_free(ptr) and make memblock_free_ptr() an alias of memblock_free() - Fourth patch: Replace and drop memblock_free_ptr() - Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() (All users should have been upgraded to memblock_free_phys() in patch 1 or memblock_free() in patch 2) Christophe
[PATCH 3/3] memblock: cleanup memblock_free interface
From: Mike Rapoport For ages memblock_free() interface dealt with physical addresses even despite the existence of memblock_alloc_xx() functions that return a virtual pointer. Introduce memblock_phys_free() for freeing physical ranges and repurpose memblock_free() to free virtual pointers to make the following pairing abundantly clear: int memblock_phys_free(phys_addr_t base, phys_addr_t size); phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size); void *memblock_alloc(phys_addr_t size, phys_addr_t align); void memblock_free(void *ptr, size_t size); Replace intermediate memblock_free_ptr() with memblock_free() and drop unnecessary aliases memblock_free_early() and memblock_free_early_nid(). Suggested-by: Linus Torvalds Signed-off-by: Mike Rapoport --- arch/alpha/kernel/core_irongate.c | 2 +- arch/arc/mm/init.c| 2 +- arch/arm/mach-hisi/platmcpm.c | 2 +- arch/arm/mm/init.c| 2 +- arch/arm64/mm/mmu.c | 4 ++-- arch/mips/mm/init.c | 2 +- arch/mips/sgi-ip30/ip30-setup.c | 6 +++--- arch/powerpc/kernel/dt_cpu_ftrs.c | 2 +- arch/powerpc/kernel/paca.c| 4 ++-- arch/powerpc/kernel/setup-common.c| 2 +- arch/powerpc/kernel/setup_64.c| 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 2 +- arch/powerpc/platforms/pseries/svm.c | 4 +--- arch/riscv/kernel/setup.c | 4 ++-- arch/s390/kernel/setup.c | 8 arch/s390/kernel/smp.c| 4 ++-- arch/s390/kernel/uv.c | 2 +- arch/s390/mm/kasan_init.c | 2 +- arch/sh/boards/mach-ap325rxa/setup.c | 2 +- arch/sh/boards/mach-ecovec24/setup.c | 4 ++-- arch/sh/boards/mach-kfr2r09/setup.c | 2 +- arch/sh/boards/mach-migor/setup.c | 2 +- arch/sh/boards/mach-se/7724/setup.c | 4 ++-- arch/sparc/kernel/smp_64.c| 2 +- arch/um/kernel/mem.c | 2 +- arch/x86/kernel/setup.c | 4 ++-- arch/x86/kernel/setup_percpu.c| 2 +- arch/x86/mm/init.c| 2 +- arch/x86/mm/kasan_init_64.c | 4 ++-- arch/x86/mm/numa.c| 2 +- arch/x86/mm/numa_emulation.c | 2 +- arch/x86/xen/mmu_pv.c | 6 +++--- arch/x86/xen/p2m.c| 2 +- arch/x86/xen/setup.c | 6 +++--- drivers/base/arch_numa.c | 4 ++-- drivers/firmware/efi/memmap.c | 2 +- drivers/macintosh/smu.c | 2 +- drivers/of/kexec.c| 2 +- drivers/of/of_reserved_mem.c | 4 ++-- drivers/s390/char/sclp_early.c| 2 +- drivers/usb/early/xhci-dbc.c | 10 +- drivers/xen/swiotlb-xen.c | 2 +- include/linux/memblock.h | 16 ++-- init/initramfs.c | 2 +- init/main.c | 2 +- kernel/dma/swiotlb.c | 2 +- kernel/printk/printk.c| 4 ++-- lib/bootconfig.c | 2 +- lib/cpumask.c | 2 +- mm/cma.c | 2 +- mm/memblock.c | 20 ++-- mm/memory_hotplug.c | 2 +- mm/percpu.c | 8 mm/sparse.c | 2 +- tools/bootconfig/include/linux/memblock.h | 2 +- 55 files changed, 92 insertions(+), 106 deletions(-) diff --git a/arch/alpha/kernel/core_irongate.c b/arch/alpha/kernel/core_irongate.c index 72af1e72d833..6b8ed12936b6 100644 --- a/arch/alpha/kernel/core_irongate.c +++ b/arch/alpha/kernel/core_irongate.c @@ -233,7 +233,7 @@ albacore_init_arch(void) unsigned long size; size = initrd_end - initrd_start; - memblock_free(__pa(initrd_start), PAGE_ALIGN(size)); + memblock_free((void *)initrd_start, PAGE_ALIGN(size)); if (!move_initrd(pci_mem)) printk("irongate_init_arch: initrd too big " "(%ldK)\ndisabling initrd\n", diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c index 699ecf119641..59408f6a02d4 100644 --- a/arch/arc/mm/init.c +++ b/arch/arc/mm/init.c @@ -173,7 +173,7 @@ static void __init highmem_init(void) #ifdef CONFIG_HIGHMEM unsigned long tmp; - memblock_free(high_mem_start, high_mem_sz); + memblock_phys_free(high_mem_start, high_mem_sz); for (tmp = min_high_pfn; tmp < max_high_pfn; tmp++) free_highmem_page(pfn_to_page(tmp)); #endif diff --git a/arch/arm/mach-hisi/platmcpm.c b/
[PATCH 2/3] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer
From: Mike Rapoport free_p2m_page() wrongly passes a virtual pointer to memblock_free() that treats it as a physical address. Call memblock_free_ptr() instead that gets a virtual address to free the memory. Signed-off-by: Mike Rapoport --- arch/x86/xen/p2m.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c index 5e6e236977c7..141bb9dbd2fb 100644 --- a/arch/x86/xen/p2m.c +++ b/arch/x86/xen/p2m.c @@ -197,7 +197,7 @@ static void * __ref alloc_p2m_page(void) static void __ref free_p2m_page(void *p) { if (unlikely(!slab_is_available())) { - memblock_free((unsigned long)p, PAGE_SIZE); + memblock_free_ptr(p, PAGE_SIZE); return; } -- 2.28.0
[PATCH 1/3] arch_numa: simplify numa_distance allocation
From: Mike Rapoport Memory allocation of numa_distance uses memblock_phys_alloc_range() without actual range limits, converts the returned physical address to virtual and then only uses the virtual address for further initialization. Simplify this by replacing memblock_phys_alloc_range() with memblock_alloc(). Signed-off-by: Mike Rapoport --- drivers/base/arch_numa.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index 00fb4120a5b3..f6d0efd01188 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -275,15 +275,13 @@ void __init numa_free_distance(void) static int __init numa_alloc_distance(void) { size_t size; - u64 phys; int i, j; size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]); - phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0, PFN_PHYS(max_pfn)); - if (WARN_ON(!phys)) + numa_distance = memblock_alloc(size, PAGE_SIZE); + if (WARN_ON(!numa_distance)) return -ENOMEM; - numa_distance = __va(phys); numa_distance_cnt = nr_node_ids; /* fill with the default distances */ -- 2.28.0
[PATCH 0/3] memblock: cleanup memblock_free interface
From: Mike Rapoport Hi, Following the discussion on [1] this is the fix for memblock freeing APIs mismatch. The first patch is a cleanup of numa_distance allocation in arch_numa I've spotted during the conversion. The second patch is a fix for Xen memory freeing on some of the error paths. The core change is in the third patch that makes memblock_free() a counterpart of memblock_alloc() and adds memblock_phys_alloc() to be a counterpart of memblock_phys_alloc(). Since scripts/get_maintainer.pl returned more than 100 addresses I've trimmed the distribution list only to the relevant lists. [1] https://lore.kernel.org/all/CAHk-=wj9k4LZTz+svCxLYs5Y1=+ykrbauarh1+ghyg3old8...@mail.gmail.com Mike Rapoport (3): arch_numa: simplify numa_distance allocation xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer memblock: cleanup memblock_free interface arch/alpha/kernel/core_irongate.c | 2 +- arch/arc/mm/init.c| 2 +- arch/arm/mach-hisi/platmcpm.c | 2 +- arch/arm/mm/init.c| 2 +- arch/arm64/mm/mmu.c | 4 ++-- arch/mips/mm/init.c | 2 +- arch/mips/sgi-ip30/ip30-setup.c | 6 +++--- arch/powerpc/kernel/dt_cpu_ftrs.c | 2 +- arch/powerpc/kernel/paca.c| 4 ++-- arch/powerpc/kernel/setup-common.c| 2 +- arch/powerpc/kernel/setup_64.c| 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 2 +- arch/powerpc/platforms/pseries/svm.c | 4 +--- arch/riscv/kernel/setup.c | 4 ++-- arch/s390/kernel/setup.c | 8 arch/s390/kernel/smp.c| 4 ++-- arch/s390/kernel/uv.c | 2 +- arch/s390/mm/kasan_init.c | 2 +- arch/sh/boards/mach-ap325rxa/setup.c | 2 +- arch/sh/boards/mach-ecovec24/setup.c | 4 ++-- arch/sh/boards/mach-kfr2r09/setup.c | 2 +- arch/sh/boards/mach-migor/setup.c | 2 +- arch/sh/boards/mach-se/7724/setup.c | 4 ++-- arch/sparc/kernel/smp_64.c| 2 +- arch/um/kernel/mem.c | 2 +- arch/x86/kernel/setup.c | 4 ++-- arch/x86/kernel/setup_percpu.c| 2 +- arch/x86/mm/init.c| 2 +- arch/x86/mm/kasan_init_64.c | 4 ++-- arch/x86/mm/numa.c| 2 +- arch/x86/mm/numa_emulation.c | 2 +- arch/x86/xen/mmu_pv.c | 6 +++--- arch/x86/xen/p2m.c| 2 +- arch/x86/xen/setup.c | 6 +++--- drivers/base/arch_numa.c | 10 -- drivers/firmware/efi/memmap.c | 2 +- drivers/macintosh/smu.c | 2 +- drivers/of/kexec.c| 2 +- drivers/of/of_reserved_mem.c | 4 ++-- drivers/s390/char/sclp_early.c| 2 +- drivers/usb/early/xhci-dbc.c | 10 +- drivers/xen/swiotlb-xen.c | 2 +- include/linux/memblock.h | 16 ++-- init/initramfs.c | 2 +- init/main.c | 2 +- kernel/dma/swiotlb.c | 2 +- kernel/printk/printk.c| 4 ++-- lib/bootconfig.c | 2 +- lib/cpumask.c | 2 +- mm/cma.c | 2 +- mm/memblock.c | 20 ++-- mm/memory_hotplug.c | 2 +- mm/percpu.c | 8 mm/sparse.c | 2 +- tools/bootconfig/include/linux/memblock.h | 2 +- 55 files changed, 94 insertions(+), 110 deletions(-) base-commit: e4e737bb5c170df6135a127739a9e6148ee3da82 -- 2.28.0
Re: [PATCH AUTOSEL 5.10 01/26] ibmvnic: check failover_pending in login response
Hi! Something went wrong with this series. I only see first 7 patches. I thought it might be local problem, but I only see 7 patches on lore... https://lore.kernel.org/lkml/20210923033839.1421034-1-sas...@kernel.org/ Best regards, Pavel -- http://www.livejournal.com/~pavelmachek signature.asc Description: Digital signature
Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()
Nathan Lynch writes: > Srikar Dronamraju writes: > >> * Nathan Lynch [2021-09-22 11:01:12]: >> >>> Srikar Dronamraju writes: >>> > * Nathan Lynch [2021-09-20 22:12:13]: >>> > >>> >> vcpu_is_preempted() can be used outside of preempt-disabled critical >>> >> sections, yielding warnings such as: >>> >> >>> >> BUG: using smp_processor_id() in preemptible [] code: >>> >> systemd-udevd/185 >>> >> caller is rwsem_spin_on_owner+0x1cc/0x2d0 >>> >> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33 >>> >> Call Trace: >>> >> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 >>> >> (unreliable) >>> >> [c00012907b00] [c1371f70] >>> >> check_preemption_disabled+0x150/0x160 >>> >> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0 >>> >> [c00012907be0] [c01e1408] >>> >> rwsem_down_write_slowpath+0x478/0x9a0 >>> >> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0 >>> >> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0 >>> >> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70 >>> >> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0 >>> >> [c00012907e10] [c000c54c] system_call_common+0xec/0x250 >>> >> >>> >> The result of vcpu_is_preempted() is always subject to invalidation by >>> >> events inside and outside of Linux; it's just a best guess at a point in >>> >> time. Use raw_smp_processor_id() to avoid such warnings. >>> > >>> > Typically smp_processor_id() and raw_smp_processor_id() except for the >>> > CONFIG_DEBUG_PREEMPT. >>> >>> Sorry, I don't follow... >> >> I meant, Unless CONFIG_DEBUG_PREEMPT, smp_processor_id() is defined as >> raw_processor_id(). >> >>> >>> > In the CONFIG_DEBUG_PREEMPT case, smp_processor_id() >>> > is actually debug_smp_processor_id(), which does all the checks. >>> >>> Yes, OK. >>> >>> > I believe these checks in debug_smp_processor_id() are only valid for x86 >>> > case (aka cases were they have __smp_processor_id() defined.) >>> >>> Hmm, I am under the impression that the checks in >>> debug_smp_processor_id() are valid regardless of whether the arch >>> overrides __smp_processor_id(). >> >> From include/linux/smp.h >> >> /* >> * Allow the architecture to differentiate between a stable and unstable >> read. >> * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a >> * regular asm read for the stable. >> */ >> #ifndef __smp_processor_id >> #define __smp_processor_id(x) raw_smp_processor_id(x) >> #endif >> >> As far as I see, only x86 has a definition of __smp_processor_id. >> So for archs like Powerpc, __smp_processor_id(), is always >> defined as raw_smp_processor_id(). Right? > > Sure, yes. > >> I would think debug_smp_processor_id() would be useful if >> __smp_processor_id() >> is different from raw_smp_processor_id(). Do note debug_smp_processor_id() >> calls raw_smp_processor_id(). Agree. > I do not think the utility of debug_smp_processor_id() is related to > whether the arch defines __smp_processor_id(). > >> Or can I understand how debug_smp_processor_id() is useful if >> __smp_processor_id() is defined as raw_smp_processor_id()? debug_smp_processor_id() is useful on powerpc, as well as other arches, because it checks that we're in a context where the processor id won't change out from under us. eg. something like this is unsafe: int counts[NR_CPUS]; int tmp, cpu; cpu = smp_processor_id(); tmp = counts[cpu]; <- preempted here and migrated to another CPU counts[cpu] = tmp + 1; > So, for powerpc with DEBUG_PREEMPT unset, a call to smp_procesor_id() > expands to __smp_processor_id() which expands to raw_smp_processor_id(), > avoiding the preempt safety checks. This is working as intended. > > For powerpc with DEBUG_PREEMPT=y, a call to smp_processor_id() expands > to the out of line call to debug_smp_processor_id(), which calls > raw_smp_processor_id() and performs the checks, warning if called in an > inappropriate context, as seen here. Also working as intended. > > AFAICT __smp_processor_id() is a performance/codegen-oriented hook, and > not really related to the debug facility. Please see 9ed7d75b2f09d > ("x86/percpu: Relax smp_processor_id()"). Yeah good find. cheers