Re: [PATCH] powerpc/64s/hash: Fix hash_preload running with interrupts enabled
> On 28-Jul-2020, at 6:14 AM, Michael Ellerman wrote: > > Athira Rajeev writes: >>> On 27-Jul-2020, at 6:05 PM, Michael Ellerman wrote: >>> >>> Athira Rajeev writes: > On 27-Jul-2020, at 11:39 AM, Nicholas Piggin wrote: > > Commit 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from > the > caller") removed the local_irq_disable from hash_preload, but it was > required for more than just the page table walk: the hash pte busy bit is > effectively a lock which may be taken in interrupt context, and the local > update flag test must not be preempted before it's used. > > This solves apparent lockups with perf interrupting __hash_page_64K. If > get_perf_callchain then also takes a hash fault on the same page while it > is already locked, it will loop forever taking hash faults, which looks > like > this: > > cpu 0x49e: Vector: 100 (System Reset) at [c0001a4f7d70] > pc: c0072dc8: hash_page_mm+0x8/0x800 > lr: c000c5a4: do_hash_page+0x24/0x38 > sp: c0002ac1cc69ac70 > msr: 80081033 > current = 0xc0002ac1cc602e00 > paca= 0xc0001de1f280 irqmask: 0x03 irq_happened: 0x01 > pid = 20118, comm = pread2_processe > Linux version 5.8.0-rc6-00345-g1fad14f18bc6 > 49e:mon> t > [c0002ac1cc69ac70] c000c5a4 do_hash_page+0x24/0x38 (unreliable) > --- Exception: 300 (Data Access) at c008fa60 > __copy_tofrom_user_power7+0x20c/0x7ac > [link register ] c0335d10 copy_from_user_nofault+0xf0/0x150 > [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable) > [c0002ac1cc69afa0] c0109df0 read_user_stack_64+0x70/0xf0 > [c0002ac1cc69afd0] c0109fcc perf_callchain_user_64+0x15c/0x410 > [c0002ac1cc69b060] c0109c00 perf_callchain_user+0x20/0x40 > [c0002ac1cc69b080] c031c6cc get_perf_callchain+0x25c/0x360 > [c0002ac1cc69b120] c0316b50 perf_callchain+0x70/0xa0 > [c0002ac1cc69b140] c0316ddc perf_prepare_sample+0x25c/0x790 > [c0002ac1cc69b1a0] c0317350 perf_event_output_forward+0x40/0xb0 > [c0002ac1cc69b220] c0306138 __perf_event_overflow+0x88/0x1a0 > [c0002ac1cc69b270] c010cf70 record_and_restart+0x230/0x750 > [c0002ac1cc69b620] c010d69c perf_event_interrupt+0x20c/0x510 > [c0002ac1cc69b730] c0027d9c > performance_monitor_exception+0x4c/0x60 > [c0002ac1cc69b750] c000b2f8 > performance_monitor_common_virt+0x1b8/0x1c0 > --- Exception: f00 (Performance Monitor) at c00cb5b0 > pSeries_lpar_hpte_insert+0x0/0x160 > [link register ] c00846f0 __hash_page_64K+0x210/0x540 > [c0002ac1cc69ba50] (unreliable) > [c0002ac1cc69bb00] c0073ae0 update_mmu_cache+0x390/0x3a0 > [c0002ac1cc69bb70] c037f024 wp_page_copy+0x364/0xce0 > [c0002ac1cc69bc20] c038272c do_wp_page+0xdc/0xa60 > [c0002ac1cc69bc70] c03857bc handle_mm_fault+0xb9c/0x1b60 > [c0002ac1cc69bd50] c006c434 __do_page_fault+0x314/0xc90 > [c0002ac1cc69be20] c000c5c8 handle_page_fault+0x10/0x2c > --- Exception: 300 (Data Access) at 7fff8c861fe8 > SP (76b19660) is in userspace > > Reported-by: Athira Rajeev > Reported-by: Anton Blanchard > Reviewed-by: Aneesh Kumar K.V > Fixes: 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from > the > caller") > Signed-off-by: Nicholas Piggin Hi, Tested with the patch and it fixes the lockups I was seeing with my test run. Thanks for the fix. Tested-by: Athira Rajeev >>> >>> Thanks for testing. >>> >>> What test are you running? >> >> Hi Michael >> >> I was running “perf record” and Unixbench tests ( >> https://github.com/kdlucas/byte-unixbench ) in parallel where we were >> getting soft lockups >> >> 1. Perf command run: >> # perf record -a -g -c 1000 -o sleep 60 >> >> 2. Unixbench tests >> # Run -q -c spawn > > Thanks, I can reproduce it with that. Sure Michael > > cheers
[PATCH] powerpc: Fix MMCRA_BHRB_DISABLE define to work with binutils version < 2.28
commit 9908c826d5ed ("powerpc/perf: Add Power10 PMU feature to DT CPU features") defines MMCRA_BHRB_DISABLE as `0x20UL`. Binutils version less than 2.28 doesn't support UL suffix. linux-ppc/arch/powerpc/kernel/cpu_setup_power.S: Assembler messages: linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')' linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of line, first unrecognized character is `L' linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')' linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')' linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of line, first unrecognized character is `L' linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')' linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')' linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: operand out of range (0x0020 is not between 0x8000 and 0x) Fix this by wrapping it around `_UL` macro. Fixes: 9908c826d5ed ("Add Power10 PMU feature to DT CPU features") Signed-off-by: Athira Rajeev Suggested-by: Michael Ellerman --- arch/powerpc/include/asm/reg.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index ae71027..41419f1 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -12,6 +12,7 @@ #ifdef __KERNEL__ #include +#include #include #include #include @@ -888,7 +889,7 @@ #define MMCRA_SLOT 0x0700UL /* SLOT bits (37-39) */ #define MMCRA_SLOT_SHIFT 24 #define MMCRA_SAMPLE_ENABLE 0x0001UL /* enable sampling */ -#define MMCRA_BHRB_DISABLE 0x20UL // BHRB disable bit for ISA v3.1 +#define MMCRA_BHRB_DISABLE _UL(0x20) // BHRB disable bit for ISA v3.1 #define POWER6_MMCRA_SDSYNC 0x0800ULL/* SDAR/SIAR synced */ #define POWER6_MMCRA_SIHV 0x0400ULL #define POWER6_MMCRA_SIPR 0x0200ULL -- 1.8.3.1
[PATCH] powerpc/configs: Add BLK_DEV_NVME to pseries_defconfig
I've forgotten to manual enable NVME when building pseries kernels for machines with NVME adapters. Since it's a reasonably common configuration, enable it by default. Signed-off-by: Anton Blanchard --- arch/powerpc/configs/pseries_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index dfa4a726333b..358642d6f46d 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -94,6 +94,7 @@ CONFIG_BLK_DEV_NBD=m CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_SIZE=65536 CONFIG_VIRTIO_BLK=m +CONFIG_BLK_DEV_NVME=y CONFIG_BLK_DEV_SD=y CONFIG_CHR_DEV_ST=m CONFIG_BLK_DEV_SR=y -- 2.26.2
Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls
Hari Bathini writes: > On 28/07/20 7:16 pm, Michael Ellerman wrote: >> Hari Bathini writes: >>> Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9 >>> to be filled with OPAL base & entry addresses respectively. Setting >>> these registers allows the kernel to perform OPAL calls before the >>> device tree is parsed. >> >> I'm not convinced we want to do this. >> >> If we do it becomes part of the kexec ABI and we have to honour it into >> the future. >> >> And in practice there are no non-development kernels built with OPAL early >> debugging enabled, so it's not clear it actually helps anyone other than >> developers. >> > > Hmmm.. kexec-tools does it since commit d58ad564852c ("kexec/ppc64 > Enable early kernel's OPAL calls") for kexec_load syscall. So, we would > be breaking kexec ABI either way, I guess. Ugh, OK. > Let me put this patch at the end of the series in the respin to let you > decide whether to have it or not.. Thanks. cheers
Re: [PATCH 13/15] arch, drivers: replace for_each_membock() with for_each_mem_range()
On Tue, 28 Jul 2020, 07:16 Mike Rapoport, wrote: > From: Mike Rapoport > > There are several occurrences of the following pattern: > > for_each_memblock(memory, reg) { > start = __pfn_to_phys(memblock_region_memory_base_pfn(reg); > end = __pfn_to_phys(memblock_region_memory_end_pfn(reg)); > > /* do something with start and end */ > } > > Using for_each_mem_range() iterator is more appropriate in such cases and > allows simpler and cleaner code. > > Signed-off-by: Mike Rapoport > --- > arch/arm/kernel/setup.c | 18 +++ > arch/arm/mm/mmu.c| 39 > arch/arm/mm/pmsa-v7.c| 20 ++-- > arch/arm/mm/pmsa-v8.c| 17 +-- > arch/arm/xen/mm.c| 7 +++-- > arch/arm64/mm/kasan_init.c | 8 ++--- > arch/arm64/mm/mmu.c | 11 ++- > arch/c6x/kernel/setup.c | 9 +++--- > arch/microblaze/mm/init.c| 9 +++--- > arch/mips/cavium-octeon/dma-octeon.c | 12 > arch/mips/kernel/setup.c | 31 +-- > arch/openrisc/mm/init.c | 8 +++-- > arch/powerpc/kernel/fadump.c | 27 +++- > arch/powerpc/mm/book3s64/hash_utils.c| 16 +- > arch/powerpc/mm/book3s64/radix_pgtable.c | 11 +++ > arch/powerpc/mm/kasan/kasan_init_32.c| 8 ++--- > arch/powerpc/mm/mem.c| 16 ++ > arch/powerpc/mm/pgtable_32.c | 8 ++--- > arch/riscv/mm/init.c | 24 ++- > arch/riscv/mm/kasan_init.c | 10 +++--- > arch/s390/kernel/setup.c | 27 ++-- > arch/s390/mm/vmem.c | 16 +- > arch/sparc/mm/init_64.c | 12 +++- > drivers/bus/mvebu-mbus.c | 12 > drivers/s390/char/zcore.c| 9 +++--- > 25 files changed, 187 insertions(+), 198 deletions(-) > > diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c > index d8e18cdd96d3..3f65d0ac9f63 100644 > --- a/arch/arm/kernel/setup.c > +++ b/arch/arm/kernel/setup.c > @@ -843,19 +843,25 @@ early_param("mem", early_mem); > > static void __init request_standard_resources(const struct machine_desc > *mdesc) > { > - struct memblock_region *region; > + phys_addr_t start, end, res_end; > struct resource *res; > + u64 i; > > kernel_code.start = virt_to_phys(_text); > kernel_code.end = virt_to_phys(__init_begin - 1); > kernel_data.start = virt_to_phys(_sdata); > kernel_data.end = virt_to_phys(_end - 1); > > - for_each_memblock(memory, region) { > - phys_addr_t start = > __pfn_to_phys(memblock_region_memory_base_pfn(region)); > - phys_addr_t end = > __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1; > + for_each_mem_range(i, , ) { > unsigned long boot_alias_start; > > + /* > +* In memblock, end points to the first byte after the > +* range while in resourses, end points to the last byte in > +* the range. > +*/ > + res_end = end - 1; > + > /* > * Some systems have a special memory alias which is only > * used for booting. We need to advertise this region to > @@ -869,7 +875,7 @@ static void __init request_standard_resources(const > struct machine_desc *mdesc) > __func__, sizeof(*res)); > res->name = "System RAM (boot alias)"; > res->start = boot_alias_start; > - res->end = phys_to_idmap(end); > + res->end = phys_to_idmap(res_end); > res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; > request_resource(_resource, res); > } > @@ -880,7 +886,7 @@ static void __init request_standard_resources(const > struct machine_desc *mdesc) > sizeof(*res)); > res->name = "System RAM"; > res->start = start; > - res->end = end; > + res->end = res_end; > res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; > > request_resource(_resource, res); > diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c > index 628028bfbb92..a149d9cb4fdb 100644 > --- a/arch/arm/mm/mmu.c > +++ b/arch/arm/mm/mmu.c > @@ -1155,9 +1155,8 @@ phys_addr_t arm_lowmem_limit __initdata = 0; > > void __init adjust_lowmem_bounds(void) > { > - phys_addr_t memblock_limit = 0; > - u64 vmalloc_limit; > - struct memblock_region *reg; > + phys_addr_t block_start, block_end, memblock_limit = 0; > +
Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Excerpts from Linus Torvalds's message of July 29, 2020 5:02 am: > On Tue, Jul 28, 2020 at 3:53 AM Nicholas Piggin wrote: >> >> The quirk is a problem with coprocessor where it's supposed to >> invalidate the translation after a fault but it doesn't, so we can get a >> read-only TLB stuck after something else does a RO->RW upgrade on the >> TLB. Something like that IIRC. Coprocessors have their own MMU which >> lives in the nest not the core, so you need a global TLB flush to >> invalidate that thing. > > So I assumed, but it does seem confused. > > Why? Because if there are stale translations on the co-processor, > there's no guarantee that one of the CPU's will have them and take a > fault. > > So I'm not seeing why a core CPU doing spurious TLB invalidation would > follow from "stale TLB in the Nest". If the nest MMU access faults, it sends an interrupt to the CPU and the driver tries to handle the page fault for it (I think that's how it works). > If anything, I think "we have a coprocessor that needs to never have > stale TLB entries" would impact the _regular_ TLB invalidates (by > update_mmu_cache()) and perhaps make those more aggressive, exactly > because the coprocessor may not handle the fault as gracefully. It could be done that way... Hmm although we do have something similar also in radix__ptep_set_access_flags for the relaxing permissions case so maybe this is required for not-present faults as well? I'm not actually sure now. But it's a bit weird and awkward because it's working around quirks in the hardware which aren't regular, not because we're _completely_ confused (I hope). Thanks, Nick
Re: [PATCH] powerpc/powernv/pci: Fix build of pci-ioda.o
On Wed, Jul 29, 2020 at 8:35 AM Gustavo Romero wrote: > > Currently pnv_ioda_setup_bus_dma() is outside of a CONFIG_IOMMU_API guard > and if CONFIG_IOMMU_API=n the build can fail if the compiler sets > -Werror=unused-function, because pnv_ioda_setup_bus_dma() is only used in > functions guarded by a CONFIG_IOMMU_API guard. > > That issue can be easily reproduced using the skiroot_defconfig. For other > configs, like powernv_defconfig, that issue is hidden by the fact that > if CONFIG_IOMMU_SUPPORT is enabled plus other common IOMMU options, like > CONFIG_OF_IOMMU, by default CONFIG_IOMMU_API is enabled as well. Hence, for > powernv_defconfig, it's necessary to set CONFIG_IOMMU_SUPPORT=n to make the > build fail, because CONFIG_PCI=y and pci-ioda.c is included in the build, > but since CONFIG_IOMMU_SUPPORT=n the CONFIG_IOMMU_API is disabled, breaking > the build. > > This commit fixes that build issue by moving the pnv_ioda_setup_bus_dma() > inside a CONFIG_IOMMU_API guard, so when CONFIG_IOMMU_API is disabled that > function is not defined. I think a fix for this is already in -next.
[PATCH] powerpc/powernv/pci: Fix build of pci-ioda.o
Currently pnv_ioda_setup_bus_dma() is outside of a CONFIG_IOMMU_API guard and if CONFIG_IOMMU_API=n the build can fail if the compiler sets -Werror=unused-function, because pnv_ioda_setup_bus_dma() is only used in functions guarded by a CONFIG_IOMMU_API guard. That issue can be easily reproduced using the skiroot_defconfig. For other configs, like powernv_defconfig, that issue is hidden by the fact that if CONFIG_IOMMU_SUPPORT is enabled plus other common IOMMU options, like CONFIG_OF_IOMMU, by default CONFIG_IOMMU_API is enabled as well. Hence, for powernv_defconfig, it's necessary to set CONFIG_IOMMU_SUPPORT=n to make the build fail, because CONFIG_PCI=y and pci-ioda.c is included in the build, but since CONFIG_IOMMU_SUPPORT=n the CONFIG_IOMMU_API is disabled, breaking the build. This commit fixes that build issue by moving the pnv_ioda_setup_bus_dma() inside a CONFIG_IOMMU_API guard, so when CONFIG_IOMMU_API is disabled that function is not defined. Signed-off-by: Gustavo Romero --- arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++ 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 73a63efcf855..743d840712da 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; } -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) -{ - struct pci_dev *dev; - - list_for_each_entry(dev, >devices, bus_list) { - set_iommu_table_base(>dev, pe->table_group.tables[0]); - dev->dev.archdata.dma_offset = pe->tce_bypass_base; - - if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) - pnv_ioda_setup_bus_dma(pe, dev->subordinate); - } -} - static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb, bool real_mode) { @@ -2501,6 +2488,19 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, #endif #ifdef CONFIG_IOMMU_API +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, >devices, bus_list) { + set_iommu_table_base(>dev, pe->table_group.tables[0]); + dev->dev.archdata.dma_offset = pe->tce_bypass_base; + + if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) + pnv_ioda_setup_bus_dma(pe, dev->subordinate); + } +} + unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, __u64 window_size, __u32 levels) { -- 2.17.1
Re: [RESEND PATCH v5 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel
On 28/07/20 7:14 pm, Michael Ellerman wrote: Hari Bathini writes: diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 2df6f4273ddd..8df085a22fd7 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -17,9 +17,21 @@ #include #include #include +#include #include +#include +#include #include +struct umem_info { + uint64_t *buf; /* data buffer for usable-memory property */ + uint32_t idx; /* current index */ + uint32_t size; /* size allocated for the data buffer */ Use kernel types please, u64, u32. + /* usable memory ranges to look up */ + const struct crash_mem *umrngs; "umrngs". Given it's part of the umem_info struct could it just be "ranges"? True. Actually, having crash_mem_range *ranges + u32 nr_ranges and populating them seems better. Will do that.. + return NULL; + } um_info->size = new_size; + + memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ); Just pass __GFP_ZERO to krealloc? There are patches submitted to stable fixing a few modules that use krealloc with __GFP_ZERO. Also, this zeroing is not really needed. I will drop the memset instead.. Thanks Hari
Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls
On 28/07/20 7:16 pm, Michael Ellerman wrote: Hari Bathini writes: Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9 to be filled with OPAL base & entry addresses respectively. Setting these registers allows the kernel to perform OPAL calls before the device tree is parsed. I'm not convinced we want to do this. If we do it becomes part of the kexec ABI and we have to honour it into the future. And in practice there are no non-development kernels built with OPAL early debugging enabled, so it's not clear it actually helps anyone other than developers. Hmmm.. kexec-tools does it since commit d58ad564852c ("kexec/ppc64 Enable early kernel's OPAL calls") for kexec_load syscall. So, we would be breaking kexec ABI either way, I guess. Let me put this patch at the end of the series in the respin to let you decide whether to have it or not.. Thanks Hari
Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Hi Laurent, Laurent Dufour writes: > Le 28/07/2020 à 19:37, Nathan Lynch a écrit : >> The drmem lmb list can have hundreds of thousands of entries, and >> unfortunately lookups take the form of linear searches. As long as >> this is the case, traversals have the potential to monopolize the CPU >> and provoke lockup reports, workqueue stalls, and the like unless >> they explicitly yield. >> >> Rather than placing cond_resched() calls within various >> for_each_drmem_lmb() loop blocks in the code, put it in the iteration >> expression of the loop macro itself so users can't omit it. > > Is that not too much to call cond_resched() on every LMB? > > Could that be less frequent, every 10, or 100, I don't really know ? Everything done within for_each_drmem_lmb is relatively heavyweight already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens of milliseconds. I don't think cond_resched() is an expensive check in this context.
Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
On Tue, Jul 28, 2020 at 3:53 AM Nicholas Piggin wrote: > > The quirk is a problem with coprocessor where it's supposed to > invalidate the translation after a fault but it doesn't, so we can get a > read-only TLB stuck after something else does a RO->RW upgrade on the > TLB. Something like that IIRC. Coprocessors have their own MMU which > lives in the nest not the core, so you need a global TLB flush to > invalidate that thing. So I assumed, but it does seem confused. Why? Because if there are stale translations on the co-processor, there's no guarantee that one of the CPU's will have them and take a fault. So I'm not seeing why a core CPU doing spurious TLB invalidation would follow from "stale TLB in the Nest". If anything, I think "we have a coprocessor that needs to never have stale TLB entries" would impact the _regular_ TLB invalidates (by update_mmu_cache()) and perhaps make those more aggressive, exactly because the coprocessor may not handle the fault as gracefully. I dunno. I don't know the coprocessor side well enough to judge, I'm just looking at it from a conceptual standpoint. Linus
Re: [PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct
On 28.07.20 18:53, Scott Cheloha wrote: > At memory hot-remove time we can retrieve an LMB's nid from its > corresponding memory_block. There is no need to store the nid > in multiple locations. > > Note that lmb_to_memblock() uses find_memory_block() to get the > corresponding memory_block. As find_memory_block() runs in sub-linear > time this approach is negligibly slower than what we do at present. > > In exchange for this lookup at hot-remove time we no longer need to > call memory_add_physaddr_to_nid() during drmem_init() for each LMB. > On powerpc, memory_add_physaddr_to_nid() is a linear search, so this > spares us an O(n^2) initialization during boot. > > On systems with many LMBs that initialization overhead is palpable and > disruptive. For example, on a box with 249854 LMBs we're seeing > drmem_init() take upwards of 30 seconds to complete: > > [ 53.721639] drmem: initializing drmem v2 > [ 80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! > [swapper/0:1] > [ 80.604377] Modules linked in: > [ 80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4 > [ 80.604397] NIP: c00a4980 LR: c00a4940 CTR: > > [ 80.604407] REGS: c0002dbff8493830 TRAP: 0901 Not tainted (5.6.0-rc2+) > [ 80.604412] MSR: 82009033 CR: > 44000248 XER: 000d > [ 80.604431] CFAR: c00a4a38 IRQMASK: 0 > [ 80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 > c0003cfede30 > [ 80.604431] GPR04: c0f4095a 002f > 1000 > [ 80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 > c00c0002fdfb2001 > [ 80.604431] GPR12: c0001e8ec200 > [ 80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0 > [ 80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 > [ 80.604492] Call Trace: > [ 80.604498] [c0002dbff8493ac0] [c00a4940] > hot_add_scn_to_nid+0x60/0x3e0 (unreliable) > [ 80.604509] [c0002dbff8493b20] [c0087c10] > memory_add_physaddr_to_nid+0x20/0x60 > [ 80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0 > [ 80.604530] [c0002dbff8493c10] [c0010154] > do_one_initcall+0x64/0x2c0 > [ 80.604540] [c0002dbff8493ce0] [c10c4aa0] > kernel_init_freeable+0x2d8/0x3a0 > [ 80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148 > [ 80.604560] [c0002dbff8493e20] [c000b648] > ret_from_kernel_thread+0x5c/0x74 > [ 80.604567] Instruction dump: > [ 80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 > 7d094214 > [ 80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac > e949 7fbe5040 > [ 89.047390] drmem: 249854 LMB(s) > > With a patched kernel on the same machine we're no longer seeing the > soft lockup. drmem_init() now completes in negligible time, even when > the LMB count is large. > Yeah, as long as you remove_memory() in memory block granularity, this is guaranteed to work. Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Le 28/07/2020 à 19:37, Nathan Lynch a écrit : The drmem lmb list can have hundreds of thousands of entries, and unfortunately lookups take the form of linear searches. As long as this is the case, traversals have the potential to monopolize the CPU and provoke lockup reports, workqueue stalls, and the like unless they explicitly yield. Rather than placing cond_resched() calls within various for_each_drmem_lmb() loop blocks in the code, put it in the iteration expression of the loop macro itself so users can't omit it. Hi Nathan, Is that not too much to call cond_resched() on every LMB? Could that be less frequent, every 10, or 100, I don't really know ? Cheers, Laurent. Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format") Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/drmem.h | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 414d209f45bb..36d0ed04bda8 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -8,6 +8,8 @@ #ifndef _ASM_POWERPC_LMB_H #define _ASM_POWERPC_LMB_H +#include + struct drmem_lmb { u64 base_addr; u32 drc_index; @@ -26,8 +28,14 @@ struct drmem_lmb_info { extern struct drmem_lmb_info *drmem_info; +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) +{ + cond_resched(); + return ++lmb; +} + #define for_each_drmem_lmb_in_range(lmb, start, end) \ - for ((lmb) = (start); (lmb) < (end); (lmb)++) + for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb)) #define for_each_drmem_lmb(lmb) \ for_each_drmem_lmb_in_range((lmb), \
[PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
The drmem lmb list can have hundreds of thousands of entries, and unfortunately lookups take the form of linear searches. As long as this is the case, traversals have the potential to monopolize the CPU and provoke lockup reports, workqueue stalls, and the like unless they explicitly yield. Rather than placing cond_resched() calls within various for_each_drmem_lmb() loop blocks in the code, put it in the iteration expression of the loop macro itself so users can't omit it. Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format") Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/drmem.h | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 414d209f45bb..36d0ed04bda8 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -8,6 +8,8 @@ #ifndef _ASM_POWERPC_LMB_H #define _ASM_POWERPC_LMB_H +#include + struct drmem_lmb { u64 base_addr; u32 drc_index; @@ -26,8 +28,14 @@ struct drmem_lmb_info { extern struct drmem_lmb_info *drmem_info; +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) +{ + cond_resched(); + return ++lmb; +} + #define for_each_drmem_lmb_in_range(lmb, start, end) \ - for ((lmb) = (start); (lmb) < (end); (lmb)++) + for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb)) #define for_each_drmem_lmb(lmb)\ for_each_drmem_lmb_in_range((lmb), \ -- 2.25.4
[PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct
At memory hot-remove time we can retrieve an LMB's nid from its corresponding memory_block. There is no need to store the nid in multiple locations. Note that lmb_to_memblock() uses find_memory_block() to get the corresponding memory_block. As find_memory_block() runs in sub-linear time this approach is negligibly slower than what we do at present. In exchange for this lookup at hot-remove time we no longer need to call memory_add_physaddr_to_nid() during drmem_init() for each LMB. On powerpc, memory_add_physaddr_to_nid() is a linear search, so this spares us an O(n^2) initialization during boot. On systems with many LMBs that initialization overhead is palpable and disruptive. For example, on a box with 249854 LMBs we're seeing drmem_init() take upwards of 30 seconds to complete: [ 53.721639] drmem: initializing drmem v2 [ 80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1] [ 80.604377] Modules linked in: [ 80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4 [ 80.604397] NIP: c00a4980 LR: c00a4940 CTR: [ 80.604407] REGS: c0002dbff8493830 TRAP: 0901 Not tainted (5.6.0-rc2+) [ 80.604412] MSR: 82009033 CR: 44000248 XER: 000d [ 80.604431] CFAR: c00a4a38 IRQMASK: 0 [ 80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 c0003cfede30 [ 80.604431] GPR04: c0f4095a 002f 1000 [ 80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 c00c0002fdfb2001 [ 80.604431] GPR12: c0001e8ec200 [ 80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0 [ 80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 [ 80.604492] Call Trace: [ 80.604498] [c0002dbff8493ac0] [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable) [ 80.604509] [c0002dbff8493b20] [c0087c10] memory_add_physaddr_to_nid+0x20/0x60 [ 80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0 [ 80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0 [ 80.604540] [c0002dbff8493ce0] [c10c4aa0] kernel_init_freeable+0x2d8/0x3a0 [ 80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148 [ 80.604560] [c0002dbff8493e20] [c000b648] ret_from_kernel_thread+0x5c/0x74 [ 80.604567] Instruction dump: [ 80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 7d094214 [ 80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 7fbe5040 [ 89.047390] drmem: 249854 LMB(s) With a patched kernel on the same machine we're no longer seeing the soft lockup. drmem_init() now completes in negligible time, even when the LMB count is large. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/drmem.h | 21 --- arch/powerpc/mm/drmem.c | 6 +- .../platforms/pseries/hotplug-memory.c| 19 ++--- 3 files changed, 13 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 414d209f45bb..34e4e9b257f5 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -13,9 +13,6 @@ struct drmem_lmb { u32 drc_index; u32 aa_index; u32 flags; -#ifdef CONFIG_MEMORY_HOTPLUG - int nid; -#endif }; struct drmem_lmb_info { @@ -104,22 +101,4 @@ static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb) lmb->aa_index = 0x; } -#ifdef CONFIG_MEMORY_HOTPLUG -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ - lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr); -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ - lmb->nid = -1; -} -#else -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ -} -#endif - #endif /* _ASM_POWERPC_LMB_H */ diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 59327cefbc6a..873fcfc7b875 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop) if (!drmem_info->lmbs) return; - for_each_drmem_lmb(lmb) { + for_each_drmem_lmb(lmb) read_drconf_v1_cell(lmb, ); - lmb_set_nid(lmb); - } } static void __init init_drmem_v2_lmbs(const __be32 *prop) @@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop) lmb->aa_index = dr_cell.aa_index; lmb->flags = dr_cell.flags; - - lmb_set_nid(lmb); } } } diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 5ace2f9a277e..7bf66fdcf916 100644 ---
[PATCH] selftests/powerpc: return skip code for spectre_v2
When running under older versions of qemu of under newer versions with old machine types, some security features will not be reported to the guest. This will lead the guest OS to consider itself Vulnerable to spectre_v2. So, spectre_v2 test fails in such cases when the host is mitigated and miss predictions cannot be detected as expected by the test. Make it return the skip code instead, for this particular case. We don't want to miss the case when the test fails and the system reports as mitigated or not affected. But it is not a problem to miss failures when the system reports as Vulnerable. Signed-off-by: Thadeu Lima de Souza Cascardo --- tools/testing/selftests/powerpc/security/spectre_v2.c | 8 1 file changed, 8 insertions(+) diff --git a/tools/testing/selftests/powerpc/security/spectre_v2.c b/tools/testing/selftests/powerpc/security/spectre_v2.c index 8c6b982af2a8..d5445bfd63ed 100644 --- a/tools/testing/selftests/powerpc/security/spectre_v2.c +++ b/tools/testing/selftests/powerpc/security/spectre_v2.c @@ -183,6 +183,14 @@ int spectre_v2_test(void) if (miss_percent > 15) { printf("Branch misses > 15%% unexpected in this configuration!\n"); printf("Possible mis-match between reported & actual mitigation\n"); + /* Such a mismatch may be caused by a guest system +* reporting as vulnerable when the host is mitigated. +* Return skip code to avoid detecting this as an +* error. We are not vulnerable and reporting otherwise, +* so missing such a mismatch is safe. +*/ + if (state == VULNERABLE) + return 4; return 1; } break; -- 2.25.1
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Hi, On 27/07/20 06:32, Srikar Dronamraju wrote: > Add percpu coregroup maps and masks to create coregroup domain. > If a coregroup doesn't exist, the coregroup domain will be degenerated > in favour of SMT/CACHE domain. > So there's at least one arm64 platform out there with the same "pairs of cores share L2" thing (Ampere eMAG), and that lives quite happily with the default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC domain, and the whole system is covered by DIE. Now arguably it's not a perfect representation; DIE doesn't have SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That will impact all callsites using cpus_share_cache(): in the eMAG case, only pairs of cores will be seen as sharing cache, even though *all* cores share the same L3. I'm trying to paint a picture of what the P9 topology looks like (the one you showcase in your cover letter) to see if there are any similarities; from what I gather in [1], wikichips and your cover letter, with P9 you can have something like this in a single DIE (somewhat unsure about L3 setup; it looks to be distributed?) +-+ | L3 | +---+-+---+-+---+-+---+ | L2 | | L2 | | L2 | | L2 | +--+-+--+ +--+-+--+ +--+-+--+ +--+-+--+ | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ Which would lead to (ignoring the whole SMT CPU numbering shenanigans) NUMA [ ... DIE [ ] MC [ ] [ ] [ ] [ ] BIGCORE [ ] [ ] [ ] [ ] SMT [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 This however has MC == BIGCORE; what makes it you can have different spans for these two domains? If it's not too much to ask, I'd love to have a P9 topology diagram. [1]: 20200722081822.gg9...@linux.vnet.ibm.com
Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
On 07/28/20 at 05:15pm, Mike Rapoport wrote: > On Tue, Jul 28, 2020 at 07:02:54PM +0800, Baoquan He wrote: > > On 07/28/20 at 08:11am, Mike Rapoport wrote: > > > From: Mike Rapoport > > > > > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo > > > regions to set node ID in memblock.reserved and than traverses > > > memblock.reserved to update reserved_nodemask to include node IDs that > > > were > > > set in the first loop. > > > > > > Remove redundant traversal over memblock.reserved and update > > > reserved_nodemask while iterating over numa_meminfo. > > > > > > Signed-off-by: Mike Rapoport > > > --- > > > arch/x86/mm/numa.c | 26 ++ > > > 1 file changed, 10 insertions(+), 16 deletions(-) > > > > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > > > index 8ee952038c80..4078abd33938 100644 > > > --- a/arch/x86/mm/numa.c > > > +++ b/arch/x86/mm/numa.c > > > @@ -498,31 +498,25 @@ static void __init > > > numa_clear_kernel_node_hotplug(void) > > >* and use those ranges to set the nid in memblock.reserved. > > >* This will split up the memblock regions along node > > >* boundaries and will set the node IDs as well. > > > + * > > > + * The nid will also be set in reserved_nodemask which is later > > > + * used to clear MEMBLOCK_HOTPLUG flag. > > > + * > > > + * [ Note, when booting with mem=nn[kMG] or in a kdump kernel, > > > + * numa_meminfo might not include all memblock.reserved > > > + * memory ranges, because quirks such as trim_snb_memory() > > > + * reserve specific pages for Sandy Bridge graphics. > > > + * These ranges will remain with nid == MAX_NUMNODES. ] > > >*/ > > > for (i = 0; i < numa_meminfo.nr_blks; i++) { > > > struct numa_memblk *mb = numa_meminfo.blk + i; > > > int ret; > > > > > > ret = memblock_set_node(mb->start, mb->end - mb->start, > > > , mb->nid); > > > + node_set(mb->nid, reserved_nodemask); > > > > Really? This will set all node id into reserved_nodemask. But in the > > current code, it's setting nid into memblock reserved region which > > interleaves with numa_memoinfo, then get those nid and set it in > > reserved_nodemask. This is so different, with my understanding. Please > > correct me if I am wrong. > > You are right, I've missed the intersections of numa_meminfo with > memblock.reserved. > > x86 interaction with membock is so, hmm, interesting... Yeah, numa_clear_kernel_node_hotplug() intends to find out any memory node which has reserved memory, then make it as unmovable. Setting all node id into reserved_nodemask will break the use case of hot removing hotpluggable boot memory after system bootup.
Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
On Tue, Jul 28, 2020 at 07:02:54PM +0800, Baoquan He wrote: > On 07/28/20 at 08:11am, Mike Rapoport wrote: > > From: Mike Rapoport > > > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo > > regions to set node ID in memblock.reserved and than traverses > > memblock.reserved to update reserved_nodemask to include node IDs that were > > set in the first loop. > > > > Remove redundant traversal over memblock.reserved and update > > reserved_nodemask while iterating over numa_meminfo. > > > > Signed-off-by: Mike Rapoport > > --- > > arch/x86/mm/numa.c | 26 ++ > > 1 file changed, 10 insertions(+), 16 deletions(-) > > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > > index 8ee952038c80..4078abd33938 100644 > > --- a/arch/x86/mm/numa.c > > +++ b/arch/x86/mm/numa.c > > @@ -498,31 +498,25 @@ static void __init > > numa_clear_kernel_node_hotplug(void) > > * and use those ranges to set the nid in memblock.reserved. > > * This will split up the memblock regions along node > > * boundaries and will set the node IDs as well. > > +* > > +* The nid will also be set in reserved_nodemask which is later > > +* used to clear MEMBLOCK_HOTPLUG flag. > > +* > > +* [ Note, when booting with mem=nn[kMG] or in a kdump kernel, > > +* numa_meminfo might not include all memblock.reserved > > +* memory ranges, because quirks such as trim_snb_memory() > > +* reserve specific pages for Sandy Bridge graphics. > > +* These ranges will remain with nid == MAX_NUMNODES. ] > > */ > > for (i = 0; i < numa_meminfo.nr_blks; i++) { > > struct numa_memblk *mb = numa_meminfo.blk + i; > > int ret; > > > > ret = memblock_set_node(mb->start, mb->end - mb->start, > > , mb->nid); > > + node_set(mb->nid, reserved_nodemask); > > Really? This will set all node id into reserved_nodemask. But in the > current code, it's setting nid into memblock reserved region which > interleaves with numa_memoinfo, then get those nid and set it in > reserved_nodemask. This is so different, with my understanding. Please > correct me if I am wrong. You are right, I've missed the intersections of numa_meminfo with memblock.reserved. x86 interaction with membock is so, hmm, interesting... > Thanks > Baoquan > > > WARN_ON_ONCE(ret); > > } > > > > - /* > > -* Now go over all reserved memblock regions, to construct a > > -* node mask of all kernel reserved memory areas. > > -* > > -* [ Note, when booting with mem=nn[kMG] or in a kdump kernel, > > -* numa_meminfo might not include all memblock.reserved > > -* memory ranges, because quirks such as trim_snb_memory() > > -* reserve specific pages for Sandy Bridge graphics. ] > > -*/ > > - for_each_memblock(reserved, mb_region) { > > - int nid = memblock_get_region_node(mb_region); > > - > > - if (nid != MAX_NUMNODES) > > - node_set(nid, reserved_nodemask); > > - } > > - > > /* > > * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory > > * belonging to the reserved node mask. > > -- > > 2.26.2 > > > > > -- Sincerely yours, Mike.
Re: [RESEND PATCH v5 08/11] ppc64/kexec_file: setup backup region for kdump kernel
Hari Bathini writes: > diff --git a/arch/powerpc/kexec/file_load_64.c > b/arch/powerpc/kexec/file_load_64.c > index a5c1442590b2..88408b17a7f6 100644 > --- a/arch/powerpc/kexec/file_load_64.c > +++ b/arch/powerpc/kexec/file_load_64.c > @@ -697,6 +699,69 @@ static int update_usable_mem_fdt(void *fdt, struct > crash_mem *usable_mem) > return ret; > } > > +/** > + * load_backup_segment - Locate a memory hole to place the backup region. > + * @image: Kexec image. > + * @kbuf:Buffer contents and memory parameters. > + * > + * Returns 0 on success, negative errno on error. > + */ > +static int load_backup_segment(struct kimage *image, struct kexec_buf *kbuf) > +{ > + void *buf; > + int ret; > + > + /* Setup a segment for backup region */ > + buf = vzalloc(BACKUP_SRC_SIZE); This worried me initially, because we can't copy from physically discontiguous pages in real mode. But as you explained this buffer is not used for copying. I think if you move the large comment below up here, it would be clearer. > diff --git a/arch/powerpc/purgatory/trampoline_64.S > b/arch/powerpc/purgatory/trampoline_64.S > index 464af8e8a4cb..d4b52961f592 100644 > --- a/arch/powerpc/purgatory/trampoline_64.S > +++ b/arch/powerpc/purgatory/trampoline_64.S > @@ -43,14 +44,38 @@ master: > mr %r17,%r3/* save cpu id to r17 */ > mr %r15,%r4/* save physical address in reg15 */ > > + bl 0f /* Work out where we're running */ > +0: mflr%r18 I know you just moved it, but this should use: bcl 20, 31, $+4 mflr%r18 Which is a special form of branch and link that doesn't unbalance the link stack in the chip. > + /* > + * Copy BACKUP_SRC_SIZE bytes from BACKUP_SRC_START to > + * backup_start 8 bytes at a time. > + * > + * Use r3 = dest, r4 = src, r5 = size, r6 = count > + */ > + ld %r3,(backup_start - 0b)(%r18) > + cmpdi %cr0,%r3,0 I prefer spaces or tabs between arguments, eg: cmpdi %cr0, %r3, 0 > + beq 80f /* skip if there is no backup region */ Local labels will make this clearer I think. eg: beq .Lskip_copy > + lis %r5,BACKUP_SRC_SIZE@h > + ori %r5,%r5,BACKUP_SRC_SIZE@l > + cmpdi %cr0,%r5,0 > + beq 80f /* skip if copy size is zero */ > + lis %r4,BACKUP_SRC_START@h > + ori %r4,%r4,BACKUP_SRC_START@l > + li %r6,0 > +70: .Lcopy_loop: > + ldx %r0,%r6,%r4 > + stdx%r0,%r6,%r3 > + addi%r6,%r6,8 > + cmpld %cr0,%r6,%r5 > + blt 70b blt .Lcopy_loop > + .Lskip_copy: > +80: > or %r3,%r3,%r3 /* ok now to high priority, lets boot */ > lis %r6,0x1 > mtctr %r6 /* delay a bit for slaves to catch up */ > bdnz. /* before we overwrite 0-100 again */ cheers
Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls
Hari Bathini writes: > Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9 > to be filled with OPAL base & entry addresses respectively. Setting > these registers allows the kernel to perform OPAL calls before the > device tree is parsed. I'm not convinced we want to do this. If we do it becomes part of the kexec ABI and we have to honour it into the future. And in practice there are no non-development kernels built with OPAL early debugging enabled, so it's not clear it actually helps anyone other than developers. cheers > v4 -> v5: > * New patch. Updated opal_base & opal_entry values in r8 & r9 respectively. > This change was part of the below dropped patch in v4: > - https://lore.kernel.org/patchwork/patch/1275667/ > > > arch/powerpc/kexec/file_load_64.c | 16 > arch/powerpc/purgatory/trampoline_64.S | 15 +++ > 2 files changed, 31 insertions(+) > > diff --git a/arch/powerpc/kexec/file_load_64.c > b/arch/powerpc/kexec/file_load_64.c > index 8df085a22fd7..a5c1442590b2 100644 > --- a/arch/powerpc/kexec/file_load_64.c > +++ b/arch/powerpc/kexec/file_load_64.c > @@ -713,6 +713,8 @@ int setup_purgatory_ppc64(struct kimage *image, const > void *slave_code, > const void *fdt, unsigned long kernel_load_addr, > unsigned long fdt_load_addr) > { > + struct device_node *dn = NULL; > + uint64_t val; > int ret; > > ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr, > @@ -735,9 +737,23 @@ int setup_purgatory_ppc64(struct kimage *image, const > void *slave_code, > goto out; > } > > + /* Setup OPAL base & entry values */ > + dn = of_find_node_by_path("/ibm,opal"); > + if (dn) { > + of_property_read_u64(dn, "opal-base-address", ); > + ret = kexec_purgatory_get_set_symbol(image, "opal_base", , > + sizeof(val), false); > + if (ret) > + goto out; > + > + of_property_read_u64(dn, "opal-entry-address", ); > + ret = kexec_purgatory_get_set_symbol(image, "opal_entry", , > + sizeof(val), false); > + } > out: > if (ret) > pr_err("Failed to setup purgatory symbols"); > + of_node_put(dn); > return ret; > } > > diff --git a/arch/powerpc/purgatory/trampoline_64.S > b/arch/powerpc/purgatory/trampoline_64.S > index a5a83c3f53e6..464af8e8a4cb 100644 > --- a/arch/powerpc/purgatory/trampoline_64.S > +++ b/arch/powerpc/purgatory/trampoline_64.S > @@ -61,6 +61,10 @@ master: > li %r4,28 > STWX_BE %r17,%r3,%r4/* Store my cpu as __be32 at byte 28 */ > 1: > + /* Load opal base and entry values in r8 & r9 respectively */ > + ld %r8,(opal_base - 0b)(%r18) > + ld %r9,(opal_entry - 0b)(%r18) > + > /* load the kernel address */ > ld %r4,(kernel - 0b)(%r18) > > @@ -102,6 +106,17 @@ dt_offset: > .8byte 0x0 > .size dt_offset, . - dt_offset > > + .balign 8 > + .globl opal_base > +opal_base: > + .8byte 0x0 > + .size opal_base, . - opal_base > + > + .balign 8 > + .globl opal_entry > +opal_entry: > + .8byte 0x0 > + .size opal_entry, . - opal_entry > > .data > .balign 8
Re: [RESEND PATCH v5 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel
Hari Bathini writes: > diff --git a/arch/powerpc/kexec/file_load_64.c > b/arch/powerpc/kexec/file_load_64.c > index 2df6f4273ddd..8df085a22fd7 100644 > --- a/arch/powerpc/kexec/file_load_64.c > +++ b/arch/powerpc/kexec/file_load_64.c > @@ -17,9 +17,21 @@ > #include > #include > #include > +#include > #include > +#include > +#include > #include > > +struct umem_info { > + uint64_t *buf; /* data buffer for usable-memory property */ > + uint32_t idx; /* current index */ > + uint32_t size; /* size allocated for the data buffer */ Use kernel types please, u64, u32. > + /* usable memory ranges to look up */ > + const struct crash_mem *umrngs; "umrngs". Given it's part of the umem_info struct could it just be "ranges"? > +}; > + > const struct kexec_file_ops * const kexec_file_loaders[] = { > _elf64_ops, > NULL > @@ -74,6 +86,42 @@ static int get_exclude_memory_ranges(struct crash_mem > **mem_ranges) > return ret; > } > > +/** > + * get_usable_memory_ranges - Get usable memory ranges. This list includes > + *regions like crashkernel, opal/rtas & > tce-table, > + *that kdump kernel could use. > + * @mem_ranges: Range list to add the memory ranges to. > + * > + * Returns 0 on success, negative errno on error. > + */ > +static int get_usable_memory_ranges(struct crash_mem **mem_ranges) > +{ > + int ret; > + > + /* > + * prom code doesn't take kindly to missing low memory. So, add I don't know what that's referring to, "prom code" is too vague. > + * [0, crashk_res.end] instead of [crashk_res.start, crashk_res.end] > + * to keep it happy. > + */ > + ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1); > + if (ret) > + goto out; > + > + ret = add_rtas_mem_range(mem_ranges); > + if (ret) > + goto out; > + > + ret = add_opal_mem_range(mem_ranges); > + if (ret) > + goto out; > + > + ret = add_tce_mem_ranges(mem_ranges); > +out: > + if (ret) > + pr_err("Failed to setup usable memory ranges\n"); > + return ret; > +} > + > /** > * __locate_mem_hole_top_down - Looks top down for a large enough memory hole > * in the memory regions between buf_min & > buf_max > @@ -273,6 +321,382 @@ static int locate_mem_hole_bottom_up_ppc64(struct > kexec_buf *kbuf, > return ret; > } > > +/** > + * check_realloc_usable_mem - Reallocate buffer if it can't accommodate > entries > + * @um_info: Usable memory buffer and ranges info. > + * @cnt: No. of entries to accommodate. > + * > + * Frees up the old buffer if memory reallocation fails. > + * > + * Returns buffer on success, NULL on error. > + */ > +static uint64_t *check_realloc_usable_mem(struct umem_info *um_info, int cnt) > +{ > + void *tbuf; > + > + if (um_info->size >= > + ((um_info->idx + cnt) * sizeof(*(um_info->buf > + return um_info->buf; This is awkward. AFAICS you only use um_info->size here, so instead why not store the number of u64s you have space for, as num for example. Then the above comparison becomes: if (um_info->num >= (um_info->idx + count)) Then you only have to calculate the size internally here for the realloc. > + > + um_info->size += MEM_RANGE_CHUNK_SZ; new_size = um_info->size + MEM_RANGE_CHUNK_SZ; tbuf = krealloc(um_info->buf, new_size, GFP_KERNEL); > + tbuf = krealloc(um_info->buf, um_info->size, GFP_KERNEL); > + if (!tbuf) { > + um_info->size -= MEM_RANGE_CHUNK_SZ; Then you can drop this. > + return NULL; > + } um_info->size = new_size; > + > + memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ); Just pass __GFP_ZERO to krealloc? > + return tbuf; > +} > + > +/** > + * add_usable_mem - Add the usable memory ranges within the given memory > range > + * to the buffer > + * @um_info:Usable memory buffer and ranges info. > + * @base: Base address of memory range to look for. > + * @end:End address of memory range to look for. > + * @cnt:No. of usable memory ranges added to buffer. One caller never uses this AFAICS. Couldn't the other caller just compare the um_info->idx before and after the call, and avoid another pass by reference parameter. > + * > + * Returns 0 on success, negative errno on error. > + */ > +static int add_usable_mem(struct umem_info *um_info, uint64_t base, > + uint64_t end, int *cnt) > +{ > + uint64_t loc_base, loc_end, *buf; > + const struct crash_mem *umrngs; > + int i, add; add should be bool. > + *cnt = 0; > + umrngs = um_info->umrngs; > + for (i = 0; i < umrngs->nr_ranges; i++) { > + add = 0; > + loc_base = umrngs->ranges[i].start; > +
Re: [PATCH v3 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states
Hello Rafael, On 27/07/20 7:12 pm, Rafael J. Wysocki wrote: On Tue, Jul 21, 2020 at 2:43 PM Pratik Rajesh Sampat wrote: Fire directed smp_call_function_single IPIs from a specified source CPU to the specified target CPU to reduce the noise we have to wade through in the trace log. And what's the purpose of it? The idea for this comes from that fact that estimating wake-up latencies and residencies for stop states is not an easy task. The purpose is essentially to determine wakeup latencies, that are caused by either, an IPI or a timer and compare with the advertised wakeup latencies for each stop state. This might help in determining the accuracy of our advertised values and/or if they need any re-calibration. The module is based on the idea written by Srivatsa Bhat and maintained by Vaidyanathan Srinivasan internally. Queue HR timer and measure jitter. Wakeup latency measurement for idle states using hrtimer. Echo a value in ns to timer_test_function and watch trace. A HRtimer will be queued and when it fires the expected wakeup vs actual wakeup is computes and delay printed in ns. Implemented as a module which utilizes debugfs so that it can be integrated with selftests. To include the module, check option and include as module kernel hacking -> Cpuidle latency selftests [srivatsa.b...@linux.vnet.ibm.com: Initial implementation in cpidle/sysfs] [sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer and fix some of the time calculation] [e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and increase the resolution of IPI wakeup] Signed-off-by: Pratik Rajesh Sampat Reviewed-by: Gautham R. Shenoy --- drivers/cpuidle/Makefile | 1 + drivers/cpuidle/test-cpuidle_latency.c | 150 + lib/Kconfig.debug | 10 ++ 3 files changed, 161 insertions(+) create mode 100644 drivers/cpuidle/test-cpuidle_latency.c diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile index f07800cbb43f..2ae05968078c 100644 --- a/drivers/cpuidle/Makefile +++ b/drivers/cpuidle/Makefile @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o +obj-$(CONFIG_IDLE_LATENCY_SELFTEST) += test-cpuidle_latency.o ## # ARM SoC drivers diff --git a/drivers/cpuidle/test-cpuidle_latency.c b/drivers/cpuidle/test-cpuidle_latency.c new file mode 100644 index ..61574665e972 --- /dev/null +++ b/drivers/cpuidle/test-cpuidle_latency.c @@ -0,0 +1,150 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Module-based API test facility for cpuidle latency using IPIs and timers I'd like to see a more detailed description of what it does and how it works here. Right, I'll add that. Based on comments from Daniel I have also been working on a user-space only variant of this test as that does seem like a better way to go. The only downside is that the latency will be higher, but as we are taking baseline measurements the diff of that from our observed reading should still remain the same. Just that the test will take longer to run. I'm yet to accurately confirm this. I would appreciate your thoughts on that. + */ + +#include +#include +#include + +/* IPI based wakeup latencies */ +struct latency { + unsigned int src_cpu; + unsigned int dest_cpu; + ktime_t time_start; + ktime_t time_end; + u64 latency_ns; +} ipi_wakeup; + +static void measure_latency(void *info) +{ + struct latency *v; + ktime_t time_diff; + + v = (struct latency *)info; + v->time_end = ktime_get(); + time_diff = ktime_sub(v->time_end, v->time_start); + v->latency_ns = ktime_to_ns(time_diff); +} + +void run_smp_call_function_test(unsigned int cpu) +{ + ipi_wakeup.src_cpu = smp_processor_id(); + ipi_wakeup.dest_cpu = cpu; + ipi_wakeup.time_start = ktime_get(); + smp_call_function_single(cpu, measure_latency, _wakeup, 1); +} + +/* Timer based wakeup latencies */ +struct timer_data { + unsigned int src_cpu; + u64 timeout; + ktime_t time_start; + ktime_t time_end; + struct hrtimer timer; + u64 timeout_diff_ns; +} timer_wakeup; + +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer) +{ + struct timer_data *w; + ktime_t time_diff; + + w = container_of(hrtimer, struct timer_data, timer); + w->time_end = ktime_get(); + + time_diff = ktime_sub(w->time_end, w->time_start); + time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout)); + w->timeout_diff_ns = ktime_to_ns(time_diff); + return HRTIMER_NORESTART; +} + +static void run_timer_test(unsigned int ns) +{ +
[PATCH v4 3/3] powerpc test_emulate_step: add testcases for divde[.] and divdeu[.] instructions
add testcases for divde, divde., divdeu, divdeu. emulated instructions to cover few scenarios, * with same dividend and divisor to have undefine RT for divdeu[.] * with divide by zero to have undefine RT for both divde[.] and divdeu[.] * with negative dividend to cover -|divisor| < r <= 0 if the dividend is negative for divde[.] * normal case with proper dividend and divisor for both divde[.] and divdeu[.] Reviewed-by: Sandipan Das Signed-off-by: Balamuruhan S Acked-by: Naveen N. Rao --- arch/powerpc/lib/test_emulate_step.c | 156 +++ 1 file changed, 156 insertions(+) diff --git a/arch/powerpc/lib/test_emulate_step.c b/arch/powerpc/lib/test_emulate_step.c index d242e9f72e0c..0a201b771477 100644 --- a/arch/powerpc/lib/test_emulate_step.c +++ b/arch/powerpc/lib/test_emulate_step.c @@ -1019,6 +1019,162 @@ static struct compute_test compute_tests[] = { } } }, + { + .mnemonic = "divde", + .subtests = { + { + .descr = "RA = LONG_MIN, RB = LONG_MIN", + .instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)), + .regs = { + .gpr[21] = LONG_MIN, + .gpr[22] = LONG_MIN, + } + }, + { + .descr = "RA = 1L, RB = 0", + .instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)), + .flags = IGNORE_GPR(20), + .regs = { + .gpr[21] = 1L, + .gpr[22] = 0, + } + }, + { + .descr = "RA = LONG_MIN, RB = LONG_MAX", + .instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)), + .regs = { + .gpr[21] = LONG_MIN, + .gpr[22] = LONG_MAX, + } + } + } + }, + { + .mnemonic = "divde.", + .subtests = { + { + .descr = "RA = LONG_MIN, RB = LONG_MIN", + .instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 22)), + .regs = { + .gpr[21] = LONG_MIN, + .gpr[22] = LONG_MIN, + } + }, + { + .descr = "RA = 1L, RB = 0", + .instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 22)), + .flags = IGNORE_GPR(20), + .regs = { + .gpr[21] = 1L, + .gpr[22] = 0, + } + }, + { + .descr = "RA = LONG_MIN, RB = LONG_MAX", + .instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 22)), + .regs = { + .gpr[21] = LONG_MIN, + .gpr[22] = LONG_MAX, + } + } + } + }, + { + .mnemonic = "divdeu", + .subtests = { + { + .descr = "RA = LONG_MIN, RB = LONG_MIN", + .instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)), + .flags = IGNORE_GPR(20), + .regs = { + .gpr[21] = LONG_MIN, + .gpr[22] = LONG_MIN, + } + }, + { + .descr = "RA = 1L, RB = 0", + .instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)), + .flags = IGNORE_GPR(20), + .regs = { + .gpr[21] = 1L, + .gpr[22] = 0, + } + }, + { + .descr = "RA = LONG_MIN, RB = LONG_MAX", + .instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)), + .regs = { + .gpr[21] = LONG_MIN, +
[PATCH v4 2/3] powerpc sstep: add support for divde[.] and divdeu[.] instructions
This patch adds emulation support for divde, divdeu instructions, * Divide Doubleword Extended (divde[.]) * Divide Doubleword Extended Unsigned (divdeu[.]) Reviewed-by: Sandipan Das Signed-off-by: Balamuruhan S Acked-by: Naveen N. Rao --- arch/powerpc/lib/sstep.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index c58ea9e787cb..caee8cc77e19 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -1806,7 +1806,18 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, op->val = (int) regs->gpr[ra] / (int) regs->gpr[rb]; goto arith_done; - +#ifdef __powerpc64__ + case 425: /* divde[.] */ + asm volatile(PPC_DIVDE(%0, %1, %2) : + "=r" (op->val) : "r" (regs->gpr[ra]), + "r" (regs->gpr[rb])); + goto arith_done; + case 393: /* divdeu[.] */ + asm volatile(PPC_DIVDEU(%0, %1, %2) : + "=r" (op->val) : "r" (regs->gpr[ra]), + "r" (regs->gpr[rb])); + goto arith_done; +#endif case 755: /* darn */ if (!cpu_has_feature(CPU_FTR_ARCH_300)) return -1; -- 2.24.1
[PATCH v4 1/3] powerpc ppc-opcode: add divde and divdeu opcodes
include instruction opcodes for divde and divdeu as macros. Reviewed-by: Sandipan Das Signed-off-by: Balamuruhan S Acked-by: Naveen N. Rao --- arch/powerpc/include/asm/ppc-opcode.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 4c0bdafb6a7b..a6e3700c4566 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -466,6 +466,10 @@ #define PPC_RAW_MULI(d, a, i) (0x1c00 | ___PPC_RT(d) | ___PPC_RA(a) | IMM_L(i)) #define PPC_RAW_DIVWU(d, a, b) (0x7c000396 | ___PPC_RT(d) | ___PPC_RA(a) | ___PPC_RB(b)) #define PPC_RAW_DIVDU(d, a, b) (0x7c000392 | ___PPC_RT(d) | ___PPC_RA(a) | ___PPC_RB(b)) +#define PPC_RAW_DIVDE(t, a, b) (0x7c000352 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b)) +#define PPC_RAW_DIVDE_DOT(t, a, b) (0x7c000352 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b) | 0x1) +#define PPC_RAW_DIVDEU(t, a, b)(0x7c000312 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b)) +#define PPC_RAW_DIVDEU_DOT(t, a, b)(0x7c000312 | ___PPC_RT(t) | ___PPC_RA(a) | ___PPC_RB(b) | 0x1) #define PPC_RAW_AND(d, a, b) (0x7c38 | ___PPC_RA(d) | ___PPC_RS(a) | ___PPC_RB(b)) #define PPC_RAW_ANDI(d, a, i) (0x7000 | ___PPC_RA(d) | ___PPC_RS(a) | IMM_L(i)) #define PPC_RAW_AND_DOT(d, a, b) (0x7c39 | ___PPC_RA(d) | ___PPC_RS(a) | ___PPC_RB(b)) @@ -510,6 +514,8 @@ #define PPC_DARN(t, l) stringify_in_c(.long PPC_RAW_DARN(t, l)) #definePPC_DCBAL(a, b) stringify_in_c(.long PPC_RAW_DCBAL(a, b)) #definePPC_DCBZL(a, b) stringify_in_c(.long PPC_RAW_DCBZL(a, b)) +#definePPC_DIVDE(t, a, b) stringify_in_c(.long PPC_RAW_DIVDE(t, a, b)) +#definePPC_DIVDEU(t, a, b) stringify_in_c(.long PPC_RAW_DIVDEU(t, a, b)) #define PPC_LQARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LQARX(t, a, b, eh)) #define PPC_LDARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LDARX(t, a, b, eh)) #define PPC_LWARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LWARX(t, a, b, eh)) -- 2.24.1
[PATCH v4 0/3] Add support for divde[.] and divdeu[.] instruction emulation
Hi All, This patchset adds support to emulate divde, divde., divdeu and divdeu. instructions and testcases for it. Resend v4: rebased on latest powerpc next branch Changes in v4: - Fix review comments from Naveen, * replace TEST_DIVDEU() instead of wrongly used TEST_DIVDEU_DOT() in divdeu testcase. * Include `acked-by` tag from Naveen for the series. * Rebase it on latest mpe's merge tree. Changes in v3: - * Fix suggestion from Sandipan to remove `PPC_INST_DIVDE_DOT` and `PPC_INST_DIVDEU_DOT` opcode macros defined in ppc-opcode.h, reuse `PPC_INST_DIVDE` and `PPC_INST_DIVDEU` in test_emulate_step.c to derive them respectively. Changes in v2: - * Fix review comments from Paul to make divde_dot and divdeu_dot simple by using divde and divdeu, then goto `arith_done` instead of `compute_done`. * Include `Reviewed-by` tag from Sandipan Das. * Rebase with recent mpe's merge tree. I would request for your review and suggestions for making it better. Boot Log: :: :: :: :: 291494043: (291493996): [0.352649][T1] emulate_step_test: divde : RA = LONG_MIN, RB = LONG_MIN PASS 291517665: (291517580): [0.352695][T1] emulate_step_test: divde : RA = 1L, RB = 0PASS 291541357: (291541234): [0.352742][T1] emulate_step_test: divde : RA = LONG_MIN, RB = LONG_MAX PASS 291565107: (291564946): [0.352788][T1] emulate_step_test: divde. : RA = LONG_MIN, RB = LONG_MIN PASS 291588757: (291588558): [0.352834][T1] emulate_step_test: divde. : RA = 1L, RB = 0PASS 291612477: (291612240): [0.352881][T1] emulate_step_test: divde. : RA = LONG_MIN, RB = LONG_MAX PASS 291636201: (291635926): [0.352927][T1] emulate_step_test: divdeu : RA = LONG_MIN, RB = LONG_MIN PASS 291659830: (291659517): [0.352973][T1] emulate_step_test: divdeu : RA = 1L, RB = 0PASS 291683529: (291683178): [0.353019][T1] emulate_step_test: divdeu : RA = LONG_MIN, RB = LONG_MAX PASS 291707248: (291706859): [0.353066][T1] emulate_step_test: divdeu : RA = LONG_MAX - 1, RB = LONG_MAX PASS 291730962: (291730535): [0.353112][T1] emulate_step_test: divdeu : RA = LONG_MIN + 1, RB = LONG_MIN PASS 291754714: (291754249): [0.353158][T1] emulate_step_test: divdeu. : RA = LONG_MIN, RB = LONG_MIN PASS 291778371: (291777868): [0.353205][T1] emulate_step_test: divdeu. : RA = 1L, RB = 0PASS 291802098: (291801557): [0.353251][T1] emulate_step_test: divdeu. : RA = LONG_MIN, RB = LONG_MAX PASS 291825844: (291825265): [0.353297][T1] emulate_step_test: divdeu. : RA = LONG_MAX - 1, RB = LONG_MAX PASS 291849586: (291848969): [0.353344][T1] emulate_step_test: divdeu. : RA = LONG_MIN + 1, RB = LONG_MIN PASS :: :: :: :: 292520225: (292519608): [0.354654][T1] registered taskstats version 1 292584751: (292584134): [0.354780][T1] pstore: Using crash dump compression: deflate 296454422: (296453805): [0.362338][T1] Freeing unused kernel memory: 1408K 296467838: (296467221): [0.362364][T1] This architecture does not have kernel memory protection. 296485387: (296484770): [0.362398][T1] Run /init as init process 297987339: (297986761): [0.365332][ T46] mount (46) used greatest stack depth: 12512 bytes left 298889548: (29992): [0.367094][ T47] mount (47) used greatest stack depth: 11824 bytes left 355356256: (355355821): Welcome to Buildroot 355376898: (355376463): buildroot login: Balamuruhan S (3): powerpc ppc-opcode: add divde and divdeu opcodes powerpc sstep: add support for divde[.] and divdeu[.] instructions powerpc test_emulate_step: add testcases for divde[.] and divdeu[.] instructions arch/powerpc/include/asm/ppc-opcode.h | 6 + arch/powerpc/lib/sstep.c | 13 ++- arch/powerpc/lib/test_emulate_step.c | 156 ++ 3 files changed, 174 insertions(+), 1 deletion(-) base-commit: 7a9912e4cf048b607c8fafcfbdca750f1d78 -- 2.24.1
Re: [RESEND PATCH v5 03/11] powerpc/kexec_file: add helper functions for getting memory ranges
Hi Hari, Some comments inline ... Hari Bathini writes: > diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c > new file mode 100644 > index ..21bea1b78443 > --- /dev/null > +++ b/arch/powerpc/kexec/ranges.c > @@ -0,0 +1,417 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * powerpc code to implement the kexec_file_load syscall > + * > + * Copyright (C) 2004 Adam Litke (a...@us.ibm.com) > + * Copyright (C) 2004 IBM Corp. > + * Copyright (C) 2004,2005 Milton D Miller II, IBM Corporation > + * Copyright (C) 2005 R Sharada (shar...@in.ibm.com) > + * Copyright (C) 2006 Mohan Kumar M (mo...@in.ibm.com) > + * Copyright (C) 2020 IBM Corporation > + * > + * Based on kexec-tools' kexec-ppc64.c, fs2dt.c. > + * Heavily modified for the kernel by > + * Hari Bathini . Please just use your name, email addresses bit rot. It's in the commit log anyway. > + */ > + > +#undef DEBUG ^ Dont do that in new code please. > +#define pr_fmt(fmt) "kexec ranges: " fmt > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +/** > + * get_max_nr_ranges - Get the max no. of ranges crash_mem structure > + * could hold, given the size allocated for it. > + * @size: Allocation size of crash_mem structure. > + * > + * Returns the maximum no. of ranges. > + */ > +static inline unsigned int get_max_nr_ranges(size_t size) > +{ > + return ((size - sizeof(struct crash_mem)) / > + sizeof(struct crash_mem_range)); > +} > + > +/** > + * get_mem_rngs_size - Get the allocated size of mrngs based on > + * max_nr_ranges and chunk size. > + * @mrngs: Memory ranges. mrngs is not a great name, what about memory_ranges or ranges? Ditto everywhere else you use mrngs. > + * > + * Returns the maximum size of @mrngs. > + */ > +static inline size_t get_mem_rngs_size(struct crash_mem *mrngs) > +{ > + size_t size; > + > + if (!mrngs) > + return 0; > + > + size = (sizeof(struct crash_mem) + > + (mrngs->max_nr_ranges * sizeof(struct crash_mem_range))); > + > + /* > + * Memory is allocated in size multiple of MEM_RANGE_CHUNK_SZ. > + * So, align to get the actual length. > + */ > + return ALIGN(size, MEM_RANGE_CHUNK_SZ); > +} > + > +/** > + * __add_mem_range - add a memory range to memory ranges list. > + * @mem_ranges: Range list to add the memory range to. > + * @base:Base address of the range to add. > + * @size:Size of the memory range to add. > + * > + * (Re)allocates memory, if needed. > + * > + * Returns 0 on success, negative errno on error. > + */ > +static int __add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size) > +{ > + struct crash_mem *mrngs = *mem_ranges; > + > + if ((mrngs == NULL) || (mrngs->nr_ranges == mrngs->max_nr_ranges)) { (mrngs == NULL) should just be !mrngs. > + mrngs = realloc_mem_ranges(mem_ranges); > + if (!mrngs) > + return -ENOMEM; > + } > + > + mrngs->ranges[mrngs->nr_ranges].start = base; > + mrngs->ranges[mrngs->nr_ranges].end = base + size - 1; > + pr_debug("Added memory range [%#016llx - %#016llx] at index %d\n", > + base, base + size - 1, mrngs->nr_ranges); > + mrngs->nr_ranges++; > + return 0; > +} > + > +/** > + * __merge_memory_ranges - Merges the given memory ranges list. > + * @mem_ranges:Range list to merge. > + * > + * Assumes a sorted range list. > + * > + * Returns nothing. > + */ A lot of this code is annoyingly similar to the memblock code, though the internals of that are all static these days. I guess for now we'll just have to add all this. Maybe in future it can be consolidated. > +static void __merge_memory_ranges(struct crash_mem *mrngs) > +{ > + struct crash_mem_range *rngs; > + int i, idx; > + > + if (!mrngs) > + return; > + > + idx = 0; > + rngs = >ranges[0]; > + for (i = 1; i < mrngs->nr_ranges; i++) { > + if (rngs[i].start <= (rngs[i-1].end + 1)) > + rngs[idx].end = rngs[i].end; > + else { > + idx++; > + if (i == idx) > + continue; > + > + rngs[idx] = rngs[i]; > + } > + } > + mrngs->nr_ranges = idx + 1; > +} > + > +/** > + * realloc_mem_ranges - reallocate mem_ranges with size incremented > + * by MEM_RANGE_CHUNK_SZ. Frees up the old memory, > + * if memory allocation fails. > + * @mem_ranges: Memory ranges to reallocate. > + * > + * Returns pointer to reallocated memory on success, NULL otherwise. > + */ > +struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges) > +{ > + struct crash_mem *mrngs = *mem_ranges; > + unsigned int nr_ranges; > + size_t size; > + > + size =
Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
* Mike Rapoport wrote: > On Tue, Jul 28, 2020 at 12:44:40PM +0200, Ingo Molnar wrote: > > > > * Mike Rapoport wrote: > > > > > From: Mike Rapoport > > > > > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo > > > regions to set node ID in memblock.reserved and than traverses > > > memblock.reserved to update reserved_nodemask to include node IDs that > > > were > > > set in the first loop. > > > > > > Remove redundant traversal over memblock.reserved and update > > > reserved_nodemask while iterating over numa_meminfo. > > > > > > Signed-off-by: Mike Rapoport > > > --- > > > arch/x86/mm/numa.c | 26 ++ > > > 1 file changed, 10 insertions(+), 16 deletions(-) > > > > I suspect you'd like to carry this in the -mm tree? > > Yes. > > > Acked-by: Ingo Molnar > > Thanks! Assuming it is correct and works. :-) Thanks, Ingo
Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state
Excerpts from pet...@infradead.org's message of July 26, 2020 10:11 pm: > On Sun, Jul 26, 2020 at 02:14:34PM +1000, Nicholas Piggin wrote: >> Excerpts from Peter Zijlstra's message of July 26, 2020 6:26 am: > >> > Which is 'funny' when it interleaves like: >> > >> >local_irq_disable(); >> >... >> >local_irq_enable() >> > trace_hardirqs_on(); >> > >> > raw_local_irq_enable(); >> > >> > Because then it will undo the trace_hardirqs_on() we just did. With the >> > result that both tracing and lockdep will see a hardirqs-disable without >> > a matching enable, while the hardware state is enabled. >> >> Seems like an arch problem -- why not disable if it was enabled only? >> I guess the local_irq tracing calls are a mess so maybe they copied >> those. > > Because, as I wrote earlier, then we can miss updating software state. > So your proposal has: > > raw_local_irq_disable() > > if (!arch_irqs_disabled(regs->flags) // false > trace_hardirqs_off(); > > // tracing/lockdep still think IRQs are enabled > // hardware IRQ state is disabled. ... and then lockdep_nmi_enter can disable IRQs if they were enabled? The only reason it's done this way as opposed to a much simple counter increment/decrement AFAIKS is to avoid some overhead of calling trace_hardirqs_on/off (which seems a bit dubious but let's go with it). In that case the lockdep_nmi_enter code is the right spot to clean up that gap vs NMIs. I guess there's an argument that arch_nmi_enter could do it. I don't see the problem with fixing it up here though, this is a slow path so it doesn't matter if we have some more logic for it. Thanks, Nick
Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
On 07/28/20 at 08:11am, Mike Rapoport wrote: > From: Mike Rapoport > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo > regions to set node ID in memblock.reserved and than traverses > memblock.reserved to update reserved_nodemask to include node IDs that were > set in the first loop. > > Remove redundant traversal over memblock.reserved and update > reserved_nodemask while iterating over numa_meminfo. > > Signed-off-by: Mike Rapoport > --- > arch/x86/mm/numa.c | 26 ++ > 1 file changed, 10 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > index 8ee952038c80..4078abd33938 100644 > --- a/arch/x86/mm/numa.c > +++ b/arch/x86/mm/numa.c > @@ -498,31 +498,25 @@ static void __init numa_clear_kernel_node_hotplug(void) >* and use those ranges to set the nid in memblock.reserved. >* This will split up the memblock regions along node >* boundaries and will set the node IDs as well. > + * > + * The nid will also be set in reserved_nodemask which is later > + * used to clear MEMBLOCK_HOTPLUG flag. > + * > + * [ Note, when booting with mem=nn[kMG] or in a kdump kernel, > + * numa_meminfo might not include all memblock.reserved > + * memory ranges, because quirks such as trim_snb_memory() > + * reserve specific pages for Sandy Bridge graphics. > + * These ranges will remain with nid == MAX_NUMNODES. ] >*/ > for (i = 0; i < numa_meminfo.nr_blks; i++) { > struct numa_memblk *mb = numa_meminfo.blk + i; > int ret; > > ret = memblock_set_node(mb->start, mb->end - mb->start, > , mb->nid); > + node_set(mb->nid, reserved_nodemask); Really? This will set all node id into reserved_nodemask. But in the current code, it's setting nid into memblock reserved region which interleaves with numa_memoinfo, then get those nid and set it in reserved_nodemask. This is so different, with my understanding. Please correct me if I am wrong. Thanks Baoquan > WARN_ON_ONCE(ret); > } > > - /* > - * Now go over all reserved memblock regions, to construct a > - * node mask of all kernel reserved memory areas. > - * > - * [ Note, when booting with mem=nn[kMG] or in a kdump kernel, > - * numa_meminfo might not include all memblock.reserved > - * memory ranges, because quirks such as trim_snb_memory() > - * reserve specific pages for Sandy Bridge graphics. ] > - */ > - for_each_memblock(reserved, mb_region) { > - int nid = memblock_get_region_node(mb_region); > - > - if (nid != MAX_NUMNODES) > - node_set(nid, reserved_nodemask); > - } > - > /* >* Finally, clear the MEMBLOCK_HOTPLUG flag for all memory >* belonging to the reserved node mask. > -- > 2.26.2 > >
Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
On Tue, Jul 28, 2020 at 12:44:40PM +0200, Ingo Molnar wrote: > > * Mike Rapoport wrote: > > > From: Mike Rapoport > > > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo > > regions to set node ID in memblock.reserved and than traverses > > memblock.reserved to update reserved_nodemask to include node IDs that were > > set in the first loop. > > > > Remove redundant traversal over memblock.reserved and update > > reserved_nodemask while iterating over numa_meminfo. > > > > Signed-off-by: Mike Rapoport > > --- > > arch/x86/mm/numa.c | 26 ++ > > 1 file changed, 10 insertions(+), 16 deletions(-) > > I suspect you'd like to carry this in the -mm tree? Yes. > Acked-by: Ingo Molnar Thanks! > Thanks, > > Ingo -- Sincerely yours, Mike.
Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Excerpts from Linus Torvalds's message of July 28, 2020 4:37 am: > [ Adding linux-arch, just to make other architectures aware of this issue too. > > We have a "flush_tlb_fix_spurious_fault()" thing to take care of the > "TLB may contain stale entries, we can't take the same fault over and > over again" situation. > > On x86, it's a no-op, because x86 doesn't do that. x86 will re-walk > the page tables - or possibly just always invalidate the faulting TLB > entry - before taking a fault, so there can be no long-term stale > TLB's. [snip] > It looks like powerpc people at least thought about this, and only > do it if there is a coprocessor. Which sounds a bit confused, but I > don't know the rules. I'm not sure about ppc32 and 64e, I'm almost certain they should do a local flush if anyting, and someone with a good understanding of the ISAs and CPUs might be able to nop it entirely. I agree global can't ever really make sense (except as a default because we have no generic local flush). powerpc/64s reloads translations after taking a fault, so it's fine with a nop here. The quirk is a problem with coprocessor where it's supposed to invalidate the translation after a fault but it doesn't, so we can get a read-only TLB stuck after something else does a RO->RW upgrade on the TLB. Something like that IIRC. Coprocessors have their own MMU which lives in the nest not the core, so you need a global TLB flush to invalidate that thing. Thanks, Nick
Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved
* Mike Rapoport wrote: > From: Mike Rapoport > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo > regions to set node ID in memblock.reserved and than traverses > memblock.reserved to update reserved_nodemask to include node IDs that were > set in the first loop. > > Remove redundant traversal over memblock.reserved and update > reserved_nodemask while iterating over numa_meminfo. > > Signed-off-by: Mike Rapoport > --- > arch/x86/mm/numa.c | 26 ++ > 1 file changed, 10 insertions(+), 16 deletions(-) I suspect you'd like to carry this in the -mm tree? Acked-by: Ingo Molnar Thanks, Ingo
Re: [PATCH 03/15] arm, xtensa: simplify initialization of high memory pages
On Mon, Jul 27, 2020 at 10:12 PM Mike Rapoport wrote: > > From: Mike Rapoport > > The function free_highpages() in both arm and xtensa essentially open-code > for_each_free_mem_range() loop to detect high memory pages that were not > reserved and that should be initialized and passed to the buddy allocator. > > Replace open-coded implementation of for_each_free_mem_range() with usage > of memblock API to simplify the code. > > Signed-off-by: Mike Rapoport > --- > arch/arm/mm/init.c| 48 +++-- > arch/xtensa/mm/init.c | 55 --- > 2 files changed, 18 insertions(+), 85 deletions(-) For the xtensa part: Reviewed-by: Max Filippov Tested-by: Max Filippov -- Thanks. -- Max
[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c
https://bugzilla.kernel.org/show_bug.cgi?id=206203 --- Comment #17 from Erhard F. (erhar...@mailbox.org) --- Created attachment 290641 --> https://bugzilla.kernel.org/attachment.cgi?id=290641=edit kmemleak output (kernel 5.8-rc7, PowerMac G4 3,6) Also happens on my G4 DP. -- You are receiving this mail because: You are watching the assignee of the bug.
[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c
https://bugzilla.kernel.org/show_bug.cgi?id=206203 --- Comment #16 from Erhard F. (erhar...@mailbox.org) --- Created attachment 290639 --> https://bugzilla.kernel.org/attachment.cgi?id=290639=edit dmesg (kernel 5.8-rc7, PowerMac G4 3,6) -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's
On Thu, Jul 23, 2020 at 10:41 PM Nathan Lynch wrote: > > Pingfan Liu writes: > > This patch prepares for the incoming patch which swaps the order of KOBJ_ > > uevent and dt's updating. > > > > It has no functional effect, just groups lmb operation and memblock's in > > order to insert dt updating operation easily, and makes it easier to > > review. > > ... > > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > > b/arch/powerpc/platforms/pseries/hotplug-memory.c > > index 5d545b7..1a3ac3b 100644 > > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > > @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); > > static int dlpar_remove_lmb(struct drmem_lmb *lmb) > > { > > unsigned long block_sz; > > - int rc; > > + phys_addr_t base_addr; > > + int rc, nid; > > > > if (!lmb_is_removable(lmb)) > > return -EINVAL; > > @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) > > if (rc) > > return rc; > > > > + base_addr = lmb->base_addr; > > + nid = lmb->nid; > > block_sz = pseries_memory_block_size(); > > > > - __remove_memory(lmb->nid, lmb->base_addr, block_sz); > > - > > - /* Update memory regions for memory remove */ > > - memblock_remove(lmb->base_addr, block_sz); > > - > > invalidate_lmb_associativity_index(lmb); > > lmb_clear_nid(lmb); > > lmb->flags &= ~DRCONF_MEM_ASSIGNED; > > > > + __remove_memory(nid, base_addr, block_sz); > > + > > + /* Update memory regions for memory remove */ > > + memblock_remove(base_addr, block_sz); > > + > > return 0; > > } > > I don't understand; the commit message should not claim this has no > functional effect when it changes the order of operations like > this. Maybe this is an improvement over the current behavior, but it's > not explained why it would be. One group of functions, which name contains lmb, are powerpc specific, and used to form dt. The other group __remove_memory() and memblock_remove() are integrated with linux mm. And [2/2] arrange dt-updating just before __remove_memory() Thanks, Pingfan
Re: [PATCH 02/15] dma-contiguous: simplify cma_early_percent_memory()
On Tue, Jul 28, 2020 at 08:11:40AM +0300, Mike Rapoport wrote: > From: Mike Rapoport > > The memory size calculation in cma_early_percent_memory() traverses > memblock.memory rather than simply call memblock_phys_mem_size(). The > comment in that function suggests that at some point there should have been > call to memblock_analyze() before memblock_phys_mem_size() could be used. > As of now, there is no memblock_analyze() at all and > memblock_phys_mem_size() can be used as soon as cold-plug memory is > registerd with memblock. > > Replace loop over memblock.memory with a call to memblock_phys_mem_size(). > > Signed-off-by: Mike Rapoport Looks good: Reviewed-by: Christoph Hellwig