Re: [PATCH] powerpc/64s/hash: Fix hash_preload running with interrupts enabled

2020-07-28 Thread Athira Rajeev



> On 28-Jul-2020, at 6:14 AM, Michael Ellerman  wrote:
> 
> Athira Rajeev  writes:
>>> On 27-Jul-2020, at 6:05 PM, Michael Ellerman  wrote:
>>> 
>>> Athira Rajeev  writes:
> On 27-Jul-2020, at 11:39 AM, Nicholas Piggin  wrote:
> 
> Commit 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from 
> the
> caller") removed the local_irq_disable from hash_preload, but it was
> required for more than just the page table walk: the hash pte busy bit is
> effectively a lock which may be taken in interrupt context, and the local
> update flag test must not be preempted before it's used.
> 
> This solves apparent lockups with perf interrupting __hash_page_64K. If
> get_perf_callchain then also takes a hash fault on the same page while it
> is already locked, it will loop forever taking hash faults, which looks 
> like
> this:
> 
> cpu 0x49e: Vector: 100 (System Reset) at [c0001a4f7d70]
>  pc: c0072dc8: hash_page_mm+0x8/0x800
>  lr: c000c5a4: do_hash_page+0x24/0x38
>  sp: c0002ac1cc69ac70
> msr: 80081033
> current = 0xc0002ac1cc602e00
> paca= 0xc0001de1f280   irqmask: 0x03   irq_happened: 0x01
>  pid   = 20118, comm = pread2_processe
> Linux version 5.8.0-rc6-00345-g1fad14f18bc6
> 49e:mon> t
> [c0002ac1cc69ac70] c000c5a4 do_hash_page+0x24/0x38 (unreliable)
> --- Exception: 300 (Data Access) at c008fa60 
> __copy_tofrom_user_power7+0x20c/0x7ac
> [link register   ] c0335d10 copy_from_user_nofault+0xf0/0x150
> [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
> [c0002ac1cc69afa0] c0109df0 read_user_stack_64+0x70/0xf0
> [c0002ac1cc69afd0] c0109fcc perf_callchain_user_64+0x15c/0x410
> [c0002ac1cc69b060] c0109c00 perf_callchain_user+0x20/0x40
> [c0002ac1cc69b080] c031c6cc get_perf_callchain+0x25c/0x360
> [c0002ac1cc69b120] c0316b50 perf_callchain+0x70/0xa0
> [c0002ac1cc69b140] c0316ddc perf_prepare_sample+0x25c/0x790
> [c0002ac1cc69b1a0] c0317350 perf_event_output_forward+0x40/0xb0
> [c0002ac1cc69b220] c0306138 __perf_event_overflow+0x88/0x1a0
> [c0002ac1cc69b270] c010cf70 record_and_restart+0x230/0x750
> [c0002ac1cc69b620] c010d69c perf_event_interrupt+0x20c/0x510
> [c0002ac1cc69b730] c0027d9c 
> performance_monitor_exception+0x4c/0x60
> [c0002ac1cc69b750] c000b2f8 
> performance_monitor_common_virt+0x1b8/0x1c0
> --- Exception: f00 (Performance Monitor) at c00cb5b0 
> pSeries_lpar_hpte_insert+0x0/0x160
> [link register   ] c00846f0 __hash_page_64K+0x210/0x540
> [c0002ac1cc69ba50]  (unreliable)
> [c0002ac1cc69bb00] c0073ae0 update_mmu_cache+0x390/0x3a0
> [c0002ac1cc69bb70] c037f024 wp_page_copy+0x364/0xce0
> [c0002ac1cc69bc20] c038272c do_wp_page+0xdc/0xa60
> [c0002ac1cc69bc70] c03857bc handle_mm_fault+0xb9c/0x1b60
> [c0002ac1cc69bd50] c006c434 __do_page_fault+0x314/0xc90
> [c0002ac1cc69be20] c000c5c8 handle_page_fault+0x10/0x2c
> --- Exception: 300 (Data Access) at 7fff8c861fe8
> SP (76b19660) is in userspace
> 
> Reported-by: Athira Rajeev 
> Reported-by: Anton Blanchard 
> Reviewed-by: Aneesh Kumar K.V 
> Fixes: 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from 
> the
> caller")
> Signed-off-by: Nicholas Piggin 
 
 
 Hi,
 
 Tested with the patch and it fixes the lockups I was seeing with my test 
 run.
 Thanks for the fix.
 
 Tested-by: Athira Rajeev 
>>> 
>>> Thanks for testing.
>>> 
>>> What test are you running?
>> 
>> Hi Michael
>> 
>> I was running  “perf record”  and Unixbench tests ( 
>> https://github.com/kdlucas/byte-unixbench ) in parallel where we were 
>> getting soft lockups
>> 
>> 1. Perf command run:
>> # perf record -a -g -c 1000 -o  sleep 60
>> 
>> 2. Unixbench tests
>> # Run -q -c  spawn
> 
> Thanks, I can reproduce it with that.

Sure Michael


> 
> cheers



[PATCH] powerpc: Fix MMCRA_BHRB_DISABLE define to work with binutils version < 2.28

2020-07-28 Thread Athira Rajeev
commit 9908c826d5ed ("powerpc/perf: Add Power10 PMU feature to
DT CPU features") defines MMCRA_BHRB_DISABLE as `0x20UL`.
Binutils version less than 2.28 doesn't support UL suffix.

linux-ppc/arch/powerpc/kernel/cpu_setup_power.S: Assembler messages:
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of 
line, first unrecognized character is `L'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of 
line, first unrecognized character is `L'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: operand out of 
range (0x0020 is not between 0x8000 and 
0x)

Fix this by wrapping it around `_UL` macro.

Fixes: 9908c826d5ed ("Add Power10 PMU feature to DT CPU features")
Signed-off-by: Athira Rajeev 
Suggested-by: Michael Ellerman 
---
 arch/powerpc/include/asm/reg.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index ae71027..41419f1 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -12,6 +12,7 @@
 #ifdef __KERNEL__
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -888,7 +889,7 @@
 #define   MMCRA_SLOT   0x0700UL /* SLOT bits (37-39) */
 #define   MMCRA_SLOT_SHIFT 24
 #define   MMCRA_SAMPLE_ENABLE 0x0001UL /* enable sampling */
-#define   MMCRA_BHRB_DISABLE  0x20UL // BHRB disable bit for ISA v3.1
+#define   MMCRA_BHRB_DISABLE  _UL(0x20) // BHRB disable bit for ISA 
v3.1
 #define   POWER6_MMCRA_SDSYNC 0x0800ULL/* SDAR/SIAR synced */
 #define   POWER6_MMCRA_SIHV   0x0400ULL
 #define   POWER6_MMCRA_SIPR   0x0200ULL
-- 
1.8.3.1



[PATCH] powerpc/configs: Add BLK_DEV_NVME to pseries_defconfig

2020-07-28 Thread Anton Blanchard
I've forgotten to manual enable NVME when building pseries kernels
for machines with NVME adapters. Since it's a reasonably common
configuration, enable it by default.

Signed-off-by: Anton Blanchard 
---
 arch/powerpc/configs/pseries_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index dfa4a726333b..358642d6f46d 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -94,6 +94,7 @@ CONFIG_BLK_DEV_NBD=m
 CONFIG_BLK_DEV_RAM=y
 CONFIG_BLK_DEV_RAM_SIZE=65536
 CONFIG_VIRTIO_BLK=m
+CONFIG_BLK_DEV_NVME=y
 CONFIG_BLK_DEV_SD=y
 CONFIG_CHR_DEV_ST=m
 CONFIG_BLK_DEV_SR=y
-- 
2.26.2



Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls

2020-07-28 Thread Michael Ellerman
Hari Bathini  writes:
> On 28/07/20 7:16 pm, Michael Ellerman wrote:
>> Hari Bathini  writes:
>>> Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9
>>> to be filled with OPAL base & entry addresses respectively. Setting
>>> these registers allows the kernel to perform OPAL calls before the
>>> device tree is parsed.
>> 
>> I'm not convinced we want to do this.
>> 
>> If we do it becomes part of the kexec ABI and we have to honour it into
>> the future.
>> 
>> And in practice there are no non-development kernels built with OPAL early
>> debugging enabled, so it's not clear it actually helps anyone other than
>> developers.
>> 
>
> Hmmm.. kexec-tools does it since commit d58ad564852c ("kexec/ppc64
> Enable early kernel's OPAL calls") for kexec_load syscall. So, we would
> be breaking kexec ABI either way, I guess.

Ugh, OK.

> Let me put this patch at the end of the series in the respin to let you
> decide whether to have it or not..

Thanks.

cheers


Re: [PATCH 13/15] arch, drivers: replace for_each_membock() with for_each_mem_range()

2020-07-28 Thread Emil Renner Berthing
On Tue, 28 Jul 2020, 07:16 Mike Rapoport,  wrote:

> From: Mike Rapoport 
>
> There are several occurrences of the following pattern:
>
> for_each_memblock(memory, reg) {
> start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
> end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));
>
> /* do something with start and end */
> }
>
> Using for_each_mem_range() iterator is more appropriate in such cases and
> allows simpler and cleaner code.
>
> Signed-off-by: Mike Rapoport 
> ---
>  arch/arm/kernel/setup.c  | 18 +++
>  arch/arm/mm/mmu.c| 39 
>  arch/arm/mm/pmsa-v7.c| 20 ++--
>  arch/arm/mm/pmsa-v8.c| 17 +--
>  arch/arm/xen/mm.c|  7 +++--
>  arch/arm64/mm/kasan_init.c   |  8 ++---
>  arch/arm64/mm/mmu.c  | 11 ++-
>  arch/c6x/kernel/setup.c  |  9 +++---
>  arch/microblaze/mm/init.c|  9 +++---
>  arch/mips/cavium-octeon/dma-octeon.c | 12 
>  arch/mips/kernel/setup.c | 31 +--
>  arch/openrisc/mm/init.c  |  8 +++--
>  arch/powerpc/kernel/fadump.c | 27 +++-
>  arch/powerpc/mm/book3s64/hash_utils.c| 16 +-
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 11 +++
>  arch/powerpc/mm/kasan/kasan_init_32.c|  8 ++---
>  arch/powerpc/mm/mem.c| 16 ++
>  arch/powerpc/mm/pgtable_32.c |  8 ++---
>  arch/riscv/mm/init.c | 24 ++-
>  arch/riscv/mm/kasan_init.c   | 10 +++---
>  arch/s390/kernel/setup.c | 27 ++--
>  arch/s390/mm/vmem.c  | 16 +-
>  arch/sparc/mm/init_64.c  | 12 +++-
>  drivers/bus/mvebu-mbus.c | 12 
>  drivers/s390/char/zcore.c|  9 +++---
>  25 files changed, 187 insertions(+), 198 deletions(-)
>
> diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
> index d8e18cdd96d3..3f65d0ac9f63 100644
> --- a/arch/arm/kernel/setup.c
> +++ b/arch/arm/kernel/setup.c
> @@ -843,19 +843,25 @@ early_param("mem", early_mem);
>
>  static void __init request_standard_resources(const struct machine_desc
> *mdesc)
>  {
> -   struct memblock_region *region;
> +   phys_addr_t start, end, res_end;
> struct resource *res;
> +   u64 i;
>
> kernel_code.start   = virt_to_phys(_text);
> kernel_code.end = virt_to_phys(__init_begin - 1);
> kernel_data.start   = virt_to_phys(_sdata);
> kernel_data.end = virt_to_phys(_end - 1);
>
> -   for_each_memblock(memory, region) {
> -   phys_addr_t start =
> __pfn_to_phys(memblock_region_memory_base_pfn(region));
> -   phys_addr_t end =
> __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
> +   for_each_mem_range(i, , ) {
> unsigned long boot_alias_start;
>
> +   /*
> +* In memblock, end points to the first byte after the
> +* range while in resourses, end points to the last byte in
> +* the range.
> +*/
> +   res_end = end - 1;
> +
> /*
>  * Some systems have a special memory alias which is only
>  * used for booting.  We need to advertise this region to
> @@ -869,7 +875,7 @@ static void __init request_standard_resources(const
> struct machine_desc *mdesc)
>   __func__, sizeof(*res));
> res->name = "System RAM (boot alias)";
> res->start = boot_alias_start;
> -   res->end = phys_to_idmap(end);
> +   res->end = phys_to_idmap(res_end);
> res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> request_resource(_resource, res);
> }
> @@ -880,7 +886,7 @@ static void __init request_standard_resources(const
> struct machine_desc *mdesc)
>   sizeof(*res));
> res->name  = "System RAM";
> res->start = start;
> -   res->end = end;
> +   res->end = res_end;
> res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
>
> request_resource(_resource, res);
> diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
> index 628028bfbb92..a149d9cb4fdb 100644
> --- a/arch/arm/mm/mmu.c
> +++ b/arch/arm/mm/mmu.c
> @@ -1155,9 +1155,8 @@ phys_addr_t arm_lowmem_limit __initdata = 0;
>
>  void __init adjust_lowmem_bounds(void)
>  {
> -   phys_addr_t memblock_limit = 0;
> -   u64 vmalloc_limit;
> -   struct memblock_region *reg;
> +   phys_addr_t block_start, block_end, memblock_limit = 0;
> +   

Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

2020-07-28 Thread Nicholas Piggin
Excerpts from Linus Torvalds's message of July 29, 2020 5:02 am:
> On Tue, Jul 28, 2020 at 3:53 AM Nicholas Piggin  wrote:
>>
>> The quirk is a problem with coprocessor where it's supposed to
>> invalidate the translation after a fault but it doesn't, so we can get a
>> read-only TLB stuck after something else does a RO->RW upgrade on the
>> TLB. Something like that IIRC.  Coprocessors have their own MMU which
>> lives in the nest not the core, so you need a global TLB flush to
>> invalidate that thing.
> 
> So I assumed, but it does seem confused.
> 
> Why? Because if there are stale translations on the co-processor,
> there's no guarantee that one of the CPU's will have them and take a
> fault.
> 
> So I'm not seeing why a core CPU doing spurious TLB invalidation would
> follow from "stale TLB in the Nest".

If the nest MMU access faults, it sends an interrupt to the CPU and
the driver tries to handle the page fault for it (I think that's how
it works).

> If anything, I think "we have a coprocessor that needs to never have
> stale TLB entries" would impact the _regular_ TLB invalidates (by
> update_mmu_cache()) and perhaps make those more aggressive, exactly
> because the coprocessor may not handle the fault as gracefully.

It could be done that way... Hmm although we do have something similar 
also in radix__ptep_set_access_flags for the relaxing permissions case
so maybe this is required for not-present faults as well? I'm not 
actually sure now.

But it's a bit weird and awkward because it's working around quirks in
the hardware which aren't regular, not because we're _completely_ 
confused (I hope).

Thanks,
Nick


Re: [PATCH] powerpc/powernv/pci: Fix build of pci-ioda.o

2020-07-28 Thread Oliver O'Halloran
On Wed, Jul 29, 2020 at 8:35 AM Gustavo Romero  wrote:
>
> Currently pnv_ioda_setup_bus_dma() is outside of a CONFIG_IOMMU_API guard
> and if CONFIG_IOMMU_API=n the build can fail if the compiler sets
> -Werror=unused-function, because pnv_ioda_setup_bus_dma() is only used in
> functions guarded by a CONFIG_IOMMU_API guard.
>
> That issue can be easily reproduced using the skiroot_defconfig. For other
> configs, like powernv_defconfig, that issue is hidden by the fact that
> if CONFIG_IOMMU_SUPPORT is enabled plus other common IOMMU options, like
> CONFIG_OF_IOMMU, by default CONFIG_IOMMU_API is enabled as well. Hence, for
> powernv_defconfig, it's necessary to set CONFIG_IOMMU_SUPPORT=n to make the
> build fail, because CONFIG_PCI=y and pci-ioda.c is included in the build,
> but since CONFIG_IOMMU_SUPPORT=n the CONFIG_IOMMU_API is disabled, breaking
> the build.
>
> This commit fixes that build issue by moving the pnv_ioda_setup_bus_dma()
> inside a CONFIG_IOMMU_API guard, so when CONFIG_IOMMU_API is disabled that
> function is not defined.

I think a fix for this is already in -next.


[PATCH] powerpc/powernv/pci: Fix build of pci-ioda.o

2020-07-28 Thread Gustavo Romero
Currently pnv_ioda_setup_bus_dma() is outside of a CONFIG_IOMMU_API guard
and if CONFIG_IOMMU_API=n the build can fail if the compiler sets
-Werror=unused-function, because pnv_ioda_setup_bus_dma() is only used in
functions guarded by a CONFIG_IOMMU_API guard.

That issue can be easily reproduced using the skiroot_defconfig. For other
configs, like powernv_defconfig, that issue is hidden by the fact that
if CONFIG_IOMMU_SUPPORT is enabled plus other common IOMMU options, like
CONFIG_OF_IOMMU, by default CONFIG_IOMMU_API is enabled as well. Hence, for
powernv_defconfig, it's necessary to set CONFIG_IOMMU_SUPPORT=n to make the
build fail, because CONFIG_PCI=y and pci-ioda.c is included in the build,
but since CONFIG_IOMMU_SUPPORT=n the CONFIG_IOMMU_API is disabled, breaking
the build.

This commit fixes that build issue by moving the pnv_ioda_setup_bus_dma()
inside a CONFIG_IOMMU_API guard, so when CONFIG_IOMMU_API is disabled that
function is not defined.

Signed-off-by: Gustavo Romero 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 73a63efcf855..743d840712da 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 }
 
-static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
-{
-   struct pci_dev *dev;
-
-   list_for_each_entry(dev, >devices, bus_list) {
-   set_iommu_table_base(>dev, pe->table_group.tables[0]);
-   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
-
-   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
-   }
-}
-
 static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
 bool real_mode)
 {
@@ -2501,6 +2488,19 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
+static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   set_iommu_table_base(>dev, pe->table_group.tables[0]);
+   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
+
+   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
+   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
+   }
+}
+
 unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
__u64 window_size, __u32 levels)
 {
-- 
2.17.1



Re: [RESEND PATCH v5 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel

2020-07-28 Thread Hari Bathini




On 28/07/20 7:14 pm, Michael Ellerman wrote:

Hari Bathini  writes:

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 2df6f4273ddd..8df085a22fd7 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -17,9 +17,21 @@
  #include 
  #include 
  #include 
+#include 
  #include 
+#include 
+#include 
  #include 
  
+struct umem_info {

+   uint64_t *buf; /* data buffer for usable-memory property */
+   uint32_t idx;  /* current index */
+   uint32_t size; /* size allocated for the data buffer */


Use kernel types please, u64, u32.


+   /* usable memory ranges to look up */
+   const struct crash_mem *umrngs;


"umrngs".

Given it's part of the umem_info struct could it just be "ranges"?


True. Actually, having crash_mem_range *ranges + u32 nr_ranges and 
populating them seems better. Will do that..



+   return NULL;
+   }


um_info->size = new_size;


+
+   memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ);


Just pass __GFP_ZERO to krealloc?


There are patches submitted to stable fixing a few modules that use 
krealloc with __GFP_ZERO. Also, this zeroing is not really needed.

I will drop the memset instead..

Thanks
Hari


Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls

2020-07-28 Thread Hari Bathini




On 28/07/20 7:16 pm, Michael Ellerman wrote:

Hari Bathini  writes:

Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9
to be filled with OPAL base & entry addresses respectively. Setting
these registers allows the kernel to perform OPAL calls before the
device tree is parsed.


I'm not convinced we want to do this.

If we do it becomes part of the kexec ABI and we have to honour it into
the future.

And in practice there are no non-development kernels built with OPAL early
debugging enabled, so it's not clear it actually helps anyone other than
developers.



Hmmm.. kexec-tools does it since commit d58ad564852c ("kexec/ppc64
Enable early kernel's OPAL calls") for kexec_load syscall. So, we would
be breaking kexec ABI either way, I guess.

Let me put this patch at the end of the series in the respin to let you
decide whether to have it or not..

Thanks
Hari


Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-28 Thread Nathan Lynch
Hi Laurent,

Laurent Dufour  writes:
> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>> The drmem lmb list can have hundreds of thousands of entries, and
>> unfortunately lookups take the form of linear searches. As long as
>> this is the case, traversals have the potential to monopolize the CPU
>> and provoke lockup reports, workqueue stalls, and the like unless
>> they explicitly yield.
>> 
>> Rather than placing cond_resched() calls within various
>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>> expression of the loop macro itself so users can't omit it.
>
> Is that not too much to call cond_resched() on every LMB?
>
> Could that be less frequent, every 10, or 100, I don't really know ?

Everything done within for_each_drmem_lmb is relatively heavyweight
already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
of milliseconds. I don't think cond_resched() is an expensive check in
this context.


Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

2020-07-28 Thread Linus Torvalds
On Tue, Jul 28, 2020 at 3:53 AM Nicholas Piggin  wrote:
>
> The quirk is a problem with coprocessor where it's supposed to
> invalidate the translation after a fault but it doesn't, so we can get a
> read-only TLB stuck after something else does a RO->RW upgrade on the
> TLB. Something like that IIRC.  Coprocessors have their own MMU which
> lives in the nest not the core, so you need a global TLB flush to
> invalidate that thing.

So I assumed, but it does seem confused.

Why? Because if there are stale translations on the co-processor,
there's no guarantee that one of the CPU's will have them and take a
fault.

So I'm not seeing why a core CPU doing spurious TLB invalidation would
follow from "stale TLB in the Nest".

If anything, I think "we have a coprocessor that needs to never have
stale TLB entries" would impact the _regular_ TLB invalidates (by
update_mmu_cache()) and perhaps make those more aggressive, exactly
because the coprocessor may not handle the fault as gracefully.

I dunno. I don't know the coprocessor side well enough to judge, I'm
just looking at it from a conceptual standpoint.

  Linus


Re: [PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct

2020-07-28 Thread David Hildenbrand
On 28.07.20 18:53, Scott Cheloha wrote:
> At memory hot-remove time we can retrieve an LMB's nid from its
> corresponding memory_block.  There is no need to store the nid
> in multiple locations.
> 
> Note that lmb_to_memblock() uses find_memory_block() to get the
> corresponding memory_block.  As find_memory_block() runs in sub-linear
> time this approach is negligibly slower than what we do at present.
> 
> In exchange for this lookup at hot-remove time we no longer need to
> call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
> On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
> spares us an O(n^2) initialization during boot.
> 
> On systems with many LMBs that initialization overhead is palpable and
> disruptive.  For example, on a box with 249854 LMBs we're seeing
> drmem_init() take upwards of 30 seconds to complete:
> 
> [   53.721639] drmem: initializing drmem v2
> [   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! 
> [swapper/0:1]
> [   80.604377] Modules linked in:
> [   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
> [   80.604397] NIP:  c00a4980 LR: c00a4940 CTR: 
> 
> [   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
> [   80.604412] MSR:  82009033   CR: 
> 44000248  XER: 000d
> [   80.604431] CFAR: c00a4a38 IRQMASK: 0
> [   80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 
> c0003cfede30
> [   80.604431] GPR04:  c0f4095a 002f 
> 1000
> [   80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 
> c00c0002fdfb2001
> [   80.604431] GPR12:  c0001e8ec200
> [   80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0
> [   80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0
> [   80.604492] Call Trace:
> [   80.604498] [c0002dbff8493ac0] [c00a4940] 
> hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
> [   80.604509] [c0002dbff8493b20] [c0087c10] 
> memory_add_physaddr_to_nid+0x20/0x60
> [   80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0
> [   80.604530] [c0002dbff8493c10] [c0010154] 
> do_one_initcall+0x64/0x2c0
> [   80.604540] [c0002dbff8493ce0] [c10c4aa0] 
> kernel_init_freeable+0x2d8/0x3a0
> [   80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148
> [   80.604560] [c0002dbff8493e20] [c000b648] 
> ret_from_kernel_thread+0x5c/0x74
> [   80.604567] Instruction dump:
> [   80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 
> 7d094214
> [   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac 
> e949 7fbe5040
> [   89.047390] drmem: 249854 LMB(s)
> 
> With a patched kernel on the same machine we're no longer seeing the
> soft lockup.  drmem_init() now completes in negligible time, even when
> the LMB count is large.
> 

Yeah, as long as you remove_memory() in memory block granularity, this
is guaranteed to work.

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb



Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-28 Thread Laurent Dufour

Le 28/07/2020 à 19:37, Nathan Lynch a écrit :

The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.


Hi Nathan,

Is that not too much to call cond_resched() on every LMB?

Could that be less frequent, every 10, or 100, I don't really know ?

Cheers,
Laurent.


Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT 
format")
Signed-off-by: Nathan Lynch 
---
  arch/powerpc/include/asm/drmem.h | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..36d0ed04bda8 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,8 @@
  #ifndef _ASM_POWERPC_LMB_H
  #define _ASM_POWERPC_LMB_H
  
+#include 

+
  struct drmem_lmb {
u64 base_addr;
u32 drc_index;
@@ -26,8 +28,14 @@ struct drmem_lmb_info {
  
  extern struct drmem_lmb_info *drmem_info;
  
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)

+{
+   cond_resched();
+   return ++lmb;
+}
+
  #define for_each_drmem_lmb_in_range(lmb, start, end)  \
-   for ((lmb) = (start); (lmb) < (end); (lmb)++)
+   for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
  
  #define for_each_drmem_lmb(lmb)	\

for_each_drmem_lmb_in_range((lmb),  \





[PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-28 Thread Nathan Lynch
The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.

Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT 
format")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/drmem.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..36d0ed04bda8 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,8 @@
 #ifndef _ASM_POWERPC_LMB_H
 #define _ASM_POWERPC_LMB_H
 
+#include 
+
 struct drmem_lmb {
u64 base_addr;
u32 drc_index;
@@ -26,8 +28,14 @@ struct drmem_lmb_info {
 
 extern struct drmem_lmb_info *drmem_info;
 
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
+{
+   cond_resched();
+   return ++lmb;
+}
+
 #define for_each_drmem_lmb_in_range(lmb, start, end)   \
-   for ((lmb) = (start); (lmb) < (end); (lmb)++)
+   for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
 
 #define for_each_drmem_lmb(lmb)\
for_each_drmem_lmb_in_range((lmb),  \
-- 
2.25.4



[PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct

2020-07-28 Thread Scott Cheloha
At memory hot-remove time we can retrieve an LMB's nid from its
corresponding memory_block.  There is no need to store the nid
in multiple locations.

Note that lmb_to_memblock() uses find_memory_block() to get the
corresponding memory_block.  As find_memory_block() runs in sub-linear
time this approach is negligibly slower than what we do at present.

In exchange for this lookup at hot-remove time we no longer need to
call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
spares us an O(n^2) initialization during boot.

On systems with many LMBs that initialization overhead is palpable and
disruptive.  For example, on a box with 249854 LMBs we're seeing
drmem_init() take upwards of 30 seconds to complete:

[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c00a4980 LR: c00a4940 CTR: 
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  82009033   CR: 44000248  
XER: 000d
[   80.604431] CFAR: c00a4a38 IRQMASK: 0
[   80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 
c0003cfede30
[   80.604431] GPR04:  c0f4095a 002f 
1000
[   80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 
c00c0002fdfb2001
[   80.604431] GPR12:  c0001e8ec200
[   80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c00a4940] 
hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c0087c10] 
memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c10c4aa0] 
kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c000b648] 
ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 
7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 
7fbe5040
[   89.047390] drmem: 249854 LMB(s)

With a patched kernel on the same machine we're no longer seeing the
soft lockup.  drmem_init() now completes in negligible time, even when
the LMB count is large.

Signed-off-by: Scott Cheloha 
---
 arch/powerpc/include/asm/drmem.h  | 21 ---
 arch/powerpc/mm/drmem.c   |  6 +-
 .../platforms/pseries/hotplug-memory.c| 19 ++---
 3 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..34e4e9b257f5 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -13,9 +13,6 @@ struct drmem_lmb {
u32 drc_index;
u32 aa_index;
u32 flags;
-#ifdef CONFIG_MEMORY_HOTPLUG
-   int nid;
-#endif
 };
 
 struct drmem_lmb_info {
@@ -104,22 +101,4 @@ static inline void 
invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
lmb->aa_index = 0x;
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr);
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = -1;
-}
-#else
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-}
-#endif
-
 #endif /* _ASM_POWERPC_LMB_H */
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327cefbc6a..873fcfc7b875 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
if (!drmem_info->lmbs)
return;
 
-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb(lmb)
read_drconf_v1_cell(lmb, );
-   lmb_set_nid(lmb);
-   }
 }
 
 static void __init init_drmem_v2_lmbs(const __be32 *prop)
@@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
 
lmb->aa_index = dr_cell.aa_index;
lmb->flags = dr_cell.flags;
-
-   lmb_set_nid(lmb);
}
}
 }
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5ace2f9a277e..7bf66fdcf916 100644
--- 

[PATCH] selftests/powerpc: return skip code for spectre_v2

2020-07-28 Thread Thadeu Lima de Souza Cascardo
When running under older versions of qemu of under newer versions with old
machine types, some security features will not be reported to the guest.
This will lead the guest OS to consider itself Vulnerable to spectre_v2.

So, spectre_v2 test fails in such cases when the host is mitigated and miss
predictions cannot be detected as expected by the test.

Make it return the skip code instead, for this particular case. We don't
want to miss the case when the test fails and the system reports as
mitigated or not affected. But it is not a problem to miss failures when
the system reports as Vulnerable.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 tools/testing/selftests/powerpc/security/spectre_v2.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/powerpc/security/spectre_v2.c 
b/tools/testing/selftests/powerpc/security/spectre_v2.c
index 8c6b982af2a8..d5445bfd63ed 100644
--- a/tools/testing/selftests/powerpc/security/spectre_v2.c
+++ b/tools/testing/selftests/powerpc/security/spectre_v2.c
@@ -183,6 +183,14 @@ int spectre_v2_test(void)
if (miss_percent > 15) {
printf("Branch misses > 15%% unexpected in this 
configuration!\n");
printf("Possible mis-match between reported & actual 
mitigation\n");
+   /* Such a mismatch may be caused by a guest system
+* reporting as vulnerable when the host is mitigated.
+* Return skip code to avoid detecting this as an
+* error. We are not vulnerable and reporting otherwise,
+* so missing such a mismatch is safe.
+*/
+   if (state == VULNERABLE)
+   return 4;
return 1;
}
break;
-- 
2.25.1



Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-07-28 Thread Valentin Schneider


Hi,

On 27/07/20 06:32, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
>

So there's at least one arm64 platform out there with the same "pairs of
cores share L2" thing (Ampere eMAG), and that lives quite happily with the
default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
domain, and the whole system is covered by DIE.

Now arguably it's not a perfect representation; DIE doesn't have
SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
will impact all callsites using cpus_share_cache(): in the eMAG case, only
pairs of cores will be seen as sharing cache, even though *all* cores share
the same L3.

I'm trying to paint a picture of what the P9 topology looks like (the one
you showcase in your cover letter) to see if there are any similarities;
from what I gather in [1], wikichips and your cover letter, with P9 you can
have something like this in a single DIE (somewhat unsure about L3 setup;
it looks to be distributed?)

 +-+
 |  L3 |
 +---+-+---+-+---+-+---+
 |   L2  | |   L2  | |   L2  | |   L2  |
 +--+-+--+ +--+-+--+ +--+-+--+ +--+-+--+
 |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
 |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+

Which would lead to (ignoring the whole SMT CPU numbering shenanigans)

NUMA [   ...
DIE  [ ]
MC   [ ] [ ] [ ] [ ]
BIGCORE  [ ] [ ] [ ] [ ]
SMT  [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  

This however has MC == BIGCORE; what makes it you can have different spans
for these two domains? If it's not too much to ask, I'd love to have a P9
topology diagram.

[1]: 20200722081822.gg9...@linux.vnet.ibm.com


Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved

2020-07-28 Thread Baoquan He
On 07/28/20 at 05:15pm, Mike Rapoport wrote:
> On Tue, Jul 28, 2020 at 07:02:54PM +0800, Baoquan He wrote:
> > On 07/28/20 at 08:11am, Mike Rapoport wrote:
> > > From: Mike Rapoport 
> > > 
> > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > > regions to set node ID in memblock.reserved and than traverses
> > > memblock.reserved to update reserved_nodemask to include node IDs that 
> > > were
> > > set in the first loop.
> > > 
> > > Remove redundant traversal over memblock.reserved and update
> > > reserved_nodemask while iterating over numa_meminfo.
> > > 
> > > Signed-off-by: Mike Rapoport 
> > > ---
> > >  arch/x86/mm/numa.c | 26 ++
> > >  1 file changed, 10 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 8ee952038c80..4078abd33938 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -498,31 +498,25 @@ static void __init 
> > > numa_clear_kernel_node_hotplug(void)
> > >* and use those ranges to set the nid in memblock.reserved.
> > >* This will split up the memblock regions along node
> > >* boundaries and will set the node IDs as well.
> > > +  *
> > > +  * The nid will also be set in reserved_nodemask which is later
> > > +  * used to clear MEMBLOCK_HOTPLUG flag.
> > > +  *
> > > +  * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> > > +  *   numa_meminfo might not include all memblock.reserved
> > > +  *   memory ranges, because quirks such as trim_snb_memory()
> > > +  *   reserve specific pages for Sandy Bridge graphics.
> > > +  *   These ranges will remain with nid == MAX_NUMNODES. ]
> > >*/
> > >   for (i = 0; i < numa_meminfo.nr_blks; i++) {
> > >   struct numa_memblk *mb = numa_meminfo.blk + i;
> > >   int ret;
> > >  
> > >   ret = memblock_set_node(mb->start, mb->end - mb->start, 
> > > , mb->nid);
> > > + node_set(mb->nid, reserved_nodemask);
> > 
> > Really? This will set all node id into reserved_nodemask. But in the
> > current code, it's setting nid into memblock reserved region which
> > interleaves with numa_memoinfo, then get those nid and set it in
> > reserved_nodemask. This is so different, with my understanding. Please
> > correct me if I am wrong.
> 
> You are right, I've missed the intersections of numa_meminfo with
> memblock.reserved.
> 
> x86 interaction with membock is so, hmm, interesting...

Yeah, numa_clear_kernel_node_hotplug() intends to find out any memory node
which has reserved memory, then make it as unmovable. Setting all node
id into reserved_nodemask will break the use case of hot removing hotpluggable
boot memory after system bootup.



Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved

2020-07-28 Thread Mike Rapoport
On Tue, Jul 28, 2020 at 07:02:54PM +0800, Baoquan He wrote:
> On 07/28/20 at 08:11am, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > regions to set node ID in memblock.reserved and than traverses
> > memblock.reserved to update reserved_nodemask to include node IDs that were
> > set in the first loop.
> > 
> > Remove redundant traversal over memblock.reserved and update
> > reserved_nodemask while iterating over numa_meminfo.
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >  arch/x86/mm/numa.c | 26 ++
> >  1 file changed, 10 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 8ee952038c80..4078abd33938 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -498,31 +498,25 @@ static void __init 
> > numa_clear_kernel_node_hotplug(void)
> >  * and use those ranges to set the nid in memblock.reserved.
> >  * This will split up the memblock regions along node
> >  * boundaries and will set the node IDs as well.
> > +*
> > +* The nid will also be set in reserved_nodemask which is later
> > +* used to clear MEMBLOCK_HOTPLUG flag.
> > +*
> > +* [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> > +*   numa_meminfo might not include all memblock.reserved
> > +*   memory ranges, because quirks such as trim_snb_memory()
> > +*   reserve specific pages for Sandy Bridge graphics.
> > +*   These ranges will remain with nid == MAX_NUMNODES. ]
> >  */
> > for (i = 0; i < numa_meminfo.nr_blks; i++) {
> > struct numa_memblk *mb = numa_meminfo.blk + i;
> > int ret;
> >  
> > ret = memblock_set_node(mb->start, mb->end - mb->start, 
> > , mb->nid);
> > +   node_set(mb->nid, reserved_nodemask);
> 
> Really? This will set all node id into reserved_nodemask. But in the
> current code, it's setting nid into memblock reserved region which
> interleaves with numa_memoinfo, then get those nid and set it in
> reserved_nodemask. This is so different, with my understanding. Please
> correct me if I am wrong.

You are right, I've missed the intersections of numa_meminfo with
memblock.reserved.

x86 interaction with membock is so, hmm, interesting...
 
> Thanks
> Baoquan
> 
> > WARN_ON_ONCE(ret);
> > }
> >  
> > -   /*
> > -* Now go over all reserved memblock regions, to construct a
> > -* node mask of all kernel reserved memory areas.
> > -*
> > -* [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> > -*   numa_meminfo might not include all memblock.reserved
> > -*   memory ranges, because quirks such as trim_snb_memory()
> > -*   reserve specific pages for Sandy Bridge graphics. ]
> > -*/
> > -   for_each_memblock(reserved, mb_region) {
> > -   int nid = memblock_get_region_node(mb_region);
> > -
> > -   if (nid != MAX_NUMNODES)
> > -   node_set(nid, reserved_nodemask);
> > -   }
> > -
> > /*
> >  * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
> >  * belonging to the reserved node mask.
> > -- 
> > 2.26.2
> > 
> > 
> 

-- 
Sincerely yours,
Mike.


Re: [RESEND PATCH v5 08/11] ppc64/kexec_file: setup backup region for kdump kernel

2020-07-28 Thread Michael Ellerman
Hari Bathini  writes:
> diff --git a/arch/powerpc/kexec/file_load_64.c 
> b/arch/powerpc/kexec/file_load_64.c
> index a5c1442590b2..88408b17a7f6 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -697,6 +699,69 @@ static int update_usable_mem_fdt(void *fdt, struct 
> crash_mem *usable_mem)
>   return ret;
>  }
>  
> +/**
> + * load_backup_segment - Locate a memory hole to place the backup region.
> + * @image:   Kexec image.
> + * @kbuf:Buffer contents and memory parameters.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int load_backup_segment(struct kimage *image, struct kexec_buf *kbuf)
> +{
> + void *buf;
> + int ret;
> +
> + /* Setup a segment for backup region */
> + buf = vzalloc(BACKUP_SRC_SIZE);

This worried me initially, because we can't copy from physically
discontiguous pages in real mode.

But as you explained this buffer is not used for copying.

I think if you move the large comment below up here, it would be
clearer.


> diff --git a/arch/powerpc/purgatory/trampoline_64.S 
> b/arch/powerpc/purgatory/trampoline_64.S
> index 464af8e8a4cb..d4b52961f592 100644
> --- a/arch/powerpc/purgatory/trampoline_64.S
> +++ b/arch/powerpc/purgatory/trampoline_64.S
> @@ -43,14 +44,38 @@ master:
>   mr  %r17,%r3/* save cpu id to r17 */
>   mr  %r15,%r4/* save physical address in reg15 */
>  
> + bl  0f  /* Work out where we're running */
> +0:   mflr%r18

I know you just moved it, but this should use:

bcl 20, 31, $+4
mflr%r18

Which is a special form of branch and link that doesn't unbalance the
link stack in the chip.

> + /*
> +  * Copy BACKUP_SRC_SIZE bytes from BACKUP_SRC_START to
> +  * backup_start 8 bytes at a time.
> +  *
> +  * Use r3 = dest, r4 = src, r5 = size, r6 = count
> +  */
> + ld  %r3,(backup_start - 0b)(%r18)
> + cmpdi   %cr0,%r3,0

I prefer spaces or tabs between arguments, eg:

cmpdi   %cr0, %r3, 0

> + beq 80f /* skip if there is no backup region */

Local labels will make this clearer I think. eg:

beq .Lskip_copy

> + lis %r5,BACKUP_SRC_SIZE@h
> + ori %r5,%r5,BACKUP_SRC_SIZE@l
> + cmpdi   %cr0,%r5,0
> + beq 80f /* skip if copy size is zero */
> + lis %r4,BACKUP_SRC_START@h
> + ori %r4,%r4,BACKUP_SRC_START@l
> + li  %r6,0
> +70:

.Lcopy_loop:

> + ldx %r0,%r6,%r4
> + stdx%r0,%r6,%r3
> + addi%r6,%r6,8
> + cmpld   %cr0,%r6,%r5
> + blt 70b

blt .Lcopy_loop

> +

.Lskip_copy:

> +80:
>   or  %r3,%r3,%r3 /* ok now to high priority, lets boot */
>   lis %r6,0x1
>   mtctr   %r6 /* delay a bit for slaves to catch up */
>   bdnz.   /* before we overwrite 0-100 again */


cheers


Re: [RESEND PATCH v5 07/11] ppc64/kexec_file: enable early kernel's OPAL calls

2020-07-28 Thread Michael Ellerman
Hari Bathini  writes:
> Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9
> to be filled with OPAL base & entry addresses respectively. Setting
> these registers allows the kernel to perform OPAL calls before the
> device tree is parsed.

I'm not convinced we want to do this.

If we do it becomes part of the kexec ABI and we have to honour it into
the future.

And in practice there are no non-development kernels built with OPAL early
debugging enabled, so it's not clear it actually helps anyone other than
developers.

cheers

> v4 -> v5:
> * New patch. Updated opal_base & opal_entry values in r8 & r9 respectively.
>   This change was part of the below dropped patch in v4:
> - https://lore.kernel.org/patchwork/patch/1275667/
>
>
>  arch/powerpc/kexec/file_load_64.c  |   16 
>  arch/powerpc/purgatory/trampoline_64.S |   15 +++
>  2 files changed, 31 insertions(+)
>
> diff --git a/arch/powerpc/kexec/file_load_64.c 
> b/arch/powerpc/kexec/file_load_64.c
> index 8df085a22fd7..a5c1442590b2 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -713,6 +713,8 @@ int setup_purgatory_ppc64(struct kimage *image, const 
> void *slave_code,
> const void *fdt, unsigned long kernel_load_addr,
> unsigned long fdt_load_addr)
>  {
> + struct device_node *dn = NULL;
> + uint64_t val;
>   int ret;
>  
>   ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
> @@ -735,9 +737,23 @@ int setup_purgatory_ppc64(struct kimage *image, const 
> void *slave_code,
>   goto out;
>   }
>  
> + /* Setup OPAL base & entry values */
> + dn = of_find_node_by_path("/ibm,opal");
> + if (dn) {
> + of_property_read_u64(dn, "opal-base-address", );
> + ret = kexec_purgatory_get_set_symbol(image, "opal_base", ,
> +  sizeof(val), false);
> + if (ret)
> + goto out;
> +
> + of_property_read_u64(dn, "opal-entry-address", );
> + ret = kexec_purgatory_get_set_symbol(image, "opal_entry", ,
> +  sizeof(val), false);
> + }
>  out:
>   if (ret)
>   pr_err("Failed to setup purgatory symbols");
> + of_node_put(dn);
>   return ret;
>  }
>  
> diff --git a/arch/powerpc/purgatory/trampoline_64.S 
> b/arch/powerpc/purgatory/trampoline_64.S
> index a5a83c3f53e6..464af8e8a4cb 100644
> --- a/arch/powerpc/purgatory/trampoline_64.S
> +++ b/arch/powerpc/purgatory/trampoline_64.S
> @@ -61,6 +61,10 @@ master:
>   li  %r4,28
>   STWX_BE %r17,%r3,%r4/* Store my cpu as __be32 at byte 28 */
>  1:
> + /* Load opal base and entry values in r8 & r9 respectively */
> + ld  %r8,(opal_base - 0b)(%r18)
> + ld  %r9,(opal_entry - 0b)(%r18)
> +
>   /* load the kernel address */
>   ld  %r4,(kernel - 0b)(%r18)
>  
> @@ -102,6 +106,17 @@ dt_offset:
>   .8byte  0x0
>   .size dt_offset, . - dt_offset
>  
> + .balign 8
> + .globl opal_base
> +opal_base:
> + .8byte  0x0
> + .size opal_base, . - opal_base
> +
> + .balign 8
> + .globl opal_entry
> +opal_entry:
> + .8byte  0x0
> + .size opal_entry, . - opal_entry
>  
>   .data
>   .balign 8


Re: [RESEND PATCH v5 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel

2020-07-28 Thread Michael Ellerman
Hari Bathini  writes:
> diff --git a/arch/powerpc/kexec/file_load_64.c 
> b/arch/powerpc/kexec/file_load_64.c
> index 2df6f4273ddd..8df085a22fd7 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -17,9 +17,21 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> +#include 
> +#include 
>  #include 
>  
> +struct umem_info {
> + uint64_t *buf; /* data buffer for usable-memory property */
> + uint32_t idx;  /* current index */
> + uint32_t size; /* size allocated for the data buffer */

Use kernel types please, u64, u32.

> + /* usable memory ranges to look up */
> + const struct crash_mem *umrngs;

"umrngs".

Given it's part of the umem_info struct could it just be "ranges"?

> +};
> +
>  const struct kexec_file_ops * const kexec_file_loaders[] = {
>   _elf64_ops,
>   NULL
> @@ -74,6 +86,42 @@ static int get_exclude_memory_ranges(struct crash_mem 
> **mem_ranges)
>   return ret;
>  }
>  
> +/**
> + * get_usable_memory_ranges - Get usable memory ranges. This list includes
> + *regions like crashkernel, opal/rtas & 
> tce-table,
> + *that kdump kernel could use.
> + * @mem_ranges:   Range list to add the memory ranges to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int get_usable_memory_ranges(struct crash_mem **mem_ranges)
> +{
> + int ret;
> +
> + /*
> +  * prom code doesn't take kindly to missing low memory. So, add

I don't know what that's referring to, "prom code" is too vague.

> +  * [0, crashk_res.end] instead of [crashk_res.start, crashk_res.end]
> +  * to keep it happy.
> +  */
> + ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1);
> + if (ret)
> + goto out;
> +
> + ret = add_rtas_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_opal_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_tce_mem_ranges(mem_ranges);
> +out:
> + if (ret)
> + pr_err("Failed to setup usable memory ranges\n");
> + return ret;
> +}
> +
>  /**
>   * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
>   *  in the memory regions between buf_min & 
> buf_max
> @@ -273,6 +321,382 @@ static int locate_mem_hole_bottom_up_ppc64(struct 
> kexec_buf *kbuf,
>   return ret;
>  }
>  
> +/**
> + * check_realloc_usable_mem - Reallocate buffer if it can't accommodate 
> entries
> + * @um_info:  Usable memory buffer and ranges info.
> + * @cnt:  No. of entries to accommodate.
> + *
> + * Frees up the old buffer if memory reallocation fails.
> + *
> + * Returns buffer on success, NULL on error.
> + */
> +static uint64_t *check_realloc_usable_mem(struct umem_info *um_info, int cnt)
> +{
> + void *tbuf;
> +
> + if (um_info->size >=
> + ((um_info->idx + cnt) * sizeof(*(um_info->buf
> + return um_info->buf;

This is awkward.

AFAICS you only use um_info->size here, so instead why not store the
number of u64s you have space for, as num for example.

Then the above comparison becomes:

if (um_info->num >= (um_info->idx + count))

Then you only have to calculate the size internally here for the
realloc.

> +
> + um_info->size += MEM_RANGE_CHUNK_SZ;

new_size = um_info->size + MEM_RANGE_CHUNK_SZ;
tbuf = krealloc(um_info->buf, new_size, GFP_KERNEL);

> + tbuf = krealloc(um_info->buf, um_info->size, GFP_KERNEL);
> + if (!tbuf) {
> + um_info->size -= MEM_RANGE_CHUNK_SZ;

Then you can drop this.

> + return NULL;
> + }

um_info->size = new_size;

> +
> + memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ);

Just pass __GFP_ZERO to krealloc?

> + return tbuf;
> +}
> +
> +/**
> + * add_usable_mem - Add the usable memory ranges within the given memory 
> range
> + *  to the buffer
> + * @um_info:Usable memory buffer and ranges info.
> + * @base:   Base address of memory range to look for.
> + * @end:End address of memory range to look for.
> + * @cnt:No. of usable memory ranges added to buffer.

One caller never uses this AFAICS.

Couldn't the other caller just compare the um_info->idx before and after
the call, and avoid another pass by reference parameter.

> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int add_usable_mem(struct umem_info *um_info, uint64_t base,
> +   uint64_t end, int *cnt)
> +{
> + uint64_t loc_base, loc_end, *buf;
> + const struct crash_mem *umrngs;
> + int i, add;

add should be bool.

> + *cnt = 0;
> + umrngs = um_info->umrngs;
> + for (i = 0; i < umrngs->nr_ranges; i++) {
> + add = 0;
> + loc_base = umrngs->ranges[i].start;
> + 

Re: [PATCH v3 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states

2020-07-28 Thread Pratik Sampat

Hello Rafael,


On 27/07/20 7:12 pm, Rafael J. Wysocki wrote:

On Tue, Jul 21, 2020 at 2:43 PM Pratik Rajesh Sampat
 wrote:

Fire directed smp_call_function_single IPIs from a specified source
CPU to the specified target CPU to reduce the noise we have to wade
through in the trace log.

And what's the purpose of it?


The idea for this comes from that fact that estimating wake-up
latencies and residencies for stop states is not an easy task.

The purpose is essentially to determine wakeup latencies, that are
caused by either, an IPI or a timer and compare with the advertised
wakeup latencies for each stop state.

This might help in determining the accuracy of our advertised values
and/or if they need any re-calibration.


The module is based on the idea written by Srivatsa Bhat and maintained
by Vaidyanathan Srinivasan internally.

Queue HR timer and measure jitter. Wakeup latency measurement for idle
states using hrtimer.  Echo a value in ns to timer_test_function and
watch trace. A HRtimer will be queued and when it fires the expected
wakeup vs actual wakeup is computes and delay printed in ns.

Implemented as a module which utilizes debugfs so that it can be
integrated with selftests.

To include the module, check option and include as module
kernel hacking -> Cpuidle latency selftests

[srivatsa.b...@linux.vnet.ibm.com: Initial implementation in
  cpidle/sysfs]

[sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
  and fix some of the time calculation]

[e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and
  increase the resolution of IPI wakeup]

Signed-off-by: Pratik Rajesh Sampat 
Reviewed-by: Gautham R. Shenoy 
---
  drivers/cpuidle/Makefile   |   1 +
  drivers/cpuidle/test-cpuidle_latency.c | 150 +
  lib/Kconfig.debug  |  10 ++
  3 files changed, 161 insertions(+)
  create mode 100644 drivers/cpuidle/test-cpuidle_latency.c

diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index f07800cbb43f..2ae05968078c 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
  obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
  obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
  obj-$(CONFIG_HALTPOLL_CPUIDLE)   += cpuidle-haltpoll.o
+obj-$(CONFIG_IDLE_LATENCY_SELFTEST)  += test-cpuidle_latency.o

  
##
  # ARM SoC drivers
diff --git a/drivers/cpuidle/test-cpuidle_latency.c 
b/drivers/cpuidle/test-cpuidle_latency.c
new file mode 100644
index ..61574665e972
--- /dev/null
+++ b/drivers/cpuidle/test-cpuidle_latency.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Module-based API test facility for cpuidle latency using IPIs and timers

I'd like to see a more detailed description of what it does and how it
works here.


Right, I'll add that.
Based on comments from Daniel I have also been working on a
user-space only variant of this test as that does seem like
a better way to go.

The only downside is that the latency will be higher, but as we are
taking baseline measurements the diff of that from our observed reading
should still remain the same. Just that the test will take longer to run.
I'm yet to accurately confirm this.

I would appreciate your thoughts on that.


+ */
+
+#include 
+#include 
+#include 
+
+/* IPI based wakeup latencies */
+struct latency {
+   unsigned int src_cpu;
+   unsigned int dest_cpu;
+   ktime_t time_start;
+   ktime_t time_end;
+   u64 latency_ns;
+} ipi_wakeup;
+
+static void measure_latency(void *info)
+{
+   struct latency *v;
+   ktime_t time_diff;
+
+   v = (struct latency *)info;
+   v->time_end = ktime_get();
+   time_diff = ktime_sub(v->time_end, v->time_start);
+   v->latency_ns = ktime_to_ns(time_diff);
+}
+
+void run_smp_call_function_test(unsigned int cpu)
+{
+   ipi_wakeup.src_cpu = smp_processor_id();
+   ipi_wakeup.dest_cpu = cpu;
+   ipi_wakeup.time_start = ktime_get();
+   smp_call_function_single(cpu, measure_latency, _wakeup, 1);
+}
+
+/* Timer based wakeup latencies */
+struct timer_data {
+   unsigned int src_cpu;
+   u64 timeout;
+   ktime_t time_start;
+   ktime_t time_end;
+   struct hrtimer timer;
+   u64 timeout_diff_ns;
+} timer_wakeup;
+
+static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
+{
+   struct timer_data *w;
+   ktime_t time_diff;
+
+   w = container_of(hrtimer, struct timer_data, timer);
+   w->time_end = ktime_get();
+
+   time_diff = ktime_sub(w->time_end, w->time_start);
+   time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
+   w->timeout_diff_ns = ktime_to_ns(time_diff);
+   return HRTIMER_NORESTART;
+}
+
+static void run_timer_test(unsigned int ns)
+{
+   

[PATCH v4 3/3] powerpc test_emulate_step: add testcases for divde[.] and divdeu[.] instructions

2020-07-28 Thread Balamuruhan S
add testcases for divde, divde., divdeu, divdeu. emulated
instructions to cover few scenarios,
* with same dividend and divisor to have undefine RT
  for divdeu[.]
* with divide by zero to have undefine RT for both
  divde[.] and divdeu[.]
* with negative dividend to cover -|divisor| < r <= 0 if
  the dividend is negative for divde[.]
* normal case with proper dividend and divisor for both
  divde[.] and divdeu[.]

Reviewed-by: Sandipan Das 
Signed-off-by: Balamuruhan S 
Acked-by: Naveen N. Rao 
---
 arch/powerpc/lib/test_emulate_step.c | 156 +++
 1 file changed, 156 insertions(+)

diff --git a/arch/powerpc/lib/test_emulate_step.c 
b/arch/powerpc/lib/test_emulate_step.c
index d242e9f72e0c..0a201b771477 100644
--- a/arch/powerpc/lib/test_emulate_step.c
+++ b/arch/powerpc/lib/test_emulate_step.c
@@ -1019,6 +1019,162 @@ static struct compute_test compute_tests[] = {
}
}
},
+   {
+   .mnemonic = "divde",
+   .subtests = {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   }
+   },
+   {
+   .descr = "RA = 1L, RB = 0",
+   .instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)),
+   .flags = IGNORE_GPR(20),
+   .regs = {
+   .gpr[21] = 1L,
+   .gpr[22] = 0,
+   }
+   },
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MAX",
+   .instr = ppc_inst(PPC_RAW_DIVDE(20, 21, 22)),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MAX,
+   }
+   }
+   }
+   },
+   {
+   .mnemonic = "divde.",
+   .subtests = {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 
22)),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   }
+   },
+   {
+   .descr = "RA = 1L, RB = 0",
+   .instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 
22)),
+   .flags = IGNORE_GPR(20),
+   .regs = {
+   .gpr[21] = 1L,
+   .gpr[22] = 0,
+   }
+   },
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MAX",
+   .instr = ppc_inst(PPC_RAW_DIVDE_DOT(20, 21, 
22)),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MAX,
+   }
+   }
+   }
+   },
+   {
+   .mnemonic = "divdeu",
+   .subtests = {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+   .flags = IGNORE_GPR(20),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   }
+   },
+   {
+   .descr = "RA = 1L, RB = 0",
+   .instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+   .flags = IGNORE_GPR(20),
+   .regs = {
+   .gpr[21] = 1L,
+   .gpr[22] = 0,
+   }
+   },
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MAX",
+   .instr = ppc_inst(PPC_RAW_DIVDEU(20, 21, 22)),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+  

[PATCH v4 2/3] powerpc sstep: add support for divde[.] and divdeu[.] instructions

2020-07-28 Thread Balamuruhan S
This patch adds emulation support for divde, divdeu instructions,
* Divide Doubleword Extended (divde[.])
* Divide Doubleword Extended Unsigned (divdeu[.])

Reviewed-by: Sandipan Das 
Signed-off-by: Balamuruhan S 
Acked-by: Naveen N. Rao 
---
 arch/powerpc/lib/sstep.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index c58ea9e787cb..caee8cc77e19 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1806,7 +1806,18 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
op->val = (int) regs->gpr[ra] /
(int) regs->gpr[rb];
goto arith_done;
-
+#ifdef __powerpc64__
+   case 425:   /* divde[.] */
+   asm volatile(PPC_DIVDE(%0, %1, %2) :
+   "=r" (op->val) : "r" (regs->gpr[ra]),
+   "r" (regs->gpr[rb]));
+   goto arith_done;
+   case 393:   /* divdeu[.] */
+   asm volatile(PPC_DIVDEU(%0, %1, %2) :
+   "=r" (op->val) : "r" (regs->gpr[ra]),
+   "r" (regs->gpr[rb]));
+   goto arith_done;
+#endif
case 755:   /* darn */
if (!cpu_has_feature(CPU_FTR_ARCH_300))
return -1;
-- 
2.24.1



[PATCH v4 1/3] powerpc ppc-opcode: add divde and divdeu opcodes

2020-07-28 Thread Balamuruhan S
include instruction opcodes for divde and divdeu as macros.

Reviewed-by: Sandipan Das 
Signed-off-by: Balamuruhan S 
Acked-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ppc-opcode.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 4c0bdafb6a7b..a6e3700c4566 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -466,6 +466,10 @@
 #define PPC_RAW_MULI(d, a, i)  (0x1c00 | ___PPC_RT(d) | 
___PPC_RA(a) | IMM_L(i))
 #define PPC_RAW_DIVWU(d, a, b) (0x7c000396 | ___PPC_RT(d) | 
___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_RAW_DIVDU(d, a, b) (0x7c000392 | ___PPC_RT(d) | 
___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_RAW_DIVDE(t, a, b) (0x7c000352 | ___PPC_RT(t) | 
___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_RAW_DIVDE_DOT(t, a, b) (0x7c000352 | ___PPC_RT(t) | 
___PPC_RA(a) | ___PPC_RB(b) | 0x1)
+#define PPC_RAW_DIVDEU(t, a, b)(0x7c000312 | ___PPC_RT(t) | 
___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_RAW_DIVDEU_DOT(t, a, b)(0x7c000312 | ___PPC_RT(t) | 
___PPC_RA(a) | ___PPC_RB(b) | 0x1)
 #define PPC_RAW_AND(d, a, b)   (0x7c38 | ___PPC_RA(d) | 
___PPC_RS(a) | ___PPC_RB(b))
 #define PPC_RAW_ANDI(d, a, i)  (0x7000 | ___PPC_RA(d) | 
___PPC_RS(a) | IMM_L(i))
 #define PPC_RAW_AND_DOT(d, a, b)   (0x7c39 | ___PPC_RA(d) | 
___PPC_RS(a) | ___PPC_RB(b))
@@ -510,6 +514,8 @@
 #define PPC_DARN(t, l) stringify_in_c(.long PPC_RAW_DARN(t, l))
 #definePPC_DCBAL(a, b) stringify_in_c(.long PPC_RAW_DCBAL(a, 
b))
 #definePPC_DCBZL(a, b) stringify_in_c(.long PPC_RAW_DCBZL(a, 
b))
+#definePPC_DIVDE(t, a, b)  stringify_in_c(.long PPC_RAW_DIVDE(t, 
a, b))
+#definePPC_DIVDEU(t, a, b) stringify_in_c(.long PPC_RAW_DIVDEU(t, 
a, b))
 #define PPC_LQARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LQARX(t, a, b, eh))
 #define PPC_LDARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LDARX(t, a, b, eh))
 #define PPC_LWARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LWARX(t, a, b, eh))
-- 
2.24.1



[PATCH v4 0/3] Add support for divde[.] and divdeu[.] instruction emulation

2020-07-28 Thread Balamuruhan S
Hi All,

This patchset adds support to emulate divde, divde., divdeu and divdeu.
instructions and testcases for it.

Resend v4: rebased on latest powerpc next branch

Changes in v4:
-
Fix review comments from Naveen,
* replace TEST_DIVDEU() instead of wrongly used TEST_DIVDEU_DOT() in
  divdeu testcase.
* Include `acked-by` tag from Naveen for the series.
* Rebase it on latest mpe's merge tree.

Changes in v3:
-
* Fix suggestion from Sandipan to remove `PPC_INST_DIVDE_DOT` and
  `PPC_INST_DIVDEU_DOT` opcode macros defined in ppc-opcode.h, reuse
  `PPC_INST_DIVDE` and `PPC_INST_DIVDEU` in test_emulate_step.c to
  derive them respectively.

Changes in v2:
-
* Fix review comments from Paul to make divde_dot and divdeu_dot simple
  by using divde and divdeu, then goto `arith_done` instead of
  `compute_done`.
* Include `Reviewed-by` tag from Sandipan Das.
* Rebase with recent mpe's merge tree.

I would request for your review and suggestions for making it better.

Boot Log:

:: ::
:: ::
291494043: (291493996): [0.352649][T1] emulate_step_test: divde 
 : RA = LONG_MIN, RB = LONG_MIN   PASS
291517665: (291517580): [0.352695][T1] emulate_step_test: divde 
 : RA = 1L, RB = 0PASS
291541357: (291541234): [0.352742][T1] emulate_step_test: divde 
 : RA = LONG_MIN, RB = LONG_MAX   PASS
291565107: (291564946): [0.352788][T1] emulate_step_test: divde.
 : RA = LONG_MIN, RB = LONG_MIN   PASS
291588757: (291588558): [0.352834][T1] emulate_step_test: divde.
 : RA = 1L, RB = 0PASS
291612477: (291612240): [0.352881][T1] emulate_step_test: divde.
 : RA = LONG_MIN, RB = LONG_MAX   PASS
291636201: (291635926): [0.352927][T1] emulate_step_test: divdeu
 : RA = LONG_MIN, RB = LONG_MIN   PASS
291659830: (291659517): [0.352973][T1] emulate_step_test: divdeu
 : RA = 1L, RB = 0PASS
291683529: (291683178): [0.353019][T1] emulate_step_test: divdeu
 : RA = LONG_MIN, RB = LONG_MAX   PASS
291707248: (291706859): [0.353066][T1] emulate_step_test: divdeu
 : RA = LONG_MAX - 1, RB = LONG_MAX   PASS
291730962: (291730535): [0.353112][T1] emulate_step_test: divdeu
 : RA = LONG_MIN + 1, RB = LONG_MIN   PASS
291754714: (291754249): [0.353158][T1] emulate_step_test: divdeu.   
 : RA = LONG_MIN, RB = LONG_MIN   PASS
291778371: (291777868): [0.353205][T1] emulate_step_test: divdeu.   
 : RA = 1L, RB = 0PASS
291802098: (291801557): [0.353251][T1] emulate_step_test: divdeu.   
 : RA = LONG_MIN, RB = LONG_MAX   PASS
291825844: (291825265): [0.353297][T1] emulate_step_test: divdeu.   
 : RA = LONG_MAX - 1, RB = LONG_MAX   PASS
291849586: (291848969): [0.353344][T1] emulate_step_test: divdeu.   
 : RA = LONG_MIN + 1, RB = LONG_MIN   PASS
:: ::
:: ::
292520225: (292519608): [0.354654][T1] registered taskstats version 1
292584751: (292584134): [0.354780][T1] pstore: Using crash dump 
compression: deflate
296454422: (296453805): [0.362338][T1] Freeing unused kernel memory: 
1408K
296467838: (296467221): [0.362364][T1] This architecture does not have 
kernel memory protection.
296485387: (296484770): [0.362398][T1] Run /init as init process
297987339: (297986761): [0.365332][   T46] mount (46) used greatest stack 
depth: 12512 bytes left
298889548: (29992): [0.367094][   T47] mount (47) used greatest stack 
depth: 11824 bytes left

355356256: (355355821): Welcome to Buildroot
355376898: (355376463): buildroot login:

Balamuruhan S (3):
  powerpc ppc-opcode: add divde and divdeu opcodes
  powerpc sstep: add support for divde[.] and divdeu[.] instructions
  powerpc test_emulate_step: add testcases for divde[.] and divdeu[.]
instructions

 arch/powerpc/include/asm/ppc-opcode.h |   6 +
 arch/powerpc/lib/sstep.c  |  13 ++-
 arch/powerpc/lib/test_emulate_step.c  | 156 ++
 3 files changed, 174 insertions(+), 1 deletion(-)


base-commit: 7a9912e4cf048b607c8fafcfbdca750f1d78
-- 
2.24.1



Re: [RESEND PATCH v5 03/11] powerpc/kexec_file: add helper functions for getting memory ranges

2020-07-28 Thread Michael Ellerman
Hi Hari,

Some comments inline ...

Hari Bathini  writes:
> diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c
> new file mode 100644
> index ..21bea1b78443
> --- /dev/null
> +++ b/arch/powerpc/kexec/ranges.c
> @@ -0,0 +1,417 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * powerpc code to implement the kexec_file_load syscall
> + *
> + * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
> + * Copyright (C) 2004  IBM Corp.
> + * Copyright (C) 2004,2005  Milton D Miller II, IBM Corporation
> + * Copyright (C) 2005  R Sharada (shar...@in.ibm.com)
> + * Copyright (C) 2006  Mohan Kumar M (mo...@in.ibm.com)
> + * Copyright (C) 2020  IBM Corporation
> + *
> + * Based on kexec-tools' kexec-ppc64.c, fs2dt.c.
> + * Heavily modified for the kernel by
> + * Hari Bathini .

Please just use your name, email addresses bit rot. It's in the commit
log anyway.

> + */
> +
> +#undef DEBUG
^
Dont do that in new code please.

> +#define pr_fmt(fmt) "kexec ranges: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * get_max_nr_ranges - Get the max no. of ranges crash_mem structure
> + * could hold, given the size allocated for it.
> + * @size:  Allocation size of crash_mem structure.
> + *
> + * Returns the maximum no. of ranges.
> + */
> +static inline unsigned int get_max_nr_ranges(size_t size)
> +{
> + return ((size - sizeof(struct crash_mem)) /
> + sizeof(struct crash_mem_range));
> +}
> +
> +/**
> + * get_mem_rngs_size - Get the allocated size of mrngs based on
> + * max_nr_ranges and chunk size.
> + * @mrngs: Memory ranges.

mrngs is not a great name, what about memory_ranges or ranges?

Ditto everywhere else you use mrngs.

> + *
> + * Returns the maximum size of @mrngs.
> + */
> +static inline size_t get_mem_rngs_size(struct crash_mem *mrngs)
> +{
> + size_t size;
> +
> + if (!mrngs)
> + return 0;
> +
> + size = (sizeof(struct crash_mem) +
> + (mrngs->max_nr_ranges * sizeof(struct crash_mem_range)));
> +
> + /*
> +  * Memory is allocated in size multiple of MEM_RANGE_CHUNK_SZ.
> +  * So, align to get the actual length.
> +  */
> + return ALIGN(size, MEM_RANGE_CHUNK_SZ);
> +}
> +
> +/**
> + * __add_mem_range - add a memory range to memory ranges list.
> + * @mem_ranges:  Range list to add the memory range to.
> + * @base:Base address of the range to add.
> + * @size:Size of the memory range to add.
> + *
> + * (Re)allocates memory, if needed.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int __add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size)
> +{
> + struct crash_mem *mrngs = *mem_ranges;
> +
> + if ((mrngs == NULL) || (mrngs->nr_ranges == mrngs->max_nr_ranges)) {

(mrngs == NULL) should just be !mrngs.

> + mrngs = realloc_mem_ranges(mem_ranges);
> + if (!mrngs)
> + return -ENOMEM;
> + }
> +
> + mrngs->ranges[mrngs->nr_ranges].start = base;
> + mrngs->ranges[mrngs->nr_ranges].end = base + size - 1;
> + pr_debug("Added memory range [%#016llx - %#016llx] at index %d\n",
> +  base, base + size - 1, mrngs->nr_ranges);
> + mrngs->nr_ranges++;
> + return 0;
> +}
> +
> +/**
> + * __merge_memory_ranges - Merges the given memory ranges list.
> + * @mem_ranges:Range list to merge.
> + *
> + * Assumes a sorted range list.
> + *
> + * Returns nothing.
> + */

A lot of this code is annoyingly similar to the memblock code, though
the internals of that are all static these days.

I guess for now we'll just have to add all this. Maybe in future it can
be consolidated.

> +static void __merge_memory_ranges(struct crash_mem *mrngs)
> +{
> + struct crash_mem_range *rngs;
> + int i, idx;
> +
> + if (!mrngs)
> + return;
> +
> + idx = 0;
> + rngs = >ranges[0];
> + for (i = 1; i < mrngs->nr_ranges; i++) {
> + if (rngs[i].start <= (rngs[i-1].end + 1))
> + rngs[idx].end = rngs[i].end;
> + else {
> + idx++;
> + if (i == idx)
> + continue;
> +
> + rngs[idx] = rngs[i];
> + }
> + }
> + mrngs->nr_ranges = idx + 1;
> +}
> +
> +/**
> + * realloc_mem_ranges - reallocate mem_ranges with size incremented
> + *  by MEM_RANGE_CHUNK_SZ. Frees up the old memory,
> + *  if memory allocation fails.
> + * @mem_ranges: Memory ranges to reallocate.
> + *
> + * Returns pointer to reallocated memory on success, NULL otherwise.
> + */
> +struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges)
> +{
> + struct crash_mem *mrngs = *mem_ranges;
> + unsigned int nr_ranges;
> + size_t size;
> +
> + size = 

Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved

2020-07-28 Thread Ingo Molnar


* Mike Rapoport  wrote:

> On Tue, Jul 28, 2020 at 12:44:40PM +0200, Ingo Molnar wrote:
> > 
> > * Mike Rapoport  wrote:
> > 
> > > From: Mike Rapoport 
> > > 
> > > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > > regions to set node ID in memblock.reserved and than traverses
> > > memblock.reserved to update reserved_nodemask to include node IDs that 
> > > were
> > > set in the first loop.
> > > 
> > > Remove redundant traversal over memblock.reserved and update
> > > reserved_nodemask while iterating over numa_meminfo.
> > > 
> > > Signed-off-by: Mike Rapoport 
> > > ---
> > >  arch/x86/mm/numa.c | 26 ++
> > >  1 file changed, 10 insertions(+), 16 deletions(-)
> > 
> > I suspect you'd like to carry this in the -mm tree?
> 
> Yes.
>  
> > Acked-by: Ingo Molnar 
> 
> Thanks!

Assuming it is correct and works. :-)

Thanks,

Ingo


Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-07-28 Thread Nicholas Piggin
Excerpts from pet...@infradead.org's message of July 26, 2020 10:11 pm:
> On Sun, Jul 26, 2020 at 02:14:34PM +1000, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 26, 2020 6:26 am:
> 
>> > Which is 'funny' when it interleaves like:
>> > 
>> >local_irq_disable();
>> >...
>> >local_irq_enable()
>> >  trace_hardirqs_on();
>> >  
>> >  raw_local_irq_enable();
>> > 
>> > Because then it will undo the trace_hardirqs_on() we just did. With the
>> > result that both tracing and lockdep will see a hardirqs-disable without
>> > a matching enable, while the hardware state is enabled.
>> 
>> Seems like an arch problem -- why not disable if it was enabled only?
>> I guess the local_irq tracing calls are a mess so maybe they copied 
>> those.
> 
> Because, as I wrote earlier, then we can miss updating software state.
> So your proposal has:
> 
>   raw_local_irq_disable()
>   
> if (!arch_irqs_disabled(regs->flags) // false
>   trace_hardirqs_off();
> 
> // tracing/lockdep still think IRQs are enabled
> // hardware IRQ state is disabled.

... and then lockdep_nmi_enter can disable IRQs if they were enabled?

The only reason it's done this way as opposed to a much simple counter 
increment/decrement AFAIKS is to avoid some overhead of calling 
trace_hardirqs_on/off (which seems a bit dubious but let's go with it).

In that case the lockdep_nmi_enter code is the right spot to clean up 
that gap vs NMIs. I guess there's an argument that arch_nmi_enter could
do it. I don't see the problem with fixing it up here though, this is a 
slow path so it doesn't matter if we have some more logic for it.

Thanks,
Nick


Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved

2020-07-28 Thread Baoquan He
On 07/28/20 at 08:11am, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> regions to set node ID in memblock.reserved and than traverses
> memblock.reserved to update reserved_nodemask to include node IDs that were
> set in the first loop.
> 
> Remove redundant traversal over memblock.reserved and update
> reserved_nodemask while iterating over numa_meminfo.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/x86/mm/numa.c | 26 ++
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 8ee952038c80..4078abd33938 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -498,31 +498,25 @@ static void __init numa_clear_kernel_node_hotplug(void)
>* and use those ranges to set the nid in memblock.reserved.
>* This will split up the memblock regions along node
>* boundaries and will set the node IDs as well.
> +  *
> +  * The nid will also be set in reserved_nodemask which is later
> +  * used to clear MEMBLOCK_HOTPLUG flag.
> +  *
> +  * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> +  *   numa_meminfo might not include all memblock.reserved
> +  *   memory ranges, because quirks such as trim_snb_memory()
> +  *   reserve specific pages for Sandy Bridge graphics.
> +  *   These ranges will remain with nid == MAX_NUMNODES. ]
>*/
>   for (i = 0; i < numa_meminfo.nr_blks; i++) {
>   struct numa_memblk *mb = numa_meminfo.blk + i;
>   int ret;
>  
>   ret = memblock_set_node(mb->start, mb->end - mb->start, 
> , mb->nid);
> + node_set(mb->nid, reserved_nodemask);

Really? This will set all node id into reserved_nodemask. But in the
current code, it's setting nid into memblock reserved region which
interleaves with numa_memoinfo, then get those nid and set it in
reserved_nodemask. This is so different, with my understanding. Please
correct me if I am wrong.

Thanks
Baoquan

>   WARN_ON_ONCE(ret);
>   }
>  
> - /*
> -  * Now go over all reserved memblock regions, to construct a
> -  * node mask of all kernel reserved memory areas.
> -  *
> -  * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> -  *   numa_meminfo might not include all memblock.reserved
> -  *   memory ranges, because quirks such as trim_snb_memory()
> -  *   reserve specific pages for Sandy Bridge graphics. ]
> -  */
> - for_each_memblock(reserved, mb_region) {
> - int nid = memblock_get_region_node(mb_region);
> -
> - if (nid != MAX_NUMNODES)
> - node_set(nid, reserved_nodemask);
> - }
> -
>   /*
>* Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
>* belonging to the reserved node mask.
> -- 
> 2.26.2
> 
> 



Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved

2020-07-28 Thread Mike Rapoport
On Tue, Jul 28, 2020 at 12:44:40PM +0200, Ingo Molnar wrote:
> 
> * Mike Rapoport  wrote:
> 
> > From: Mike Rapoport 
> > 
> > numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> > regions to set node ID in memblock.reserved and than traverses
> > memblock.reserved to update reserved_nodemask to include node IDs that were
> > set in the first loop.
> > 
> > Remove redundant traversal over memblock.reserved and update
> > reserved_nodemask while iterating over numa_meminfo.
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >  arch/x86/mm/numa.c | 26 ++
> >  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> I suspect you'd like to carry this in the -mm tree?

Yes.
 
> Acked-by: Ingo Molnar 

Thanks!

> Thanks,
> 
>   Ingo

-- 
Sincerely yours,
Mike.


Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

2020-07-28 Thread Nicholas Piggin
Excerpts from Linus Torvalds's message of July 28, 2020 4:37 am:
> [ Adding linux-arch, just to make other architectures aware of this issue too.
> 
>   We have a "flush_tlb_fix_spurious_fault()" thing to take care of the
> "TLB may contain stale entries, we can't take the same fault over and
> over again" situation.
> 
>   On x86, it's a no-op, because x86 doesn't do that. x86 will re-walk
> the page tables - or possibly just always invalidate the faulting TLB
> entry - before taking a fault, so there can be no long-term stale
> TLB's.

[snip]

>   It looks like powerpc people at least thought about this, and only
> do it if there is a coprocessor. Which sounds a bit confused, but I
> don't know the rules.

I'm not sure about ppc32 and 64e, I'm almost certain they should do a 
local flush if anyting, and someone with a good understanding of the 
ISAs and CPUs might be able to nop it entirely. I agree global can't 
ever really make sense (except as a default because we have no generic 
local flush).

powerpc/64s reloads translations after taking a fault, so it's fine with 
a nop here.

The quirk is a problem with coprocessor where it's supposed to 
invalidate the translation after a fault but it doesn't, so we can get a 
read-only TLB stuck after something else does a RO->RW upgrade on the 
TLB. Something like that IIRC.  Coprocessors have their own MMU which 
lives in the nest not the core, so you need a global TLB flush to
invalidate that thing.

Thanks,
Nick


Re: [PATCH 14/15] x86/numa: remove redundant iteration over memblock.reserved

2020-07-28 Thread Ingo Molnar


* Mike Rapoport  wrote:

> From: Mike Rapoport 
> 
> numa_clear_kernel_node_hotplug() function first traverses numa_meminfo
> regions to set node ID in memblock.reserved and than traverses
> memblock.reserved to update reserved_nodemask to include node IDs that were
> set in the first loop.
> 
> Remove redundant traversal over memblock.reserved and update
> reserved_nodemask while iterating over numa_meminfo.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/x86/mm/numa.c | 26 ++
>  1 file changed, 10 insertions(+), 16 deletions(-)

I suspect you'd like to carry this in the -mm tree?

Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH 03/15] arm, xtensa: simplify initialization of high memory pages

2020-07-28 Thread Max Filippov
On Mon, Jul 27, 2020 at 10:12 PM Mike Rapoport  wrote:
>
> From: Mike Rapoport 
>
> The function free_highpages() in both arm and xtensa essentially open-code
> for_each_free_mem_range() loop to detect high memory pages that were not
> reserved and that should be initialized and passed to the buddy allocator.
>
> Replace open-coded implementation of for_each_free_mem_range() with usage
> of memblock API to simplify the code.
>
> Signed-off-by: Mike Rapoport 
> ---
>  arch/arm/mm/init.c| 48 +++--
>  arch/xtensa/mm/init.c | 55 ---
>  2 files changed, 18 insertions(+), 85 deletions(-)

For the xtensa part:
Reviewed-by: Max Filippov 
Tested-by: Max Filippov 

-- 
Thanks.
-- Max


[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c

2020-07-28 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206203

--- Comment #17 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 290641
  --> https://bugzilla.kernel.org/attachment.cgi?id=290641=edit
kmemleak output (kernel 5.8-rc7, PowerMac G4 3,6)

Also happens on my G4 DP.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c

2020-07-28 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206203

--- Comment #16 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 290639
  --> https://bugzilla.kernel.org/attachment.cgi?id=290639=edit
dmesg (kernel 5.8-rc7, PowerMac G4 3,6)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's

2020-07-28 Thread Pingfan Liu
On Thu, Jul 23, 2020 at 10:41 PM Nathan Lynch  wrote:
>
> Pingfan Liu  writes:
> > This patch prepares for the incoming patch which swaps the order of KOBJ_
> > uevent and dt's updating.
> >
> > It has no functional effect, just groups lmb operation and memblock's in
> > order to insert dt updating operation easily, and makes it easier to
> > review.
>
> ...
>
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 5d545b7..1a3ac3b 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
> >  static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> >  {
> >   unsigned long block_sz;
> > - int rc;
> > + phys_addr_t base_addr;
> > + int rc, nid;
> >
> >   if (!lmb_is_removable(lmb))
> >   return -EINVAL;
> > @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> >   if (rc)
> >   return rc;
> >
> > + base_addr = lmb->base_addr;
> > + nid = lmb->nid;
> >   block_sz = pseries_memory_block_size();
> >
> > - __remove_memory(lmb->nid, lmb->base_addr, block_sz);
> > -
> > - /* Update memory regions for memory remove */
> > - memblock_remove(lmb->base_addr, block_sz);
> > -
> >   invalidate_lmb_associativity_index(lmb);
> >   lmb_clear_nid(lmb);
> >   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
> >
> > + __remove_memory(nid, base_addr, block_sz);
> > +
> > + /* Update memory regions for memory remove */
> > + memblock_remove(base_addr, block_sz);
> > +
> >   return 0;
> >  }
>
> I don't understand; the commit message should not claim this has no
> functional effect when it changes the order of operations like
> this. Maybe this is an improvement over the current behavior, but it's
> not explained why it would be.
One group of functions, which name contains lmb, are powerpc specific,
and used to form dt.

The other group __remove_memory() and memblock_remove() are integrated
with linux mm.

And [2/2] arrange dt-updating just before __remove_memory()

Thanks,
Pingfan


Re: [PATCH 02/15] dma-contiguous: simplify cma_early_percent_memory()

2020-07-28 Thread Christoph Hellwig
On Tue, Jul 28, 2020 at 08:11:40AM +0300, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The memory size calculation in cma_early_percent_memory() traverses
> memblock.memory rather than simply call memblock_phys_mem_size(). The
> comment in that function suggests that at some point there should have been
> call to memblock_analyze() before memblock_phys_mem_size() could be used.
> As of now, there is no memblock_analyze() at all and
> memblock_phys_mem_size() can be used as soon as cold-plug memory is
> registerd with memblock.
> 
> Replace loop over memblock.memory with a call to memblock_phys_mem_size().
> 
> Signed-off-by: Mike Rapoport 

Looks good:

Reviewed-by: Christoph Hellwig