date:20210923

Re: [PATCH v2 9/9] iommu/vt-d: Use pci core's DVSEC functionality

2021-09-23 Thread Christoph Hellwig

On Thu, Sep 23, 2021 at 10:26:47AM -0700, Ben Widawsky wrote:
>   */
>  static int siov_find_pci_dvsec(struct pci_dev *pdev)
>  {
> + return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5);
>  }

I hink the siov_find_pci_dvsec helper is pretty pointless now and can be
folded into its only caller.  And independent of that: this capability
really needs a symbolic name.  Especially for a vendor like Intel that
might have a few there should be a list of them somewhere.

Re: [PATCH] powerpc: don't select KFENCE on platform PPC_FSL_BOOK3E

2021-09-23 Thread Christophe Leroy





Le 24/09/2021 à 08:39, Liu Shixin a écrit :

On platform PPC_FSL_BOOK3E, all lowmem is managed by tlbcam. That means
we didn't really map the kfence pool with page granularity. Therefore,
if KFENCE is enabled, the system will hit the following panic:


Could you please explain a bit more what the problem is ?

KFENCE has been implemented with the same logic as DEBUG_PAGEALLOC.

DEBUG_PAGEALLOC is enabled on FSL_BOOK3E.

In MMU_setup(), __map_without_ltlbs is set to 1 when KFENCE is enabled.

__map_without_ltlbs should disable the use of tlbcam.


So what's wrong really ?

Does DEBUG_PAGEALLOC work on FSL_BOOK3E ?

Thanks
Christophe



 BUG: Kernel NULL pointer dereference on read at 0x
 Faulting instruction address: 0xc01de598
 Oops: Kernel access of bad area, sig: 11 [#1]
 BE PAGE_SIZE=4K SMP NR_CPUS=4 MPC8544 DS
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3+ #298
 NIP:  c01de598 LR: c08ae9c4 CTR: 
 REGS: c0b4bea0 TRAP: 0300   Not tainted  (5.12.0-rc3+)
 MSR:  00021000   CR: 24000228  XER: 2000
 DEAR:  ESR: 
 GPR00: c08ae9c4 c0b4bf60 c0ad64e0 ef72 00021000   
0200
 GPR08: c0ad5000   0004  008fbb30  

 GPR16:     c000   

 GPR24: c08ca004 c08ca004 c0b6a0e0 c0b6 c0b58f00 c085 c08ca000 
ef72
 NIP [c01de598] kfence_protect+0x44/0x6c
 LR [c08ae9c4] kfence_init+0xfc/0x2a4
 Call Trace:
 [c0b4bf60] [efffe160] 0xefffe160 (unreliable)
 [c0b4bf70] [c08ae9c4] kfence_init+0xfc/0x2a4
 [c0b4bfb0] [c0894d3c] start_kernel+0x3bc/0x574
 [c0b4bff0] [c470] set_ivor+0x14c/0x188
 Instruction dump:
 7c0802a6 8109d594 546a653a 90010014 54630026 3920 7d48502e 2c0a
 41820010 554a0026 5469b53a 7d295214 <8149> 38831000 554a003c 9149
 random: get_random_bytes called from print_oops_end_marker+0x40/0x78 with 
crng_init=0
 ---[ end trace  ]---

Signed-off-by: Liu Shixin 
---
  arch/powerpc/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d46db0bfb998..cffd57bcb5e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -185,7 +185,7 @@ config PPC
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KGDB
-   select HAVE_ARCH_KFENCE if PPC32
+   select HAVE_ARCH_KFENCE if PPC32 && !PPC_FSL_BOOK3E
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if COMPAT
select HAVE_ARCH_NVRAM_OPS

[PATCH] powerpc: don't select KFENCE on platform PPC_FSL_BOOK3E

2021-09-23 Thread Liu Shixin

On platform PPC_FSL_BOOK3E, all lowmem is managed by tlbcam. That means
we didn't really map the kfence pool with page granularity. Therefore,
if KFENCE is enabled, the system will hit the following panic:

BUG: Kernel NULL pointer dereference on read at 0x
Faulting instruction address: 0xc01de598
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K SMP NR_CPUS=4 MPC8544 DS
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3+ #298
NIP:  c01de598 LR: c08ae9c4 CTR: 
REGS: c0b4bea0 TRAP: 0300   Not tainted  (5.12.0-rc3+)
MSR:  00021000   CR: 24000228  XER: 2000
DEAR:  ESR: 
GPR00: c08ae9c4 c0b4bf60 c0ad64e0 ef72 00021000   
0200
GPR08: c0ad5000   0004  008fbb30  

GPR16:     c000   

GPR24: c08ca004 c08ca004 c0b6a0e0 c0b6 c0b58f00 c085 c08ca000 
ef72
NIP [c01de598] kfence_protect+0x44/0x6c
LR [c08ae9c4] kfence_init+0xfc/0x2a4
Call Trace:
[c0b4bf60] [efffe160] 0xefffe160 (unreliable)
[c0b4bf70] [c08ae9c4] kfence_init+0xfc/0x2a4
[c0b4bfb0] [c0894d3c] start_kernel+0x3bc/0x574
[c0b4bff0] [c470] set_ivor+0x14c/0x188
Instruction dump:
7c0802a6 8109d594 546a653a 90010014 54630026 3920 7d48502e 2c0a
41820010 554a0026 5469b53a 7d295214 <8149> 38831000 554a003c 9149
random: get_random_bytes called from print_oops_end_marker+0x40/0x78 with 
crng_init=0
---[ end trace  ]---

Signed-off-by: Liu Shixin 
---
 arch/powerpc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d46db0bfb998..cffd57bcb5e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -185,7 +185,7 @@ config PPC
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KGDB
-   select HAVE_ARCH_KFENCE if PPC32
+   select HAVE_ARCH_KFENCE if PPC32 && !PPC_FSL_BOOK3E
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if COMPAT
select HAVE_ARCH_NVRAM_OPS
-- 
2.18.0.huawei.25

Re: [PATCH 3/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Mike Rapoport

On Thu, Sep 23, 2021 at 03:54:46PM +0200, Christophe Leroy wrote:
> 
> Le 23/09/2021 à 14:01, Mike Rapoport a écrit :
> > On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote:
> > > 
> > > 
> > > Le 23/09/2021 à 09:43, Mike Rapoport a écrit :
> > > > From: Mike Rapoport 
> > > > 
> > > > For ages memblock_free() interface dealt with physical addresses even
> > > > despite the existence of memblock_alloc_xx() functions that return a
> > > > virtual pointer.
> > > > 
> > > > Introduce memblock_phys_free() for freeing physical ranges and repurpose
> > > > memblock_free() to free virtual pointers to make the following pairing
> > > > abundantly clear:
> > > > 
> > > > int memblock_phys_free(phys_addr_t base, phys_addr_t size);
> > > > phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t 
> > > > size);
> > > > 
> > > > void *memblock_alloc(phys_addr_t size, phys_addr_t align);
> > > > void memblock_free(void *ptr, size_t size);
> > > > 
> > > > Replace intermediate memblock_free_ptr() with memblock_free() and drop
> > > > unnecessary aliases memblock_free_early() and memblock_free_early_nid().
> > > > 
> > > > Suggested-by: Linus Torvalds 
> > > > Signed-off-by: Mike Rapoport 
> > > > ---
> > > 
> > > > diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
> > > > index 1a04e5bdf655..37826d8c4f74 100644
> > > > --- a/arch/s390/kernel/smp.c
> > > > +++ b/arch/s390/kernel/smp.c
> > > > @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void)
> > > > /* Get the CPU registers */
> > > > smp_save_cpu_regs(sa, addr, is_boot_cpu, page);
> > > > }
> > > > -   memblock_free(page, PAGE_SIZE);
> > > > +   memblock_phys_free(page, PAGE_SIZE);
> > > > diag_amode31_ops.diag308_reset();
> > > > pcpu_set_smt(0);
> > > >}
> > > > @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
> > > > /* Add CPUs present at boot */
> > > > __smp_rescan_cpus(info, true);
> > > > -   memblock_free_early((unsigned long)info, sizeof(*info));
> > > > +   memblock_free(info, sizeof(*info));
> > > >}
> > > >/*
> > > 
> > > I'm a bit lost. IIUC memblock_free_early() and memblock_free() where
> > > identical.
> > 
> > Yes, they were, but all calls to memblock_free_early() were using
> > __pa(vaddr) because they had a virtual address at hand.
> 
> I'm still not following. In the above memblock_free_early() was taking
> (unsigned long)info . Was it a bug ? 

Not really because s390 has pa == va:

https://elixir.bootlin.com/linux/latest/source/arch/s390/include/asm/page.h#L169


-- 
Sincerely yours,
Mike.

Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()

2021-09-23 Thread Michael Ellerman

Srikar Dronamraju  writes:
> * Michael Ellerman  [2021-09-23 17:29:32]:
>
>> Nathan Lynch  writes:
>> > Srikar Dronamraju  writes:
>> >
>> >> * Nathan Lynch  [2021-09-22 11:01:12]:
>> >>
>> >>> Srikar Dronamraju  writes:
...
>> >> Or can I understand how debug_smp_processor_id() is useful if
>> >> __smp_processor_id() is defined as raw_smp_processor_id()?
>> 
>> debug_smp_processor_id() is useful on powerpc, as well as other arches,
>> because it checks that we're in a context where the processor id won't
>> change out from under us.
>> 
>> eg. something like this is unsafe:
>> 
>>   int counts[NR_CPUS];
>>   int tmp, cpu;
>>   
>>   cpu = smp_processor_id();
>>   tmp = counts[cpu];
>>  <- preempted here and migrated to another CPU
>>   counts[cpu] = tmp + 1;
>> 
>
> If lets say the above call was replaced by raw_smp_processor_id(), how would
> it avoid the preemption / migration to another CPU?
>
> Replacing it with raw_smp_processor_id() may avoid, the debug splat but the
> underlying issue would still remain as is. No?

Correct.

Using raw_smp_processor_id() is generally the wrong answer. For this
example the correct fix is to disable preemption around the code, eg:

   int counts[NR_CPUS];
   int tmp, cpu;
   
   preempt_disable()

   cpu = smp_processor_id();
   tmp = counts[cpu];
   counts[cpu] = tmp + 1;

   preempt_enable();


For the original issue I think it is OK to use raw_smp_processor_id(),
because we're already doing a racy check of whether another CPU has been
preempted by the hypervisor.

if (!is_kvm_guest()) {
int first_cpu = cpu_first_thread_sibling(smp_processor_id());

if (cpu_first_thread_sibling(cpu) == first_cpu)
return false;
}

We could disable preemption around that, eg:

if (!is_kvm_guest()) {
int first_cpu;
bool is_sibling;

preempt_disable();
first_cpu = cpu_first_thread_sibling(smp_processor_id());
is_sibling = (cpu_first_thread_sibling(cpu) == first_cpu)
preempt_enable();

// Can be preempted here

if (is_sibling)
return false;
}

But before we return we could be preempted, and then is_sibling is no
longer necessarily correct. So that doesn't really gain us anything.

The only way to make that result stable is to disable preemption in the
caller, but most callers don't want to AFAICS, because they know they're
doing a racy check to begin with.

cheers

Re: [PATCH 3/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Shahab Vahedi

On 9/23/21 9:43 AM, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> For ages memblock_free() interface dealt with physical addresses even
> despite the existence of memblock_alloc_xx() functions that return a
> virtual pointer.
> 
> Introduce memblock_phys_free() for freeing physical ranges and repurpose
> memblock_free() to free virtual pointers to make the following pairing
> abundantly clear:
> 
>   int memblock_phys_free(phys_addr_t base, phys_addr_t size);
>   phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);
> 
>   void *memblock_alloc(phys_addr_t size, phys_addr_t align);
>   void memblock_free(void *ptr, size_t size);
> 
> Replace intermediate memblock_free_ptr() with memblock_free() and drop
> unnecessary aliases memblock_free_early() and memblock_free_early_nid().
> 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Mike Rapoport 

arch/arc part: Reviewed-by: Shahab Vahedi 


Thanks,
Shahab

Re: [PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC

2021-09-23 Thread Liang, Kan





On 9/23/2021 1:26 PM, Ben Widawsky wrote:

Add pci_find_dvsec_capability to locate a Designated Vendor-Specific
Extended Capability with the specified DVSEC ID.

The Designated Vendor-Specific Extended Capability (DVSEC) allows one or
more vendor specific capabilities that aren't tied to the vendor ID of
the PCI component.

DVSEC is critical for both the Compute Express Link (CXL) driver as well
as the driver for OpenCAPI coherent accelerator (OCXL).

Cc: David E. Box 
Cc: Jonathan Cameron 
Cc: Bjorn Helgaas 
Cc: Dan Williams 
Cc: linux-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andrew Donnellan 
Cc: Lu Baolu 
Reviewed-by: Frederic Barrat 
Signed-off-by: Ben Widawsky 


Applied the interface for the perf uncore driver as below. The interface 
works properly.


Tested-by: Kan Liang 


From ebb69ba386dca91fb372522b13af9feb84adcbc0 Mon Sep 17 00:00:00 2001
From: Kan Liang 
Date: Thu, 23 Sep 2021 13:59:24 -0700
Subject: [PATCH] perf/x86/intel/uncore: Use pci core's DVSEC functionality

Apply standard interface pci_find_dvsec_capability for perf uncore
driver and remove unused macros.

Reduce maintenance burden of DVSEC query implementation.

Signed-off-by: Kan Liang 
---
 arch/x86/events/intel/uncore_discovery.c | 41 
+++-

 arch/x86/events/intel/uncore_discovery.h |  6 -
 2 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/arch/x86/events/intel/uncore_discovery.c 
b/arch/x86/events/intel/uncore_discovery.c

index 3049c64..f8ea092 100644
--- a/arch/x86/events/intel/uncore_discovery.c
+++ b/arch/x86/events/intel/uncore_discovery.c
@@ -21,7 +21,7 @@ static bool has_generic_discovery_table(void)
return false;

/* A discovery table device has the unique capability ID. */
-   dvsec = pci_find_next_ext_capability(dev, 0, 
UNCORE_EXT_CAP_ID_DISCOVERY);
+   dvsec = pci_find_next_ext_capability(dev, 0, PCI_EXT_CAP_ID_DVSEC);
pci_dev_put(dev);
if (dvsec)
return true;
@@ -260,7 +260,7 @@ static int parse_discovery_table(struct pci_dev 
*dev, int die,


 bool intel_uncore_has_discovery_tables(void)
 {
-   u32 device, val, entry_id, bar_offset;
+   u32 device, val, bar_offset;
int die, dvsec = 0, ret = true;
struct pci_dev *dev = NULL;
bool parsed = false;
@@ -275,27 +275,24 @@ bool intel_uncore_has_discovery_tables(void)
 * the discovery table devices.
 */
 	while ((dev = pci_get_device(PCI_VENDOR_ID_INTEL, device, dev)) != 
NULL) {
-		while ((dvsec = pci_find_next_ext_capability(dev, dvsec, 
UNCORE_EXT_CAP_ID_DISCOVERY))) {

-   pci_read_config_dword(dev, dvsec + 
UNCORE_DISCOVERY_DVSEC_OFFSET, &val);
-   entry_id = val & UNCORE_DISCOVERY_DVSEC_ID_MASK;
-   if (entry_id != UNCORE_DISCOVERY_DVSEC_ID_PMON)
-   continue;
-
-			pci_read_config_dword(dev, dvsec + UNCORE_DISCOVERY_DVSEC2_OFFSET, 
&val);

-
-   if (val & ~UNCORE_DISCOVERY_DVSEC2_BIR_MASK) {
-   ret = false;
-   goto err;
-   }
-   bar_offset = UNCORE_DISCOVERY_BIR_BASE +
- (val & UNCORE_DISCOVERY_DVSEC2_BIR_MASK) * 
UNCORE_DISCOVERY_BIR_STEP;

-
-   die = get_device_die_id(dev);
-   if (die < 0)
-   continue;
-
-   parse_discovery_table(dev, die, bar_offset, &parsed);
+		dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_INTEL, 
UNCORE_DISCOVERY_DVSEC_ID_PMON);

+   if (!dvsec)
+   continue;
+
+   pci_read_config_dword(dev, dvsec + UNCORE_DISCOVERY_DVSEC2_OFFSET, 
&val);
+
+   if (val & ~UNCORE_DISCOVERY_DVSEC2_BIR_MASK) {
+   ret = false;
+   goto err;
}
+   bar_offset = UNCORE_DISCOVERY_BIR_BASE +
+			 (val & UNCORE_DISCOVERY_DVSEC2_BIR_MASK) * 
UNCORE_DISCOVERY_BIR_STEP;

+
+   die = get_device_die_id(dev);
+   if (die < 0)
+   continue;
+
+   parse_discovery_table(dev, die, bar_offset, &parsed);
}

/* None of the discovery tables are available */
diff --git a/arch/x86/events/intel/uncore_discovery.h 
b/arch/x86/events/intel/uncore_discovery.h

index 6d735611..84d56e5 100644
--- a/arch/x86/events/intel/uncore_discovery.h
+++ b/arch/x86/events/intel/uncore_discovery.h
@@ -2,12 +2,6 @@

 /* Generic device ID of a discovery table device */
 #define UNCORE_DISCOVERY_TABLE_DEVICE  0x09a7
-/* Capability ID for a discovery table device */
-#define UNCORE_EXT_CAP_ID_DISCOVERY0x23
-/* First DVSEC offset */
-#define UNCORE_DISCOVERY_DVSEC_OFFSET  0x8
-/* Mask of the supported discovery entry type */
-#define UNCORE_DISCOVERY_DVSEC_ID_MASK 0x
 /* PMON discovery ent

Re: [PATCH v1] powerpc/64s: Fix unrecoverable MCE crash

2021-09-23 Thread Ganesh


On 9/22/21 7:32 AM, Nicholas Piggin wrote:


The machine check handler is not considered NMI on 64s. The early
handler is the true NMI handler, and then it schedules the
machine_check_exception handler to run when interrupts are enabled.

This works fine except the case of an unrecoverable MCE, where the true
NMI is taken when MSR[RI] is clear, it can not recover to schedule the
next handler, so it calls machine_check_exception directly so something
might be done about it.

Calling an async handler from NMI context can result in irq state and
other things getting corrupted. This can also trigger the BUG at
arch/powerpc/include/asm/interrupt.h:168.

Fix this by just making the 64s machine_check_exception handler an NMI
like it is on other subarchs.

Signed-off-by: Nicholas Piggin 
---


Hi Nick,

If I inject control memory access error in LPAR on top of this patch
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210906084303.183921-1-ganes...@linux.ibm.com/

I see the following warning trace

WARNING: CPU: 130 PID: 7122 at arch/powerpc/include/asm/interrupt.h:319 
machine_check_exception+0x310/0x340
 Modules linked in:
 CPU: 130 PID: 7122 Comm: inj_access_err Kdump: loaded Tainted: G   M   
   5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty #22
 NIP:  c002f980 LR: c002f7e8 CTR: c0a31860
 REGS: c039fe51bb20 TRAP: 0700   Tainted: G   M   
(5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty)
 MSR:  80029033   CR: 88000222  XER: 2004
 CFAR: c002f844 IRQMASK: 0
 GPR00: c002f798 c039fe51bdc0 c20d 0001
 GPR04:  4002 4000 19af
 GPR08: 0077e5ad  c077ee16c700 0080
 GPR12: 88000222 c077ee16c700  
 GPR16:    
 GPR20:    
 GPR24:   c20fecd8 
 GPR28:  0001 0001 c039fe51be80
 NIP [c002f980] machine_check_exception+0x310/0x340
 LR [c002f7e8] machine_check_exception+0x178/0x340
 Call Trace:
 [c039fe51bdc0] [c002f798] machine_check_exception+0x128/0x340 
(unreliable)
 [c039fe51be10] [c00086ec] machine_check_common+0x1ac/0x1b0
 --- interrupt: 200 at 0x1968
 NIP:  1968 LR: 1958 CTR: 
 REGS: c039fe51be80 TRAP: 0200   Tainted: G   M   
(5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty)
 MSR:  82a0f033   CR: 22000824  
XER: 
 CFAR: 021c DAR: 7fffb00c DSISR: 0208 IRQMASK: 0
 GPR00: 22000824 7fffc9647770 10027f00 7fffb00c
 GPR04:    
 GPR08:  7fffb00c 0001 
 GPR12:  7fffb015a330  
 GPR16:    
 GPR20:    
 GPR24:    185c
 GPR28: 7fffc9647d18 0001 19b0 7fffc9647770
 NIP [1968] 0x1968
 LR [1958] 0x1958
 --- interrupt: 200

Re: [PATCH 0/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Mike Rapoport

Hi Linus,

On Thu, Sep 23, 2021 at 09:01:46AM -0700, Linus Torvalds wrote:
> On Thu, Sep 23, 2021 at 12:43 AM Mike Rapoport  wrote:
> >
> You need to be a LOT more careful.
> 
> From a trivial check - exactly because I looked at doing it with a
> script, and decided it's not so easy - I found cases like this:
> 
> -   memblock_free(__pa(paca_ptrs) + new_ptrs_size,
> +   memblock_free(paca_ptrs + new_ptrs_size,
> 
> which is COMPLETELY wrong.

I did use a coccinelle script that's slightly more robust that a sed you've
sent, but then I did a manual review, hence the two small patches with
fixes. Indeed I missed this one, so to be on the safe side I'll rename only
the obvious cases where coccinelle can be used reliably and leave all the
rest as it's now. If somebody cares enough they can update it later.

> And no, making the scripting just replace '__pa(x)' with '(void *)(x)'

These were actually manual and they are required for variables that
used as virtual addresses but have unsigned long type, like e.g.
initrd_start. So it's either __pa(x) or (void *).

-- 
Sincerely yours,
Mike.

Re: [PATCH v4] lockdown,selinux: fix wrong subject in some SELinux lockdown checks

2021-09-23 Thread Paul Moore

On Wed, Sep 15, 2021 at 10:59 PM Paul Moore  wrote:
>
> On Mon, Sep 13, 2021 at 5:05 PM Paul Moore  wrote:
> >
> > On Mon, Sep 13, 2021 at 10:02 AM Ondrej Mosnacek  
> > wrote:
> > >
> > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > > lockdown") added an implementation of the locked_down LSM hook to
> > > SELinux, with the aim to restrict which domains are allowed to perform
> > > operations that would breach lockdown.
> > >
> > > However, in several places the security_locked_down() hook is called in
> > > situations where the current task isn't doing any action that would
> > > directly breach lockdown, leading to SELinux checks that are basically
> > > bogus.
> > >
> > > To fix this, add an explicit struct cred pointer argument to
> > > security_lockdown() and define NULL as a special value to pass instead
> > > of current_cred() in such situations. LSMs that take the subject
> > > credentials into account can then fall back to some default or ignore
> > > such calls altogether. In the SELinux lockdown hook implementation, use
> > > SECINITSID_KERNEL in case the cred argument is NULL.
> > >
> > > Most of the callers are updated to pass current_cred() as the cred
> > > pointer, thus maintaining the same behavior. The following callers are
> > > modified to pass NULL as the cred pointer instead:
> > > 1. arch/powerpc/xmon/xmon.c
> > >  Seems to be some interactive debugging facility. It appears that
> > >  the lockdown hook is called from interrupt context here, so it
> > >  should be more appropriate to request a global lockdown decision.
> > > 2. fs/tracefs/inode.c:tracefs_create_file()
> > >  Here the call is used to prevent creating new tracefs entries when
> > >  the kernel is locked down. Assumes that locking down is one-way -
> > >  i.e. if the hook returns non-zero once, it will never return zero
> > >  again, thus no point in creating these files. Also, the hook is
> > >  often called by a module's init function when it is loaded by
> > >  userspace, where it doesn't make much sense to do a check against
> > >  the current task's creds, since the task itself doesn't actually
> > >  use the tracing functionality (i.e. doesn't breach lockdown), just
> > >  indirectly makes some new tracepoints available to whoever is
> > >  authorized to use them.
> > > 3. net/xfrm/xfrm_user.c:copy_to_user_*()
> > >  Here a cryptographic secret is redacted based on the value returned
> > >  from the hook. There are two possible actions that may lead here:
> > >  a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> > > task context is relevant, since the dumped data is sent back to
> > > the current task.
> > >  b) When adding/deleting/updating an SA via XFRM_MSG_xxxSA, the
> > > dumped SA is broadcasted to tasks subscribed to XFRM events -
> > > here the current task context is not relevant as it doesn't
> > > represent the tasks that could potentially see the secret.
> > >  It doesn't seem worth it to try to keep using the current task's
> > >  context in the a) case, since the eventual data leak can be
> > >  circumvented anyway via b), plus there is no way for the task to
> > >  indicate that it doesn't care about the actual key value, so the
> > >  check could generate a lot of "false alert" denials with SELinux.
> > >  Thus, let's pass NULL instead of current_cred() here faute de
> > >  mieux.
> > >
> > > Improvements-suggested-by: Casey Schaufler 
> > > Improvements-suggested-by: Paul Moore 
> > > Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux 
> > > lockdown")
> > > Acked-by: Dan Williams  [cxl]
> > > Acked-by: Steffen Klassert  [xfrm]
> > > Signed-off-by: Ondrej Mosnacek 
> > > ---
> > >
> > > v4:
> > > - rebase on top of TODO
> > > - fix rebase conflicts:
> > >   * drivers/cxl/pci.c
> > > - trivial: the lockdown reason was corrected in mainline
> > >   * kernel/bpf/helpers.c, kernel/trace/bpf_trace.c
> > > - trivial: LOCKDOWN_BPF_READ was renamed to LOCKDOWN_BPF_READ_KERNEL
> > >   in mainline
> > >   * kernel/power/hibernate.c
> > > - trivial: !secretmem_active() was added to the condition in
> > >   hibernation_available()
> > > - cover new security_locked_down() call in kernel/bpf/helpers.c
> > >   (LOCKDOWN_BPF_WRITE_USER in BPF_FUNC_probe_write_user case)
> > >
> > > v3: 
> > > https://lore.kernel.org/lkml/20210616085118.1141101-1-omosn...@redhat.com/
> > > - add the cred argument to security_locked_down() and adapt all callers
> > > - keep using current_cred() in BPF, as the hook calls have been shifted
> > >   to program load time (commit ff40e51043af ("bpf, lockdown, audit: Fix
> > >   buggy SELinux lockdown permission checks"))
> > > - in SELinux, don't ignore hook calls where cred == NULL, but use
> > >   SECINITSID_KERNEL as the subject instead
> > > - update explanations

Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-23 Thread Borislav Petkov

On Thu, Sep 23, 2021 at 12:05:58AM +0300, Kirill A. Shutemov wrote:
> Unless we find other way to guarantee RIP-relative access, we must use
> fixup_pointer() to access any global variables.

Yah, I've asked compiler folks about any guarantees we have wrt
rip-relative addresses but it doesn't look good. Worst case, we'd have
to do the fixup_pointer() thing.

In the meantime, Tom and I did some more poking at this and here's a
diff ontop.

The direction being that we'll stick both the AMD and Intel
*cc_platform_has() call into cc_platform.c for which instrumentation
will be disabled so no issues with that.

And that will keep all that querying all together in a single file.

---
diff --git a/arch/x86/include/asm/mem_encrypt.h 
b/arch/x86/include/asm/mem_encrypt.h
index a73712b6ee0e..2d4f5c17d79c 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -51,7 +51,6 @@ void __init mem_encrypt_free_decrypted_mem(void);
 void __init mem_encrypt_init(void);
 
 void __init sev_es_init_vc_handling(void);
-bool amd_cc_platform_has(enum cc_attr attr);
 
 #define __bss_decrypted __section(".bss..decrypted")
 
@@ -74,7 +73,6 @@ static inline void __init sme_encrypt_kernel(struct 
boot_params *bp) { }
 static inline void __init sme_enable(struct boot_params *bp) { }
 
 static inline void sev_es_init_vc_handling(void) { }
-static inline bool amd_cc_platform_has(enum cc_attr attr) { return false; }
 
 static inline int __init
 early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 
0; }
@@ -103,12 +101,6 @@ static inline u64 sme_get_me_mask(void)
return sme_me_mask;
 }
 
-#if defined(CONFIG_CPU_SUP_INTEL) && defined(CONFIG_ARCH_HAS_CC_PLATFORM)
-bool intel_cc_platform_has(enum cc_attr attr);
-#else
-static inline bool intel_cc_platform_has(enum cc_attr attr) { return false; }
-#endif
-
 #endif /* __ASSEMBLY__ */
 
 #endif /* __X86_MEM_ENCRYPT_H__ */
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index da54a1805211..97ede7052f77 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -13,6 +13,52 @@
 
 #include 
 
+static bool intel_cc_platform_has(enum cc_attr attr)
+{
+#ifdef CONFIG_INTEL_TDX_GUEST
+   return false;
+#else
+   return false;
+#endif
+}
+
+/*
+ * SME and SEV are very similar but they are not the same, so there are
+ * times that the kernel will need to distinguish between SME and SEV. The
+ * cc_platform_has() function is used for this.  When a distinction isn't
+ * needed, the CC_ATTR_MEM_ENCRYPT attribute can be used.
+ *
+ * The trampoline code is a good example for this requirement.  Before
+ * paging is activated, SME will access all memory as decrypted, but SEV
+ * will access all memory as encrypted.  So, when APs are being brought
+ * up under SME the trampoline area cannot be encrypted, whereas under SEV
+ * the trampoline area must be encrypted.
+ */
+static bool amd_cc_platform_has(enum cc_attr attr)
+{
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+   switch (attr) {
+   case CC_ATTR_MEM_ENCRYPT:
+   return sme_me_mask;
+
+   case CC_ATTR_HOST_MEM_ENCRYPT:
+   return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED);
+
+   case CC_ATTR_GUEST_MEM_ENCRYPT:
+   return sev_status & MSR_AMD64_SEV_ENABLED;
+
+   case CC_ATTR_GUEST_STATE_ENCRYPT:
+   return sev_status & MSR_AMD64_SEV_ES_ENABLED;
+
+   default:
+   return false;
+   }
+#else
+   return false;
+#endif
+}
+
+
 bool cc_platform_has(enum cc_attr attr)
 {
if (sme_me_mask)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 53756ff12295..8321c43554a1 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -60,13 +60,6 @@ static u64 msr_test_ctrl_cache __ro_after_init;
  */
 static bool cpu_model_supports_sld __ro_after_init;
 
-#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
-bool intel_cc_platform_has(enum cc_attr attr)
-{
-   return false;
-}
-#endif
-
 /*
  * Processors which have self-snooping capability can handle conflicting
  * memory type across CPUs by snooping its own cache. However, there exists
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 9417d404ea92..23d54b810f08 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -361,38 +361,6 @@ int __init early_set_memory_encrypted(unsigned long vaddr, 
unsigned long size)
return early_set_memory_enc_dec(vaddr, size, true);
 }
 
-/*
- * SME and SEV are very similar but they are not the same, so there are
- * times that the kernel will need to distinguish between SME and SEV. The
- * cc_platform_has() function is used for this.  When a distinction isn't
- * needed, the CC_ATTR_MEM_ENCRYPT attribute can be used.
- *
- * The trampoline code is a good example for this requirement.  Before
- * paging is activated, SME will access all memory as decrypted, but SEV
- * will access all memory a

Re: [PATCH 5/9] xen/x86: make "earlyprintk=xen" work for HVM/PVH DomU

2021-09-23 Thread Juergen Gross


On 07.09.21 12:10, Jan Beulich wrote:

xenboot_write_console() is dealing with these quite fine so I don't see
why xenboot_console_setup() would return -ENOENT in this case.

Adjust documentation accordingly.

Signed-off-by: Jan Beulich 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature

Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()

2021-09-23 Thread Srikar Dronamraju

* Michael Ellerman  [2021-09-23 17:29:32]:

> Nathan Lynch  writes:
> > Srikar Dronamraju  writes:
> >
> >> * Nathan Lynch  [2021-09-22 11:01:12]:
> >>
> >>> Srikar Dronamraju  writes:
> >>> > * Nathan Lynch  [2021-09-20 22:12:13]:
> >>> >
> >>> >> vcpu_is_preempted() can be used outside of preempt-disabled critical
> >>> >> sections, yielding warnings such as:
> >>> >> 
> >>> >> BUG: using smp_processor_id() in preemptible [] code: 
> >>> >> systemd-udevd/185
> >>> >> caller is rwsem_spin_on_owner+0x1cc/0x2d0
> >>> >> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
> >>> >> Call Trace:
> >>> >> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 
> >>> >> (unreliable)
> >>> >> [c00012907b00] [c1371f70] 
> >>> >> check_preemption_disabled+0x150/0x160
> >>> >> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
> >>> >> [c00012907be0] [c01e1408] 
> >>> >> rwsem_down_write_slowpath+0x478/0x9a0
> >>> >> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0
> >>> >> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0
> >>> >> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70
> >>> >> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0
> >>> >> [c00012907e10] [c000c54c] system_call_common+0xec/0x250
> >>> >> 
> >>> >> The result of vcpu_is_preempted() is always subject to invalidation by
> >>> >> events inside and outside of Linux; it's just a best guess at a point 
> >>> >> in
> >>> >> time. Use raw_smp_processor_id() to avoid such warnings.
> >>> >
> >>> > Typically smp_processor_id() and raw_smp_processor_id() except for the
> >>> > CONFIG_DEBUG_PREEMPT.
> >>> 
> >>> Sorry, I don't follow...
> >>
> >> I meant, Unless CONFIG_DEBUG_PREEMPT, smp_processor_id() is defined as
> >> raw_processor_id().
> >>
> >>> 
> >>> > In the CONFIG_DEBUG_PREEMPT case, smp_processor_id()
> >>> > is actually debug_smp_processor_id(), which does all the checks.
> >>> 
> >>> Yes, OK.
> >>> 
> >>> > I believe these checks in debug_smp_processor_id() are only valid for 
> >>> > x86
> >>> > case (aka cases were they have __smp_processor_id() defined.)
> >>> 
> >>> Hmm, I am under the impression that the checks in
> >>> debug_smp_processor_id() are valid regardless of whether the arch
> >>> overrides __smp_processor_id().
> >>
> >> From include/linux/smp.h
> >>
> >> /*
> >>  * Allow the architecture to differentiate between a stable and unstable 
> >> read.
> >>  * For example, x86 uses an IRQ-safe asm-volatile read for the unstable 
> >> but a
> >>  * regular asm read for the stable.
> >>  */
> >> #ifndef __smp_processor_id
> >> #define __smp_processor_id(x) raw_smp_processor_id(x)
> >> #endif
> >>
> >> As far as I see, only x86 has a definition of __smp_processor_id.
> >> So for archs like Powerpc, __smp_processor_id(), is always
> >> defined as raw_smp_processor_id(). Right?
> >
> > Sure, yes.
> >
> >> I would think debug_smp_processor_id() would be useful if 
> >> __smp_processor_id()
> >> is different from raw_smp_processor_id(). Do note debug_smp_processor_id() 
> >> calls raw_smp_processor_id().
> 
> Agree.
> 
> > I do not think the utility of debug_smp_processor_id() is related to
> > whether the arch defines __smp_processor_id().
> >
> >> Or can I understand how debug_smp_processor_id() is useful if
> >> __smp_processor_id() is defined as raw_smp_processor_id()?
> 
> debug_smp_processor_id() is useful on powerpc, as well as other arches,
> because it checks that we're in a context where the processor id won't
> change out from under us.
> 
> eg. something like this is unsafe:
> 
>   int counts[NR_CPUS];
>   int tmp, cpu;
>   
>   cpu = smp_processor_id();
>   tmp = counts[cpu];
>   <- preempted here and migrated to another CPU
>   counts[cpu] = tmp + 1;
> 

If lets say the above call was replaced by raw_smp_processor_id(), how would
it avoid the preemption / migration to another CPU?

Replacing it with raw_smp_processor_id() may avoid, the debug splat but the
underlying issue would still remain as is. No?

> 
> > So, for powerpc with DEBUG_PREEMPT unset, a call to smp_procesor_id()
> > expands to __smp_processor_id() which expands to raw_smp_processor_id(),
> > avoiding the preempt safety checks. This is working as intended.
> >
> > For powerpc with DEBUG_PREEMPT=y, a call to smp_processor_id() expands
> > to the out of line call to debug_smp_processor_id(), which calls
> > raw_smp_processor_id() and performs the checks, warning if called in an
> > inappropriate context, as seen here. Also working as intended.
> >
> > AFAICT __smp_processor_id() is a performance/codegen-oriented hook, and
> > not really related to the debug facility. Please see 9ed7d75b2f09d
> > ("x86/percpu: Relax smp_processor_id()").
> 
> Yeah good find.
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 3/9] xen/x86: make "earlyprintk=xen" work better for PVH Dom0

2021-09-23 Thread Juergen Gross


On 07.09.21 12:09, Jan Beulich wrote:

The xen_hvm_early_write() path better wouldn't be taken in this case;
while port 0xE9 can be used, the hypercall path is quite a bit more
efficient. Put that first, as it may also work for DomU-s (see also
xen_raw_console_write()).

While there also bail from the function when the first
domU_write_console() failed - later ones aren't going to succeed.

Signed-off-by: Jan Beulich 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature

Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-09-23 Thread Srikar Dronamraju

* Michael Ellerman  [2021-09-23 21:17:25]:

> Srikar Dronamraju  writes:
> > * Michael Ellerman  [2021-08-26 23:36:53]:
> >
> >> Srikar Dronamraju  writes:
> >> > Scheduler expects unique number of node distances to be available at
> >> > boot.
> ...
> >
> >> > Fake the offline node's distance_lookup_table entries so that all
> >> > possible node distances are updated.
> >>
> >> Does this work if we have a single node offline at boot?
> >>
> >
> > It should.
> >
> >> Say we start with:
> >>
> >> node distances:
> >> node   0   1
> >>   0:  10  20
> >>   1:  20  10
> >>
> >> And node 2 is offline at boot. We can only initialise that nodes entries
> >> in the distance_lookup_table:
> >>
> >>while (i--)
> >>distance_lookup_table[node][i] = node;
> >>
> >> By filling them all with 2 that causes node_distance(2, X) to return the
> >> maximum distance for all other nodes X, because we won't break out of
> >> the loop in __node_distance():
> >>
> >>for (i = 0; i < distance_ref_points_depth; i++) {
> >>if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
> >>break;
> >>
> >>/* Double the distance for each NUMA level */
> >>distance *= 2;
> >>}
> >>
> >> If distance_ref_points_depth was 4 we'd return 160.
> >
> > As you already know, distance 10, 20, .. are defined by Powerpc, form1
> > affinity. PAPR doesn't define actual distances, it only provides us the
> > associativity. If there are distance_ref_points_depth is 4,
> > (distance_ref_points_depth doesn't take local distance into consideration)
> > 10, 20, 40, 80, 160.
> >
> >>
> >> That'd leave us with 3 unique distances at boot, 10, 20, 160.
> >>
> >
> > So if there are unique distances, then the distances as per the current
> > code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
> > the series. like having 160 without 80.
>
> I'm confused what you mean there.
>

At the outset, if we have a better probable solution, do let me know, I am
willing to try that too.

> If we have a node that's offline at boot then we get 160 for that node,
> that's just the result of having no info for it, so we never break out
> of the for loop.
>
> So if we have two nodes, one hop apart, and then an offline node we get
> 10, 20, 160.
>
> Or if you're using depth = 3 then it's 10, 20, 80.
>

My understanding is as below:

device-tree provides the max hops by way of
ibm,associativity-reference-points. This is mapped to
distance_ref_points_depth in Linux-powerpc.

Now Linux-powerpc encodes hops as (dis-regarding local distance) 20, 40, 80,
160, 320 ...
So if the distance_ref_points_depth is 3, then the hops are 20, 40, 80.

Do you disagree?


> >> But when node 2 comes online it might introduce more than 1 new distance
> >> value, eg. it could be that the actual distances are:
> >>
> >> node distances:
> >> node   0   1   2
> >>   0:  10  20  40
> >>   1:  20  10  80
> >>   2:  40  80  10
> >>
> >> ie. we now have 4 distances, 10, 20, 40, 80.
> >>
> >> What am I missing?
> >
> > As I said above, I am not sure how we can have a break in the series.
> > If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
> > atleast for form1 affinity.
>
> I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
> guarantees we see each value (other than 10).

The hop distances are not from the device-tree, the device-tree only gives
us the max hops possible. Linux-powerpc is actually hard-coding the
distances which each hop distance being 2x the previous.
So we may not see any nodes at a particular hop, but we know maximum hops.
And if distance_ref_points_depth is 3, then hops are 20, 40, 80 only.

>
> We can have two nodes one hop apart, so we have 10 and 20, then a third
> node is added 3 hops away, so we get 10, 20, 80.
>

> The real problem is that the third node could be 3 hops from node 0
> and 2 hops from node 1, and so the addition of the third node causes
> two new distance values (40 & 80) to be required.

So here the max hops as given by device-tree is 3. So we know that we are
looking for max-distance of 80 by way of distance_ref_points_depth.

Even if the 3rd node was at 4 hops, we would already know the max distance
of 160, by way of distance_ref_points_depth. However in the most unlikely
scenarios where the number of possible nodes are less than the
distance_ref_points_depth(aka max hops) + there are CPUless/memoryless nodes
we may not have initialized to the right distances.

>
> I think maybe what you're saying is that in practice we don't see setups
> like that. But I don't know if I'm happy with a solution that doesn't
> work in the general case, and relies on the particular properties of our
> current set of systems.
>

But our current set of systems are having a problem (Systems can likely
crash on adding a CPU to a node.)  The only other way I can think of is the
previous approach were we ask scheduler hook which tells ho

[PATCH v2 9/9] iommu/vt-d: Use pci core's DVSEC functionality

2021-09-23 Thread Ben Widawsky

Reduce maintenance burden of DVSEC query implementation by using the
centralized PCI core implementation.

Cc: io...@lists.linux-foundation.org
Cc: David Woodhouse 
Cc: Lu Baolu 
Signed-off-by: Ben Widawsky 
---
 drivers/iommu/intel/iommu.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d75f59ae28e6..30c97181f0ae 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5398,20 +5398,7 @@ static int intel_iommu_disable_sva(struct device *dev)
  */
 static int siov_find_pci_dvsec(struct pci_dev *pdev)
 {
-   int pos;
-   u16 vendor, id;
-
-   pos = pci_find_next_ext_capability(pdev, 0, 0x23);
-   while (pos) {
-   pci_read_config_word(pdev, pos + 4, &vendor);
-   pci_read_config_word(pdev, pos + 8, &id);
-   if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
-   return pos;
-
-   pos = pci_find_next_ext_capability(pdev, pos, 0x23);
-   }
-
-   return 0;
+   return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5);
 }
 
 static bool
-- 
2.33.0

[PATCH v2 8/9] ocxl: Use pci core's DVSEC functionality

2021-09-23 Thread Ben Widawsky

Reduce maintenance burden of DVSEC query implementation by using the
centralized PCI core implementation.

There are two obvious places to simply drop in the new core
implementation. There remains find_dvsec_from_pos() which would benefit
from using a core implementation. As that change is less trivial it is
reserved for later.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andrew Donnellan 
Acked-by: Frederic Barrat  (v1)
Signed-off-by: Ben Widawsky 
---
 arch/powerpc/platforms/powernv/ocxl.c |  3 ++-
 drivers/misc/ocxl/config.c| 13 +
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 9105efcf242a..28b009b46464 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -107,7 +107,8 @@ static int get_max_afu_index(struct pci_dev *dev, int 
*afu_idx)
int pos;
u32 val;
 
-   pos = find_dvsec_from_pos(dev, OCXL_DVSEC_FUNC_ID, 0);
+   pos = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM,
+   OCXL_DVSEC_FUNC_ID);
if (!pos)
return -ESRCH;
 
diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index a68738f38252..e401a51596b9 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -33,18 +33,7 @@
 
 static int find_dvsec(struct pci_dev *dev, int dvsec_id)
 {
-   int vsec = 0;
-   u16 vendor, id;
-
-   while ((vsec = pci_find_next_ext_capability(dev, vsec,
-   OCXL_EXT_CAP_ID_DVSEC))) {
-   pci_read_config_word(dev, vsec + OCXL_DVSEC_VENDOR_OFFSET,
-   &vendor);
-   pci_read_config_word(dev, vsec + OCXL_DVSEC_ID_OFFSET, &id);
-   if (vendor == PCI_VENDOR_ID_IBM && id == dvsec_id)
-   return vsec;
-   }
-   return 0;
+   return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM, dvsec_id);
 }
 
 static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 afu_idx)
-- 
2.33.0

[PATCH v2 7/9] cxl/pci: Use pci core's DVSEC functionality

2021-09-23 Thread Ben Widawsky

Reduce maintenance burden of DVSEC query implementation by using the
centralized PCI core implementation.

Signed-off-by: Ben Widawsky 
---
 drivers/cxl/pci.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 5eaf2736f779..79d4d9b16d83 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -340,25 +340,7 @@ static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, 
struct cxl_register_map
 
 static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
 {
-   int pos;
-
-   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC);
-   if (!pos)
-   return 0;
-
-   while (pos) {
-   u16 vendor, id;
-
-   pci_read_config_word(pdev, pos + PCI_DVSEC_HEADER1, &vendor);
-   pci_read_config_word(pdev, pos + PCI_DVSEC_HEADER2, &id);
-   if (vendor == PCI_DVSEC_VENDOR_ID_CXL && dvsec == id)
-   return pos;
-
-   pos = pci_find_next_ext_capability(pdev, pos,
-  PCI_EXT_CAP_ID_DVSEC);
-   }
-
-   return 0;
+   return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL, dvsec);
 }
 
 static int cxl_probe_regs(struct cxl_mem *cxlm, struct cxl_register_map *map)
-- 
2.33.0

[PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC

2021-09-23 Thread Ben Widawsky

Add pci_find_dvsec_capability to locate a Designated Vendor-Specific
Extended Capability with the specified DVSEC ID.

The Designated Vendor-Specific Extended Capability (DVSEC) allows one or
more vendor specific capabilities that aren't tied to the vendor ID of
the PCI component.

DVSEC is critical for both the Compute Express Link (CXL) driver as well
as the driver for OpenCAPI coherent accelerator (OCXL).

Cc: David E. Box 
Cc: Jonathan Cameron 
Cc: Bjorn Helgaas 
Cc: Dan Williams 
Cc: linux-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andrew Donnellan 
Cc: Lu Baolu 
Reviewed-by: Frederic Barrat 
Signed-off-by: Ben Widawsky 
---
 drivers/pci/pci.c   | 32 
 include/linux/pci.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ce2ab62b64cf..94ac86ff28b0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -732,6 +732,38 @@ u16 pci_find_vsec_capability(struct pci_dev *dev, u16 
vendor, int cap)
 }
 EXPORT_SYMBOL_GPL(pci_find_vsec_capability);
 
+/**
+ * pci_find_dvsec_capability - Find DVSEC for vendor
+ * @dev: PCI device to query
+ * @vendor: Vendor ID to match for the DVSEC
+ * @dvsec: Designated Vendor-specific capability ID
+ *
+ * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability
+ * offset in config space; otherwise return 0.
+ */
+u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
+{
+   int pos;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
+   if (!pos)
+   return 0;
+
+   while (pos) {
+   u16 v, id;
+
+   pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &v);
+   pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &id);
+   if (vendor == v && dvsec == id)
+   return pos;
+
+   pos = pci_find_next_ext_capability(dev, pos, 
PCI_EXT_CAP_ID_DVSEC);
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(pci_find_dvsec_capability);
+
 /**
  * pci_find_parent_resource - return resource region of parent bus of given
  *   region
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..c93ccfa4571b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1130,6 +1130,7 @@ u16 pci_find_ext_capability(struct pci_dev *dev, int cap);
 u16 pci_find_next_ext_capability(struct pci_dev *dev, u16 pos, int cap);
 struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
 u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap);
+u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec);
 
 u64 pci_get_dsn(struct pci_dev *dev);
 
-- 
2.33.0

[PATCH v2 5/9] cxl/pci: Make more use of cxl_register_map

2021-09-23 Thread Ben Widawsky

The structure exists to pass around information about register mapping.
Using it more extensively cleans up many existing functions.

Signed-off-by: Ben Widawsky 
---
 drivers/cxl/cxl.h |  1 +
 drivers/cxl/pci.c | 36 +---
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 7d6b011dd963..3b128ce71564 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -140,6 +140,7 @@ struct cxl_device_reg_map {
 };
 
 struct cxl_register_map {
+   void __iomem *base;
u64 block_offset;
u8 reg_type;
u8 barno;
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bbbacbc94fbf..5eaf2736f779 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -306,35 +306,36 @@ static int cxl_pci_setup_mailbox(struct cxl_mem *cxlm)
return 0;
 }
 
-static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm,
- u8 bar, u64 offset)
+static int cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map 
*map)
 {
-   void __iomem *addr;
+   int bar = map->barno;
struct device *dev = cxlm->dev;
struct pci_dev *pdev = to_pci_dev(dev);
+   resource_size_t offset = map->block_offset;
 
/* Basic sanity check that BAR is big enough */
if (pci_resource_len(pdev, bar) < offset) {
dev_err(dev, "BAR%d: %pr: too small (offset: %#llx)\n", bar,
&pdev->resource[bar], (unsigned long long)offset);
-   return IOMEM_ERR_PTR(-ENXIO);
+   return -ENXIO;
}
 
-   addr = pci_iomap(pdev, bar, 0);
-   if (!addr) {
+   map->base = pci_iomap(pdev, bar, 0);
+   if (!map->base) {
dev_err(dev, "failed to map registers\n");
-   return addr;
+   return PTR_ERR(map->base);
}
 
dev_dbg(dev, "Mapped CXL Memory Device resource bar %u @ %#llx\n",
bar, offset);
 
-   return addr;
+   return 0;
 }
 
-static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, void __iomem *base)
+static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, struct 
cxl_register_map *map)
 {
-   pci_iounmap(to_pci_dev(cxlm->dev), base);
+   pci_iounmap(to_pci_dev(cxlm->dev), map->base);
+   map->base = 0;
 }
 
 static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
@@ -360,9 +361,9 @@ static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)
return 0;
 }
 
-static int cxl_probe_regs(struct cxl_mem *cxlm, void __iomem *base,
- struct cxl_register_map *map)
+static int cxl_probe_regs(struct cxl_mem *cxlm, struct cxl_register_map *map)
 {
+   void __iomem *base = map->base + map->block_offset;
struct cxl_component_reg_map *comp_map;
struct cxl_device_reg_map *dev_map;
struct device *dev = cxlm->dev;
@@ -487,7 +488,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 
for (i = 0; i < ARRAY_SIZE(types); i++) {
struct cxl_register_map map;
-   void __iomem *base;
 
rc = find_register_block(pdev, types[i], &map);
if (rc) {
@@ -498,14 +498,12 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
break;
}
 
-   base = cxl_pci_map_regblock(cxlm, map.barno, map.block_offset);
-   if (!base) {
-   rc = -ENOMEM;
+   rc = cxl_pci_map_regblock(cxlm, &map);
+   if (rc)
break;
-   }
 
-   rc = cxl_probe_regs(cxlm, base + map.block_offset, &map);
-   cxl_pci_unmap_regblock(cxlm, base);
+   rc = cxl_probe_regs(cxlm, &map);
+   cxl_pci_unmap_regblock(cxlm, &map);
if (rc)
break;
 
-- 
2.33.0

[PATCH v2 4/9] cxl/pci: Refactor cxl_pci_setup_regs

2021-09-23 Thread Ben Widawsky

In preparation for moving parts of register mapping to cxl_core, the
cxl_pci driver is refactored to utilize a new helper to find register
blocks by type.

cxl_pci scanned through all register blocks and mapping the ones that
the driver will use. This logic is inverted so that the driver
specifically requests the register blocks from a new helper. Under the
hood, the same implementation of scanning through all register locator
DVSEC entries exists.

There are 2 behavioral changes (#2 is arguable):
1. A dev_err is introduced if cxl_map_regs fails.
2. The previous logic would try to map component registers and device
   registers multiple times if there were present and keep the mapping
   of the last one found (furthest offset in the register locator).
   While this is disallowed in the spec, CXL 2.0 8.1.9: "Each register
   block identifier shall only occur once in the Register Locator DVSEC
   structure" it was how the driver would respond to the spec violation.
   The new logic will take the first found register block by type and
   move on.

Signed-off-by: Ben Widawsky 

---
Changes since v1:
---
 drivers/cxl/pci.c | 126 +-
 1 file changed, 70 insertions(+), 56 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 7256c236fdb3..bbbacbc94fbf 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -428,6 +428,45 @@ static void cxl_decode_register_block(u32 reg_lo, u32 
reg_hi,
*reg_type = FIELD_GET(CXL_REGLOC_RBI_MASK, reg_lo);
 }
 
+static int find_register_block(struct pci_dev *pdev, enum cxl_regloc_type type,
+  struct cxl_register_map *map)
+{
+   int regloc, i, rc = -ENODEV;
+   u32 regloc_size, regblocks;
+
+   regloc = cxl_pci_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC_DVSEC_ID);
+   if (!regloc)
+   return -ENXIO;
+
+   pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, ®loc_size);
+   regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
+
+   regloc += PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET;
+   regblocks = (regloc_size - PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET) / 8;
+
+   for (i = 0; i < regblocks; i++, regloc += 8) {
+   u32 reg_lo, reg_hi;
+   u8 reg_type, bar;
+   u64 offset;
+
+   pci_read_config_dword(pdev, regloc, ®_lo);
+   pci_read_config_dword(pdev, regloc + 4, ®_hi);
+
+   cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset,
+ ®_type);
+
+   if (reg_type == type) {
+   map->barno = bar;
+   map->block_offset = offset;
+   map->reg_type = reg_type;
+   rc = 0;
+   break;
+   }
+   }
+
+   return rc;
+}
+
 /**
  * cxl_pci_setup_regs() - Setup necessary MMIO.
  * @cxlm: The CXL memory device to communicate with.
@@ -440,69 +479,44 @@ static void cxl_decode_register_block(u32 reg_lo, u32 
reg_hi,
  */
 static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
 {
-   void __iomem *base;
-   u32 regloc_size, regblocks;
-   int regloc, i, n_maps, ret = 0;
+   int rc, i;
struct device *dev = cxlm->dev;
struct pci_dev *pdev = to_pci_dev(dev);
-   struct cxl_register_map *map, maps[CXL_REGLOC_RBI_TYPES];
+   const enum cxl_regloc_type types[] = { CXL_REGLOC_RBI_MEMDEV,
+  CXL_REGLOC_RBI_COMPONENT };
 
-   regloc = cxl_pci_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC_DVSEC_ID);
-   if (!regloc) {
-   dev_err(dev, "register location dvsec not found\n");
-   return -ENXIO;
-   }
+   for (i = 0; i < ARRAY_SIZE(types); i++) {
+   struct cxl_register_map map;
+   void __iomem *base;
 
-   /* Get the size of the Register Locator DVSEC */
-   pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, ®loc_size);
-   regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
-
-   regloc += PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET;
-   regblocks = (regloc_size - PCI_DVSEC_ID_CXL_REGLOC_BLOCK1_OFFSET) / 8;
-
-   for (i = 0, n_maps = 0; i < regblocks; i++, regloc += 8) {
-   u32 reg_lo, reg_hi;
-   u8 reg_type;
-   u64 offset;
-   u8 bar;
-
-   pci_read_config_dword(pdev, regloc, ®_lo);
-   pci_read_config_dword(pdev, regloc + 4, ®_hi);
-
-   cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset,
- ®_type);
-
-   /* Ignore unknown register block types */
-   if (reg_type > CXL_REGLOC_RBI_MEMDEV)
-   continue;
-
-   base = cxl_pci_map_regblock(cxlm, bar, offset);
-   if (!base)
-   return -ENOMEM;
-
-   map = &maps[n_maps];
-

[PATCH v2 3/9] cxl/pci: Remove pci request/release regions

2021-09-23 Thread Ben Widawsky

Quoting Dan, "... the request + release regions should probably just be
dropped. It's not like any of the register enumeration would collide
with someone else who already has the registers mapped. The collision
only comes when the registers are mapped for their final usage, and that
will have more precision in the request."

Recommended-by: Dan Williams 
Signed-off-by: Ben Widawsky 
---
 drivers/cxl/pci.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index ccc7c2573ddc..7256c236fdb3 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -453,9 +453,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
return -ENXIO;
}
 
-   if (pci_request_mem_regions(pdev, pci_name(pdev)))
-   return -ENODEV;
-
/* Get the size of the Register Locator DVSEC */
pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, ®loc_size);
regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
@@ -499,8 +496,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
n_maps++;
}
 
-   pci_release_mem_regions(pdev);
-
for (i = 0; i < n_maps; i++) {
ret = cxl_map_regs(cxlm, &maps[i]);
if (ret)
-- 
2.33.0

[PATCH v2 2/9] cxl/pci: Remove dev_dbg for unknown register blocks

2021-09-23 Thread Ben Widawsky

While interesting to driver developers, the dev_dbg message doesn't do
much except clutter up logs. This information should be attainable
through sysfs, and someday lspci like utilities. This change
additionally helps reduce the LOC in a subsequent patch to refactor some
of cxl_pci register mapping.

Signed-off-by: Ben Widawsky 
---
 drivers/cxl/pci.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 64180f46c895..ccc7c2573ddc 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -475,9 +475,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
cxl_decode_register_block(reg_lo, reg_hi, &bar, &offset,
  ®_type);
 
-   dev_dbg(dev, "Found register block in bar %u @ 0x%llx of type 
%u\n",
-   bar, offset, reg_type);
-
/* Ignore unknown register block types */
if (reg_type > CXL_REGLOC_RBI_MEMDEV)
continue;
-- 
2.33.0

[PATCH v2 1/9] cxl: Convert "RBI" to enum

2021-09-23 Thread Ben Widawsky

In preparation for passing around the Register Block Indicator (RBI) as
a parameter, it is desirable to convert the type to an enum so that the
interface can use a well defined type checked parameter.

As a result of this change, should future versions of the spec add
sparsely defined identifiers, it could become a problem if checking for
invalid identifiers since the code currently checks for the max
identifier. This is not an issue with current spec, and the algorithm to
obtain the register blocks will change before those possible additions
are made.

Signed-off-by: Ben Widawsky 
---
 drivers/cxl/pci.h | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/pci.h b/drivers/cxl/pci.h
index 8c1a58813816..7d3e4bf06b45 100644
--- a/drivers/cxl/pci.h
+++ b/drivers/cxl/pci.h
@@ -20,13 +20,15 @@
 #define CXL_REGLOC_BIR_MASK GENMASK(2, 0)
 
 /* Register Block Identifier (RBI) */
-#define CXL_REGLOC_RBI_MASK GENMASK(15, 8)
-#define CXL_REGLOC_RBI_EMPTY 0
-#define CXL_REGLOC_RBI_COMPONENT 1
-#define CXL_REGLOC_RBI_VIRT 2
-#define CXL_REGLOC_RBI_MEMDEV 3
-#define CXL_REGLOC_RBI_TYPES CXL_REGLOC_RBI_MEMDEV + 1
+enum cxl_regloc_type {
+   CXL_REGLOC_RBI_EMPTY = 0,
+   CXL_REGLOC_RBI_COMPONENT,
+   CXL_REGLOC_RBI_VIRT,
+   CXL_REGLOC_RBI_MEMDEV,
+   CXL_REGLOC_RBI_TYPES
+};
 
+#define CXL_REGLOC_RBI_MASK GENMASK(15, 8)
 #define CXL_REGLOC_ADDR_MASK GENMASK(31, 16)
 
 #endif /* __CXL_PCI_H__ */
-- 
2.33.0

[PATCH v2 0/9] cxl_pci refactor for reusability

2021-09-23 Thread Ben Widawsky

Changes since v1 [3]:
- get_max_afu_index() updated to new interface (Dan)
- siov_find_pci_dvsec updated to new interface (Dan)
- remove unnecessary pci request/release regions with refactor (Dan)
- move @base into cxl_register_map (Dan)
- Expanded the Cc list to include other users of PCIe DVSEC. (Dan)

Three spots are not converted to use the new PCI core DVSEC helper as the
changes are non-trivial.
- find_dvsec_afu_ctrl (ocxl)
- pmt_pci_probe (David)
- intel_uncore_has_discovery_tables (Kan)

The interdiff is below. A range-diff can be generated against v1[3] if desired.
Discussion occurred partially offlist between Dan and myself. To summarize, Dan
contends that patch 4, "cxl/pci: Refactor cxl_pci_setup_regs" should either be
rewrittn, or have a precursor patch that cxl_pci will still scan all register
blocks, but only claim the specified types. While the current diff is somewhat
complex, I contend that this is unneeded churn because we can easily test this
change in our existing regression testing. It also ultimately results in a
simpler function in the cxl_port patch series when cxl_pci stops trying to map
the component register block (loop is removed entirely). A tiebreak here would
be great :-)

---

Original commit message:

Provide the ability to obtain CXL register blocks as discrete functionality.
This functionality will become useful for other CXL drivers that need access to
CXL register blocks. It is also in line with other additions to core which moves
register mapping functionality.

At the introduction of the CXL driver the only user of CXL MMIO was cxl_pci
(then known as cxl_mem). As the driver has evolved it is clear that cxl_pci will
not be the only entity that needs access to CXL MMIO. This series stops short of
moving the generalized functionality into cxl_core for the sake of getting eyes
on the important foundational bits sooner rather than later. The ultimate plan
is to move much of the code into cxl_core.

Via review of two previous patches [1] & [2] it has been suggested that the bits
which are being used for DVSEC enumeration move into PCI core. As CXL core is
soon going to require these, let's try to get the ball rolling now on making
that happen.

---

[1]: 
https://lore.kernel.org/linux-pci/20210913190131.xiiszmno46qie...@intel.com/
[2]: 
https://lore.kernel.org/linux-cxl/20210920225638.1729482-1-ben.widaw...@intel.com/
[3]: 
https://lore.kernel.org/linux-cxl/20210921220459.2437386-1-ben.widaw...@intel.com/


Ben Widawsky (9):
  cxl: Convert "RBI" to enum
  cxl/pci: Remove dev_dbg for unknown register blocks
  cxl/pci: Remove pci request/release regions
  cxl/pci: Refactor cxl_pci_setup_regs
  cxl/pci: Make more use of cxl_register_map
  PCI: Add pci_find_dvsec_capability to find designated VSEC
  cxl/pci: Use pci core's DVSEC functionality
  ocxl: Use pci core's DVSEC functionality
  iommu/vt-d: Use pci core's DVSEC functionality

 arch/powerpc/platforms/powernv/ocxl.c |   3 +-
 drivers/cxl/cxl.h |   1 +
 drivers/cxl/pci.c | 176 --
 drivers/cxl/pci.h |  14 +-
 drivers/iommu/intel/iommu.c   |  15 +--
 drivers/misc/ocxl/config.c|  13 +-
 drivers/pci/pci.c |  32 +
 include/linux/pci.h   |   1 +
 8 files changed, 127 insertions(+), 128 deletions(-)

Interdiff against v1:
diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 9105efcf242a..28b009b46464 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -107,7 +107,8 @@ static int get_max_afu_index(struct pci_dev *dev, int 
*afu_idx)
int pos;
u32 val;
 
-   pos = find_dvsec_from_pos(dev, OCXL_DVSEC_FUNC_ID, 0);
+   pos = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_IBM,
+   OCXL_DVSEC_FUNC_ID);
if (!pos)
return -ESRCH;
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 7d6b011dd963..3b128ce71564 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -140,6 +140,7 @@ struct cxl_device_reg_map {
 };
 
 struct cxl_register_map {
+   void __iomem *base;
u64 block_offset;
u8 reg_type;
u8 barno;
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 040379f727ad..79d4d9b16d83 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -306,9 +306,8 @@ static int cxl_pci_setup_mailbox(struct cxl_mem *cxlm)
return 0;
 }
 
-static void __iomem *cxl_pci_map_regblock(struct cxl_mem *cxlm, struct 
cxl_register_map *map)
+static int cxl_pci_map_regblock(struct cxl_mem *cxlm, struct cxl_register_map 
*map)
 {
-   void __iomem *addr;
int bar = map->barno;
struct device *dev = cxlm->dev;
struct pci_dev *pdev = to_pci_dev(dev);
@@ -318,24 +317,25 @@ static void __iomem *cxl_pci_map_regblock(struct cxl_mem 
*cxlm, struct cxl_regis
if (pci_resource_len(pde

[Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5

2021-09-23 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=213837

--- Comment #9 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 298933
  --> https://bugzilla.kernel.org/attachment.cgi?id=298933&action=edit
System.map (5.15-rc2 + patch, PowerMac G5 11,2)

(In reply to mpe from comment #8)
> So it looks like you have actually overran your stack, rather than
> something else clobbering your stack.
> 
> Can you attach your System.map for that exact kernel? We might be able
> to work out what functions we were in when we overran.
> 
> You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep
> the system running a bit longer and give us some other clues.
> 
> cheers
Hm, interesting...

What I do to trigger this bug is building llvm-12 on the G5 via distcc (on the
other side is a 16-core Opteron) and MAKEOPTS="-j10 -l3". As the G5 got 16 GiB
RAM building runs in a zstd-compressed ext2 filesystem (/sbin/zram-init -d1 -s2
-azstd -text2 -orelatime -m1777 -Lvar_tmp_dir 49152 /var/tmp). Most of the time
the bug is triggered very shortly after the actual building starts via meson.
At this time the build directory /var/tmp/portage occupies about 800 MiB.

Also sometimes I don't get a proper stack trace via netconsole but this:
BUG: unable to handle kernel data access on write at 0xc00037c82040
BUG: unable to handle kernel data access on write at 0xc00037c8

Please find the relevant System.map attached. I'll do another kernel build with
CONFIG_THREAD_SHIFT=15 and see if anything changes.

Thanks for investigating this!

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.

Re: [PATCH 0/2] kvm: fix KVM_MAX_VCPU_ID handling

2021-09-23 Thread Paolo Bonzini


On 13/09/21 15:57, Juergen Gross wrote:

Revert commit 76b4f357d0e7d8f6f00 which was based on wrong reasoning
and rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS in order to avoid the
same issue in future.

Juergen Gross (2):
   x86/kvm: revert commit 76b4f357d0e7d8f6f00
   kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS

  Documentation/virt/kvm/devices/xics.rst| 2 +-
  Documentation/virt/kvm/devices/xive.rst| 2 +-
  arch/mips/kvm/mips.c   | 2 +-
  arch/powerpc/include/asm/kvm_book3s.h  | 2 +-
  arch/powerpc/include/asm/kvm_host.h| 4 ++--
  arch/powerpc/kvm/book3s_xive.c | 2 +-
  arch/powerpc/kvm/powerpc.c | 2 +-
  arch/x86/include/asm/kvm_host.h| 2 +-
  arch/x86/kvm/ioapic.c  | 2 +-
  arch/x86/kvm/ioapic.h  | 4 ++--
  arch/x86/kvm/x86.c | 2 +-
  include/linux/kvm_host.h   | 4 ++--
  tools/testing/selftests/kvm/kvm_create_max_vcpus.c | 2 +-
  virt/kvm/kvm_main.c| 2 +-
  14 files changed, 17 insertions(+), 17 deletions(-)



Queued, thanks.

Paolo

Re: [PATCH 0/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Linus Torvalds

On Thu, Sep 23, 2021 at 12:43 AM Mike Rapoport  wrote:
>
> The core change is in the third patch that makes memblock_free() a
> counterpart of memblock_alloc() and adds memblock_phys_alloc() to be a

^^^
> counterpart of memblock_phys_alloc().

That should be 'memblock_phys_free()'

HOWEVER.

The real reason I'm replying is that this patch is horribly buggy, and
will cause subtle problems that are nasty to debug.

You need to be a LOT more careful.

>From a trivial check - exactly because I looked at doing it with a
script, and decided it's not so easy - I found cases like this:

-   memblock_free(__pa(paca_ptrs) + new_ptrs_size,
+   memblock_free(paca_ptrs + new_ptrs_size,

which is COMPLETELY wrong.

Why? Because now that addition is done as _pointer_ addition, not as
an integer addition, and the end result is something completely
different.

pcac_ptrs is of type 'struct paca_struct **', so when you add
new_ptrs_size to it, it will add it in terms of that many pointers,
not that many bytes.

You need to use some smarter scripting, or some way to validate it.

And no, making the scripting just replace '__pa(x)' with '(void *)(x)'
- which _would_ be mindless and get the same result - is not
acceptable either, because it avoids one of the big improvements from
using the right interface, namely having compiler type checking (and
saner code that people understand).

So NAK. No broken automated scripting patches.

   Linus

[PATCH] KVM: PPC: Book3S HV: Use GLOBAL_TOC for kvmppc_h_set_dabr/xdabr()

2021-09-23 Thread Michael Ellerman

kvmppc_h_set_dabr(), and kvmppc_h_set_xdabr() which jumps into
it, need to use _GLOBAL_TOC to setup the kernel TOC pointer, because
kvmppc_h_set_dabr() uses LOAD_REG_ADDR() to load dawr_force_enable.

When called from hcall_try_real_mode() we have the kernel TOC in r2,
established near the start of kvmppc_interrupt_hv(), so there is no
issue.

But they can also be called from kvmppc_pseries_do_hcall() which is
module code, so the access ends up happening with the kvm-hv module's
r2, which will not point at dawr_force_enable and could even cause a
fault.

With the current code layout and compilers we haven't observed a fault
in practice, the load hits somewhere in kvm-hv.ko and silently returns
some bogus value.

Note that we we expect p8/p9 guests to use the DAWR, but SLOF uses
h_set_dabr() to test if sc1 works correctly, see SLOF's
lib/libhvcall/brokensc1.c.

Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 90484425a1e6..30a8a07cff18 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1999,7 +1999,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
.globl  hcall_real_table_end
 hcall_real_table_end:
 
-_GLOBAL(kvmppc_h_set_xdabr)
+_GLOBAL_TOC(kvmppc_h_set_xdabr)
 EXPORT_SYMBOL_GPL(kvmppc_h_set_xdabr)
andi.   r0, r5, DABRX_USER | DABRX_KERNEL
beq 6f
@@ -2009,7 +2009,7 @@ EXPORT_SYMBOL_GPL(kvmppc_h_set_xdabr)
 6: li  r3, H_PARAMETER
blr
 
-_GLOBAL(kvmppc_h_set_dabr)
+_GLOBAL_TOC(kvmppc_h_set_dabr)
 EXPORT_SYMBOL_GPL(kvmppc_h_set_dabr)
li  r5, DABRX_USER | DABRX_KERNEL
 3:
-- 
2.25.1

[Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5

2021-09-23 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=213837

--- Comment #8 from m...@ellerman.id.au ---
bugzilla-dae...@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> --- Comment #7 from Erhard F. (erhar...@mailbox.org) ---
> Created attachment 298919
>   --> https://bugzilla.kernel.org/attachment.cgi?id=298919&action=edit
> dmesg (5.15-rc2 + patch, PowerMac G5 11,2)
>
> (In reply to mpe from comment #6)
>> Can you try this patch, it might help us work out what is corrupting the
>> stack.
> With your patch applied to recent v5.15-rc2 the output looks like this:
>
> [...]
> stack corrupted? stack end = 0xc00029fdc000
> stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a   
> 
...

> Can't make much sense out of it but hopefully you can. ;)

Thanks. Obvious isn't it? ;)

  stack corrupted? stack end = 0xc00029fdc000
  stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbc10: 0ddc 7c10   
|...
  stack: c00029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a 
).NAg=K.
  stack: c00029fdbc30:   0ddc 8e10 

  stack: c00029fdbc40:   41fc4e41 673d41a3 
A.NAg=A.
  stack: c00029fdbc50: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbc60: 0ddc 8e0c   

  stack: c00029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a 
y.NAg=M.
  stack: c00029fdbc80:   0ddc 9008 

  stack: c00029fdbc90:   91fc4e41 673d4573 
..NAg=Es
  stack: c00029fdbca0: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbcb0: 0dd7 ac16   

  stack: c00029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a 
..NAg=B.
  stack: c00029fdbcd0:   0ddc 6c04 
l...
  stack: c00029fdbce0:   e1fc4e41 673d474b 
..NAg=GK
  stack: c00029fdbcf0: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbd00: 0ddc 8800   

  stack: c00029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a 
..NAg=AC
  stack: c00029fdbd20:   0ddb 6c0e 
l...
  stack: c00029fdbd30:   31fd4e41 673d4f43 
1.NAg=OC
  stack: c00029fdbd40: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbd50: 0ddc 8e08   

  stack: c00029fdbd60: 69fd4e41 673d407b 5a5a5a5a 5a5a5a5a 
i.NAg=@{
  stack: c00029fdbd70:   0ddc 9208 

  stack: c00029fdbd80:   81fd4e41 673d4633 
..NAg=F3
  stack: c00029fdbd90: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbda0: 0ddb 4218   
B...
  stack: c00029fdbdb0: b9fd4e41 673d42fb 5a5a5a5a 5a5a5a5a 
..NAg=B.
  stack: c00029fdbdc0:   0ddc 7e18 
~...
  stack: c00029fdbdd0:   d1fd4e41 673d4a1b 
..NAg=J.
  stack: c00029fdbde0: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbdf0: 0ddc 8e04   

  stack: c00029fdbe00: 09fe4e41 673d4ee3 5a5a5a5a 5a5a5a5a 
..NAg=N.
  stack: c00029fdbe10:   0dd9 721c 
r...
  stack: c00029fdbe20:   21fe4e41 673d4fa3 
!.NAg=O.

That's slab data.

It's not clear what the actual data is, but because you booted with
slub_debug=FZP we can see the red zones and poison.

The  is SLUB_RED_ACTIVE, and 5a5a5a5a is POISON_INUSE (see poison.h)


  stack: c00029fdbe30: c000 29fdbeb0   
)...

But then here we have an obvious pointer (big endian FTW).

And it points nearby, just slightly higher in memory, so that looks
suspiciously like a stack back chain pointer. There's more similar
values if you look further.

But we shouldn't be seeing the stack yet, it's meant to start (end) at
c00029fdc000 ...

  stack: c00029fdbe40: 0ddc 9400   

  stack: c00029fdbe50: 59fe4e41 673d4933 5a5a5a5a 5a5a5a5a 
Y.NAg=I3
  stack: c00029fdbe60:   0dd9 6024 
`..$
  stack: c00029fdbe70:   71fe4e41 673d416b 
q.NAg=Ak
  stack: c00029fdbe80: 5a5a5a5a 5a5a5a5a   

  stack: c00029fdbe90: 0ddc 600c   
`...
  stack: c00029fdbea0: c000 29fdbf20  0002  )..

  stack: c00029fdbeb0: c000 29fdbf30 0ddc 7e1c )..0~...
<---

Re: [Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5

2021-09-23 Thread Michael Ellerman

bugzilla-dae...@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> --- Comment #7 from Erhard F. (erhar...@mailbox.org) ---
> Created attachment 298919
>   --> https://bugzilla.kernel.org/attachment.cgi?id=298919&action=edit
> dmesg (5.15-rc2 + patch, PowerMac G5 11,2)
>
> (In reply to mpe from comment #6)
>> Can you try this patch, it might help us work out what is corrupting the
>> stack.
> With your patch applied to recent v5.15-rc2 the output looks like this:
>
> [...]
> stack corrupted? stack end = 0xc00029fdc000
> stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a    
...

> Can't make much sense out of it but hopefully you can. ;)

Thanks. Obvious isn't it? ;)

  stack corrupted? stack end = 0xc00029fdc000
  stack: c00029fdbc00: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbc10: 0ddc 7c10    |...
  stack: c00029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a  ).NAg=K.
  stack: c00029fdbc30:   0ddc 8e10  
  stack: c00029fdbc40:   41fc4e41 673d41a3  A.NAg=A.
  stack: c00029fdbc50: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbc60: 0ddc 8e0c    
  stack: c00029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a  y.NAg=M.
  stack: c00029fdbc80:   0ddc 9008  
  stack: c00029fdbc90:   91fc4e41 673d4573  ..NAg=Es
  stack: c00029fdbca0: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbcb0: 0dd7 ac16    
  stack: c00029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a  ..NAg=B.
  stack: c00029fdbcd0:   0ddc 6c04  l...
  stack: c00029fdbce0:   e1fc4e41 673d474b  ..NAg=GK
  stack: c00029fdbcf0: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbd00: 0ddc 8800    
  stack: c00029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a  ..NAg=AC
  stack: c00029fdbd20:   0ddb 6c0e  l...
  stack: c00029fdbd30:   31fd4e41 673d4f43  1.NAg=OC
  stack: c00029fdbd40: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbd50: 0ddc 8e08    
  stack: c00029fdbd60: 69fd4e41 673d407b 5a5a5a5a 5a5a5a5a  i.NAg=@{
  stack: c00029fdbd70:   0ddc 9208  
  stack: c00029fdbd80:   81fd4e41 673d4633  ..NAg=F3
  stack: c00029fdbd90: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbda0: 0ddb 4218    B...
  stack: c00029fdbdb0: b9fd4e41 673d42fb 5a5a5a5a 5a5a5a5a  ..NAg=B.
  stack: c00029fdbdc0:   0ddc 7e18  ~...
  stack: c00029fdbdd0:   d1fd4e41 673d4a1b  ..NAg=J.
  stack: c00029fdbde0: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbdf0: 0ddc 8e04    
  stack: c00029fdbe00: 09fe4e41 673d4ee3 5a5a5a5a 5a5a5a5a  ..NAg=N.
  stack: c00029fdbe10:   0dd9 721c  r...
  stack: c00029fdbe20:   21fe4e41 673d4fa3  !.NAg=O.

That's slab data.

It's not clear what the actual data is, but because you booted with
slub_debug=FZP we can see the red zones and poison.

The  is SLUB_RED_ACTIVE, and 5a5a5a5a is POISON_INUSE (see poison.h)


  stack: c00029fdbe30: c000 29fdbeb0    )...

But then here we have an obvious pointer (big endian FTW).

And it points nearby, just slightly higher in memory, so that looks
suspiciously like a stack back chain pointer. There's more similar
values if you look further.

But we shouldn't be seeing the stack yet, it's meant to start (end) at
c00029fdc000 ...

  stack: c00029fdbe40: 0ddc 9400    
  stack: c00029fdbe50: 59fe4e41 673d4933 5a5a5a5a 5a5a5a5a  Y.NAg=I3
  stack: c00029fdbe60:   0dd9 6024  `..$
  stack: c00029fdbe70:   71fe4e41 673d416b  q.NAg=Ak
  stack: c00029fdbe80: 5a5a5a5a 5a5a5a5a    
  stack: c00029fdbe90: 0ddc 600c    `...
  stack: c00029fdbea0: c000 29fdbf20  0002  ).. 
  stack: c00029fdbeb0: c000 29fdbf30 0ddc 7e1c )..0~... 
<---
  stack: c00029fdbec0: c000 29fdbf40 c1fe4e41 673d4723  )..@..NAg=G#
  stack: c0002

Re: [PATCH 3/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Christophe Leroy





Le 23/09/2021 à 14:01, Mike Rapoport a écrit :

On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote:



Le 23/09/2021 à 09:43, Mike Rapoport a écrit :

From: Mike Rapoport 

For ages memblock_free() interface dealt with physical addresses even
despite the existence of memblock_alloc_xx() functions that return a
virtual pointer.

Introduce memblock_phys_free() for freeing physical ranges and repurpose
memblock_free() to free virtual pointers to make the following pairing
abundantly clear:

int memblock_phys_free(phys_addr_t base, phys_addr_t size);
phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);

void *memblock_alloc(phys_addr_t size, phys_addr_t align);
void memblock_free(void *ptr, size_t size);

Replace intermediate memblock_free_ptr() with memblock_free() and drop
unnecessary aliases memblock_free_early() and memblock_free_early_nid().

Suggested-by: Linus Torvalds 
Signed-off-by: Mike Rapoport 
---



diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 1a04e5bdf655..37826d8c4f74 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void)
/* Get the CPU registers */
smp_save_cpu_regs(sa, addr, is_boot_cpu, page);
}
-   memblock_free(page, PAGE_SIZE);
+   memblock_phys_free(page, PAGE_SIZE);
diag_amode31_ops.diag308_reset();
pcpu_set_smt(0);
   }
@@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
/* Add CPUs present at boot */
__smp_rescan_cpus(info, true);
-   memblock_free_early((unsigned long)info, sizeof(*info));
+   memblock_free(info, sizeof(*info));
   }
   /*


I'm a bit lost. IIUC memblock_free_early() and memblock_free() where
identical.


Yes, they were, but all calls to memblock_free_early() were using
__pa(vaddr) because they had a virtual address at hand.


I'm still not following. In the above memblock_free_early() was taking 
(unsigned long)info . Was it a bug ? It looks odd to hide bug fixes in 
such a big patch, should that bug fix go in patch 2 ?





In the first hunk memblock_free() gets replaced by memblock_phys_free()
In the second hunk memblock_free_early() gets replaced by memblock_free()


In the first hunk the memory is allocated with memblock_phys_alloc() and we
have a physical range to free. In the second hunk the memory is allocated
with memblock_alloc() and we are freeing a virtual pointer.
  

I think it would be easier to follow if you could split it in several
patches:


It was an explicit request from Linus to make it a single commit:

   but the actual commit can and should be just a single commit that just
   fixes 'memblock_free()' to have sane interfaces.

I don't feel strongly about splitting it (except my laziness really
objects), but I don't think doing the conversion in several steps worth the
churn.


The commit is quite big (55 files changed, approx 100 lines modified).

If done in the right order the change should be minimal.

It is rather not-easy to follow and review when a function that was 
existing (namely memblock_free() ) disappears and re-appears in the same 
commit but to do something different.


You do:
- memblock_free() ==> memblock_phys_free()
- memblock_free_ptr() ==> memblock_free()

At least you could split in two patches, the advantage would be that 
between first and second patch memblock() doesn't exist anymore so you 
can check you really don't have anymore user.





- First patch: Create memblock_phys_free() and change all relevant
memblock_free() to memblock_phys_free() - Or change memblock_free() to
memblock_phys_free() and make memblock_free() an alias of it.
- Second patch: Make memblock_free_ptr() become memblock_free() and change
all remaining callers to the new semantics (IIUC memblock_free(__pa(ptr))
becomes memblock_free(ptr) and make memblock_free_ptr() an alias of
memblock_free()
- Fourth patch: Replace and drop memblock_free_ptr()
- Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() (All
users should have been upgraded to memblock_free_phys() in patch 1 or
memblock_free() in patch 2)

Christophe

Re: [PATCH v2 1/2] powerpc/powermac: add missing g5_phy_disable_cpu1() declaration

2021-09-23 Thread Krzysztof Kozlowski

On 23/09/2021 15:21, Michael Ellerman wrote:
> Krzysztof Kozlowski  writes:
>> g5_phy_disable_cpu1() is used outside of platforms/powermac/feature.c,
>> so it should have a declaration to fix W=1 warning:
>>
>>   arch/powerpc/platforms/powermac/feature.c:1533:6:
>> error: no previous prototype for ‘g5_phy_disable_cpu1’ 
>> [-Werror=missing-prototypes]
>>
>> Signed-off-by: Krzysztof Kozlowski 
>>
>> ---
>>
>> Changes since v1:
>> 1. Drop declaration in powermac/smp.c
> 
> Sorry I missed v1, so I'm going to ask for more changes :}
> 
>>  arch/powerpc/include/asm/pmac_feature.h | 4 
> 
> Putting it here exposes it to the whole kernel, but it's only needed
> inside arch/powerpc/platforms/powermac.
> 
> The right place for the prototype is arch/powerpc/platforms/powermac/pmac.h,
> which is for platform internal prototypes.

I'll fix it up.

> 
>>  arch/powerpc/platforms/powermac/smp.c   | 2 --
>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/pmac_feature.h 
>> b/arch/powerpc/include/asm/pmac_feature.h
>> index e08e829261b6..7703e5bf1203 100644
>> --- a/arch/powerpc/include/asm/pmac_feature.h
>> +++ b/arch/powerpc/include/asm/pmac_feature.h
>> @@ -143,6 +143,10 @@
>>   */
>>  struct device_node;
>>  
>> +#ifdef CONFIG_PPC64
>> +void g5_phy_disable_cpu1(void);
>> +#endif /* CONFIG_PPC64 */
> 
> The ifdef around the prototype doesn't gain much, and is extra visual
> noise, so I'd rather without it.


Sure.


Best regards,
Krzysztof

Re: [PATCH v2 1/2] powerpc/powermac: add missing g5_phy_disable_cpu1() declaration

2021-09-23 Thread Michael Ellerman

Krzysztof Kozlowski  writes:
> g5_phy_disable_cpu1() is used outside of platforms/powermac/feature.c,
> so it should have a declaration to fix W=1 warning:
>
>   arch/powerpc/platforms/powermac/feature.c:1533:6:
> error: no previous prototype for ‘g5_phy_disable_cpu1’ 
> [-Werror=missing-prototypes]
>
> Signed-off-by: Krzysztof Kozlowski 
>
> ---
>
> Changes since v1:
> 1. Drop declaration in powermac/smp.c

Sorry I missed v1, so I'm going to ask for more changes :}

>  arch/powerpc/include/asm/pmac_feature.h | 4 

Putting it here exposes it to the whole kernel, but it's only needed
inside arch/powerpc/platforms/powermac.

The right place for the prototype is arch/powerpc/platforms/powermac/pmac.h,
which is for platform internal prototypes.

>  arch/powerpc/platforms/powermac/smp.c   | 2 --
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pmac_feature.h 
> b/arch/powerpc/include/asm/pmac_feature.h
> index e08e829261b6..7703e5bf1203 100644
> --- a/arch/powerpc/include/asm/pmac_feature.h
> +++ b/arch/powerpc/include/asm/pmac_feature.h
> @@ -143,6 +143,10 @@
>   */
>  struct device_node;
>  
> +#ifdef CONFIG_PPC64
> +void g5_phy_disable_cpu1(void);
> +#endif /* CONFIG_PPC64 */

The ifdef around the prototype doesn't gain much, and is extra visual
noise, so I'd rather without it.

cheers

Re: [PATCH 3/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Juergen Gross


On 23.09.21 09:43, Mike Rapoport wrote:

From: Mike Rapoport 

For ages memblock_free() interface dealt with physical addresses even
despite the existence of memblock_alloc_xx() functions that return a
virtual pointer.

Introduce memblock_phys_free() for freeing physical ranges and repurpose
memblock_free() to free virtual pointers to make the following pairing
abundantly clear:

int memblock_phys_free(phys_addr_t base, phys_addr_t size);
phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);

void *memblock_alloc(phys_addr_t size, phys_addr_t align);
void memblock_free(void *ptr, size_t size);

Replace intermediate memblock_free_ptr() with memblock_free() and drop
unnecessary aliases memblock_free_early() and memblock_free_early_nid().

Suggested-by: Linus Torvalds 
Signed-off-by: Mike Rapoport 


arch/x86/xen/ parts: Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature

Re: [PATCH 2/3] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer

2021-09-23 Thread Juergen Gross


On 23.09.21 09:43, Mike Rapoport wrote:

From: Mike Rapoport 

free_p2m_page() wrongly passes a virtual pointer to memblock_free() that
treats it as a physical address.

Call memblock_free_ptr() instead that gets a virtual address to free the
memory.

Signed-off-by: Mike Rapoport 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature

Re: [PATCH 3/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Mike Rapoport

On Thu, Sep 23, 2021 at 11:47:48AM +0200, Christophe Leroy wrote:
> 
> 
> Le 23/09/2021 à 09:43, Mike Rapoport a écrit :
> > From: Mike Rapoport 
> > 
> > For ages memblock_free() interface dealt with physical addresses even
> > despite the existence of memblock_alloc_xx() functions that return a
> > virtual pointer.
> > 
> > Introduce memblock_phys_free() for freeing physical ranges and repurpose
> > memblock_free() to free virtual pointers to make the following pairing
> > abundantly clear:
> > 
> > int memblock_phys_free(phys_addr_t base, phys_addr_t size);
> > phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);
> > 
> > void *memblock_alloc(phys_addr_t size, phys_addr_t align);
> > void memblock_free(void *ptr, size_t size);
> > 
> > Replace intermediate memblock_free_ptr() with memblock_free() and drop
> > unnecessary aliases memblock_free_early() and memblock_free_early_nid().
> > 
> > Suggested-by: Linus Torvalds 
> > Signed-off-by: Mike Rapoport 
> > ---
> 
> > diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
> > index 1a04e5bdf655..37826d8c4f74 100644
> > --- a/arch/s390/kernel/smp.c
> > +++ b/arch/s390/kernel/smp.c
> > @@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void)
> > /* Get the CPU registers */
> > smp_save_cpu_regs(sa, addr, is_boot_cpu, page);
> > }
> > -   memblock_free(page, PAGE_SIZE);
> > +   memblock_phys_free(page, PAGE_SIZE);
> > diag_amode31_ops.diag308_reset();
> > pcpu_set_smt(0);
> >   }
> > @@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
> > /* Add CPUs present at boot */
> > __smp_rescan_cpus(info, true);
> > -   memblock_free_early((unsigned long)info, sizeof(*info));
> > +   memblock_free(info, sizeof(*info));
> >   }
> >   /*
> 
> I'm a bit lost. IIUC memblock_free_early() and memblock_free() where
> identical.

Yes, they were, but all calls to memblock_free_early() were using
__pa(vaddr) because they had a virtual address at hand.

> In the first hunk memblock_free() gets replaced by memblock_phys_free()
> In the second hunk memblock_free_early() gets replaced by memblock_free()

In the first hunk the memory is allocated with memblock_phys_alloc() and we
have a physical range to free. In the second hunk the memory is allocated
with memblock_alloc() and we are freeing a virtual pointer.
 
> I think it would be easier to follow if you could split it in several
> patches:

It was an explicit request from Linus to make it a single commit:

  but the actual commit can and should be just a single commit that just
  fixes 'memblock_free()' to have sane interfaces.

I don't feel strongly about splitting it (except my laziness really
objects), but I don't think doing the conversion in several steps worth the
churn.

> - First patch: Create memblock_phys_free() and change all relevant
> memblock_free() to memblock_phys_free() - Or change memblock_free() to
> memblock_phys_free() and make memblock_free() an alias of it.
> - Second patch: Make memblock_free_ptr() become memblock_free() and change
> all remaining callers to the new semantics (IIUC memblock_free(__pa(ptr))
> becomes memblock_free(ptr) and make memblock_free_ptr() an alias of
> memblock_free()
> - Fourth patch: Replace and drop memblock_free_ptr()
> - Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() (All
> users should have been upgraded to memblock_free_phys() in patch 1 or
> memblock_free() in patch 2)
> 
> Christophe

-- 
Sincerely yours,
Mike.

Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-09-23 Thread Michael Ellerman

Srikar Dronamraju  writes:
> * Michael Ellerman  [2021-08-26 23:36:53]:
>
>> Srikar Dronamraju  writes:
>> > Scheduler expects unique number of node distances to be available at
>> > boot.
...
>
>> > Fake the offline node's distance_lookup_table entries so that all
>> > possible node distances are updated.
>> 
>> Does this work if we have a single node offline at boot?
>> 
>
> It should.
>
>> Say we start with:
>> 
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10
>> 
>> And node 2 is offline at boot. We can only initialise that nodes entries
>> in the distance_lookup_table:
>> 
>>  while (i--)
>>  distance_lookup_table[node][i] = node;
>> 
>> By filling them all with 2 that causes node_distance(2, X) to return the
>> maximum distance for all other nodes X, because we won't break out of
>> the loop in __node_distance():
>> 
>>  for (i = 0; i < distance_ref_points_depth; i++) {
>>  if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>>  break;
>> 
>>  /* Double the distance for each NUMA level */
>>  distance *= 2;
>>  }
>> 
>> If distance_ref_points_depth was 4 we'd return 160.
>
> As you already know, distance 10, 20, .. are defined by Powerpc, form1
> affinity. PAPR doesn't define actual distances, it only provides us the
> associativity. If there are distance_ref_points_depth is 4,
> (distance_ref_points_depth doesn't take local distance into consideration)
> 10, 20, 40, 80, 160.
>
>> 
>> That'd leave us with 3 unique distances at boot, 10, 20, 160.
>> 
>
> So if there are unique distances, then the distances as per the current
> code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
> the series. like having 160 without 80.

I'm confused what you mean there.

If we have a node that's offline at boot then we get 160 for that node,
that's just the result of having no info for it, so we never break out
of the for loop.

So if we have two nodes, one hop apart, and then an offline node we get
10, 20, 160.

Or if you're using depth = 3 then it's 10, 20, 80.

>> But when node 2 comes online it might introduce more than 1 new distance
>> value, eg. it could be that the actual distances are:
>> 
>> node distances:
>> node   0   1   2
>>   0:  10  20  40
>>   1:  20  10  80
>>   2:  40  80  10
>> 
>> ie. we now have 4 distances, 10, 20, 40, 80.
>> 
>> What am I missing?
>
> As I said above, I am not sure how we can have a break in the series.
> If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
> atleast for form1 affinity.

I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
guarantees we see each value (other than 10).

We can have two nodes one hop apart, so we have 10 and 20, then a third
node is added 3 hops away, so we get 10, 20, 80.

The real problem is that the third node could be 3 hops from node 0
and 2 hops from node 1, and so the addition of the third node causes
two new distance values (40 & 80) to be required.

I think maybe what you're saying is that in practice we don't see setups
like that. But I don't know if I'm happy with a solution that doesn't
work in the general case, and relies on the particular properties of our
current set of systems.

Possibly we just need to detect that case and WARN about it. The only
problem is we won't know until the system is already up and running, ie.
we can't know at boot that the onlining of the third node will cause 2
new distance values to be added.

cheers

Re: [PATCH 3/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Christophe Leroy





Le 23/09/2021 à 09:43, Mike Rapoport a écrit :

From: Mike Rapoport 

For ages memblock_free() interface dealt with physical addresses even
despite the existence of memblock_alloc_xx() functions that return a
virtual pointer.

Introduce memblock_phys_free() for freeing physical ranges and repurpose
memblock_free() to free virtual pointers to make the following pairing
abundantly clear:

int memblock_phys_free(phys_addr_t base, phys_addr_t size);
phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);

void *memblock_alloc(phys_addr_t size, phys_addr_t align);
void memblock_free(void *ptr, size_t size);

Replace intermediate memblock_free_ptr() with memblock_free() and drop
unnecessary aliases memblock_free_early() and memblock_free_early_nid().

Suggested-by: Linus Torvalds 
Signed-off-by: Mike Rapoport 
---



diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 1a04e5bdf655..37826d8c4f74 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void)
/* Get the CPU registers */
smp_save_cpu_regs(sa, addr, is_boot_cpu, page);
}
-   memblock_free(page, PAGE_SIZE);
+   memblock_phys_free(page, PAGE_SIZE);
diag_amode31_ops.diag308_reset();
pcpu_set_smt(0);
  }
@@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
  
  	/* Add CPUs present at boot */

__smp_rescan_cpus(info, true);
-   memblock_free_early((unsigned long)info, sizeof(*info));
+   memblock_free(info, sizeof(*info));
  }
  
  /*


I'm a bit lost. IIUC memblock_free_early() and memblock_free() where 
identical.


In the first hunk memblock_free() gets replaced by memblock_phys_free()
In the second hunk memblock_free_early() gets replaced by memblock_free()

I think it would be easier to follow if you could split it in several 
patches:
- First patch: Create memblock_phys_free() and change all relevant 
memblock_free() to memblock_phys_free() - Or change memblock_free() to 
memblock_phys_free() and make memblock_free() an alias of it.
- Second patch: Make memblock_free_ptr() become memblock_free() and 
change all remaining callers to the new semantics (IIUC 
memblock_free(__pa(ptr)) becomes memblock_free(ptr) and make 
memblock_free_ptr() an alias of memblock_free()

- Fourth patch: Replace and drop memblock_free_ptr()
- Fifth patch: Drop memblock_free_early() and memblock_free_early_nid() 
(All users should have been upgraded to memblock_free_phys() in patch 1 
or memblock_free() in patch 2)


Christophe

[PATCH 3/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Mike Rapoport

From: Mike Rapoport 

For ages memblock_free() interface dealt with physical addresses even
despite the existence of memblock_alloc_xx() functions that return a
virtual pointer.

Introduce memblock_phys_free() for freeing physical ranges and repurpose
memblock_free() to free virtual pointers to make the following pairing
abundantly clear:

int memblock_phys_free(phys_addr_t base, phys_addr_t size);
phys_addr_t memblock_phys_alloc(phys_addr_t base, phys_addr_t size);

void *memblock_alloc(phys_addr_t size, phys_addr_t align);
void memblock_free(void *ptr, size_t size);

Replace intermediate memblock_free_ptr() with memblock_free() and drop
unnecessary aliases memblock_free_early() and memblock_free_early_nid().

Suggested-by: Linus Torvalds 
Signed-off-by: Mike Rapoport 
---
 arch/alpha/kernel/core_irongate.c |  2 +-
 arch/arc/mm/init.c|  2 +-
 arch/arm/mach-hisi/platmcpm.c |  2 +-
 arch/arm/mm/init.c|  2 +-
 arch/arm64/mm/mmu.c   |  4 ++--
 arch/mips/mm/init.c   |  2 +-
 arch/mips/sgi-ip30/ip30-setup.c   |  6 +++---
 arch/powerpc/kernel/dt_cpu_ftrs.c |  2 +-
 arch/powerpc/kernel/paca.c|  4 ++--
 arch/powerpc/kernel/setup-common.c|  2 +-
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  2 +-
 arch/powerpc/platforms/pseries/svm.c  |  4 +---
 arch/riscv/kernel/setup.c |  4 ++--
 arch/s390/kernel/setup.c  |  8 
 arch/s390/kernel/smp.c|  4 ++--
 arch/s390/kernel/uv.c |  2 +-
 arch/s390/mm/kasan_init.c |  2 +-
 arch/sh/boards/mach-ap325rxa/setup.c  |  2 +-
 arch/sh/boards/mach-ecovec24/setup.c  |  4 ++--
 arch/sh/boards/mach-kfr2r09/setup.c   |  2 +-
 arch/sh/boards/mach-migor/setup.c |  2 +-
 arch/sh/boards/mach-se/7724/setup.c   |  4 ++--
 arch/sparc/kernel/smp_64.c|  2 +-
 arch/um/kernel/mem.c  |  2 +-
 arch/x86/kernel/setup.c   |  4 ++--
 arch/x86/kernel/setup_percpu.c|  2 +-
 arch/x86/mm/init.c|  2 +-
 arch/x86/mm/kasan_init_64.c   |  4 ++--
 arch/x86/mm/numa.c|  2 +-
 arch/x86/mm/numa_emulation.c  |  2 +-
 arch/x86/xen/mmu_pv.c |  6 +++---
 arch/x86/xen/p2m.c|  2 +-
 arch/x86/xen/setup.c  |  6 +++---
 drivers/base/arch_numa.c  |  4 ++--
 drivers/firmware/efi/memmap.c |  2 +-
 drivers/macintosh/smu.c   |  2 +-
 drivers/of/kexec.c|  2 +-
 drivers/of/of_reserved_mem.c  |  4 ++--
 drivers/s390/char/sclp_early.c|  2 +-
 drivers/usb/early/xhci-dbc.c  | 10 +-
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/memblock.h  | 16 ++--
 init/initramfs.c  |  2 +-
 init/main.c   |  2 +-
 kernel/dma/swiotlb.c  |  2 +-
 kernel/printk/printk.c|  4 ++--
 lib/bootconfig.c  |  2 +-
 lib/cpumask.c |  2 +-
 mm/cma.c  |  2 +-
 mm/memblock.c | 20 ++--
 mm/memory_hotplug.c   |  2 +-
 mm/percpu.c   |  8 
 mm/sparse.c   |  2 +-
 tools/bootconfig/include/linux/memblock.h |  2 +-
 55 files changed, 92 insertions(+), 106 deletions(-)

diff --git a/arch/alpha/kernel/core_irongate.c 
b/arch/alpha/kernel/core_irongate.c
index 72af1e72d833..6b8ed12936b6 100644
--- a/arch/alpha/kernel/core_irongate.c
+++ b/arch/alpha/kernel/core_irongate.c
@@ -233,7 +233,7 @@ albacore_init_arch(void)
unsigned long size;
 
size = initrd_end - initrd_start;
-   memblock_free(__pa(initrd_start), PAGE_ALIGN(size));
+   memblock_free((void *)initrd_start, PAGE_ALIGN(size));
if (!move_initrd(pci_mem))
printk("irongate_init_arch: initrd too big "
   "(%ldK)\ndisabling initrd\n",
diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 699ecf119641..59408f6a02d4 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -173,7 +173,7 @@ static void __init highmem_init(void)
 #ifdef CONFIG_HIGHMEM
unsigned long tmp;
 
-   memblock_free(high_mem_start, high_mem_sz);
+   memblock_phys_free(high_mem_start, high_mem_sz);
for (tmp = min_high_pfn; tmp < max_high_pfn; tmp++)
free_highmem_page(pfn_to_page(tmp));
 #endif
diff --git a/arch/arm/mach-hisi/platmcpm.c b/

[PATCH 2/3] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer

2021-09-23 Thread Mike Rapoport

From: Mike Rapoport 

free_p2m_page() wrongly passes a virtual pointer to memblock_free() that
treats it as a physical address.

Call memblock_free_ptr() instead that gets a virtual address to free the
memory.

Signed-off-by: Mike Rapoport 
---
 arch/x86/xen/p2m.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 5e6e236977c7..141bb9dbd2fb 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -197,7 +197,7 @@ static void * __ref alloc_p2m_page(void)
 static void __ref free_p2m_page(void *p)
 {
if (unlikely(!slab_is_available())) {
-   memblock_free((unsigned long)p, PAGE_SIZE);
+   memblock_free_ptr(p, PAGE_SIZE);
return;
}
 
-- 
2.28.0

[PATCH 1/3] arch_numa: simplify numa_distance allocation

2021-09-23 Thread Mike Rapoport

From: Mike Rapoport 

Memory allocation of numa_distance uses memblock_phys_alloc_range() without
actual range limits, converts the returned physical address to virtual and
then only uses the virtual address for further initialization.

Simplify this by replacing memblock_phys_alloc_range() with
memblock_alloc().

Signed-off-by: Mike Rapoport 
---
 drivers/base/arch_numa.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 00fb4120a5b3..f6d0efd01188 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -275,15 +275,13 @@ void __init numa_free_distance(void)
 static int __init numa_alloc_distance(void)
 {
size_t size;
-   u64 phys;
int i, j;
 
size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
-   phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0, PFN_PHYS(max_pfn));
-   if (WARN_ON(!phys))
+   numa_distance = memblock_alloc(size, PAGE_SIZE);
+   if (WARN_ON(!numa_distance))
return -ENOMEM;
 
-   numa_distance = __va(phys);
numa_distance_cnt = nr_node_ids;
 
/* fill with the default distances */
-- 
2.28.0

[PATCH 0/3] memblock: cleanup memblock_free interface

2021-09-23 Thread Mike Rapoport

From: Mike Rapoport 

Hi,

Following the discussion on [1] this is the fix for memblock freeing APIs
mismatch. 

The first patch is a cleanup of numa_distance allocation in arch_numa I've
spotted during the conversion.
The second patch is a fix for Xen memory freeing on some of the error
paths.

The core change is in the third patch that makes memblock_free() a
counterpart of memblock_alloc() and adds memblock_phys_alloc() to be a
counterpart of memblock_phys_alloc().

Since scripts/get_maintainer.pl returned more than 100 addresses I've
trimmed the distribution list only to the relevant lists.

[1] 
https://lore.kernel.org/all/CAHk-=wj9k4LZTz+svCxLYs5Y1=+ykrbauarh1+ghyg3old8...@mail.gmail.com

Mike Rapoport (3):
  arch_numa: simplify numa_distance allocation
  xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer
  memblock: cleanup memblock_free interface

 arch/alpha/kernel/core_irongate.c |  2 +-
 arch/arc/mm/init.c|  2 +-
 arch/arm/mach-hisi/platmcpm.c |  2 +-
 arch/arm/mm/init.c|  2 +-
 arch/arm64/mm/mmu.c   |  4 ++--
 arch/mips/mm/init.c   |  2 +-
 arch/mips/sgi-ip30/ip30-setup.c   |  6 +++---
 arch/powerpc/kernel/dt_cpu_ftrs.c |  2 +-
 arch/powerpc/kernel/paca.c|  4 ++--
 arch/powerpc/kernel/setup-common.c|  2 +-
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  2 +-
 arch/powerpc/platforms/pseries/svm.c  |  4 +---
 arch/riscv/kernel/setup.c |  4 ++--
 arch/s390/kernel/setup.c  |  8 
 arch/s390/kernel/smp.c|  4 ++--
 arch/s390/kernel/uv.c |  2 +-
 arch/s390/mm/kasan_init.c |  2 +-
 arch/sh/boards/mach-ap325rxa/setup.c  |  2 +-
 arch/sh/boards/mach-ecovec24/setup.c  |  4 ++--
 arch/sh/boards/mach-kfr2r09/setup.c   |  2 +-
 arch/sh/boards/mach-migor/setup.c |  2 +-
 arch/sh/boards/mach-se/7724/setup.c   |  4 ++--
 arch/sparc/kernel/smp_64.c|  2 +-
 arch/um/kernel/mem.c  |  2 +-
 arch/x86/kernel/setup.c   |  4 ++--
 arch/x86/kernel/setup_percpu.c|  2 +-
 arch/x86/mm/init.c|  2 +-
 arch/x86/mm/kasan_init_64.c   |  4 ++--
 arch/x86/mm/numa.c|  2 +-
 arch/x86/mm/numa_emulation.c  |  2 +-
 arch/x86/xen/mmu_pv.c |  6 +++---
 arch/x86/xen/p2m.c|  2 +-
 arch/x86/xen/setup.c  |  6 +++---
 drivers/base/arch_numa.c  | 10 --
 drivers/firmware/efi/memmap.c |  2 +-
 drivers/macintosh/smu.c   |  2 +-
 drivers/of/kexec.c|  2 +-
 drivers/of/of_reserved_mem.c  |  4 ++--
 drivers/s390/char/sclp_early.c|  2 +-
 drivers/usb/early/xhci-dbc.c  | 10 +-
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/memblock.h  | 16 ++--
 init/initramfs.c  |  2 +-
 init/main.c   |  2 +-
 kernel/dma/swiotlb.c  |  2 +-
 kernel/printk/printk.c|  4 ++--
 lib/bootconfig.c  |  2 +-
 lib/cpumask.c |  2 +-
 mm/cma.c  |  2 +-
 mm/memblock.c | 20 ++--
 mm/memory_hotplug.c   |  2 +-
 mm/percpu.c   |  8 
 mm/sparse.c   |  2 +-
 tools/bootconfig/include/linux/memblock.h |  2 +-
 55 files changed, 94 insertions(+), 110 deletions(-)


base-commit: e4e737bb5c170df6135a127739a9e6148ee3da82
-- 
2.28.0

Re: [PATCH AUTOSEL 5.10 01/26] ibmvnic: check failover_pending in login response

2021-09-23 Thread Pavel Machek

Hi!

Something went wrong with this series. I only see first 7 patches. I
thought it might be local problem, but I only see 7 patches on lore...

https://lore.kernel.org/lkml/20210923033839.1421034-1-sas...@kernel.org/

Best regards,
Pavel
-- 
http://www.livejournal.com/~pavelmachek


signature.asc
Description: Digital signature

Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()

2021-09-23 Thread Michael Ellerman

Nathan Lynch  writes:
> Srikar Dronamraju  writes:
>
>> * Nathan Lynch  [2021-09-22 11:01:12]:
>>
>>> Srikar Dronamraju  writes:
>>> > * Nathan Lynch  [2021-09-20 22:12:13]:
>>> >
>>> >> vcpu_is_preempted() can be used outside of preempt-disabled critical
>>> >> sections, yielding warnings such as:
>>> >> 
>>> >> BUG: using smp_processor_id() in preemptible [] code: 
>>> >> systemd-udevd/185
>>> >> caller is rwsem_spin_on_owner+0x1cc/0x2d0
>>> >> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
>>> >> Call Trace:
>>> >> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 
>>> >> (unreliable)
>>> >> [c00012907b00] [c1371f70] 
>>> >> check_preemption_disabled+0x150/0x160
>>> >> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
>>> >> [c00012907be0] [c01e1408] 
>>> >> rwsem_down_write_slowpath+0x478/0x9a0
>>> >> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0
>>> >> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0
>>> >> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70
>>> >> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0
>>> >> [c00012907e10] [c000c54c] system_call_common+0xec/0x250
>>> >> 
>>> >> The result of vcpu_is_preempted() is always subject to invalidation by
>>> >> events inside and outside of Linux; it's just a best guess at a point in
>>> >> time. Use raw_smp_processor_id() to avoid such warnings.
>>> >
>>> > Typically smp_processor_id() and raw_smp_processor_id() except for the
>>> > CONFIG_DEBUG_PREEMPT.
>>> 
>>> Sorry, I don't follow...
>>
>> I meant, Unless CONFIG_DEBUG_PREEMPT, smp_processor_id() is defined as
>> raw_processor_id().
>>
>>> 
>>> > In the CONFIG_DEBUG_PREEMPT case, smp_processor_id()
>>> > is actually debug_smp_processor_id(), which does all the checks.
>>> 
>>> Yes, OK.
>>> 
>>> > I believe these checks in debug_smp_processor_id() are only valid for x86
>>> > case (aka cases were they have __smp_processor_id() defined.)
>>> 
>>> Hmm, I am under the impression that the checks in
>>> debug_smp_processor_id() are valid regardless of whether the arch
>>> overrides __smp_processor_id().
>>
>> From include/linux/smp.h
>>
>> /*
>>  * Allow the architecture to differentiate between a stable and unstable 
>> read.
>>  * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
>>  * regular asm read for the stable.
>>  */
>> #ifndef __smp_processor_id
>> #define __smp_processor_id(x) raw_smp_processor_id(x)
>> #endif
>>
>> As far as I see, only x86 has a definition of __smp_processor_id.
>> So for archs like Powerpc, __smp_processor_id(), is always
>> defined as raw_smp_processor_id(). Right?
>
> Sure, yes.
>
>> I would think debug_smp_processor_id() would be useful if 
>> __smp_processor_id()
>> is different from raw_smp_processor_id(). Do note debug_smp_processor_id() 
>> calls raw_smp_processor_id().

Agree.

> I do not think the utility of debug_smp_processor_id() is related to
> whether the arch defines __smp_processor_id().
>
>> Or can I understand how debug_smp_processor_id() is useful if
>> __smp_processor_id() is defined as raw_smp_processor_id()?

debug_smp_processor_id() is useful on powerpc, as well as other arches,
because it checks that we're in a context where the processor id won't
change out from under us.

eg. something like this is unsafe:

  int counts[NR_CPUS];
  int tmp, cpu;
  
  cpu = smp_processor_id();
  tmp = counts[cpu];
<- preempted here and migrated to another CPU
  counts[cpu] = tmp + 1;


> So, for powerpc with DEBUG_PREEMPT unset, a call to smp_procesor_id()
> expands to __smp_processor_id() which expands to raw_smp_processor_id(),
> avoiding the preempt safety checks. This is working as intended.
>
> For powerpc with DEBUG_PREEMPT=y, a call to smp_processor_id() expands
> to the out of line call to debug_smp_processor_id(), which calls
> raw_smp_processor_id() and performs the checks, warning if called in an
> inappropriate context, as seen here. Also working as intended.
>
> AFAICT __smp_processor_id() is a performance/codegen-oriented hook, and
> not really related to the debug facility. Please see 9ed7d75b2f09d
> ("x86/percpu: Relax smp_processor_id()").

Yeah good find.

cheers

45 matches

Mail list logo