Re: [PATCH 0/4] ASoC: fsl_asrc: allow selecting arbitrary clocks

2020-07-22 Thread Nicolin Chen
On Fri, Jul 17, 2020 at 01:16:42PM +0200, Arnaud Ferraris wrote:
> Hi Nic,
> 
> Le 02/07/2020 à 20:42, Nicolin Chen a écrit :
> > Hi Arnaud,
> > 
> > On Thu, Jul 02, 2020 at 04:22:31PM +0200, Arnaud Ferraris wrote:
> >> The current ASRC driver hardcodes the input and output clocks used for
> >> sample rate conversions. In order to allow greater flexibility and to
> >> cover more use cases, it would be preferable to select the clocks using
> >> device-tree properties.
> > 
> > We recent just merged a new change that auto-selecting internal
> > clocks based on sample rates as the first option -- ideal ratio
> > mode is the fallback mode now. Please refer to:
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200702=d0250cf4f2abfbea64ed247230f08f5ae23979f0
> 
> While working on fixing the automatic clock selection (see my v3), I
> came across another potential issue, which would be better explained
> with an example:
>   - Input has sample rate 8kHz and uses clock SSI1 with rate 512kHz
>   - Output has sample rate 16kHz and uses clock SSI2 with rate 1024kHz
> 
> Let's say my v3 patch is merged, then the selected input clock will be
> SSI1, while the selected output clock will be SSI2. In that case, it's
> all good, as the driver will calculate the dividers right.
> 
> Now, suppose a similar board has the input wired to SSI2 and output to
> SSI1, meaning we're now in the following case:
>   - Input has sample rate 8kHz and uses clock SSI2 with rate 512kHz
>   - Output has sample rate 16kHz and uses clock SSI1 with rate 1024kHz
> (the same result is achieved during capture with the initial example
> setup, as input and output properties are then swapped)
> 
> In that case, the selected clocks will still be SSI1 for input (just
> because it appears first in the clock table), and SSI2 for output,
> meaning the calculated dividers will be:
>   - input: 512 / 16 => 32 (should be 64)
>   - output: 1024 / 8 => 128 (should be 64 here too)

I don't get the 32, 128 and 64 parts. Would you please to elaborate
a bit? What you said sounds to me like the driver calculates wrong
dividers?


Re: [v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs

2020-07-22 Thread kajoljain



On 7/21/20 11:32 AM, kajoljain wrote:
> 
> 
> On 7/17/20 8:08 PM, Athira Rajeev wrote:
>> From: Anju T Sudhakar 
>>
>> Add support for perf extended register capability in powerpc.
>> The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
>> PMU which support extended registers. The generic code define the mask
>> of extended registers as 0 for non supported architectures.
>>
>> Patch adds extended regs support for power9 platform by
>> exposing MMCR0, MMCR1 and MMCR2 registers.
>>
>> REG_RESERVED mask needs update to include extended regs.
>> `PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
>> is defined at runtime in the kernel based on platform since the supported
>> registers may differ from one processor version to another and hence the
>> MASK value.
>>
>> with patch
>> --
>>
>> available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
>> r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
>> r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
>> trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2
>>
>> PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
>> ... intr regs: mask 0x ABI 64-bit
>>  r00xc012b77c
>>  r10xc03fe5e03930
>>  r20xc1b0e000
>>  r30xc03fdcddf800
>>  r40xc03fc788
>>  r50x9c422724be
>>  r60xc03fe5e03908
>>  r70xff63bddc8706
>>  r80x9e4
>>  r90x0
>>  r10   0x1
>>  r11   0x0
>>  r12   0xc01299c0
>>  r13   0xc03c4800
>>  r14   0x0
>>  r15   0x7fffdd8b8b00
>>  r16   0x0
>>  r17   0x7fffdd8be6b8
>>  r18   0x7e7076607730
>>  r19   0x2f
>>  r20   0xc0001fc26c68
>>  r21   0xc0002041e4227e00
>>  r22   0xc0002018fb60
>>  r23   0x1
>>  r24   0xc03ffec4d900
>>  r25   0x8000
>>  r26   0x0
>>  r27   0x1
>>  r28   0x1
>>  r29   0xc1be1260
>>  r30   0x6008010
>>  r31   0xc03ffebb7218
>>  nip   0xc012b910
>>  msr   0x90009033
>>  orig_r3 0xc012b86c
>>  ctr   0xc01299c0
>>  link  0xc012b77c
>>  xer   0x0
>>  ccr   0x2800
>>  softe 0x1
>>  trap  0xf00
>>  dar   0x0
>>  dsisr 0x800
>>  sier  0x0
>>  mmcra 0x800
>>  mmcr0 0x82008090
>>  mmcr1 0x1e00
>>  mmcr2 0x0
>>  ... thread: perf:4784
>>
>> Signed-off-by: Anju T Sudhakar 
>> [Defined PERF_REG_EXTENDED_MASK at run time to add support for different 
>> platforms ]
>> Signed-off-by: Athira Rajeev 
>> Reviewed-by: Madhavan Srinivasan 
>> ---
> 
> Patch looks good to me.
> 
> Reviewed-by: Kajol Jain 

Hi Arnaldo and Jiri,
 Please let me know if you have any comments on these patches. Can you 
pull/ack these
patches if they seems fine to you.

Thanks,
Kajol Jain

> 
> Thanks,
> Kajol Jain
> 
>>  arch/powerpc/include/asm/perf_event_server.h |  8 +++
>>  arch/powerpc/include/uapi/asm/perf_regs.h| 14 +++-
>>  arch/powerpc/perf/core-book3s.c  |  1 +
>>  arch/powerpc/perf/perf_regs.c| 34 
>> +---
>>  arch/powerpc/perf/power9-pmu.c   |  6 +
>>  5 files changed, 59 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/perf_event_server.h 
>> b/arch/powerpc/include/asm/perf_event_server.h
>> index 832450a..bf85d1a 100644
>> --- a/arch/powerpc/include/asm/perf_event_server.h
>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>> @@ -15,6 +15,9 @@
>>  #define MAX_EVENT_ALTERNATIVES  8
>>  #define MAX_LIMITED_HWCOUNTERS  2
>>  
>> +extern u64 PERF_REG_EXTENDED_MASK;
>> +#define PERF_REG_EXTENDED_MASK  PERF_REG_EXTENDED_MASK
>> +
>>  struct perf_event;
>>  
>>  struct mmcr_regs {
>> @@ -62,6 +65,11 @@ struct power_pmu {
>>  int *blacklist_ev;
>>  /* BHRB entries in the PMU */
>>  int bhrb_nr;
>> +/*
>> + * set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if
>> + * the pmu supports extended perf regs capability
>> + */
>> +int capabilities;
>>  };
>>  
>>  /*
>> diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
>> b/arch/powerpc/include/uapi/asm/perf_regs.h
>> index f599064..225c64c 100644
>> --- a/arch/powerpc/include/uapi/asm/perf_regs.h
>> +++ b/arch/powerpc/include/uapi/asm/perf_regs.h
>> @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
>>  PERF_REG_POWERPC_DSISR,
>>  PERF_REG_POWERPC_SIER,
>>  PERF_REG_POWERPC_MMCRA,
>> -PERF_REG_POWERPC_MAX,
>> +/* Extended registers */
>> +PERF_REG_POWERPC_MMCR0,
>> +PERF_REG_POWERPC_MMCR1,
>> +PERF_REG_POWERPC_MMCR2,
>> +/* Max regs without the extended regs */
>> +PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>>  };
>> +
>> +#define PERF_REG_PMU_MASK   ((1ULL << PERF_REG_POWERPC_MAX) - 1)
>> +
>> +/* PERF_REG_EXTENDED_MASK value for 

RE: [PATCH devicetree 3/4] powerpc: dts: t1040rdb: put SGMII PHY under label

2020-07-22 Thread Madalin Bucur (OSS)
> -Original Message-
> From: Vladimir Oltean 
> Sent: Wednesday, July 22, 2020 8:24 PM
> To: robh...@kernel.org; shawn...@kernel.org; m...@ellerman.id.au;
> devicet...@vger.kernel.org
> Cc: b...@kernel.crashing.org; pau...@samba.org; linuxppc-
> d...@lists.ozlabs.org; linux-ker...@vger.kernel.org;
> net...@vger.kernel.org; Madalin Bucur (OSS) ;
> Radu-andrei Bulie ; fido_...@inbox.ru
> Subject: [PATCH devicetree 3/4] powerpc: dts: t1040rdb: put SGMII PHY
> under  label
> 
> We're going to add 8 more PHYs in a future patch. It is easier to follow
> the hardware description if we don't need to fish for the path of the
> MDIO controllers inside the SoC and just use the labels.
> 

Please align to the existing structure, it may be easier to add something
without paying attention to that but it's better to keep things organized.
This structure is used across all the device trees of the platforms using
DPAA, let's not start diverging now.

> Signed-off-by: Vladimir Oltean 
> ---
>  arch/powerpc/boot/dts/fsl/t1040rdb.dts | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/boot/dts/fsl/t1040rdb.dts
> b/arch/powerpc/boot/dts/fsl/t1040rdb.dts
> index 65ff34c49025..40d7126dbe90 100644
> --- a/arch/powerpc/boot/dts/fsl/t1040rdb.dts
> +++ b/arch/powerpc/boot/dts/fsl/t1040rdb.dts
> @@ -59,12 +59,6 @@ ethernet@e4000 {
>   phy-handle = <_sgmii_2>;
>   phy-connection-type = "sgmii";
>   };
> -
> - mdio@fc000 {
> - phy_sgmii_2: ethernet-phy@3 {
> - reg = <0x03>;
> - };
> - };
>   };
>   };
> 
> @@ -76,3 +70,9 @@ cpld@3,0 {
>  };
> 
>  #include "t1040si-post.dtsi"
> +
> + {
> + phy_sgmii_2: ethernet-phy@3 {
> + reg = <0x3>;
> + };
> +};
> --
> 2.25.1



Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-22 Thread Alex Ghiti




Le 7/21/20 à 7:36 PM, Palmer Dabbelt a écrit :

On Tue, 21 Jul 2020 16:11:02 PDT (-0700), b...@kernel.crashing.org wrote:

On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:

> > I guess I don't understand why this is necessary at all.
> > Specifically: why
> > can't we just relocate the kernel within the linear map?  That would
> > let the
> > bootloader put the kernel wherever it wants, modulo the physical
> > memory size we
> > support.  We'd need to handle the regions that are coupled to the
> > kernel's
> > execution address, but we could just put them in an explicit memory
> > region
> > which is what we should probably be doing anyway.
>
> Virtual relocation in the linear mapping requires to move the kernel
> physically too. Zong implemented this physical move in its KASLR RFC
> patchset, which is cumbersome since finding an available physical spot
> is harder than just selecting a virtual range in the vmalloc range.
>
> In addition, having the kernel mapping in the linear mapping prevents
> the use of hugepage for the linear mapping resulting in performance 
loss

> (at least for the GB that encompasses the kernel).
>
> Why do you find this "ugly" ? The vmalloc region is just a bunch of
> available virtual addresses to whatever purpose we want, and as 
noted by

> Zong, arm64 uses the same scheme.


I don't get it :-)

At least on powerpc we move the kernel in the linear mapping and it
works fine with huge pages, what is your problem there ? You rely on
punching small-page size holes in there ?


That was my original suggestion, and I'm not actually sure it's 
invalid.  It
would mean that both the kernel's physical and virtual addresses are set 
by the
bootloader, which may or may not be workable if we want to have an 
sv48+sv39
kernel.  My initial approach to sv48+sv39 kernels would be to just throw 
away
the sv39 memory on sv48 kernels, which would preserve the linear map but 
mean

that there is no single physical address that's accessible for both.  That
would require some coordination between the bootloader and the kernel as to
where it should be loaded, but maybe there's a better way to design the 
linear
map.  Right now we have a bunch of unwritten rules about where things 
need to

be loaded, which is a recipe for disaster.

We could copy the kernel around, but I'm not sure I really like that 
idea.  We
do zero the BSS right now, so it's not like we entirely rely on the 
bootloader
to set up the kernel image, but with the hart race boot scheme we have 
right

now we'd at least need to leave a stub sitting around.  Maybe we just throw
away SBI v0.1, though, that's why we called it all legacy in the first 
place.


My bigger worry is that anything that involves running the kernel at 
arbitrary
virtual addresses means we need a PIC kernel, which means every global 
symbol
needs an indirection.  That's probably not so bad for shared libraries, 
but the
kernel has a lot of global symbols.  PLT references probably aren't so 
scary,
as we have an incoherent instruction cache so the virtual function 
predictor

isn't that hard to build, but making all global data accesses GOT-relative
seems like a disaster for performance.  This fixed-VA thing really just 
exists

so we don't have to be full-on PIC.

In theory I think we could just get away with pretending that medany is 
PIC,
which I believe works as long as the data and text offset stays 
constant, you

you don't have any symbols between 2GiB and -2GiB (as those may stay fixed,
even in medany), and you deal with GP accordingly (which should work 
itself out
in the current startup code).  We rely on this for some of the early 
boot code

(and will soon for kexec), but that's a very controlled code base and we've
already had some issues.  I'd be much more comfortable adding an explicit
semi-PIC code model, as I tend to miss something when doing these sorts of
things and then we could at least add it to the GCC test runs and 
guarantee it
actually works.  Not really sure I want to deal with that, though.  It 
would,

however, be the only way to get random virtual addresses during kernel
execution.


At least in the old days, there were a number of assumptions that
the kernel text/data/bss resides in the linear mapping.


Ya, it terrified me as well.  Alex says arm64 puts the kernel in the 
vmalloc

region, so assuming that's the case it must be possible.  I didn't get that
from reading the arm64 port (I guess it's no secret that pretty much all 
I do

is copy their code)


See https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/mmu.c#L615.




If you change that you need to ensure that it's still physically
contiguous and you'll have to tweak __va and __pa, which might induce
extra overhead.


I'm operating under the assumption that we don't want to add an 
additional load
to virt2phys conversions.  arm64 bends over backwards to avoid the load, 
and

I'm assuming they have a reason for doing so.  Of course, if we're PIC then
maybe 

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-22 Thread Alex Ghiti

Hi Palmer,

Le 7/21/20 à 3:05 PM, Palmer Dabbelt a écrit :

On Tue, 21 Jul 2020 11:36:10 PDT (-0700), a...@ghiti.fr wrote:

Let's try to make progress here: I add linux-mm in CC to get feedback on
this patch as it blocks sv48 support too.


Sorry for being slow here.  I haven't replied because I hadn't really 
fleshed


No problem :)

out the design yet, but just so everyone's on the same page my problems 
with

this are:

* We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
* On 64-bit systems the VA space around the kernel is precious because 
it's the
  only place we can place text (modules, BPF, whatever).  If we start 
putting
  the kernel in the vmalloc space then we either have to pre-allocate a 
bunch

  of space around it (essentially making it a fixed mapping anyway) or it
  becomes likely that we won't be able to find space for modules as they're
  loaded into running systems.


Let's note that we already have this issue for BPF and modules right now.
But by keeping the kernel 'in the end' of the vmalloc region, that's 
quite mitigate this problem: if we exhaust the vmalloc region in 64bit 
and then start allocating here, I think the whole system will have other 
problem.


* Relying on a relocatable kernel for sv48 support introduces a fairly 
large

  performance hit.


I understand the performance penalty but I struggle to it "fairly 
large": can we benchmark this somehow ?




Roughly, my proposal would be to:

* Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
  anywhere and we only have one VA width, so we're not really solving any
  problems with these changes.


Ok that's possible although a lot of ifdef will get involved :)

* Staticly allocate a 2GiB portion of the VA space for all our text, as 
its own
  region.  We'd link/relocate the kernel here instead of around 
PAGE_OFFSET,
  which would decouple the kernel from the physical memory layout of the 
system.
  This would have the side effect of sorting out a bunch of bootloader 
headaches

  that we currently have.


This amounts to doing the same as this patch but instead of using the 
vmalloc region, we'd use our own right ? I believe we'd then lose the 
vmalloc facilities to allocate modules around this zone.



* Sort out how to maintain a linear map as the canonical hole moves around
  between the VA widths without adding a bunch of overhead to the 
virt2phys and
  friends.  This is probably going to be the trickiest part, but I think 
if we

  just change the page table code to essentially lie about VAs when an sv39
  system runs an sv48+sv39 kernel we could make it work -- there'd be some
  logical complexity involved, but it would remain fast.


I have to think about that.



This doesn't solve the problem of virtually relocatable kernels, but it 
does
let us decouple that from the sv48 stuff.  It also lets us stop relying 
on a

fixed physical address the kernel is loaded into, which is another thing I
don't like.



Agreed on this one.


I know this may be a more complicated approach, but there aren't any sv48
systems around right now so I just don't see the rush to support them,
particularly when there's a cost to what already exists (for those who 
haven't

been watching, so far all the sv48 patch sets have imposed a significant
performance penalty on all systems).



Alex



Alex

Le 7/9/20 à 7:11 AM, Alex Ghiti a écrit :

Hi Palmer,

Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :

On Sun, 07 Jun 2020 00:59:46 PDT (-0700), a...@ghiti.fr wrote:

This is a preparatory patch for relocatable kernel.

The kernel used to be linked at PAGE_OFFSET address and used to be
loaded
physically at the beginning of the main memory. Therefore, we could 
use

the linear mapping for the kernel mapping.

But the relocated kernel base address will be different from 
PAGE_OFFSET
and since in the linear mapping, two different virtual addresses 
cannot

point to the same physical address, the kernel mapping needs to lie
outside
the linear mapping.


I know it's been a while, but I keep opening this up to review it and
just
can't get over how ugly it is to put the kernel's linear map in the
vmalloc
region.

I guess I don't understand why this is necessary at all.
Specifically: why
can't we just relocate the kernel within the linear map?  That would
let the
bootloader put the kernel wherever it wants, modulo the physical
memory size we
support.  We'd need to handle the regions that are coupled to the
kernel's
execution address, but we could just put them in an explicit memory
region
which is what we should probably be doing anyway.


Virtual relocation in the linear mapping requires to move the kernel
physically too. Zong implemented this physical move in its KASLR RFC
patchset, which is cumbersome since finding an available physical spot
is harder than just selecting a virtual range in the vmalloc range.

In addition, having the kernel mapping in the linear mapping prevents
the use of hugepage for 

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-22 Thread Alex Ghiti

Hi Benjamin,

Le 7/21/20 à 7:11 PM, Benjamin Herrenschmidt a écrit :

On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:

I guess I don't understand why this is necessary at all.
Specifically: why
can't we just relocate the kernel within the linear map?  That would
let the
bootloader put the kernel wherever it wants, modulo the physical
memory size we
support.  We'd need to handle the regions that are coupled to the
kernel's
execution address, but we could just put them in an explicit memory
region
which is what we should probably be doing anyway.


Virtual relocation in the linear mapping requires to move the kernel
physically too. Zong implemented this physical move in its KASLR RFC
patchset, which is cumbersome since finding an available physical spot
is harder than just selecting a virtual range in the vmalloc range.

In addition, having the kernel mapping in the linear mapping prevents
the use of hugepage for the linear mapping resulting in performance loss
(at least for the GB that encompasses the kernel).

Why do you find this "ugly" ? The vmalloc region is just a bunch of
available virtual addresses to whatever purpose we want, and as noted by
Zong, arm64 uses the same scheme.


I don't get it :-)

At least on powerpc we move the kernel in the linear mapping and it
works fine with huge pages, what is your problem there ? You rely on
punching small-page size holes in there ?



ARCH_HAS_STRICT_KERNEL_RWX prevents the use of a hugepage for the kernel 
mapping in the direct mapping as it sets different permissions to 
different part of the kernel (data, text..etc).




At least in the old days, there were a number of assumptions that
the kernel text/data/bss resides in the linear mapping.

If you change that you need to ensure that it's still physically
contiguous and you'll have to tweak __va and __pa, which might induce
extra overhead.



Yes that's done in this patch and indeed there is an overhead to those 
functions.



Cheers,
Ben.
  



Thanks,

Alex


Re: [v4 2/5] KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs

2020-07-22 Thread Bharata B Rao
On Fri, Jul 17, 2020 at 01:00:24AM -0700, Ram Pai wrote:
> During the life of SVM, its GFNs transition through normal, secure and
> shared states. Since the kernel does not track GFNs that are shared, it
> is not possible to disambiguate a shared GFN from a GFN whose PFN has
> not yet been migrated to a secure-PFN. Also it is not possible to
> disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
> the ultravisor.
> 
> The ability to identify the state of a GFN is needed to skip migration
> of its PFN to secure-PFN during ESM transition.
> 
> The code is re-organized to track the states of a GFN as explained
> below.
> 
> 
>  1. States of a GFN
> ---
>  The GFN can be in one of the following states.
> 
>  (a) Secure - The GFN is secure. The GFN is associated with
>   a Secure VM, the contents of the GFN is not accessible
>   to the Hypervisor.  This GFN can be backed by a secure-PFN,
>   or can be backed by a normal-PFN with contents encrypted.
>   The former is true when the GFN is paged-in into the
>   ultravisor. The latter is true when the GFN is paged-out
>   of the ultravisor.
> 
>  (b) Shared - The GFN is shared. The GFN is associated with a
>   a secure VM. The contents of the GFN is accessible to
>   Hypervisor. This GFN is backed by a normal-PFN and its
>   content is un-encrypted.
> 
>  (c) Normal - The GFN is a normal. The GFN is associated with
>   a normal VM. The contents of the GFN is accesible to
>   the Hypervisor. Its content is never encrypted.
> 
>  2. States of a VM.
> ---
> 
>  (a) Normal VM:  A VM whose contents are always accessible to
>   the hypervisor.  All its GFNs are normal-GFNs.
> 
>  (b) Secure VM: A VM whose contents are not accessible to the
>   hypervisor without the VM's consent.  Its GFNs are
>   either Shared-GFN or Secure-GFNs.
> 
>  (c) Transient VM: A Normal VM that is transitioning to secure VM.
>   The transition starts on successful return of
>   H_SVM_INIT_START, and ends on successful return
>   of H_SVM_INIT_DONE. This transient VM, can have GFNs
>   in any of the three states; i.e Secure-GFN, Shared-GFN,
>   and Normal-GFN. The VM never executes in this state
>   in supervisor-mode.
> 
>  3. Memory slot State.
> --
>   The state of a memory slot mirrors the state of the
>   VM the memory slot is associated with.
> 
>  4. VM State transition.
> 
> 
>   A VM always starts in Normal Mode.
> 
>   H_SVM_INIT_START moves the VM into transient state. During this
>   time the Ultravisor may request some of its GFNs to be shared or
>   secured. So its GFNs can be in one of the three GFN states.
> 
>   H_SVM_INIT_DONE moves the VM entirely from transient state to
>   secure-state. At this point any left-over normal-GFNs are
>   transitioned to Secure-GFN.
> 
>   H_SVM_INIT_ABORT moves the transient VM back to normal VM.
>   All its GFNs are moved to Normal-GFNs.
> 
>   UV_TERMINATE transitions the secure-VM back to normal-VM. All
>   the secure-GFN and shared-GFNs are tranistioned to normal-GFN
>   Note: The contents of the normal-GFN is undefined at this point.
> 
>  5. GFN state implementation:
> -
> 
>  Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
>  when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
>  set, and contains the value of the secure-PFN.
>  It is associated with a normal-PFN; also called mem_pfn, when
>  the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
>  The value of the normal-PFN is not tracked.
> 
>  Shared GFN is associated with a normal-PFN. Its pfn[] has
>  KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
>  is not tracked.
> 
>  Normal GFN is associated with normal-PFN. Its pfn[] has
>  no flag set. The value of the normal-PFN is not tracked.
> 
>  6. Life cycle of a GFN
> 
>  --
>  || Share  |  Unshare | SVM   |H_SVM_INIT_DONE|
>  ||operation   |operation | abort/|   |
>  |||  | terminate |   |
>  -
>  |||  |   |   |
>  | Secure | Shared | Secure   |Normal |Secure |
>  |||  |   |   |
>  | Shared | Shared | Secure   |Normal |Shared |
>  |||  |   |   |
>  | Normal | Shared | Secure   |Normal |Secure |
>  --
> 
>  7. Life cycle of a VM
> 
>  
>  | 

Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-22 Thread Bharata B Rao
On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:
> When a secure memslot is dropped, all the pages backed in the secure device
> (aka really backed by secure memory by the Ultravisor) should be paged out
> to a normal page. Previously, this was achieved by triggering the page
> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
> 
> This can't work when hot unplugging a memory slot because the memory slot
> is flagged as invalid and gfn_to_pfn() is then not trying to access the
> page, so the page fault mechanism is not triggered.
> 
> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> simpler to directly calling it instead of triggering such a mechanism. This
> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> memslot.
> 
> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> the call to __kvmppc_svm_page_out() is made.
> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
> addition, the mmap_sem is help in read mode during that time, not in write
> mode since the virual memory layout is not impacted, and
> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
> 
> Cc: Ram Pai 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
>  1 file changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 5a4b02d3f651..ba5c7c77cc3a 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -624,35 +624,55 @@ static inline int kvmppc_svm_page_out(struct 
> vm_area_struct *vma,
>   * fault on them, do fault time migration to replace the device PTEs in
>   * QEMU page table with normal PTEs from newly allocated pages.
>   */
> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>struct kvm *kvm, bool skip_page_out)
>  {
>   int i;
>   struct kvmppc_uvmem_page_pvt *pvt;
> - unsigned long pfn, uvmem_pfn;
> - unsigned long gfn = free->base_gfn;
> + struct page *uvmem_page;
> + struct vm_area_struct *vma = NULL;
> + unsigned long uvmem_pfn, gfn;
> + unsigned long addr, end;
> +
> + mmap_read_lock(kvm->mm);
> +
> + addr = slot->userspace_addr;

We typically use gfn_to_hva() for that, but that won't work for a
memslot that is already marked INVALID which is the case here.
I think it is ok to access slot->userspace_addr here of an INVALID
memslot, but just thought of explictly bringing this up.

> + end = addr + (slot->npages * PAGE_SIZE);
>  
> - for (i = free->npages; i; --i, ++gfn) {
> - struct page *uvmem_page;
> + gfn = slot->base_gfn;
> + for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
> +
> + /* Fetch the VMA if addr is not in the latest fetched one */
> + if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
> + vma = find_vma_intersection(kvm->mm, addr, end);
> + if (!vma ||
> + vma->vm_start > addr || vma->vm_end < end) {
> + pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
> + break;
> + }
> + }

In Ram's series, kvmppc_memslot_page_merge() also walks the VMAs spanning
the memslot, but it uses a different logic for the same. Why can't these
two cases use the same method to walk the VMAs? Is there anything subtly
different between the two cases?

Regards,
Bharata.


Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-22 Thread Pingfan Liu
On Wed, Jul 22, 2020 at 12:57 PM Michael Ellerman  wrote:
>
> Pingfan Liu  writes:
> > A bug is observed on pseries by taking the following steps on rhel:
> ^
> RHEL
>
> I assume it happens on mainline too?
Yes, it does.
>
[...]
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 1a3ac3b..def8cb3f 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -372,6 +372,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> >   invalidate_lmb_associativity_index(lmb);
> >   lmb_clear_nid(lmb);
> >   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
> > + drmem_update_dt();
>
> No error checking?
Hmm, here should be a more careful design. Please see the comment at the end.
>
> >   __remove_memory(nid, base_addr, block_sz);
> >
> > @@ -607,6 +608,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> >
> >   lmb_set_nid(lmb);
> >   lmb->flags |= DRCONF_MEM_ASSIGNED;
> > + drmem_update_dt();
>
> And here ..
> >
> >   block_sz = memory_block_size_bytes();
> >
> > @@ -625,6 +627,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> >   invalidate_lmb_associativity_index(lmb);
> >   lmb_clear_nid(lmb);
> >   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
> > + drmem_update_dt();
>
>
> And here ..
>
> >   __remove_memory(nid, base_addr, block_sz);
> >   }
> > @@ -877,9 +880,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
> >   break;
> >   }
> >
> > - if (!rc)
> > - rc = drmem_update_dt();
> > -
> >   unlock_device_hotplug();
> >   return rc;
>
> Whereas previously we did check it.

drmem_update_dt() fails iff allocating memory fail. And in the failed
case, even the original code does not roll back the effect of
__add_memory()/__remove_memory().

And I plan to do the following in V4: if drmem_update_dt() fails in
dlpar_add_lmb(), then bails out immediately.

Thanks,
Pingfan


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-22 Thread Jarkko Sakkinen
On Thu, Jul 16, 2020 at 06:49:09PM +0200, Christophe Leroy wrote:
> Jarkko Sakkinen  a écrit :
> 
> > Rename module_alloc() to text_alloc() and module_memfree() to
> > text_memfree(), and move them to kernel/text.c, which is unconditionally
> > compiled to the kernel proper. This allows kprobes, ftrace and bpf to
> > allocate space for executable code without requiring to compile the modules
> > support (CONFIG_MODULES=y) in.
> 
> You are not changing enough in powerpc to have this work.
> On powerpc 32 bits (6xx), when STRICT_KERNEL_RWX is selected, the vmalloc
> space is set to NX (no exec) at segment level (ie by 256Mbytes zone) unless
> CONFIG_MODULES is selected.
> 
> Christophe

This has been deduced down to:

https://lore.kernel.org/lkml/20200717030422.679972-1-jarkko.sakki...@linux.intel.com/

I.e. not intruding PPC anymore :-)

/Jarkko


Re: [v3 11/15] powerpc/perf: BHRB control to disable BHRB logic when not used

2020-07-22 Thread Jordan Niethe
On Thu, Jul 23, 2020 at 11:26 AM Jordan Niethe  wrote:
>
> On Sat, Jul 18, 2020 at 1:26 AM Athira Rajeev
>  wrote:
> >
> > PowerISA v3.1 has few updates for the Branch History Rolling Buffer(BHRB).
> >
> > BHRB disable is controlled via Monitor Mode Control Register A (MMCRA)
> > bit, namely "BHRB Recording Disable (BHRBRD)". This field controls
> > whether BHRB entries are written when BHRB recording is enabled by other
> > bits. This patch implements support for this BHRB disable bit.
> >
> > By setting 0b1 to this bit will disable the BHRB and by setting 0b0
> > to this bit will have BHRB enabled. This addresses backward
> > compatibility (for older OS), since this bit will be cleared and
> > hardware will be writing to BHRB by default.
> >
> > This patch addresses changes to set MMCRA (BHRBRD) at boot for power10
> > ( there by the core will run faster) and enable this feature only on
> > runtime ie, on explicit need from user. Also save/restore MMCRA in the
> > restore path of state-loss idle state to make sure we keep BHRB disabled
> > if it was not enabled on request at runtime.
> >
> > Signed-off-by: Athira Rajeev 
> > ---
> >  arch/powerpc/perf/core-book3s.c   | 20 
> >  arch/powerpc/perf/isa207-common.c | 12 
> >  arch/powerpc/platforms/powernv/idle.c | 22 --
> >  3 files changed, 48 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/perf/core-book3s.c 
> > b/arch/powerpc/perf/core-book3s.c
> > index bd125fe..31c0535 100644
> > --- a/arch/powerpc/perf/core-book3s.c
> > +++ b/arch/powerpc/perf/core-book3s.c
> > @@ -1218,7 +1218,7 @@ static void write_mmcr0(struct cpu_hw_events *cpuhw, 
> > unsigned long mmcr0)
> >  static void power_pmu_disable(struct pmu *pmu)
> >  {
> > struct cpu_hw_events *cpuhw;
> > -   unsigned long flags, mmcr0, val;
> > +   unsigned long flags, mmcr0, val, mmcra;
> >
> > if (!ppmu)
> > return;
> > @@ -1251,12 +1251,24 @@ static void power_pmu_disable(struct pmu *pmu)
> > mb();
> > isync();
> >
> > +   val = mmcra = cpuhw->mmcr.mmcra;
> > +
> > /*
> >  * Disable instruction sampling if it was enabled
> >  */
> > -   if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE) {
> > -   mtspr(SPRN_MMCRA,
> > - cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
> > +   if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE)
> > +   val &= ~MMCRA_SAMPLE_ENABLE;
> > +
> > +   /* Disable BHRB via mmcra (BHRBRD) for p10 */
> > +   if (ppmu->flags & PPMU_ARCH_310S)
> > +   val |= MMCRA_BHRB_DISABLE;
> > +
> > +   /*
> > +* Write SPRN_MMCRA if mmcra has either disabled
> > +* instruction sampling or BHRB.
> > +*/
> > +   if (val != mmcra) {
> > +   mtspr(SPRN_MMCRA, mmcra);
> > mb();
> > isync();
> > }
> > diff --git a/arch/powerpc/perf/isa207-common.c 
> > b/arch/powerpc/perf/isa207-common.c
> > index 77643f3..964437a 100644
> > --- a/arch/powerpc/perf/isa207-common.c
> > +++ b/arch/powerpc/perf/isa207-common.c
> > @@ -404,6 +404,13 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
> >
> > mmcra = mmcr1 = mmcr2 = mmcr3 = 0;
> >
> > +   /*
> > +* Disable bhrb unless explicitly requested
> > +* by setting MMCRA (BHRBRD) bit.
> > +*/
> > +   if (cpu_has_feature(CPU_FTR_ARCH_31))
> > +   mmcra |= MMCRA_BHRB_DISABLE;
> > +
> > /* Second pass: assign PMCs, set all MMCR1 fields */
> > for (i = 0; i < n_ev; ++i) {
> > pmc = (event[i] >> EVENT_PMC_SHIFT) & EVENT_PMC_MASK;
> > @@ -479,6 +486,11 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
> > mmcra |= val << MMCRA_IFM_SHIFT;
> > }
> >
> > +   /* set MMCRA (BHRBRD) to 0 if there is user request for 
> > BHRB */
> > +   if (cpu_has_feature(CPU_FTR_ARCH_31) &&
> > +   (has_branch_stack(pevents[i]) || (event[i] 
> > & EVENT_WANTS_BHRB)))
> > +   mmcra &= ~MMCRA_BHRB_DISABLE;
> > +
> > if (pevents[i]->attr.exclude_user)
> > mmcr2 |= MMCR2_FCP(pmc);
> >
> > diff --git a/arch/powerpc/platforms/powernv/idle.c 
> > b/arch/powerpc/platforms/powernv/idle.c
> > index 2dd4673..1c9d0a9 100644
> > --- a/arch/powerpc/platforms/powernv/idle.c
> > +++ b/arch/powerpc/platforms/powernv/idle.c
> > @@ -611,6 +611,7 @@ static unsigned long power9_idle_stop(unsigned long 
> > psscr, bool mmu_on)
> > unsigned long srr1;
> > unsigned long pls;
> > unsigned long mmcr0 = 0;
> > +   unsigned long mmcra = 0;
> > struct p9_sprs sprs = {}; /* 

Re: [v3 11/15] powerpc/perf: BHRB control to disable BHRB logic when not used

2020-07-22 Thread Jordan Niethe
On Sat, Jul 18, 2020 at 1:26 AM Athira Rajeev
 wrote:
>
> PowerISA v3.1 has few updates for the Branch History Rolling Buffer(BHRB).
>
> BHRB disable is controlled via Monitor Mode Control Register A (MMCRA)
> bit, namely "BHRB Recording Disable (BHRBRD)". This field controls
> whether BHRB entries are written when BHRB recording is enabled by other
> bits. This patch implements support for this BHRB disable bit.
>
> By setting 0b1 to this bit will disable the BHRB and by setting 0b0
> to this bit will have BHRB enabled. This addresses backward
> compatibility (for older OS), since this bit will be cleared and
> hardware will be writing to BHRB by default.
>
> This patch addresses changes to set MMCRA (BHRBRD) at boot for power10
> ( there by the core will run faster) and enable this feature only on
> runtime ie, on explicit need from user. Also save/restore MMCRA in the
> restore path of state-loss idle state to make sure we keep BHRB disabled
> if it was not enabled on request at runtime.
>
> Signed-off-by: Athira Rajeev 
> ---
>  arch/powerpc/perf/core-book3s.c   | 20 
>  arch/powerpc/perf/isa207-common.c | 12 
>  arch/powerpc/platforms/powernv/idle.c | 22 --
>  3 files changed, 48 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index bd125fe..31c0535 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -1218,7 +1218,7 @@ static void write_mmcr0(struct cpu_hw_events *cpuhw, 
> unsigned long mmcr0)
>  static void power_pmu_disable(struct pmu *pmu)
>  {
> struct cpu_hw_events *cpuhw;
> -   unsigned long flags, mmcr0, val;
> +   unsigned long flags, mmcr0, val, mmcra;
>
> if (!ppmu)
> return;
> @@ -1251,12 +1251,24 @@ static void power_pmu_disable(struct pmu *pmu)
> mb();
> isync();
>
> +   val = mmcra = cpuhw->mmcr.mmcra;
> +
> /*
>  * Disable instruction sampling if it was enabled
>  */
> -   if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE) {
> -   mtspr(SPRN_MMCRA,
> - cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
> +   if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE)
> +   val &= ~MMCRA_SAMPLE_ENABLE;
> +
> +   /* Disable BHRB via mmcra (BHRBRD) for p10 */
> +   if (ppmu->flags & PPMU_ARCH_310S)
> +   val |= MMCRA_BHRB_DISABLE;
> +
> +   /*
> +* Write SPRN_MMCRA if mmcra has either disabled
> +* instruction sampling or BHRB.
> +*/
> +   if (val != mmcra) {
> +   mtspr(SPRN_MMCRA, mmcra);
> mb();
> isync();
> }
> diff --git a/arch/powerpc/perf/isa207-common.c 
> b/arch/powerpc/perf/isa207-common.c
> index 77643f3..964437a 100644
> --- a/arch/powerpc/perf/isa207-common.c
> +++ b/arch/powerpc/perf/isa207-common.c
> @@ -404,6 +404,13 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
>
> mmcra = mmcr1 = mmcr2 = mmcr3 = 0;
>
> +   /*
> +* Disable bhrb unless explicitly requested
> +* by setting MMCRA (BHRBRD) bit.
> +*/
> +   if (cpu_has_feature(CPU_FTR_ARCH_31))
> +   mmcra |= MMCRA_BHRB_DISABLE;
> +
> /* Second pass: assign PMCs, set all MMCR1 fields */
> for (i = 0; i < n_ev; ++i) {
> pmc = (event[i] >> EVENT_PMC_SHIFT) & EVENT_PMC_MASK;
> @@ -479,6 +486,11 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
> mmcra |= val << MMCRA_IFM_SHIFT;
> }
>
> +   /* set MMCRA (BHRBRD) to 0 if there is user request for BHRB 
> */
> +   if (cpu_has_feature(CPU_FTR_ARCH_31) &&
> +   (has_branch_stack(pevents[i]) || (event[i] & 
> EVENT_WANTS_BHRB)))
> +   mmcra &= ~MMCRA_BHRB_DISABLE;
> +
> if (pevents[i]->attr.exclude_user)
> mmcr2 |= MMCR2_FCP(pmc);
>
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 2dd4673..1c9d0a9 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -611,6 +611,7 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
> unsigned long srr1;
> unsigned long pls;
> unsigned long mmcr0 = 0;
> +   unsigned long mmcra = 0;
> struct p9_sprs sprs = {}; /* avoid false used-uninitialised */
> bool sprs_saved = false;
>
> @@ -657,6 +658,21 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
>   */
> mmcr0   = mfspr(SPRN_MMCR0);
> }
> +
> +   if 

Re: [PATCH v4 5/7] powerpc/iommu: Move iommu_table cleaning routine to iommu_table_clean

2020-07-22 Thread Leonardo Bras
On Tue, 2020-07-21 at 19:52 -0500, Brian King wrote:
> > 
> > As of today, there seems to be nothing like that happening in the
> > driver I am testing. 
> > I spoke to Brian King on slack, and he mentioned that at the point DDW
> > is created there should be no allocations in place.
> 
> I think there are a couple of scenarios here. One is where there is a DMA
> allocation prior to a call to set the DMA mask. Second scenario is if the
> driver makes multiple calls to set the DMA mask. I would argue that a properly
> written driver should tell the IOMMU subsystem what DMA mask it supports prior
> to allocating DMA memroy. Documentation/core-api/dma-api-howto.rst should
> describe what is legal and what is not.
> 
> It might be reasonable to declare its not allowed to allocate DMA memory
> and then later change the DMA mask and clearly call this out in the 
> documentation
> if its not already.
> 
> -Brian

Thank you for the feedback Brian!

That makes sense to me. I will try to have this in mind for the next
patchset. 

Best regards,



Re: [PATCH v4 5/7] powerpc/iommu: Move iommu_table cleaning routine to iommu_table_clean

2020-07-22 Thread Leonardo Bras
On Wed, 2020-07-22 at 11:28 +1000, Alexey Kardashevskiy wrote:
> 
> On 22/07/2020 08:13, Leonardo Bras wrote:
> > On Tue, 2020-07-21 at 14:59 +1000, Alexey Kardashevskiy wrote:
> > > On 16/07/2020 17:16, Leonardo Bras wrote:
> > > > Move the part of iommu_table_free() that does struct iommu_table 
> > > > cleaning
> > > > into iommu_table_clean, so we can invoke it separately.
> > > > 
> > > > This new function is useful for cleaning struct iommu_table before
> > > > initializing it again with a new DMA window, without having it freed and
> > > > allocated again.
> > > > 
> > > > Signed-off-by: Leonardo Bras 
> > > > ---
> > > >  arch/powerpc/kernel/iommu.c | 30 ++
> > > >  1 file changed, 18 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > > > index 9704f3f76e63..c3242253a4e7 100644
> > > > --- a/arch/powerpc/kernel/iommu.c
> > > > +++ b/arch/powerpc/kernel/iommu.c
> > > > @@ -735,21 +735,10 @@ struct iommu_table *iommu_init_table(struct 
> > > > iommu_table *tbl, int nid,
> > > > return tbl;
> > > >  }
> > > >  
> > > > -static void iommu_table_free(struct kref *kref)
> > > > +static void iommu_table_clean(struct iommu_table *tbl)
> > > 
> > > iommu_table_free() + iommu_init_table() + set_iommu_table_base() should
> > > work too, why new helper?
> > 
> > iommu_table_free() also frees the tbl, which would need allocate it
> > again (new address) and to fill it up again, unnecessarily. 
> 
> It is a new table in fact, everything is new there. You are only saving
> kfree+kzalloc which does not seem a huge win.
> 
> Also, iommu_table_update() simply assumes 64bit window by passing
> res_start=res_end=0 to iommu_init_table() which is not horribly robust
> either. Yeah, I know, iommu_init_table() is always called with zeroes in
> pseries but this is somewhat ok as those tables are from the device tree
> and those windows don't overlap with 32bit MMIO but under KVM they will
> (well, if we hack QEMU to advertise a single window).
> 
> I suggest removing iommu_pseries_table_update() from 6/7 and do
> iommu_table_free() + iommu_init_table() + set_iommu_table_base() with a
> WARN_ON(pdev->dev.archdata.dma_offset>=SZ_4G), may be even do this all
> in enable_ddw() where we know for sure if it is 1:1 mapping or just a
> big window.

Sure, I have yet to understand the full impact of this change, but I
will implement this and give it a try.

> 
> Out of curiosity - what page sizes does pHyp advertise in "query"?

64kB (page shift 0x10)

> 
> 
> > I think it's a better approach to only change what is needed.
> > 
> > > There is also iommu_table_clear() which does a different thing so you
> > > need a better name.
> > 
> > I agree.
> > I had not noticed this other function before sending the patchset. What
> > would be a better name though? __iommu_table_free()? 
> > 
> > > Second, iommu_table_free
> > > use and it would be ok as we would only see this when hot-unplugging a
> > > PE because we always kept the default window.
> > > Btw you must be seeing these warnings now every time you create DDW with
> > > these patches as at least the first page is reserved, do not you?
> > 
> > It does not print a warning.
> > I noticed other warnings,
> 
> And what are these?

tce_freemulti_pSeriesLP: plpar_tce_stuff failed
[...]

It's regarding the change in pagesize. 
Some places have the tceshift hardcoded as 12, tce_freemulti_pSeriesLP
is one of them, and that is causing some errors.

I wrote a patch fixing this, and I will include it in the next series.

> 
> > but not this one from iommu_table_free():
> > /* verify that table contains no entries */
> > if (!bitmap_empty(tbl->it_ma
> > p, tbl->it_size))
> > pr_warn("%s: Unexpected TCEs\n", __func__);
> > 
> > Before that, iommu_table_release_pages(tbl) is supposed to clear the 
> > bitmap, so this only tests for a tce that is created in this short period.
> 
> iommu_table_release_pages() only clears reserved pages - page 0 (just a
> protection against NULL DMA pointers) and 32bit MMIO (these should not
> be set for 64bit window). The "%s: Unexpected TCEs\n" is what checks for
> actual mapped TCEs.
> 

Oh, I haven't noticed that. Thanks for pointing!

> > > Since we are replacing a table for a device which is still in the
> > > system, we should not try messing with its DMA if it already has
> > > mappings so the warning should become an error preventing DDW. It is
> > > rather hard to trigger in practice but I could hack a driver to ask for
> > > 32bit DMA mask first, map few pages and then ask for 64bit DMA mask, it
> > > is not illegal, I think. So this needs a new helper - "bool
> > > iommu_table_in_use(tbl)" - to use in enable_ddw(). Or I am overthinking
> > > this?... Thanks,
> > 
> > As of today, there seems to be nothing like that happening in the
> > driver I am testing. 
> > I spoke to Brian King on slack, and he mentioned that at the point DDW
> > is 

Re: [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray

2020-07-22 Thread Anton Blanchard
Hi Scott,

I'm hitting this issue and Rick just pointed my at your patch. Any
chance we could get it upstream?

Thanks,
Anton

> On PowerPC, memory_add_physaddr_to_nid() uses a linear search to find
> an LMB matching the given address.  This scales very poorly when there
> are many LMBs.  The poor scaling cripples drmem_init() during boot:
> lmb_set_nid(), which calls memory_add_physaddr_to_nid(), is called for
> each LMB.
> 
> If we index each LMB in an xarray by its base address we can achieve
> O(log n) search during memory_add_physaddr_to_nid(), which scales much
> better.
> 
> For example, in the lab we have a 64TB P9 machine with 256MB LMBs.
> So, suring drmem_init() we instantiate 249854 LMBs.  On a vanilla
> kernel it completes drmem_init() in ~35 seconds with a soft lockup
> trace.  On the patched kernel it completes drmem_init() in ~0.5
> seconds.
> 
> Before:
> [   53.721639] drmem: initializing drmem v2
> [   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s!
> [swapper/0:1] [   80.604377] Modules linked in:
> [   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+
> #4 [   80.604397] NIP:  c00a4980 LR: c00a4940 CTR:
>  [   80.604407] REGS: c0002dbff8493830 TRAP: 0901
> Not tainted  (5.6.0-rc2+) [   80.604412] MSR:  82009033
>   CR: 44000248  XER: 000d [
> 80.604431] CFAR: c00a4a38 IRQMASK: 0 [   80.604431] GPR00:
> c00a4940 c0002dbff8493ac0 c1904400 c0003cfede30 [
>   80.604431] GPR04:  c0f4095a
> 002f 1000 [   80.604431] GPR08:
> cbf7ecdb7fb8 cbf7ecc2d3c8 0008 c00c0002fdfb2001 [
>   80.604431] GPR12:  c0001e8ec200 [   80.604477]
> NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0 [   80.604486]
> LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 [   80.604492]
> Call Trace: [   80.604498] [c0002dbff8493ac0] [c00a4940]
> hot_add_scn_to_nid+0x60/0x3e0 (unreliable) [   80.604509]
> [c0002dbff8493b20] [c0087c10]
> memory_add_physaddr_to_nid+0x20/0x60 [   80.604521]
> [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0 [
> 80.604530] [c0002dbff8493c10] [c0010154]
> do_one_initcall+0x64/0x2c0 [   80.604540] [c0002dbff8493ce0]
> [c10c4aa0] kernel_init_freeable+0x2d8/0x3a0 [   80.604550]
> [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148 [
> 80.604560] [c0002dbff8493e20] [c000b648]
> ret_from_kernel_thread+0x5c/0x74 [   80.604567] Instruction dump: [
> 80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018
> 3908ffe8 7d094214 [   80.604586] 7fa94040 419d00dc e9490010 714a0088
> <2faa0008> 409e00ac e949 7fbe5040 [   89.047390] drmem: 249854
> LMB(s)
> 
> After:
> [   53.424702] drmem: initializing drmem v2
> [   53.898813] drmem: 249854 LMB(s)
> 
> lmb_set_nid() is called from dlpar_lmb_add() so this patch will also
> improve memory hot-add speeds on big machines.
> 
> Signed-off-by: Scott Cheloha 
> ---
>  arch/powerpc/include/asm/drmem.h |  1 +
>  arch/powerpc/mm/drmem.c  | 24 
>  arch/powerpc/mm/numa.c   | 29 ++---
>  3 files changed, 35 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/drmem.h
> b/arch/powerpc/include/asm/drmem.h index 3d76e1c388c2..90a5a9ad872b
> 100644 --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -88,6 +88,7 @@ static inline bool drmem_lmb_reserved(struct
> drmem_lmb *lmb) return lmb->flags & DRMEM_LMB_RESERVED;
>  }
>  
> +struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr);
>  u64 drmem_lmb_memory_max(void);
>  void __init walk_drmem_lmbs(struct device_node *dn,
>   void (*func)(struct drmem_lmb *, const
> __be32 **)); diff --git a/arch/powerpc/mm/drmem.c
> b/arch/powerpc/mm/drmem.c index 44bfbdae920c..62cbe79e3860 100644
> --- a/arch/powerpc/mm/drmem.c
> +++ b/arch/powerpc/mm/drmem.c
> @@ -11,12 +11,31 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> +static DEFINE_XARRAY(drmem_lmb_xa_base_addr);
>  static struct drmem_lmb_info __drmem_info;
>  struct drmem_lmb_info *drmem_info = &__drmem_info;
>  
> +static int drmem_cache_lmb_for_lookup(struct drmem_lmb *lmb)
> +{
> + void *ret;
> +
> + ret = xa_store(_lmb_xa_base_addr, lmb->base_addr, lmb,
> +GFP_KERNEL);
> + if (xa_is_err(ret))
> + return xa_err(ret);
> +
> + return 0;
> +}
> +
> +struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr)
> +{
> + return xa_load(_lmb_xa_base_addr, base_addr);
> +}
> +
>  u64 drmem_lmb_memory_max(void)
>  {
>   struct drmem_lmb *last_lmb;
> @@ -364,6 +383,8 @@ static void __init init_drmem_v1_lmbs(const
> __be32 *prop) 
>   for_each_drmem_lmb(lmb) {
>   read_drconf_v1_cell(lmb, );
> + if (drmem_cache_lmb_for_lookup(lmb) != 0)
> + 

Re: OF: Can't handle multiple dma-ranges with different offsets

2020-07-22 Thread Chris Packham

On 22/07/20 4:19 pm, Chris Packham wrote:
> Hi,
>
> I've just fired up linux kernel v5.7 on a p2040 based system and I'm 
> getting the following new warning
>
> OF: Can't handle multiple dma-ranges with different offsets on 
> node(/pcie@ffe202000)
> OF: Can't handle multiple dma-ranges with different offsets on 
> node(/pcie@ffe202000)
>
> The warning itself was added in commit 9d55bebd9816 ("of/address: 
> Support multiple 'dma-ranges' entries") but I gather it's pointing out 
> something about the dts. My boards dts is based heavily on 
> p2041rdb.dts and the relevant pci2 section is identical (reproduced 
> below for reference).
>
>     pci2: pcie@ffe202000 {
>         reg = <0xf 0xfe202000 0 0x1000>;
>         ranges = <0x0200 0 0xe000 0xc 0x4000 0 0x2000
>               0x0100 0 0x 0xf 0xf802 0 0x0001>;
>         pcie@0 {
>             ranges = <0x0200 0 0xe000
>                   0x0200 0 0xe000
>                   0 0x2000
>
>                   0x0100 0 0x
>                   0x0100 0 0x
>                   0 0x0001>;
>         };
>     };
>
> I haven't noticed any ill effect (aside from the scary message). I'm 
> not sure if there's something missing in the dts or in the code that 
> checks the ranges. Any guidance would be appreciated.

I've also just checked the T2080RDB on v5.7.9 which shows a similar issue

OF: Can't handle multiple dma-ranges with different offsets on 
node(/pcie@ffe25)
OF: Can't handle multiple dma-ranges with different offsets on 
node(/pcie@ffe25)
pcieport :00:00.0: Invalid size 0xf9 for dma-range
pcieport :00:00.0: AER: enabled with IRQ 21
OF: Can't handle multiple dma-ranges with different offsets on 
node(/pcie@ffe27)
OF: Can't handle multiple dma-ranges with different offsets on 
node(/pcie@ffe27)
pcieport 0001:00:00.0: Invalid size 0xf9 for dma-range
pcieport 0001:00:00.0: AER: enabled with IRQ 23




[PATCH devicetree 4/4] powerpc: dts: t1040rdb: add ports for Seville Ethernet switch

2020-07-22 Thread Vladimir Oltean
Define the network interface names for the switch ports and hook them up
to the 2 QSGMII PHYs that are onboard.

A conscious decision was taken to go along with the numbers that are
written on the front panel of the board and not with the hardware
numbers of the switch chip ports. The 2 are shifted by 4.

Signed-off-by: Vladimir Oltean 
---
 arch/powerpc/boot/dts/fsl/t1040rdb.dts | 111 +
 1 file changed, 111 insertions(+)

diff --git a/arch/powerpc/boot/dts/fsl/t1040rdb.dts 
b/arch/powerpc/boot/dts/fsl/t1040rdb.dts
index 40d7126dbe90..28ee06a1706d 100644
--- a/arch/powerpc/boot/dts/fsl/t1040rdb.dts
+++ b/arch/powerpc/boot/dts/fsl/t1040rdb.dts
@@ -75,4 +75,115 @@  {
phy_sgmii_2: ethernet-phy@3 {
reg = <0x3>;
};
+
+   /* VSC8514 QSGMII PHY */
+   phy_qsgmii_0: ethernet-phy@4 {
+   reg = <0x4>;
+   };
+
+   phy_qsgmii_1: ethernet-phy@5 {
+   reg = <0x5>;
+   };
+
+   phy_qsgmii_2: ethernet-phy@6 {
+   reg = <0x6>;
+   };
+
+   phy_qsgmii_3: ethernet-phy@7 {
+   reg = <0x7>;
+   };
+
+   /* VSC8514 QSGMII PHY */
+   phy_qsgmii_4: ethernet-phy@8 {
+   reg = <0x8>;
+   };
+
+   phy_qsgmii_5: ethernet-phy@9 {
+   reg = <0x9>;
+   };
+
+   phy_qsgmii_6: ethernet-phy@a {
+   reg = <0xa>;
+   };
+
+   phy_qsgmii_7: ethernet-phy@b {
+   reg = <0xb>;
+   };
+};
+
+_port0 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_0>;
+   phy-mode = "qsgmii";
+   /* ETH4 written on chassis */
+   label = "swp4";
+   status = "okay";
+};
+
+_port1 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_1>;
+   phy-mode = "qsgmii";
+   /* ETH5 written on chassis */
+   label = "swp5";
+   status = "okay";
+};
+
+_port2 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_2>;
+   phy-mode = "qsgmii";
+   /* ETH6 written on chassis */
+   label = "swp6";
+   status = "okay";
+};
+
+_port3 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_3>;
+   phy-mode = "qsgmii";
+   /* ETH7 written on chassis */
+   label = "swp7";
+   status = "okay";
+};
+
+_port4 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_4>;
+   phy-mode = "qsgmii";
+   /* ETH8 written on chassis */
+   label = "swp8";
+   status = "okay";
+};
+
+_port5 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_5>;
+   phy-mode = "qsgmii";
+   /* ETH9 written on chassis */
+   label = "swp9";
+   status = "okay";
+};
+
+_port6 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_6>;
+   phy-mode = "qsgmii";
+   /* ETH10 written on chassis */
+   label = "swp10";
+   status = "okay";
+};
+
+_port7 {
+   managed = "in-band-status";
+   phy-handle = <_qsgmii_7>;
+   phy-mode = "qsgmii";
+   /* ETH11 written on chassis */
+   label = "swp11";
+   status = "okay";
+};
+
+_port8 {
+   ethernet = <>;
+   status = "okay";
 };
-- 
2.25.1



[PATCH devicetree 2/4] powerpc: dts: t1040: label the 2 MDIO controllers

2020-07-22 Thread Vladimir Oltean
In preparation of referencing the MDIO nodes from board DTS files (so
that we can add PHY nodes easier), add labels to mdio0 and mdio1.

Signed-off-by: Vladimir Oltean 
---
 arch/powerpc/boot/dts/fsl/t1040si-post.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
index 4af856dcc6a3..e1b138b3c714 100644
--- a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
@@ -620,11 +620,11 @@ enet3: ethernet@e6000 {
enet4: ethernet@e8000 {
};
 
-   mdio@fc000 {
+   mdio0: mdio@fc000 {
interrupts = <100 1 0 0>;
};
 
-   mdio@fd000 {
+   mdio1: mdio@fd000 {
status = "disabled";
};
};
-- 
2.25.1



[PATCH devicetree 3/4] powerpc: dts: t1040rdb: put SGMII PHY under label

2020-07-22 Thread Vladimir Oltean
We're going to add 8 more PHYs in a future patch. It is easier to follow
the hardware description if we don't need to fish for the path of the
MDIO controllers inside the SoC and just use the labels.

Signed-off-by: Vladimir Oltean 
---
 arch/powerpc/boot/dts/fsl/t1040rdb.dts | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t1040rdb.dts 
b/arch/powerpc/boot/dts/fsl/t1040rdb.dts
index 65ff34c49025..40d7126dbe90 100644
--- a/arch/powerpc/boot/dts/fsl/t1040rdb.dts
+++ b/arch/powerpc/boot/dts/fsl/t1040rdb.dts
@@ -59,12 +59,6 @@ ethernet@e4000 {
phy-handle = <_sgmii_2>;
phy-connection-type = "sgmii";
};
-
-   mdio@fc000 {
-   phy_sgmii_2: ethernet-phy@3 {
-   reg = <0x03>;
-   };
-   };
};
};
 
@@ -76,3 +70,9 @@ cpld@3,0 {
 };
 
 #include "t1040si-post.dtsi"
+
+ {
+   phy_sgmii_2: ethernet-phy@3 {
+   reg = <0x3>;
+   };
+};
-- 
2.25.1



[PATCH devicetree 0/4] Add Seville Ethernet switch to T1040RDB

2020-07-22 Thread Vladimir Oltean
Seville is a DSA switch that is embedded inside the T1040 SoC, and
supported by the mscc_seville DSA driver. The driver has been accepted
this release cycle and is currently available in net-next (and
therefore, in linux-next).

This series adds this switch to the SoC's dtsi files and to the T1040RDB
board file.

Vladimir Oltean (4):
  powerpc: dts: t1040: add bindings for Seville Ethernet switch
  powerpc: dts: t1040: label the 2 MDIO controllers
  powerpc: dts: t1040rdb: put SGMII PHY under  label
  powerpc: dts: t1040rdb: add ports for Seville Ethernet switch

 arch/powerpc/boot/dts/fsl/t1040rdb.dts  | 123 +++-
 arch/powerpc/boot/dts/fsl/t1040si-post.dtsi |  79 -
 2 files changed, 194 insertions(+), 8 deletions(-)

-- 
2.25.1



[PATCH devicetree 1/4] powerpc: dts: t1040: add bindings for Seville Ethernet switch

2020-07-22 Thread Vladimir Oltean
Add the description of the embedded L2 switch inside the SoC dtsi file
for NXP T1040.

Signed-off-by: Vladimir Oltean 
---
 arch/powerpc/boot/dts/fsl/t1040si-post.dtsi | 75 +
 1 file changed, 75 insertions(+)

diff --git a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
index 315d0557eefc..4af856dcc6a3 100644
--- a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
@@ -628,6 +628,81 @@ mdio@fd000 {
status = "disabled";
};
};
+
+   seville_switch: ethernet-switch@80 {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "mscc,vsc9953-switch";
+   little-endian;
+   reg = <0x80 0x29>;
+
+   ports {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   seville_port0: port@0 {
+   reg = <0>;
+   status = "disabled";
+   };
+
+   seville_port1: port@1 {
+   reg = <1>;
+   status = "disabled";
+   };
+
+   seville_port2: port@2 {
+   reg = <2>;
+   status = "disabled";
+   };
+
+   seville_port3: port@3 {
+   reg = <3>;
+   status = "disabled";
+   };
+
+   seville_port4: port@4 {
+   reg = <4>;
+   status = "disabled";
+   };
+
+   seville_port5: port@5 {
+   reg = <5>;
+   status = "disabled";
+   };
+
+   seville_port6: port@6 {
+   reg = <6>;
+   status = "disabled";
+   };
+
+   seville_port7: port@7 {
+   reg = <7>;
+   status = "disabled";
+   };
+
+   seville_port8: port@8 {
+   reg = <8>;
+   phy-mode = "internal";
+   status = "disabled";
+
+   fixed-link {
+   speed = <2500>;
+   full-duplex;
+   };
+   };
+
+   seville_port9: port@9 {
+   reg = <9>;
+   phy-mode = "internal";
+   status = "disabled";
+
+   fixed-link {
+   speed = <2500>;
+   full-duplex;
+   };
+   };
+   };
+   };
 };
 
  {
-- 
2.25.1



Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-22 Thread Atish Patra
On Wed, Jul 22, 2020 at 1:23 PM Arnd Bergmann  wrote:
>
> On Wed, Jul 22, 2020 at 9:52 PM Palmer Dabbelt  wrote:
> > On Wed, 22 Jul 2020 02:43:50 PDT (-0700), Arnd Bergmann wrote:
> > > On Tue, Jul 21, 2020 at 9:06 PM Palmer Dabbelt  wrote:
> > > The eventual goal is to have a split of 3840MB for either user or linear 
> > > map
> > > plus and 256MB for vmalloc, including the kernel. Switching between linear
> > > and user has a noticeable runtime overhead, but it relaxes both the limits
> > > for user memory and lowmem, and it provides a somewhat stronger
> > > address space isolation.
> >
> > Ya, I think we decided not to do that, at least for now.  I guess the right
> > answer there will depend on what 32-bit systems look like, and since we 
> > don't
> > have any I'm inclined to just stick to the fast option.
>
> Makes sense. Actually on 32-bit Arm we see fewer large-memory
> configurations in new machines than we had in the past before 64-bit
> machines were widely available at low cost, so I expect not to see a
> lot new hardware with more than 1GB of DDR3 (two 256Mbit x16 chips)
> for cost reasons, and rv32 is likely going to be similar, so you may never
> really see a need for highmem or the above hack to increase the
> size of the linear mapping.
>
> I just noticed that rv32 allows 2GB of lowmem rather than just the usual
> 768MB or 1GB, at the expense of addressable user memory. This seems
> like an unusual choice, but I also don't see any reason to change this
> or make it more flexible unless actual users appear.
>

I am a bit confused here. As per my understanding, RV32 supports 1GB
of lowmem only
as the page offset is set to 0xC000. The config option
MAXPHYSMEM_2GB is misleading
as RV32 actually allows 1GB of physical memory only. Any memory blocks beyond
DRAM + 1GB are removed in setup_bootmem. IMHO, The current config
should clarify that.

Moreover, we should add 2G split under a separate configuration if we
want to support that.

>Arnd
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv



-- 
Regards,
Atish


Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-22 Thread Arnd Bergmann
On Wed, Jul 22, 2020 at 9:52 PM Palmer Dabbelt  wrote:
> On Wed, 22 Jul 2020 02:43:50 PDT (-0700), Arnd Bergmann wrote:
> > On Tue, Jul 21, 2020 at 9:06 PM Palmer Dabbelt  wrote:
> > The eventual goal is to have a split of 3840MB for either user or linear map
> > plus and 256MB for vmalloc, including the kernel. Switching between linear
> > and user has a noticeable runtime overhead, but it relaxes both the limits
> > for user memory and lowmem, and it provides a somewhat stronger
> > address space isolation.
>
> Ya, I think we decided not to do that, at least for now.  I guess the right
> answer there will depend on what 32-bit systems look like, and since we don't
> have any I'm inclined to just stick to the fast option.

Makes sense. Actually on 32-bit Arm we see fewer large-memory
configurations in new machines than we had in the past before 64-bit
machines were widely available at low cost, so I expect not to see a
lot new hardware with more than 1GB of DDR3 (two 256Mbit x16 chips)
for cost reasons, and rv32 is likely going to be similar, so you may never
really see a need for highmem or the above hack to increase the
size of the linear mapping.

I just noticed that rv32 allows 2GB of lowmem rather than just the usual
768MB or 1GB, at the expense of addressable user memory. This seems
like an unusual choice, but I also don't see any reason to change this
or make it more flexible unless actual users appear.

   Arnd


Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-22 Thread Palmer Dabbelt

On Wed, 22 Jul 2020 02:43:50 PDT (-0700), Arnd Bergmann wrote:

On Tue, Jul 21, 2020 at 9:06 PM Palmer Dabbelt  wrote:


On Tue, 21 Jul 2020 11:36:10 PDT (-0700), a...@ghiti.fr wrote:
> Let's try to make progress here: I add linux-mm in CC to get feedback on
> this patch as it blocks sv48 support too.

Sorry for being slow here.  I haven't replied because I hadn't really fleshed
out the design yet, but just so everyone's on the same page my problems with
this are:

* We waste vmalloc space on 32-bit systems, where there isn't a lot of it.


There is actually an ongoing work to make 32-bit Arm kernels move
vmlinux into the vmalloc space, as part of the move to avoid highmem.

Overall, a 32-bit system would waste about 0.1% of its virtual address space
by having the kernel be located in both the linear map and the vmalloc area.
It's not zero, but not that bad either. With the typical split of 3072 MB user,
768MB linear and 256MB vmalloc, it's also around 1.5% of the available
vmalloc area (assuming a 4MB vmlinux in a typical 32-bit kernel), but the
boundaries can be changed arbitrarily if needed.


OK, I guess maybe it's not so bad.  Our 32-bit defconfig is 10MiB, but I
wouldn't really put much weight behind that number as it's just a 64-bit
defconfig built for 32-bit.  We don't have any 32-bit hardware anyway, so if
this becomes an issue later I guess we can just deal with it then.


The eventual goal is to have a split of 3840MB for either user or linear map
plus and 256MB for vmalloc, including the kernel. Switching between linear
and user has a noticeable runtime overhead, but it relaxes both the limits
for user memory and lowmem, and it provides a somewhat stronger
address space isolation.


Ya, I think we decided not to do that, at least for now.  I guess the right
answer there will depend on what 32-bit systems look like, and since we don't
have any I'm inclined to just stick to the fast option.


Another potential idea would be to completely randomize the physical
addresses underneath the kernel by using a random permutation of the
pages in the kernel image. This adds even more overhead (virt_to_phys
may need to call vmalloc_to_page or similar) and may cause problems
with DMA into kernel .data across page boundaries,


* Sort out how to maintain a linear map as the canonical hole moves around
  between the VA widths without adding a bunch of overhead to the virt2phys and
  friends.  This is probably going to be the trickiest part, but I think if we
  just change the page table code to essentially lie about VAs when an sv39
  system runs an sv48+sv39 kernel we could make it work -- there'd be some
  logical complexity involved, but it would remain fast.


I assume you can't use the trick that x86 has where all kernel addresses
are at the top of the 64-bit address space and user addresses are at the
bottom, regardless of the size of the page tables?


They have the load in their mapping functions, as far as I can tell that's
required to do this sort of thing.  We do as well to handle some of the
implicit boot stuff for now, but I was assuming that we'd want to get rid of
that for performance reasons.  That said, maybe it just doesn't matter?  


Re: [PATCH v4 07/12] ppc64/kexec_file: add support to relocate purgatory

2020-07-22 Thread Hari Bathini



On 22/07/20 9:55 am, Michael Ellerman wrote:
> Hari Bathini  writes:
>> Right now purgatory implementation is only minimal. But if purgatory
>> code is to be enhanced to copy memory to the backup region and verify
>> sha256 digest, relocations may have to be applied to the purgatory.
>> So, add support to relocate purgatory in kexec_file_load system call
>> by setting up TOC pointer and applying RELA relocations as needed.
>>
>> Reported-by: kernel test robot 
>> [lkp: In v1, 'struct mem_sym' was declared in parameter list]
>> Signed-off-by: Hari Bathini 
>> ---
>>
>> * Michael, can you share your opinion on the below:
>> - https://lore.kernel.org/patchwork/patch/1272027/
>> - My intention in cover note.
> 
> It seems like a lot of complexity for little benefit.
> 
> AFAICS your final purgatory_64.c is only 36 lines, and all it does is a
> single (open coded) memcpy().
> 
> It seems like we could write that in not many more lines of assembler
> and avoid all this code.

Hi Michael,

I am not sure if you would agree with me on this, but I am looking at the
purgatory code as work in progress. As mentioned in the cover note, I intend
to add log messaging, sha256 verification into purgatory. And also change it
to position independent executable after moving common purgatory code (sha256
verification) to arch-independent code.

When I initially took this up, I wanted to add all the above changes too, but
cut down on it, in the interest of time, first to get kdump (kexec -s -p)
working in v5.9 merge window.

But as the logic in patches 07/12 & 08/12 has been tested in kexec-tools code
a lot of times and there are unlikely to be any changes to them except for
__kexec_do_relocs() function (afaics), when -PIE would be used, I submitted 
them.
With patch 09/12, I tried for a change that uses relocations while is minimal
for now.

Would you prefer it to be absolutely minimal by dropping patches 7 & 8 for
now and writing the backup data copy code in assembler?

Thanks
Hari


Re: [PATCH v3 0/2] Selftest for cpuidle latency measurement

2020-07-22 Thread Pratik Sampat

Hello Daniel,

On 21/07/20 8:27 pm, Daniel Lezcano wrote:

On 21/07/2020 14:42, Pratik Rajesh Sampat wrote:

v2: https://lkml.org/lkml/2020/7/17/369
Changelog v2-->v3
Based on comments from Gautham R. Shenoy adding the following in the
selftest,
1. Grepping modules to determine if already loaded
2. Wrapper to enable/disable states
3. Preventing any operation/test on offlined CPUs
---

The patch series introduces a mechanism to measure wakeup latency for
IPI and timer based interrupts
The motivation behind this series is to find significant deviations
behind advertised latency and resisdency values

Why do you want to measure for the timer and the IPI ? Whatever the
source of the wakeup, the exit latency remains the same, no ?

Is all this kernel-ish code really needed ?

What about using a highres periodic timer and make it expires every eg.
50ms x 2400, so it is 120 secondes and measure the deviation. Repeat the
operation for each idle states.

And in order to make it as much accurate as possible, set the program
affinity on a CPU and isolate this one by preventing other processes to
be scheduled on and migrate the interrupts on the other CPUs.

That will be all userspace code, no?



The kernel module may not needed now that you mention it.
IPI latencies could be measured using pipes and threads using
pthread_attr_setaffinity_np to control the experiment, as you
suggested. This should internally fire a smp_call_function_single.

The original idea was to essentially measure it as closely as possible
in the kernel without involving the kernel-->userspace overhead.
However, the user-space approach may not be too much of a problem as
we are collecting a baseline and the delta of the latency is what we
would be concerned about anyways!

With respect to measuring both timers and IPI latencies: In principle
yes, the exit latency should remain the same but if there is a
deviation in reality we may want to measure it.

I'll implement this experiment in the userspace and get back with the
numbers to confirm.

Thanks for your comments!
Best,
Pratik








Re: [PATCH] spi: ppc4xx: Convert to use GPIO descriptors

2020-07-22 Thread Mark Brown
On Tue, 14 Jul 2020 09:22:26 +0200, Linus Walleij wrote:
> This converts the PPC4xx SPI driver to use GPIO descriptors.
> 
> The driver is already just picking some GPIOs from the device
> tree so the conversion is pretty straight forward. However
> this driver is looking form a pure "gpios" property rather
> than the standard binding "cs-gpios" so we need to add a new
> exception to the gpiolib OF parser to allow this for this
> driver's compatibles.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/1] spi: ppc4xx: Convert to use GPIO descriptors
  commit: 4726773292bfdb2917a0b4d369ddccd5e2f30991

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


RE: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-22 Thread Michael Ellerman
Ram Pai  writes:
> On Wed, Jul 22, 2020 at 12:06:06PM +1000, Michael Ellerman wrote:
>> Ram Pai  writes:
>> > An instruction accessing a mmio address, generates a HDSI fault.  This 
>> > fault is
>> > appropriately handled by the Hypervisor.  However in the case of 
>> > secureVMs, the
>> > fault is delivered to the ultravisor.
>> >
>> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
>> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
>> > translation. Walking the two level page table to read the instruction can 
>> > race
>> > with other vcpus modifying the SVM's process scoped page table.
>> 
>> You're trying to read the guest's kernel text IIUC, that mapping should
>> be stable. Possibly permissions on it could change over time, but the
>> virtual -> real mapping should not.
>
> Actually the code does not capture the address of the instruction in the
> sprg0 register. It captures the instruction itself. So should the mapping
> matter?

Sorry that was talking about reading the instruction by doing the page
walk, not with this patch applied.

cheers


Re: [v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs

2020-07-22 Thread Athira Rajeev


> On 22-Jul-2020, at 4:19 PM, Jordan Niethe  wrote:
> 
> On Wed, Jul 22, 2020 at 5:55 PM Athira Rajeev
> mailto:atraj...@linux.vnet.ibm.com>> wrote:
>> 
>> 
>> 
>> On 22-Jul-2020, at 10:11 AM, Jordan Niethe  wrote:
>> 
>> On Sat, Jul 18, 2020 at 1:13 AM Athira Rajeev
>>  wrote:
>> 
>> 
>> From: Madhavan Srinivasan 
>> 
>> Add power10 feature function to dt_cpu_ftrs.c along
>> with a power10 specific init() to initialize pmu sprs,
>> sets the oprofile_cpu_type and cpu_features. This will
>> enable performance monitoring unit(PMU) for Power10
>> in CPU features with "performance-monitor-power10".
>> 
>> For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
>> Control Register A (MMCRA) bit, namely "BHRB Recording Disable
>> (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
>> feature at boot for power10.
>> 
>> Signed-off-by: Madhavan Srinivasan 
>> ---
>> arch/powerpc/include/asm/reg.h|  3 +++
>> arch/powerpc/kernel/cpu_setup_power.S |  8 
>> arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++
>> 3 files changed, 37 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 21a1b2d..900ada1 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -1068,6 +1068,9 @@
>> #define MMCR0_PMC2_LOADMISSTIME0x5
>> #endif
>> 
>> +/* BHRB disable bit for PowerISA v3.10 */
>> +#define MMCRA_BHRB_DISABLE 0x0020
>> 
>> Shouldn't this go under SPRN_MMCRA with the other MMCRA_*.
>> 
>> 
>> 
>> Hi Jordan
>> 
>> Ok, the definition of MMCRA is under #ifdef for 64 bit .  if I move 
>> definition of MMCRA_BHRB_DISABLE along with other SPR's, I also
>> need to define this for 32-bit to satisfy core-book3s to compile as below:
>> 
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 900ada10762c..7e271657b412 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -888,6 +888,8 @@
>> #define   MMCRA_SLOT   0x0700UL /* SLOT bits (37-39) */
>> #define   MMCRA_SLOT_SHIFT 24
>> #define   MMCRA_SAMPLE_ENABLE 0x0001UL /* enable sampling */
>> +/* BHRB disable bit for PowerISA v3.10 */
>> +#define   MMCRA_BHRB_DISABLE  0x0020
>> #define   POWER6_MMCRA_SDSYNC 0x0800ULL/* SDAR/SIAR synced */
>> #define   POWER6_MMCRA_SIHV   0x0400ULL
>> #define   POWER6_MMCRA_SIPR   0x0200ULL
>> @@ -1068,9 +1070,6 @@
>> #define MMCR0_PMC2_LOADMISSTIME0x5
>> #endif
>> 
>> 
>> 
>> -/* BHRB disable bit for PowerISA v3.10 */
>> -#define MMCRA_BHRB_DISABLE 0x0020
>> -
>> /*
>>  * SPRG usage:
>>  *
>> diff --git a/arch/powerpc/perf/core-book3s.c 
>> b/arch/powerpc/perf/core-book3s.c
>> index 36baae666387..88068f20827c 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -94,6 +94,7 @@ static unsigned int freeze_events_kernel = MMCR0_FCS;
>> #define SPRN_SIER2 0
>> #define SPRN_SIER3 0
>> #define MMCRA_SAMPLE_ENABLE0
>> +#define MMCRA_BHRB_DISABLE 0
>> 
>> 
>> 
>> static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
>> {
>> 
>> 
>> 
>> +
>> /*
>> * SPRG usage:
>> *
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
>> b/arch/powerpc/kernel/cpu_setup_power.S
>> index efdcfa7..b8e0d1e 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>> @@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
>> _GLOBAL(__setup_cpu_power10)
>>   mflrr11
>>   bl  __init_FSCR_power10
>> +   bl  __init_PMU_ISA31
>> 
>> So we set MMCRA here but then aren't we still going to call __init_PMU
>> which will overwrite that?
>> Would this setting MMCRA also need to be handled in __restore_cpu_power10?
>> 
>> 
>> Thanks for this nice catch !  When I rebased code initial phase, we didn’t 
>> had power10 part filled in.
>> It was a miss from my side in adding PMu init functions and thanks for 
>> pointing this out.
>> Below patch will call __init_PMU functions in setup and restore. Please 
>> check if this looks good
>> 
>> --
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
>> b/arch/powerpc/kernel/cpu_setup_power.S
>> index efdcfa714106..e672a6c5fd7c 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>> @@ -94,6 +94,9 @@ _GLOBAL(__restore_cpu_power8)
>> _GLOBAL(__setup_cpu_power10)
>>  mflr r11
>>  bl __init_FSCR_power10
>> + bl __init_PMU
>> + bl __init_PMU_ISA31
>> + bl __init_PMU_HV
>>  b 1f
>> 
>> _GLOBAL(__setup_cpu_power9)
> 
> Won't you also need to change where the label 1 is:
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -100,8 +100,8 @@ _GLOBAL(__setup_cpu_power10)
> _GLOBAL(__setup_cpu_power9)
>mflrr11
>bl  __init_FSCR
> -1: bl  __init_PMU
> -   bl  

Re: [PATCH v3] powerpc/pseries: Avoid using addr_to_pfn in realmode

2020-07-22 Thread Ganesh



On 7/21/20 3:38 PM, Nicholas Piggin wrote:

Excerpts from Ganesh Goudar's message of July 20, 2020 6:03 pm:

When an UE or memory error exception is encountered the MCE handler
tries to find the pfn using addr_to_pfn() which takes effective
address as an argument, later pfn is used to poison the page where
memory error occurred, recent rework in this area made addr_to_pfn
to run in realmode, which can be fatal as it may try to access
memory outside RMO region.

To fix this have separate functions for realmode and virtual mode
handling and let addr_to_pfn to run in virtual mode.

You didn't really explain what you moved around. You added some
helper functions, but what does it actually do differently now? Can you
explain that in the changelog?


Sure, ill rephrase the changelog, here I have moved all that we can and we must
do in virtual mode to new helper function which runs in virtual mode, like 
filling
mce error info, using addr_to_pfn and calling save_mce_event().



Thanks,
Nick


Without this fix following kernel crash is seen on hitting UE.

[  485.128036] Oops: Kernel access of bad area, sig: 11 [#1]
[  485.128040] LE SMP NR_CPUS=2048 NUMA pSeries
[  485.128047] Modules linked in:
[  485.128067] CPU: 15 PID: 6536 Comm: insmod Kdump: loaded Tainted: G OE 5.7.0 
#22
[  485.128074] NIP:  c009b24c LR: c00398d8 CTR: c0cd57c0
[  485.128078] REGS: c3f1f970 TRAP: 0300   Tainted: G OE (5.7.0)
[  485.128082] MSR:  80001003   CR: 28008284  XER: 0001
[  485.128088] CFAR: c009b190 DAR: c001fab0 DSISR: 4000 
IRQMASK: 1
[  485.128088] GPR00: 0001 c3f1fbf0 c1634300 
b0fa0100
[  485.128088] GPR04: d222  fab0 
0022
[  485.128088] GPR08: c001fab0  c001fab0 
c3f1fc14
[  485.128088] GPR12: 0008 c3ff5880 d218 

[  485.128088] GPR16: ff20 fff1 fff2 
d21a1100
[  485.128088] GPR20: d220 c0015c893c50 c0d49b28 
c0015c893c50
[  485.128088] GPR24: d21a0d08 c14e5da8 d21a0818 
000a
[  485.128088] GPR28: 0008 000a c17e2970 
000a
[  485.128125] NIP [c009b24c] __find_linux_pte+0x11c/0x310
[  485.128130] LR [c00398d8] addr_to_pfn+0x138/0x170
[  485.128133] Call Trace:
[  485.128135] Instruction dump:
[  485.128138] 3929 7d4a3378 7c883c36 7d2907b4 794a1564 7d294038 794af082 
3900
[  485.128144] 79291f24 790af00e 78e70020 7d095214 <7c69502a> 2fa3 419e011c 
70690040
[  485.128152] ---[ end trace d34b27e29ae0e340 ]---

Signed-off-by: Ganesh Goudar 
---
V2: Leave bare metal code and save_mce_event as is.

V3: Have separate functions for realmode and virtual mode handling.
---
  arch/powerpc/platforms/pseries/ras.c | 119 ---
  1 file changed, 70 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index f3736fcd98fc..32fe3fad86b8 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -522,18 +522,55 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
  }
  
+static int mce_handle_err_realmode(int disposition, u8 error_type)

+{
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (disposition == RTAS_DISP_NOT_RECOVERED) {
+   switch (error_type) {
+   caseMC_ERROR_TYPE_SLB:
+   caseMC_ERROR_TYPE_ERAT:
+   /*
+* Store the old slb content in paca before flushing.
+* Print this when we go to virtual mode.
+* There are chances that we may hit MCE again if there
+* is a parity error on the SLB entry we trying to read
+* for saving. Hence limit the slb saving to single
+* level of recursion.
+*/
+   if (local_paca->in_mce == 1)
+   slb_save_contents(local_paca->mce_faulty_slbs);
+   flush_and_reload_slb();
+   disposition = RTAS_DISP_FULLY_RECOVERED;
+   break;
+   default:
+   break;
+   }
+   } else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
+   /* Platform corrected itself but could be degraded */
+   pr_err("MCE: limited recovery, system may be degraded\n");
+   disposition = RTAS_DISP_FULLY_RECOVERED;
+   }
+#endif
+   return disposition;
+}
  
-static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)

+static int mce_handle_err_virtmode(struct pt_regs *regs,
+  struct 

Re: [v3 04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs

2020-07-22 Thread Athira Rajeev


> On 22-Jul-2020, at 9:48 AM, Jordan Niethe  wrote:
> 
> On Sat, Jul 18, 2020 at 1:02 AM Athira Rajeev
> mailto:atraj...@linux.vnet.ibm.com>> wrote:
>> 
>> From: Madhavan Srinivasan 
>> 
>> PowerISA v3.1 includes new performance monitoring unit(PMU)
>> special purpose registers (SPRs). They are
>> 
>> Monitor Mode Control Register 3 (MMCR3)
>> Sampled Instruction Event Register 2 (SIER2)
>> Sampled Instruction Event Register 3 (SIER3)
>> 
>> MMCR3 is added for further sampling related configuration
>> control. SIER2/SIER3 are added to provide additional
>> information about the sampled instruction.
>> 
>> Patch adds new PPMU flag called "PPMU_ARCH_310S" to support
>> handling of these new SPRs, updates the struct thread_struct
>> to include these new SPRs, include MMCR3 in struct mmcr_regs.
>> This is needed to support programming of MMCR3 SPR during
>> event_[enable/disable]. Patch also adds the sysfs support
>> for the MMCR3 SPR along with SPRN_ macros for these new pmu sprs.
>> 
>> Signed-off-by: Madhavan Srinivasan 
>> ---
>> arch/powerpc/include/asm/perf_event_server.h |  2 ++
>> arch/powerpc/include/asm/processor.h |  4 
>> arch/powerpc/include/asm/reg.h   |  6 ++
>> arch/powerpc/kernel/sysfs.c  |  8 
>> arch/powerpc/perf/core-book3s.c  | 29 
>> 
>> 5 files changed, 49 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/perf_event_server.h 
>> b/arch/powerpc/include/asm/perf_event_server.h
>> index 14b8dc1..832450a 100644
>> --- a/arch/powerpc/include/asm/perf_event_server.h
>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>> @@ -22,6 +22,7 @@ struct mmcr_regs {
>>unsigned long mmcr1;
>>unsigned long mmcr2;
>>unsigned long mmcra;
>> +   unsigned long mmcr3;
>> };
>> /*
>>  * This struct provides the constants and functions needed to
>> @@ -75,6 +76,7 @@ struct power_pmu {
>> #define PPMU_HAS_SIER  0x0040 /* Has SIER */
>> #define PPMU_ARCH_207S 0x0080 /* PMC is architecture v2.07S */
>> #define PPMU_NO_SIAR   0x0100 /* Do not use SIAR */
>> +#define PPMU_ARCH_310S 0x0200 /* Has MMCR3, SIER2 and SIER3 */
> We elsewhere have CPU_FTR_ARCH_31, so should this be PPMU_ARCH_31S to
> be consistent.
>> 

Ok,
This change will need to be done in all places which are currently using 
PPMU_ARCH_310S

>> /*
>>  * Values for flags to get_alternatives()
>> diff --git a/arch/powerpc/include/asm/processor.h 
>> b/arch/powerpc/include/asm/processor.h
>> index 52a6783..a466e94 100644
>> --- a/arch/powerpc/include/asm/processor.h
>> +++ b/arch/powerpc/include/asm/processor.h
>> @@ -272,6 +272,10 @@ struct thread_struct {
>>unsignedmmcr0;
>> 
>>unsignedused_ebb;
>> +   unsigned long   mmcr3;
>> +   unsigned long   sier2;
>> +   unsigned long   sier3;
>> +
>> #endif
>> };
>> 
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 88e6c78..21a1b2d 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -876,7 +876,9 @@
>> #define   MMCR0_FCHV   0x0001UL /* freeze conditions in hypervisor mode 
>> */
>> #define SPRN_MMCR1 798
>> #define SPRN_MMCR2 785
>> +#define SPRN_MMCR3 754
>> #define SPRN_UMMCR2769
>> +#define SPRN_UMMCR3738
>> #define SPRN_MMCRA 0x312
>> #define   MMCRA_SDSYNC 0x8000UL /* SDAR synced with SIAR */
>> #define   MMCRA_SDAR_DCACHE_MISS 0x4000UL
>> @@ -918,6 +920,10 @@
>> #define   SIER_SIHV0x100   /* Sampled MSR_HV */
>> #define   SIER_SIAR_VALID  0x040   /* SIAR contents valid */
>> #define   SIER_SDAR_VALID  0x020   /* SDAR contents valid */
>> +#define SPRN_SIER2 752
>> +#define SPRN_SIER3 753
>> +#define SPRN_USIER2736
>> +#define SPRN_USIER3737
>> #define SPRN_SIAR  796
>> #define SPRN_SDAR  797
>> #define SPRN_TACR  888
>> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
>> index 571b325..46b4ebc 100644
>> --- a/arch/powerpc/kernel/sysfs.c
>> +++ b/arch/powerpc/kernel/sysfs.c
>> @@ -622,8 +622,10 @@ void ppc_enable_pmcs(void)
>> SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
>> 
>> SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
>> +SYSFS_PMCSETUP(mmcr3, SPRN_MMCR3);
>> 
>> static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
>> +static DEVICE_ATTR(mmcr3, 0600, show_mmcr3, store_mmcr3);
>> #endif /* HAS_PPC_PMC56 */
>> 
>> 
>> @@ -886,6 +888,9 @@ static int register_cpu_online(unsigned int cpu)
>> #ifdef CONFIG_PMU_SYSFS
>>if (cpu_has_feature(CPU_FTR_MMCRA))
>>device_create_file(s, _attr_mmcra);
>> +
>> +   if (cpu_has_feature(CPU_FTR_ARCH_31))
>> +   device_create_file(s, _attr_mmcr3);
>> #endif /* CONFIG_PMU_SYSFS */
>> 
>>if (cpu_has_feature(CPU_FTR_PURR)) {
>> @@ -980,6 +985,9 @@ static int unregister_cpu_online(unsigned int cpu)
>> #ifdef 

Re: [v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs

2020-07-22 Thread Athira Rajeev


> On 22-Jul-2020, at 10:11 AM, Jordan Niethe  wrote:
> 
> On Sat, Jul 18, 2020 at 1:13 AM Athira Rajeev
> mailto:atraj...@linux.vnet.ibm.com>> wrote:
>> 
>> From: Madhavan Srinivasan 
>> 
>> Add power10 feature function to dt_cpu_ftrs.c along
>> with a power10 specific init() to initialize pmu sprs,
>> sets the oprofile_cpu_type and cpu_features. This will
>> enable performance monitoring unit(PMU) for Power10
>> in CPU features with "performance-monitor-power10".
>> 
>> For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
>> Control Register A (MMCRA) bit, namely "BHRB Recording Disable
>> (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
>> feature at boot for power10.
>> 
>> Signed-off-by: Madhavan Srinivasan 
>> ---
>> arch/powerpc/include/asm/reg.h|  3 +++
>> arch/powerpc/kernel/cpu_setup_power.S |  8 
>> arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++
>> 3 files changed, 37 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 21a1b2d..900ada1 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -1068,6 +1068,9 @@
>> #define MMCR0_PMC2_LOADMISSTIME0x5
>> #endif
>> 
>> +/* BHRB disable bit for PowerISA v3.10 */
>> +#define MMCRA_BHRB_DISABLE 0x0020
> Shouldn't this go under SPRN_MMCRA with the other MMCRA_*.


Hi Jordan

Ok, the definition of MMCRA is under #ifdef for 64 bit .  if I move definition 
of MMCRA_BHRB_DISABLE along with other SPR's, I also
need to define this for 32-bit to satisfy core-book3s to compile as below:

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 900ada10762c..7e271657b412 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -888,6 +888,8 @@
 #define   MMCRA_SLOT   0x0700UL /* SLOT bits (37-39) */
 #define   MMCRA_SLOT_SHIFT 24
 #define   MMCRA_SAMPLE_ENABLE 0x0001UL /* enable sampling */
+/* BHRB disable bit for PowerISA v3.10 */
+#define   MMCRA_BHRB_DISABLE  0x0020
 #define   POWER6_MMCRA_SDSYNC 0x0800ULL/* SDAR/SIAR synced */
 #define   POWER6_MMCRA_SIHV   0x0400ULL
 #define   POWER6_MMCRA_SIPR   0x0200ULL
@@ -1068,9 +1070,6 @@
 #define MMCR0_PMC2_LOADMISSTIME0x5
 #endif
 
-/* BHRB disable bit for PowerISA v3.10 */
-#define MMCRA_BHRB_DISABLE 0x0020
-
 /*
  * SPRG usage:
  *
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 36baae666387..88068f20827c 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -94,6 +94,7 @@ static unsigned int freeze_events_kernel = MMCR0_FCS;
 #define SPRN_SIER2 0
 #define SPRN_SIER3 0
 #define MMCRA_SAMPLE_ENABLE0
+#define MMCRA_BHRB_DISABLE 0
 
 static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
 {



>> +
>> /*
>>  * SPRG usage:
>>  *
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
>> b/arch/powerpc/kernel/cpu_setup_power.S
>> index efdcfa7..b8e0d1e 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>> @@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
>> _GLOBAL(__setup_cpu_power10)
>>mflrr11
>>bl  __init_FSCR_power10
>> +   bl  __init_PMU_ISA31
> So we set MMCRA here but then aren't we still going to call __init_PMU
> which will overwrite that?
> Would this setting MMCRA also need to be handled in __restore_cpu_power10?

Thanks for this nice catch !  When I rebased code initial phase, we didn’t had 
power10 part filled in.
It was a miss from my side in adding PMu init functions and thanks for pointing 
this out. 
Below patch will call __init_PMU functions in setup and restore. Please check 
if this looks good

--
diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index efdcfa714106..e672a6c5fd7c 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -94,6 +94,9 @@ _GLOBAL(__restore_cpu_power8)
 _GLOBAL(__setup_cpu_power10)
mflrr11
bl  __init_FSCR_power10
+   bl  __init_PMU
+   bl  __init_PMU_ISA31
+   bl  __init_PMU_HV
b   1f
 
 _GLOBAL(__setup_cpu_power9)
@@ -124,6 +127,9 @@ _GLOBAL(__setup_cpu_power9)
 _GLOBAL(__restore_cpu_power10)
mflrr11
bl  __init_FSCR_power10
+   bl  __init_PMU
+   bl  __init_PMU_ISA31
+   bl  __init_PMU_HV
b   1f
 
 _GLOBAL(__restore_cpu_power9)
@@ -233,3 +239,10 @@ __init_PMU_ISA207:
li  r5,0
mtspr   SPRN_MMCRS,r5
blr
+
+__init_PMU_ISA31:
+   li  r5,0
+   mtspr   SPRN_MMCR3,r5
+   LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
+   mtspr   SPRN_MMCRA,r5
+   blr

—

>>b   1f
>> 
>> _GLOBAL(__setup_cpu_power9)
>> @@ -233,3 +234,10 

Re: [v3 04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs

2020-07-22 Thread Michael Ellerman
Jordan Niethe  writes:
> On Sat, Jul 18, 2020 at 1:02 AM Athira Rajeev  
> wrote:
>> From: Madhavan Srinivasan 
>>
>> PowerISA v3.1 includes new performance monitoring unit(PMU)
>> special purpose registers (SPRs). They are
...
>>
>> diff --git a/arch/powerpc/include/asm/perf_event_server.h 
>> b/arch/powerpc/include/asm/perf_event_server.h
>> index 14b8dc1..832450a 100644
>> --- a/arch/powerpc/include/asm/perf_event_server.h
>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>> @@ -75,6 +76,7 @@ struct power_pmu {
>>  #define PPMU_HAS_SIER  0x0040 /* Has SIER */
>>  #define PPMU_ARCH_207S 0x0080 /* PMC is architecture v2.07S */
>>  #define PPMU_NO_SIAR   0x0100 /* Do not use SIAR */
>> +#define PPMU_ARCH_310S 0x0200 /* Has MMCR3, SIER2 and SIER3 */

> We elsewhere have CPU_FTR_ARCH_31, so should this be PPMU_ARCH_31S to
> be consistent.

The "S" is no longer needed as there's no Book S vs Book E distinction
in ISA v3.1.

So I changed it to PPMU_ARCH_31.

>> diff --git a/arch/powerpc/perf/core-book3s.c 
>> b/arch/powerpc/perf/core-book3s.c
>> index f4d07b5..ca32fc0 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -581,6 +589,11 @@ static void ebb_switch_out(unsigned long mmcr0)
>> current->thread.sdar  = mfspr(SPRN_SDAR);
>> current->thread.mmcr0 = mmcr0 & MMCR0_USER_MASK;
>> current->thread.mmcr2 = mfspr(SPRN_MMCR2) & MMCR2_USER_MASK;
>> +   if (ppmu->flags & PPMU_ARCH_310S) {
>> +   current->thread.mmcr3 = mfspr(SPRN_MMCR3);

> Like MMCR0_USER_MASK and MMCR2_USER_MASK do we need a MMCR3_USER_MASK
> here, or is there no need?

mmcr0 and mmcr2 are visible via ptrace, so masking them here means we
don't expose any bits to userspace via ptrace that aren't also visible
by reading the register.

So at least while mmcr3 is not exposed via ptrace it's safe to not mask
it, if there are even any sensitive bits in it.

cheers


Re: [v3 04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs

2020-07-22 Thread Jordan Niethe
On Wed, Jul 22, 2020 at 6:07 PM Athira Rajeev
 wrote:
>
>
>
> On 22-Jul-2020, at 9:48 AM, Jordan Niethe  wrote:
>
> On Sat, Jul 18, 2020 at 1:02 AM Athira Rajeev
>  wrote:
>
>
> From: Madhavan Srinivasan 
>
> PowerISA v3.1 includes new performance monitoring unit(PMU)
> special purpose registers (SPRs). They are
>
> Monitor Mode Control Register 3 (MMCR3)
> Sampled Instruction Event Register 2 (SIER2)
> Sampled Instruction Event Register 3 (SIER3)
>
> MMCR3 is added for further sampling related configuration
> control. SIER2/SIER3 are added to provide additional
> information about the sampled instruction.
>
> Patch adds new PPMU flag called "PPMU_ARCH_310S" to support
> handling of these new SPRs, updates the struct thread_struct
> to include these new SPRs, include MMCR3 in struct mmcr_regs.
> This is needed to support programming of MMCR3 SPR during
> event_[enable/disable]. Patch also adds the sysfs support
> for the MMCR3 SPR along with SPRN_ macros for these new pmu sprs.
>
> Signed-off-by: Madhavan Srinivasan 
> ---
> arch/powerpc/include/asm/perf_event_server.h |  2 ++
> arch/powerpc/include/asm/processor.h |  4 
> arch/powerpc/include/asm/reg.h   |  6 ++
> arch/powerpc/kernel/sysfs.c  |  8 
> arch/powerpc/perf/core-book3s.c  | 29 
> 5 files changed, 49 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/perf_event_server.h 
> b/arch/powerpc/include/asm/perf_event_server.h
> index 14b8dc1..832450a 100644
> --- a/arch/powerpc/include/asm/perf_event_server.h
> +++ b/arch/powerpc/include/asm/perf_event_server.h
> @@ -22,6 +22,7 @@ struct mmcr_regs {
>unsigned long mmcr1;
>unsigned long mmcr2;
>unsigned long mmcra;
> +   unsigned long mmcr3;
> };
> /*
>  * This struct provides the constants and functions needed to
> @@ -75,6 +76,7 @@ struct power_pmu {
> #define PPMU_HAS_SIER  0x0040 /* Has SIER */
> #define PPMU_ARCH_207S 0x0080 /* PMC is architecture v2.07S */
> #define PPMU_NO_SIAR   0x0100 /* Do not use SIAR */
> +#define PPMU_ARCH_310S 0x0200 /* Has MMCR3, SIER2 and SIER3 */
>
> We elsewhere have CPU_FTR_ARCH_31, so should this be PPMU_ARCH_31S to
> be consistent.
>
>
>
> Ok,
> This change will need to be done in all places which are currently using 
> PPMU_ARCH_310S
>
> /*
>  * Values for flags to get_alternatives()
> diff --git a/arch/powerpc/include/asm/processor.h 
> b/arch/powerpc/include/asm/processor.h
> index 52a6783..a466e94 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -272,6 +272,10 @@ struct thread_struct {
>unsignedmmcr0;
>
>unsignedused_ebb;
> +   unsigned long   mmcr3;
> +   unsigned long   sier2;
> +   unsigned long   sier3;
> +
> #endif
> };
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 88e6c78..21a1b2d 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -876,7 +876,9 @@
> #define   MMCR0_FCHV   0x0001UL /* freeze conditions in hypervisor mode */
> #define SPRN_MMCR1 798
> #define SPRN_MMCR2 785
> +#define SPRN_MMCR3 754
> #define SPRN_UMMCR2769
> +#define SPRN_UMMCR3738
> #define SPRN_MMCRA 0x312
> #define   MMCRA_SDSYNC 0x8000UL /* SDAR synced with SIAR */
> #define   MMCRA_SDAR_DCACHE_MISS 0x4000UL
> @@ -918,6 +920,10 @@
> #define   SIER_SIHV0x100   /* Sampled MSR_HV */
> #define   SIER_SIAR_VALID  0x040   /* SIAR contents valid */
> #define   SIER_SDAR_VALID  0x020   /* SDAR contents valid */
> +#define SPRN_SIER2 752
> +#define SPRN_SIER3 753
> +#define SPRN_USIER2736
> +#define SPRN_USIER3737
> #define SPRN_SIAR  796
> #define SPRN_SDAR  797
> #define SPRN_TACR  888
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 571b325..46b4ebc 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -622,8 +622,10 @@ void ppc_enable_pmcs(void)
> SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
>
> SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
> +SYSFS_PMCSETUP(mmcr3, SPRN_MMCR3);
>
> static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
> +static DEVICE_ATTR(mmcr3, 0600, show_mmcr3, store_mmcr3);
> #endif /* HAS_PPC_PMC56 */
>
>
> @@ -886,6 +888,9 @@ static int register_cpu_online(unsigned int cpu)
> #ifdef CONFIG_PMU_SYSFS
>if (cpu_has_feature(CPU_FTR_MMCRA))
>device_create_file(s, _attr_mmcra);
> +
> +   if (cpu_has_feature(CPU_FTR_ARCH_31))
> +   device_create_file(s, _attr_mmcr3);
> #endif /* CONFIG_PMU_SYSFS */
>
>if (cpu_has_feature(CPU_FTR_PURR)) {
> @@ -980,6 +985,9 @@ static int unregister_cpu_online(unsigned int cpu)
> #ifdef CONFIG_PMU_SYSFS
>if (cpu_has_feature(CPU_FTR_MMCRA))
>device_remove_file(s, _attr_mmcra);
> +

Re: [v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs

2020-07-22 Thread Jordan Niethe
On Wed, Jul 22, 2020 at 5:55 PM Athira Rajeev
 wrote:
>
>
>
> On 22-Jul-2020, at 10:11 AM, Jordan Niethe  wrote:
>
> On Sat, Jul 18, 2020 at 1:13 AM Athira Rajeev
>  wrote:
>
>
> From: Madhavan Srinivasan 
>
> Add power10 feature function to dt_cpu_ftrs.c along
> with a power10 specific init() to initialize pmu sprs,
> sets the oprofile_cpu_type and cpu_features. This will
> enable performance monitoring unit(PMU) for Power10
> in CPU features with "performance-monitor-power10".
>
> For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
> Control Register A (MMCRA) bit, namely "BHRB Recording Disable
> (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
> feature at boot for power10.
>
> Signed-off-by: Madhavan Srinivasan 
> ---
> arch/powerpc/include/asm/reg.h|  3 +++
> arch/powerpc/kernel/cpu_setup_power.S |  8 
> arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++
> 3 files changed, 37 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 21a1b2d..900ada1 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -1068,6 +1068,9 @@
> #define MMCR0_PMC2_LOADMISSTIME0x5
> #endif
>
> +/* BHRB disable bit for PowerISA v3.10 */
> +#define MMCRA_BHRB_DISABLE 0x0020
>
> Shouldn't this go under SPRN_MMCRA with the other MMCRA_*.
>
>
>
> Hi Jordan
>
> Ok, the definition of MMCRA is under #ifdef for 64 bit .  if I move 
> definition of MMCRA_BHRB_DISABLE along with other SPR's, I also
> need to define this for 32-bit to satisfy core-book3s to compile as below:
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 900ada10762c..7e271657b412 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -888,6 +888,8 @@
>  #define   MMCRA_SLOT   0x0700UL /* SLOT bits (37-39) */
>  #define   MMCRA_SLOT_SHIFT 24
>  #define   MMCRA_SAMPLE_ENABLE 0x0001UL /* enable sampling */
> +/* BHRB disable bit for PowerISA v3.10 */
> +#define   MMCRA_BHRB_DISABLE  0x0020
>  #define   POWER6_MMCRA_SDSYNC 0x0800ULL/* SDAR/SIAR synced */
>  #define   POWER6_MMCRA_SIHV   0x0400ULL
>  #define   POWER6_MMCRA_SIPR   0x0200ULL
> @@ -1068,9 +1070,6 @@
>  #define MMCR0_PMC2_LOADMISSTIME0x5
>  #endif
>
>
>
> -/* BHRB disable bit for PowerISA v3.10 */
> -#define MMCRA_BHRB_DISABLE 0x0020
> -
>  /*
>   * SPRG usage:
>   *
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 36baae666387..88068f20827c 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -94,6 +94,7 @@ static unsigned int freeze_events_kernel = MMCR0_FCS;
>  #define SPRN_SIER2 0
>  #define SPRN_SIER3 0
>  #define MMCRA_SAMPLE_ENABLE0
> +#define MMCRA_BHRB_DISABLE 0
>
>
>
>  static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
>  {
>
>
>
> +
> /*
>  * SPRG usage:
>  *
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
> b/arch/powerpc/kernel/cpu_setup_power.S
> index efdcfa7..b8e0d1e 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
> _GLOBAL(__setup_cpu_power10)
>mflrr11
>bl  __init_FSCR_power10
> +   bl  __init_PMU_ISA31
>
> So we set MMCRA here but then aren't we still going to call __init_PMU
> which will overwrite that?
> Would this setting MMCRA also need to be handled in __restore_cpu_power10?
>
>
> Thanks for this nice catch !  When I rebased code initial phase, we didn’t 
> had power10 part filled in.
> It was a miss from my side in adding PMu init functions and thanks for 
> pointing this out.
> Below patch will call __init_PMU functions in setup and restore. Please check 
> if this looks good
>
> --
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
> b/arch/powerpc/kernel/cpu_setup_power.S
> index efdcfa714106..e672a6c5fd7c 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -94,6 +94,9 @@ _GLOBAL(__restore_cpu_power8)
>  _GLOBAL(__setup_cpu_power10)
>   mflr r11
>   bl __init_FSCR_power10
> + bl __init_PMU
> + bl __init_PMU_ISA31
> + bl __init_PMU_HV
>   b 1f
>
>  _GLOBAL(__setup_cpu_power9)

Won't you also need to change where the label 1 is:
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -100,8 +100,8 @@ _GLOBAL(__setup_cpu_power10)
 _GLOBAL(__setup_cpu_power9)
mflrr11
bl  __init_FSCR
-1: bl  __init_PMU
-   bl  __init_hvmode_206
+   bl  __init_PMU
+1: bl  __init_hvmode_206
mtlrr11
beqlr
li  r0,0

> @@ -124,6 +127,9 @@ _GLOBAL(__setup_cpu_power9)
>  _GLOBAL(__restore_cpu_power10)
>   mflr r11
>   bl __init_FSCR_power10
> + bl 

Re: [PATCH] powerpc/64s: Fix irq tracing corruption in interrupt/syscall return caused by perf interrupts

2020-07-22 Thread Alexey Kardashevskiy



On 22/07/2020 17:34, Nicholas Piggin wrote:
> Alexey reports lockdep_assert_irqs_enabled() warnings when stress testing 
> perf, e.g.,
> 
> WARNING: CPU: 0 PID: 1556 at kernel/softirq.c:169 
> __local_bh_enable_ip+0x258/0x270
> CPU: 0 PID: 1556 Comm: syz-executor
> NIP:  c01ec888 LR: c01ec884 CTR: c0ef0610
> REGS: c00022d4f8a0 TRAP: 0700   Not tainted  (5.8.0-rc3-x)
> MSR:  80029033   CR: 28008844  XER: 2004
> CFAR: c01dc1d0 IRQMASK: 0
> 
> The interesting thing is MSR[EE] and IRQMASK shows interrupts are enabled,
> suggesting the current->hardirqs_enabled irq tracing state is going out of 
> sync
> with the actual interrupt enable state.
> 
> The cause is a window in interrupt/syscall return where irq tracing state is 
> being
> adjusted for an irqs-enabled return while MSR[EE] is still enabled. A perf
> interrupt hits and ends up calling trace_hardirqs_off() when restoring
> interrupt flags to a disable state.
> 
> Fix this by disabling perf interrupts as well while adjusting irq tracing 
> state.
> 
> Add a debug check that catches the condition sooner.
> 
> Fixes: 68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic 
> in C")
> Reported-by: Alexey Kardashevskiy 
> Signed-off-by: Nicholas Piggin 
> ---
> 
> I can reproduce similar symptoms and this patch fixes my test case,
> still trying to confirm Alexey's test case or whether there's another
> similar bug causing it.


This does not fix my testcase. I applied this on top of 4fa640dc5230
("Merge tag 'vfio-v5.8-rc7' of git://github.com/awilliam/linux-vfio into
master")  without any of my testing code, just to be clear. Sorry...


> 
>  arch/powerpc/kernel/syscall_64.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/syscall_64.c 
> b/arch/powerpc/kernel/syscall_64.c
> index 79edba3ab312..6c6f88eff915 100644
> --- a/arch/powerpc/kernel/syscall_64.c
> +++ b/arch/powerpc/kernel/syscall_64.c
> @@ -107,8 +107,13 @@ notrace long system_call_exception(long r3, long r4, 
> long r5,
>   */
>  static notrace inline bool prep_irq_for_enabled_exit(void)
>  {
> - /* This must be done with RI=1 because tracing may touch vmaps */
> - trace_hardirqs_on();
> + if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) {
> + /* Prevent perf interrupts hitting and messing up the 
> trace_hardirqs state */
> + irq_soft_mask_set(IRQS_ALL_DISABLED);
> +
> + /* This must be done with RI=1 because tracing may touch vmaps 
> */
> + trace_hardirqs_on();
> + }
>  
>   /* This pattern matches prep_irq_for_idle */
>   __hard_EE_RI_disable();
> @@ -123,6 +128,8 @@ static notrace inline bool prep_irq_for_enabled_exit(void)
>   local_paca->irq_happened = 0;
>   irq_soft_mask_set(IRQS_ENABLED);
>  
> + lockdep_assert_irqs_enabled();
> +
>   return true;
>  }
>  
> 

-- 
Alexey


Re: [PATCH 09/20] Documentation: i2c: eliminate duplicated word

2020-07-22 Thread Wolfram Sang
On Tue, Jul 07, 2020 at 11:04:03AM -0700, Randy Dunlap wrote:
> Drop doubled word "new".
> 
> Signed-off-by: Randy Dunlap 

For the record:

Acked-by: Wolfram Sang 



signature.asc
Description: PGP signature


Re: [v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs

2020-07-22 Thread Michael Ellerman
Athira Rajeev  writes:
>> On 22-Jul-2020, at 10:11 AM, Jordan Niethe  wrote:
>> 
>> On Sat, Jul 18, 2020 at 1:13 AM Athira Rajeev
>> mailto:atraj...@linux.vnet.ibm.com>> wrote:
>>> 
>>> From: Madhavan Srinivasan 
>>> 
>>> Add power10 feature function to dt_cpu_ftrs.c along
>>> with a power10 specific init() to initialize pmu sprs,
>>> sets the oprofile_cpu_type and cpu_features. This will
>>> enable performance monitoring unit(PMU) for Power10
>>> in CPU features with "performance-monitor-power10".
>>> 
>>> For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
>>> Control Register A (MMCRA) bit, namely "BHRB Recording Disable
>>> (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
>>> feature at boot for power10.
>>> 
>>> Signed-off-by: Madhavan Srinivasan 
>>> ---
>>> arch/powerpc/include/asm/reg.h|  3 +++
>>> arch/powerpc/kernel/cpu_setup_power.S |  8 
>>> arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++
>>> 3 files changed, 37 insertions(+)
>>> 
>>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>>> index 21a1b2d..900ada1 100644
>>> --- a/arch/powerpc/include/asm/reg.h
>>> +++ b/arch/powerpc/include/asm/reg.h
>>> @@ -1068,6 +1068,9 @@
>>> #define MMCR0_PMC2_LOADMISSTIME0x5
>>> #endif
>>> 
>>> +/* BHRB disable bit for PowerISA v3.10 */
>>> +#define MMCRA_BHRB_DISABLE 0x0020
>> Shouldn't this go under SPRN_MMCRA with the other MMCRA_*.
>
>
> Hi Jordan
>
> Ok, the definition of MMCRA is under #ifdef for 64 bit .  if I move 
> definition of MMCRA_BHRB_DISABLE along with other SPR's, I also
> need to define this for 32-bit to satisfy core-book3s to compile as below:
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 900ada10762c..7e271657b412 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -888,6 +888,8 @@
>  #define   MMCRA_SLOT   0x0700UL /* SLOT bits (37-39) */
>  #define   MMCRA_SLOT_SHIFT 24
>  #define   MMCRA_SAMPLE_ENABLE 0x0001UL /* enable sampling */
> +/* BHRB disable bit for PowerISA v3.10 */
> +#define   MMCRA_BHRB_DISABLE  0x0020

I changed it to:

#define   MMCRA_BHRB_DISABLE  0x20UL // BHRB disable bit for ISA v3.1

>>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
>>> b/arch/powerpc/kernel/cpu_setup_power.S
>>> index efdcfa7..b8e0d1e 100644
>>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>>> @@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
>>> _GLOBAL(__setup_cpu_power10)
>>>mflrr11
>>>bl  __init_FSCR_power10
>>> +   bl  __init_PMU_ISA31
>> So we set MMCRA here but then aren't we still going to call __init_PMU
>> which will overwrite that?
>> Would this setting MMCRA also need to be handled in __restore_cpu_power10?
>
> Thanks for this nice catch !  When I rebased code initial phase, we didn’t 
> had power10 part filled in.
> It was a miss from my side in adding PMu init functions and thanks for 
> pointing this out. 
> Below patch will call __init_PMU functions in setup and restore. Please check 
> if this looks good

Actually those changes should be in a separate patch.

This one is wiring up DT CPU features, the cpu setup routines are not
used by DT CPU features.

So please send a new patch I can insert into the series that adds the
cpu_setup_power.S changes.

cheers

> --
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
> b/arch/powerpc/kernel/cpu_setup_power.S
> index efdcfa714106..e672a6c5fd7c 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -94,6 +94,9 @@ _GLOBAL(__restore_cpu_power8)
>  _GLOBAL(__setup_cpu_power10)
>   mflrr11
>   bl  __init_FSCR_power10
> + bl  __init_PMU
> + bl  __init_PMU_ISA31
> + bl  __init_PMU_HV
>   b   1f
>  
>  _GLOBAL(__setup_cpu_power9)
> @@ -124,6 +127,9 @@ _GLOBAL(__setup_cpu_power9)
>  _GLOBAL(__restore_cpu_power10)
>   mflrr11
>   bl  __init_FSCR_power10
> + bl  __init_PMU
> + bl  __init_PMU_ISA31
> + bl  __init_PMU_HV
>   b   1f
>  
>  _GLOBAL(__restore_cpu_power9)
> @@ -233,3 +239,10 @@ __init_PMU_ISA207:
>   li  r5,0
>   mtspr   SPRN_MMCRS,r5
>   blr
> +
> +__init_PMU_ISA31:
> + li  r5,0
> + mtspr   SPRN_MMCR3,r5
> + LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
> + mtspr   SPRN_MMCRA,r5
> + blr
>


Re: [PATCH 15/15] powerpc/powernv/sriov: Make single PE mode a per-BAR setting

2020-07-22 Thread Alexey Kardashevskiy



On 22/07/2020 15:39, Oliver O'Halloran wrote:
> On Wed, Jul 15, 2020 at 6:00 PM Alexey Kardashevskiy  wrote:
>>
>*
> -  * Generally, one M64 BAR maps one IOV BAR. To avoid 
> conflict
> -  * with other devices, IOV BAR size is expanded to be
> -  * (total_pe * VF_BAR_size).  When VF_BAR_size is half of 
> M64
> -  * segment size , the expanded size would equal to half of 
> the
> -  * whole M64 space size, which will exhaust the M64 Space 
> and
> -  * limit the system flexibility.  This is a design decision 
> to
> -  * set the boundary to quarter of the M64 segment size.
> +  * The 1/4 limit is arbitrary and can be tweaked.
>*/
> - if (total_vf_bar_sz > gate) {
> - mul = roundup_pow_of_two(total_vfs);
> - dev_info(>dev,
> - "VF BAR Total IOV size %llx > %llx, roundup 
> to %d VFs\n",
> - total_vf_bar_sz, gate, mul);
> - iov->m64_single_mode = true;
> - break;
> - }
> - }
> + if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
> + /*
> +  * On PHB3, the minimum size alignment of M64 BAR in
> +  * single mode is 32MB. If this VF BAR is smaller 
> than
> +  * 32MB, but still too large for a segmented window
> +  * then we can't map it and need to disable SR-IOV 
> for
> +  * this device.


 Why not use single PE mode for such BAR? Better than nothing.
>>>
>>> Suppose you could, but I figured VFs were mainly interesting since you
>>> could give each VF to a separate guest. If there's multiple VFs under
>>> the same single PE BAR then they'd have to be assigned to the same
>>
>> True. But with one PE per VF we can still have 15 (or 14?) isolated VFs
>> which is not hundreds but better than 0.
> 
> We can only use single PE BARs if the per-VF size is >= 32MB due to
> the alignment requirements on P8. If the per-VF size is smaller then
> we're stuck with multiple VFs inside the same BAR which is bad due to
> the PAPR requirements mentioned below. Sure we could look at doing
> something else, but considering this matches the current behaviour
> it's a bit hard to care...
>
>>> guest in order to retain the freeze/unfreeze behaviour that PAPR
>>> requires. I guess that's how it used to work, but it seems better just
>>> to disable them rather than having VFs which sort of work.
>>
>> Well, realistically the segment size should be 8MB to make this matter
>> (or the whole window 2GB) which does not seem to happen so it does not
>> matter.
> 
> I'm not sure what you mean.

I mean how can we possibly hit this case, what m64_segsize would the
platform have to trigger this. The whole check seems useless but whatever.



-- 
Alexey


Re: [v4 5/5] KVM: PPC: Book3S HV: migrate hot plugged memory

2020-07-22 Thread Bharata B Rao
On Fri, Jul 17, 2020 at 01:00:27AM -0700, Ram Pai wrote:
> From: Laurent Dufour 
> 
> When a memory slot is hot plugged to a SVM, PFNs associated with the
> GFNs in that slot must be migrated to secure-PFNs, aka device-PFNs.
> 
> Call kvmppc_uv_migrate_mem_slot() to accomplish this.
> Disable page-merge for all pages in the memory slot.
> 
> Signed-off-by: Ram Pai 
> [rearranged the code, and modified the commit log]
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/include/asm/kvm_book3s_uvmem.h | 10 ++
>  arch/powerpc/kvm/book3s_hv.c| 10 ++
>  arch/powerpc/kvm/book3s_hv_uvmem.c  | 22 ++
>  3 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
> b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> index f229ab5..6f7da00 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> @@ -25,6 +25,9 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot 
> *free,
>struct kvm *kvm, bool skip_page_out);
>  int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
>   const struct kvm_memory_slot *memslot);
> +void kvmppc_memslot_create(struct kvm *kvm, const struct kvm_memory_slot 
> *new);
> +void kvmppc_memslot_delete(struct kvm *kvm, const struct kvm_memory_slot 
> *old);

The names look a bit generic, but these functions are specific
to secure guests. May be rename them to kvmppc_uvmem_memslot_[create/delele]?

> +
>  #else
>  static inline int kvmppc_uvmem_init(void)
>  {
> @@ -84,5 +87,12 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
> unsigned long gfn)
>  static inline void
>  kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>   struct kvm *kvm, bool skip_page_out) { }
> +
> +static inline void  kvmppc_memslot_create(struct kvm *kvm,
> + const struct kvm_memory_slot *new) { }
> +
> +static inline void  kvmppc_memslot_delete(struct kvm *kvm,
> + const struct kvm_memory_slot *old) { }
> +
>  #endif /* CONFIG_PPC_UV */
>  #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index d331b46..bf3be3b 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4515,16 +4515,10 @@ static void 
> kvmppc_core_commit_memory_region_hv(struct kvm *kvm,
>  
>   switch (change) {
>   case KVM_MR_CREATE:
> - if (kvmppc_uvmem_slot_init(kvm, new))
> - return;
> - uv_register_mem_slot(kvm->arch.lpid,
> -  new->base_gfn << PAGE_SHIFT,
> -  new->npages * PAGE_SIZE,
> -  0, new->id);
> + kvmppc_memslot_create(kvm, new);
>   break;
>   case KVM_MR_DELETE:
> - uv_unregister_mem_slot(kvm->arch.lpid, old->id);
> - kvmppc_uvmem_slot_free(kvm, old);
> + kvmppc_memslot_delete(kvm, old);
>   break;
>   default:
>   /* TODO: Handle KVM_MR_MOVE */
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index a206984..a2b4d25 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -1089,6 +1089,28 @@ int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned 
> long gfn)
>   return (ret == U_SUCCESS) ? RESUME_GUEST : -EFAULT;
>  }
>  
> +void kvmppc_memslot_create(struct kvm *kvm, const struct kvm_memory_slot 
> *new)
> +{
> + if (kvmppc_uvmem_slot_init(kvm, new))
> + return;
> +
> + if (kvmppc_memslot_page_merge(kvm, new, false))
> + return;
> +
> + if (uv_register_mem_slot(kvm->arch.lpid, new->base_gfn << PAGE_SHIFT,
> + new->npages * PAGE_SIZE, 0, new->id))
> + return;
> +
> + kvmppc_uv_migrate_mem_slot(kvm, new);

Quite a few things can return failure here including
kvmppc_uv_migrate_mem_slot() and we are ignoring all of those.
I am wondering if this should be called from prepare_memory_region callback
instead of commit_memory_region. In the prepare phase, we have a way
to back out in case of error. Can you check if moving this call to
prepare callback is feasible?

In the other case in 1/5, the code issues ksm unmerge request on error,
but not here.

Also check if the code for 1st three calls can be shared with similar
code in 1/5.

Regards,
Bharata.


Re: [PATCH 05/15] powerpc/powernv/sriov: Move SR-IOV into a seperate file

2020-07-22 Thread Alexey Kardashevskiy



On 22/07/2020 15:01, Oliver O'Halloran wrote:
> On Tue, Jul 14, 2020 at 7:16 PM Alexey Kardashevskiy  wrote:
>>
>> On 10/07/2020 15:23, Oliver O'Halloran wrote:
>>> + align = pci_iov_resource_size(pdev, resno);
>>> +
>>> + /*
>>> +  * iov can be null if we have an SR-IOV device with IOV BAR that can't
>>> +  * be placed in the m64 space (i.e. The BAR is 32bit or non-prefetch).
>>> +  * In that case we don't allow VFs to be enabled so just return the
>>> +  * default alignment.
>>> +  */
>>> + if (!iov)
>>> + return align;
>>
>>
>> This is the new chunk. What would happen before? Non-prefetch BAR would
>> still go to m64 space?
> 
> I don't think there's any real change. Currently if the setup in
> pnv_pci_ioda_fixup_iov_resources() fails then pdn->vfs_expanded will
> be zero. The !iov check here fills the same role, but it's more
> explicit. vfs_expanded has some other behaviour too so we can't get
> rid of it entirely (yet).

The check is fine, you have to have one as @iov can be NULL (unlike
pci_dn). The comment is what bothered me. It would make more sense
somewhere in pnv_pci_ioda_fixup_iov_resources() near
"dev_warn(>dev, "Don't support SR-IOV with"" as now it suggests
there is one reason for the failed iov configuration only while there
are two reasons.


-- 
Alexey


Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-22 Thread Arnd Bergmann
On Tue, Jul 21, 2020 at 9:06 PM Palmer Dabbelt  wrote:
>
> On Tue, 21 Jul 2020 11:36:10 PDT (-0700), a...@ghiti.fr wrote:
> > Let's try to make progress here: I add linux-mm in CC to get feedback on
> > this patch as it blocks sv48 support too.
>
> Sorry for being slow here.  I haven't replied because I hadn't really fleshed
> out the design yet, but just so everyone's on the same page my problems with
> this are:
>
> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.

There is actually an ongoing work to make 32-bit Arm kernels move
vmlinux into the vmalloc space, as part of the move to avoid highmem.

Overall, a 32-bit system would waste about 0.1% of its virtual address space
by having the kernel be located in both the linear map and the vmalloc area.
It's not zero, but not that bad either. With the typical split of 3072 MB user,
768MB linear and 256MB vmalloc, it's also around 1.5% of the available
vmalloc area (assuming a 4MB vmlinux in a typical 32-bit kernel), but the
boundaries can be changed arbitrarily if needed.

The eventual goal is to have a split of 3840MB for either user or linear map
plus and 256MB for vmalloc, including the kernel. Switching between linear
and user has a noticeable runtime overhead, but it relaxes both the limits
for user memory and lowmem, and it provides a somewhat stronger
address space isolation.

Another potential idea would be to completely randomize the physical
addresses underneath the kernel by using a random permutation of the
pages in the kernel image. This adds even more overhead (virt_to_phys
may need to call vmalloc_to_page or similar) and may cause problems
with DMA into kernel .data across page boundaries,

> * Sort out how to maintain a linear map as the canonical hole moves around
>   between the VA widths without adding a bunch of overhead to the virt2phys 
> and
>   friends.  This is probably going to be the trickiest part, but I think if we
>   just change the page table code to essentially lie about VAs when an sv39
>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>   logical complexity involved, but it would remain fast.

I assume you can't use the trick that x86 has where all kernel addresses
are at the top of the 64-bit address space and user addresses are at the
bottom, regardless of the size of the page tables?

  Arnd


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread peterz
On Wed, Jul 22, 2020 at 10:48:54AM +0200, Peter Zijlstra wrote:
> But reading your explanation, it looks like the Linux topology setup
> could actually unravel the fused-faux-SMT8 into two independent SMT4
> parts, negating that firmware option.

Ah, it looks like that's exactly what you're doing. Nice!


Re: [v4 1/5] KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START

2020-07-22 Thread Bharata B Rao
On Fri, Jul 17, 2020 at 01:00:23AM -0700, Ram Pai wrote:
> Page-merging of pages in memory-slots associated with a Secure VM,
> is disabled in H_SVM_PAGE_IN handler.
> 
> This operation should have been done much earlier; the moment the VM
> is initiated for secure-transition. Delaying this operation, increases
> the probability for those pages to acquire new references , making it
> impossible to migrate those pages.
> 
> Disable page-migration in H_SVM_INIT_START handling.
> 
> Signed-off-by: Ram Pai 

Reviewed-by: Bharata B Rao 

with a few observations below...

> ---
>  Documentation/powerpc/ultravisor.rst |  1 +
>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 98 
> +++-
>  2 files changed, 76 insertions(+), 23 deletions(-)
> 
> diff --git a/Documentation/powerpc/ultravisor.rst 
> b/Documentation/powerpc/ultravisor.rst
> index df136c8..a1c8c37 100644
> --- a/Documentation/powerpc/ultravisor.rst
> +++ b/Documentation/powerpc/ultravisor.rst
> @@ -895,6 +895,7 @@ Return values
>  One of the following values:
>  
>   * H_SUCCESS  on success.
> +* H_STATEif the VM is not in a position to switch to secure.
>  
>  Description
>  ~~~
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index e6f76bc..0baa293 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -211,6 +211,65 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
> struct kvm *kvm,
>   return false;
>  }
>  
> +static int kvmppc_memslot_page_merge(struct kvm *kvm,
> + struct kvm_memory_slot *memslot, bool merge)
> +{
> + unsigned long gfn = memslot->base_gfn;
> + unsigned long end, start = gfn_to_hva(kvm, gfn);
> + int ret = 0;
> + struct vm_area_struct *vma;
> + int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
> +
> + if (kvm_is_error_hva(start))
> + return H_STATE;
> +
> + end = start + (memslot->npages << PAGE_SHIFT);
> +
> + mmap_write_lock(kvm->mm);
> + do {
> + vma = find_vma_intersection(kvm->mm, start, end);
> + if (!vma) {
> + ret = H_STATE;
> + break;
> + }
> + ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> +   merge_flag, >vm_flags);
> + if (ret) {
> + ret = H_STATE;
> + break;
> + }
> + start = vma->vm_end + 1;

This should be start = vma->vm_end I believe.

> + } while (end > vma->vm_end);
> +
> + mmap_write_unlock(kvm->mm);
> + return ret;
> +}
> +
> +static int __kvmppc_page_merge(struct kvm *kvm, bool merge)
> +{
> + struct kvm_memslots *slots;
> + struct kvm_memory_slot *memslot;
> + int ret = 0;
> +
> + slots = kvm_memslots(kvm);
> + kvm_for_each_memslot(memslot, slots) {
> + ret = kvmppc_memslot_page_merge(kvm, memslot, merge);
> + if (ret)
> + break;
> + }
> + return ret;
> +}

You walk through all the slots here to issue kvm_madvise, but...

> +
> +static inline int kvmppc_disable_page_merge(struct kvm *kvm)
> +{
> + return __kvmppc_page_merge(kvm, false);
> +}
> +
> +static inline int kvmppc_enable_page_merge(struct kvm *kvm)
> +{
> + return __kvmppc_page_merge(kvm, true);
> +}
> +
>  unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  {
>   struct kvm_memslots *slots;
> @@ -232,11 +291,18 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>   return H_AUTHORITY;
>  
>   srcu_idx = srcu_read_lock(>srcu);
> +
> + /* disable page-merging for all memslot */
> + ret = kvmppc_disable_page_merge(kvm);
> + if (ret)
> + goto out;
> +
> + /* register the memslot */
>   slots = kvm_memslots(kvm);
>   kvm_for_each_memslot(memslot, slots) {

... you are walking thro' the same set of slots here anyway. I think
it makes sense to issue merge advices from here itself. That will
help you to share code with kvmppc_memslot_create() in 5/5.

All the below 3 calls are common to both the code paths, I think
they can be carved out into a separate function if you prefer.

kvmppc_uvmem_slot_init
kvmppc_memslot_page_merge
uv_register_mem_slot

Regards,
Bharata.


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread Peter Zijlstra
On Wed, Jul 22, 2020 at 01:48:22PM +0530, Srikar Dronamraju wrote:
> * pet...@infradead.org  [2020-07-22 09:46:24]:
> 
> > On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > > same as SMT domain. However we could generalize this domain such that it
> > > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> > 
> > What's the whole smallcore vs bigcore thing?
> > 
> > Would it make sense to have an actual overview of the various topology
> > layers somewhere? Because I'm getting lost and can't really make sense
> > of the code due to that.
> 
> To quote with an example: using (Power 9)
> 
> Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
> independent core with threads 0,2,4 and 6 acting like a core, while threads
> 1,3,5,7 acting as another core.  
> 
> The firmware can decide to showcase them as 2 independent small cores or as
> one big core. However the LLC (i.e last level of cache) is shared between
> all the threads of the SMT 8 core. Future power chips, LLC might change, it
> may be expanded to share with another SMT 8 core or it could be made
> specific to SMT 4 core.
> 
> The smt 8 core is also known as fused core/ Big core.
> The smaller smt 4 core is known as small core.
> 
> So on a Power9 Baremetal, the firmware would show up as SMT4 core.
> and we have a CACHE level at SMT 8. CACHE level would be very very similar
> to MC domain in X86.
> 
> Hope this is clear.

Ooh, that thing. I thought P9 was in actual fact an SMT4 hardware part,
but to be compatible with P8 it has this fused option where two cores
act like a single SMT8 part.

The interleaving enumeration is done due to P7 asymmetric cores,
resuting in schedulers having the preference to use the lower threads.

Combined that results in a P9-fused configuration using two independent
full cores when there's only 2 runnable threads.

Which is a subtly different story from yours.

But reading your explanation, it looks like the Linux topology setup
could actually unravel the fused-faux-SMT8 into two independent SMT4
parts, negating that firmware option.

Anyway, a few comments just around there might be helpfull.


Thanks!


Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes

2020-07-22 Thread David Gibson
On Wed, Jul 22, 2020 at 11:35:06AM +0530, Bharata B Rao wrote:
> On Tue, Jul 21, 2020 at 10:25:58PM +1000, Michael Ellerman wrote:
> > Bharata B Rao  writes:
> > > On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote:
> > >> Nathan Lynch  writes:
> > >> > "Aneesh Kumar K.V"  writes:
> > >> >> This is the next version of the fixes for memory unplug on radix.
> > >> >> The issues and the fix are described in the actual patches.
> > >> >
> > >> > I guess this isn't actually causing problems at runtime right now, but 
> > >> > I
> > >> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
> > >> > arch_remove_memory(), which ought to be mmu-agnostic:
> > >> >
> > >> > int __ref arch_add_memory(int nid, u64 start, u64 size,
> > >> >  struct mhp_params *params)
> > >> > {
> > >> >unsigned long start_pfn = start >> PAGE_SHIFT;
> > >> >unsigned long nr_pages = size >> PAGE_SHIFT;
> > >> >int rc;
> > >> >
> > >> >resize_hpt_for_hotplug(memblock_phys_mem_size());
> > >> >
> > >> >start = (unsigned long)__va(start);
> > >> >rc = create_section_mapping(start, start + size, nid,
> > >> >params->pgprot);
> > >> > ...
> > >> 
> > >> Hmm well spotted.
> > >> 
> > >> That does return early if the ops are not setup:
> > >> 
> > >> int resize_hpt_for_hotplug(unsigned long new_mem_size)
> > >> {
> > >>  unsigned target_hpt_shift;
> > >> 
> > >>  if (!mmu_hash_ops.resize_hpt)
> > >>  return 0;
> > >> 
> > >> 
> > >> And:
> > >> 
> > >> void __init hpte_init_pseries(void)
> > >> {
> > >>  ...
> > >>  if (firmware_has_feature(FW_FEATURE_HPT_RESIZE))
> > >>  mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt;
> > >> 
> > >> And that comes in via ibm,hypertas-functions:
> > >> 
> > >>  {FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"},
> > >> 
> > >> 
> > >> But firmware is not necessarily going to add/remove that call based on
> > >> whether we're using hash/radix.
> > >
> > > Correct but hpte_init_pseries() will not be called for radix guests.
> > 
> > Yeah, duh. You'd think the function name would have been a sufficient
> > clue for me :)
> > 
> > >> So I think a follow-up patch is needed to make this more robust.
> > >> 
> > >> Aneesh/Bharata what platform did you test this series on? I'm curious
> > >> how this didn't break.
> > >
> > > I have tested memory hotplug/unplug for radix guest on zz platform and
> > > sanity-tested this for hash guest on P8.
> > >
> > > As noted above, mmu_hash_ops.resize_hpt will not be set for radix
> > > guest and hence we won't see any breakage.
> > 
> > OK.
> > 
> > That's probably fine as it is then. Or maybe just a comment in
> > resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if
> > we're using radix.
> 
> Or we could move these calls to hpt-only routines like below?
> 
> David - Do you remember if there was any particular reason to have
> these two hpt-resize calls within powerpc-generic memory hotplug code?

I don't remember, sorry.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread Srikar Dronamraju
* pet...@infradead.org  [2020-07-22 09:46:24]:

> On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> 
> What's the whole smallcore vs bigcore thing?
> 
> Would it make sense to have an actual overview of the various topology
> layers somewhere? Because I'm getting lost and can't really make sense
> of the code due to that.

To quote with an example: using (Power 9)

Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
independent core with threads 0,2,4 and 6 acting like a core, while threads
1,3,5,7 acting as another core.  

The firmware can decide to showcase them as 2 independent small cores or as
one big core. However the LLC (i.e last level of cache) is shared between
all the threads of the SMT 8 core. Future power chips, LLC might change, it
may be expanded to share with another SMT 8 core or it could be made
specific to SMT 4 core.

The smt 8 core is also known as fused core/ Big core.
The smaller smt 4 core is known as small core.

So on a Power9 Baremetal, the firmware would show up as SMT4 core.
and we have a CACHE level at SMT 8. CACHE level would be very very similar
to MC domain in X86.

Hope this is clear.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 01/10] powerpc/smp: Cache node for reuse

2020-07-22 Thread Srikar Dronamraju
* Michael Ellerman  [2020-07-22 17:41:41]:

> Srikar Dronamraju  writes:
> > While cpu_to_node is inline function with access to per_cpu variable.
> > However when using repeatedly, it may be cleaner to cache it in a local
> > variable.
> 
> It's not clear what "cleaner" is supposed to mean. Are you saying it
> makes the source clearer, or the generated code?
> 
> I'm not sure it will make any difference to the latter.

I meant the source code, I am okay dropping the hunks that try to cache
cpu_to_node.

> 
> > Also fix a build error in a some weird config.
> > "error: _numa_cpu_lookup_table_ undeclared"
> 
> Separate patch please.

Okay, will do.

> 
> > No functional change
> 
> The ifdef change means that's not true.

Okay

> > @@ -854,20 +854,24 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > cpu_callin_map[boot_cpuid] = 1;
> >  
> > for_each_possible_cpu(cpu) {
> > +   int node = cpu_to_node(cpu);
> > +
> 
> Does cpu_to_node() even work here?

Except in the case where NUMA is not enabled, (when cpu_to_node would return
-1), It should work here since numa initialization would have happened by
now. It cpu_to_node(cpu) should work once numa_setup_cpu() /
map_cpu_to_node() gets called.  And those are being called before this.

> 
> Doesn't look like it to me.
> 
> More fallout from 8c272261194d ("powerpc/numa: Enable 
> USE_PERCPU_NUMA_NODE_ID") ?
> 
> > }
> >  
> > /* Init the cpumasks so the boot CPU is related to itself */

-- 
Thanks and Regards
Srikar Dronamraju


RE: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-22 Thread Ram Pai
On Wed, Jul 22, 2020 at 12:06:06PM +1000, Michael Ellerman wrote:
> Ram Pai  writes:
> > An instruction accessing a mmio address, generates a HDSI fault.  This 
> > fault is
> > appropriately handled by the Hypervisor.  However in the case of secureVMs, 
> > the
> > fault is delivered to the ultravisor.
> >
> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > translation. Walking the two level page table to read the instruction can 
> > race
> > with other vcpus modifying the SVM's process scoped page table.
> 
> You're trying to read the guest's kernel text IIUC, that mapping should
> be stable. Possibly permissions on it could change over time, but the
> virtual -> real mapping should not.

Actually the code does not capture the address of the instruction in the
sprg0 register. It captures the instruction itself. So should the mapping
matter?

> 
> > This problem can be correctly solved with some help from the kernel.
> >
> > Capture the faulting instruction in SPRG0 register, before executing the
> > faulting instruction. This enables the ultravisor to easily procure the
> > faulting instruction and emulate it.
> 
> This is not something I'm going to merge. Sorry.

Ok. Will consider other approaches.

RP


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread peterz
On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.

What's the whole smallcore vs bigcore thing?

Would it make sense to have an actual overview of the various topology
layers somewhere? Because I'm getting lost and can't really make sense
of the code due to that.


RE: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-22 Thread Ram Pai
On Wed, Jul 22, 2020 at 12:42:05AM -0700, Ram Pai wrote:
> On Wed, Jul 22, 2020 at 03:02:32PM +1000, Paul Mackerras wrote:
> > On Thu, Jul 16, 2020 at 01:32:13AM -0700, Ram Pai wrote:
> > > An instruction accessing a mmio address, generates a HDSI fault.  This 
> > > fault is
> > > appropriately handled by the Hypervisor.  However in the case of 
> > > secureVMs, the
> > > fault is delivered to the ultravisor.
> > > 
> > > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > > translation. Walking the two level page table to read the instruction can 
> > > race
> > > with other vcpus modifying the SVM's process scoped page table.
> > > 
> > > This problem can be correctly solved with some help from the kernel.
> > > 
> > > Capture the faulting instruction in SPRG0 register, before executing the
> > > faulting instruction. This enables the ultravisor to easily procure the
> > > faulting instruction and emulate it.
> > 
> > Just a comment on the approach of putting the instruction in SPRG0:
> > these I/O accessors can be used in interrupt routines, which means
> > that if these accessors are ever used with interrupts enabled, there
> > is the possibility of an external interrupt occurring between the
> > instruction that sets SPRG0 and the load/store instruction that
> > faults.  If the handler for that interrupt itself does an I/O access,
> > it will overwrite SPRG0, corrupting the value set by the interrupted
> > code.
> 
> Acutally my proposed code restores the value of SPRG0 before returning back to
> the interrupted instruction. So here is the sequence. I think it works.
> 
>  (1) store sprg0 in register Rx (lets say srpg0 had 0xc. Rx now contains 0xc)
> 
>  (2) save faulting instruction address in sprg0 (lets say the value is 0xa.
>   sprg0 will contain 0xa).

Small correction. sprg0 does not store the address of the faulting
instruction. It stores the isntruction itself. Regardless, the code
should work, I think.

RP


Re: Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-22 Thread Ram Pai
On Wed, Jul 22, 2020 at 03:02:32PM +1000, Paul Mackerras wrote:
> On Thu, Jul 16, 2020 at 01:32:13AM -0700, Ram Pai wrote:
> > An instruction accessing a mmio address, generates a HDSI fault.  This 
> > fault is
> > appropriately handled by the Hypervisor.  However in the case of secureVMs, 
> > the
> > fault is delivered to the ultravisor.
> > 
> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > translation. Walking the two level page table to read the instruction can 
> > race
> > with other vcpus modifying the SVM's process scoped page table.
> > 
> > This problem can be correctly solved with some help from the kernel.
> > 
> > Capture the faulting instruction in SPRG0 register, before executing the
> > faulting instruction. This enables the ultravisor to easily procure the
> > faulting instruction and emulate it.
> 
> Just a comment on the approach of putting the instruction in SPRG0:
> these I/O accessors can be used in interrupt routines, which means
> that if these accessors are ever used with interrupts enabled, there
> is the possibility of an external interrupt occurring between the
> instruction that sets SPRG0 and the load/store instruction that
> faults.  If the handler for that interrupt itself does an I/O access,
> it will overwrite SPRG0, corrupting the value set by the interrupted
> code.

Acutally my proposed code restores the value of SPRG0 before returning back to
the interrupted instruction. So here is the sequence. I think it works.

 (1) store sprg0 in register Rx (lets say srpg0 had 0xc. Rx now contains 0xc)

 (2) save faulting instruction address in sprg0 (lets say the value is 0xa.
sprg0 will contain 0xa).

 (3)<- interrupt arrives

 (4) store sprg0 in register Ry ( Ry now contains 0xa )

 (5) save faulting instruction address in sprg0 (lets say the value is 0xb).

 (6) 0xb: execute faulting instruction

 (7) restore Ry into sprg0 ( sprg0 now contains 0xa )

 (8)<-- return from interrupt

 (9)  0xa: execute faulting instruction

 (10) restore Rx into sprg0 (sprg0 now contains 0xc)

> 
> The choices to fix that would seem to be (a) disable interrupts around
> all I/O accesses, (b) have the accessor save and restore SPRG0, or (c)
> solve the problem another way, such as by doing a H_LOGICAL_CI_LOAD
> or H_LOGICAL_CI_STORE hypercall.

Ok. Will explore (c).

Thanks,
RP

-- 
Ram Pai


Re: [PATCH v2 01/10] powerpc/smp: Cache node for reuse

2020-07-22 Thread Michael Ellerman
Srikar Dronamraju  writes:
> While cpu_to_node is inline function with access to per_cpu variable.
> However when using repeatedly, it may be cleaner to cache it in a local
> variable.

It's not clear what "cleaner" is supposed to mean. Are you saying it
makes the source clearer, or the generated code?

I'm not sure it will make any difference to the latter.

> Also fix a build error in a some weird config.
> "error: _numa_cpu_lookup_table_ undeclared"

Separate patch please.

> No functional change

The ifdef change means that's not true.

> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Srikar Dronamraju 
> ---
>  arch/powerpc/kernel/smp.c | 18 +++---
>  1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 73199470c265..680c0edcc59d 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -843,7 +843,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>  
>   DBG("smp_prepare_cpus\n");
>  
> - /* 
> + /*
>* setup_cpu may need to be called on the boot cpu. We havent
>* spun any cpus up but lets be paranoid.
>*/
> @@ -854,20 +854,24 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   cpu_callin_map[boot_cpuid] = 1;
>  
>   for_each_possible_cpu(cpu) {
> + int node = cpu_to_node(cpu);
> +

Does cpu_to_node() even work here?

Doesn't look like it to me.

More fallout from 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID") 
?

>   zalloc_cpumask_var_node(_cpu(cpu_sibling_map, cpu),
> - GFP_KERNEL, cpu_to_node(cpu));
> + GFP_KERNEL, node);
>   zalloc_cpumask_var_node(_cpu(cpu_l2_cache_map, cpu),
> - GFP_KERNEL, cpu_to_node(cpu));
> + GFP_KERNEL, node);
>   zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu),
> - GFP_KERNEL, cpu_to_node(cpu));
> + GFP_KERNEL, node);
> +#ifdef CONFIG_NEED_MULTIPLE_NODES
>   /*
>* numa_node_id() works after this.
>*/
>   if (cpu_present(cpu)) {
> - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
> - set_cpu_numa_mem(cpu,
> - local_memory_node(numa_cpu_lookup_table[cpu]));
> + node = numa_cpu_lookup_table[cpu];
> + set_cpu_numa_node(cpu, node);
> + set_cpu_numa_mem(cpu, local_memory_node(node));
>   }
> +#endif
>   }
>  
>   /* Init the cpumasks so the boot CPU is related to itself */
> -- 
> 2.17.1

cheers


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread Srikar Dronamraju
* Gautham R Shenoy  [2020-07-22 12:26:40]:

> Hello Srikar,
> 
> On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> > 
> > While setting up the "CACHE" domain, check if shared_cache is already
> > set.
> > @@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
> > /* Update topology CPU masks */
> > add_cpu_to_masks(cpu);
> > 
> > -   if (has_big_cores)
> > -   sibling_mask = cpu_smallcore_mask;
> > /*
> >  * Check for any shared caches. Note that this must be done on a
> >  * per-core basis because one core in the pair might be disabled.
> >  */
> > -   if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> > -   shared_caches = true;
> > +   if (!shared_caches) {
> > +   struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> > +   struct cpumask *mask = cpu_l2_cache_mask(cpu);
> > +
> > +   if (has_big_cores)
> > +   sibling_mask = cpu_smallcore_mask;
> > +
> > +   if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> > +   shared_caches = true;
> 
> At the risk of repeating my comment to the v1 version of the patch, we
> have shared caches only l2_cache_mask(cpu) is a strict superset of
> sibling_mask(cpu).
> 
> "cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))" does not
> capture this.

Why would it not? We are setting shared_caches if and only if l2_cache_mask
is a strict superset of sibling/smallcore mask.

> Could we please use
> 
>   if (!cpumask_equal(sibling_mask(cpu), mask) &&
> cpumask_subset(sibling_mask(cpu), mask) {
>   }
> 

Scheduler mandates that two cpumasks for the same CPU would either have to
be equal or one of them has to be a strict superset of the other. If not the
scheduler would mark our domains as broken. That being the case,
cpumask_weight will correctly capture what we are looking for. That said
your condition is also correct but slightly heavier and doesn't provide us
with any more information or correctness.

> 
> Otherwise the patch looks good to me.
> 
> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH] powerpc/64s: Fix irq tracing corruption in interrupt/syscall return caused by perf interrupts

2020-07-22 Thread Nicholas Piggin
Alexey reports lockdep_assert_irqs_enabled() warnings when stress testing perf, 
e.g.,

WARNING: CPU: 0 PID: 1556 at kernel/softirq.c:169 
__local_bh_enable_ip+0x258/0x270
CPU: 0 PID: 1556 Comm: syz-executor
NIP:  c01ec888 LR: c01ec884 CTR: c0ef0610
REGS: c00022d4f8a0 TRAP: 0700   Not tainted  (5.8.0-rc3-x)
MSR:  80029033   CR: 28008844  XER: 2004
CFAR: c01dc1d0 IRQMASK: 0

The interesting thing is MSR[EE] and IRQMASK shows interrupts are enabled,
suggesting the current->hardirqs_enabled irq tracing state is going out of sync
with the actual interrupt enable state.

The cause is a window in interrupt/syscall return where irq tracing state is 
being
adjusted for an irqs-enabled return while MSR[EE] is still enabled. A perf
interrupt hits and ends up calling trace_hardirqs_off() when restoring
interrupt flags to a disable state.

Fix this by disabling perf interrupts as well while adjusting irq tracing state.

Add a debug check that catches the condition sooner.

Fixes: 68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic in 
C")
Reported-by: Alexey Kardashevskiy 
Signed-off-by: Nicholas Piggin 
---

I can reproduce similar symptoms and this patch fixes my test case,
still trying to confirm Alexey's test case or whether there's another
similar bug causing it.

 arch/powerpc/kernel/syscall_64.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/syscall_64.c b/arch/powerpc/kernel/syscall_64.c
index 79edba3ab312..6c6f88eff915 100644
--- a/arch/powerpc/kernel/syscall_64.c
+++ b/arch/powerpc/kernel/syscall_64.c
@@ -107,8 +107,13 @@ notrace long system_call_exception(long r3, long r4, long 
r5,
  */
 static notrace inline bool prep_irq_for_enabled_exit(void)
 {
-   /* This must be done with RI=1 because tracing may touch vmaps */
-   trace_hardirqs_on();
+   if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) {
+   /* Prevent perf interrupts hitting and messing up the 
trace_hardirqs state */
+   irq_soft_mask_set(IRQS_ALL_DISABLED);
+
+   /* This must be done with RI=1 because tracing may touch vmaps 
*/
+   trace_hardirqs_on();
+   }
 
/* This pattern matches prep_irq_for_idle */
__hard_EE_RI_disable();
@@ -123,6 +128,8 @@ static notrace inline bool prep_irq_for_enabled_exit(void)
local_paca->irq_happened = 0;
irq_soft_mask_set(IRQS_ENABLED);
 
+   lockdep_assert_irqs_enabled();
+
return true;
 }
 
-- 
2.23.0



Re: [PATCH v2 09/10] Powerpc/smp: Create coregroup domain

2020-07-22 Thread Srikar Dronamraju
> > @@ -1386,6 +1421,9 @@ int setup_profiling_timer(unsigned int multiplier)
> > 
> >  static void fixup_topology(void)
> >  {
> > +   if (!has_coregroup_support())
> > +   powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> > +
> 
> Shouldn't we move this condition after doing the fixup for shared
> caches ? Because if we have shared_caches, but not core_group, then we
> want the coregroup domain to degenerate correctly.
> 

Currently we aren't consolidating, and hence the order doesn't matter for
degeneration. However even if in future, if we want to consolidate the
domains before calling set_sched_topology(), this order would be ideal.

> 
> > if (shared_caches) {
> > pr_info("Using shared cache scheduler topology\n");
> > powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-22 Thread Laurent Dufour

Le 21/07/2020 à 23:37, Ram Pai a écrit :

On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:

When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This

 ^^ call directly instead of triggering..


way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write

  ^^ held


mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 


Reviewed-by: Ram Pai 


Thanks for reviewing this series.

Regarding the wordsmithing, Paul, could you manage that when pulling the series?

Thanks,
Laurent.


Re: [PATCH v2 10/10] powerpc/smp: Implement cpu_to_coregroup_id

2020-07-22 Thread Gautham R Shenoy
On Tue, Jul 21, 2020 at 05:08:14PM +0530, Srikar Dronamraju wrote:
> Lookup the coregroup id from the associativity array.
> 
> If unable to detect the coregroup id, fallback on the core id.
> This way, ensure sched_domain degenerates and an extra sched domain is
> not created.
> 
> Ideally this function should have been implemented in
> arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> don't need to find the primary domain again.
> 
> If the device-tree mentions more than one coregroup, then kernel
> implements only the last or the smallest coregroup, which currently
> corresponds to the penultimate domain in the device-tree.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

Looks good to me.

Reviewed-by : Gautham R. Shenoy 


> ---
> Changelog v1 -> v2:
> powerpc/smp: Implement cpu_to_coregroup_id
>   Move coregroup_enabled before getting associativity (Gautham)
> 
>  arch/powerpc/mm/numa.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index ef8aa580da21..ae57b68beaee 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
> 
>  int cpu_to_coregroup_id(int cpu)
>  {
> + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> + int index;
> +
> + if (cpu < 0 || cpu > nr_cpu_ids)
> + return -1;
> +
> + if (!coregroup_enabled)
> + goto out;
> +
> + if (!firmware_has_feature(FW_FEATURE_VPHN))
> + goto out;
> +
> + if (vphn_get_associativity(cpu, associativity))
> + goto out;
> +
> + index = of_read_number(associativity, 1);
> + if (index > min_common_depth + 1)
> + return of_read_number([index - 1], 1);
> +
> +out:
>   return cpu_to_core_id(cpu);
>  }
> 
> -- 
> 2.17.1
> 


Re: [PATCH v2 09/10] Powerpc/smp: Create coregroup domain

2020-07-22 Thread Gautham R Shenoy
Hi Srikar,

On Tue, Jul 21, 2020 at 05:08:13PM +0530, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

A query below.

> ---
> Changelog v1 -> v2:
> Powerpc/smp: Create coregroup domain
>   Moved coregroup topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/include/asm/topology.h | 10 
>  arch/powerpc/kernel/smp.c   | 38 +
>  arch/powerpc/mm/numa.c  |  5 
>  3 files changed, 53 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..6609174918ab 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h

[..snip..]

> @@ -91,6 +92,7 @@ enum {
>   smt_idx,
>  #endif
>   bigcore_idx,
> + mc_idx,
>   die_idx,
>  };
> 


[..snip..]

> @@ -879,6 +896,7 @@ static struct sched_domain_topology_level 
> powerpc_topology[] = {
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
>   { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> + { cpu_mc_mask, SD_INIT_NAME(MC) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };


[..snip..]

> @@ -1386,6 +1421,9 @@ int setup_profiling_timer(unsigned int multiplier)
> 
>  static void fixup_topology(void)
>  {
> + if (!has_coregroup_support())
> + powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> +

Shouldn't we move this condition after doing the fixup for shared
caches ? Because if we have shared_caches, but not core_group, then we
want the coregroup domain to degenerate correctly.


>   if (shared_caches) {
>   pr_info("Using shared cache scheduler topology\n");
>   powerpc_topology[bigcore_idx].mask = shared_cache_mask;


--
Thanks and regards
gautham.


Re: [PATCH v2 04/10] powerpc/smp: Enable small core scheduling sooner

2020-07-22 Thread Srikar Dronamraju
* Gautham R Shenoy  [2020-07-22 11:29:25]:

> Hello Srikar,
> 
> On Tue, Jul 21, 2020 at 05:08:08PM +0530, Srikar Dronamraju wrote:
> > Enable small core scheduling as soon as we detect that we are in a
> > system that supports thread group. Doing so would avoid a redundant
> > check.
> 
> The patch looks good to me. However, I think the commit message still
> reflect the v1 code where we were moving the functionality from
> smp_cpus_done() to init_big_cores().
> 
> In this we are moving it to a helper function to collate all fixups to
> topology.
> 

Right.
Thanks for pointing , will correct this.


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-07-22 Thread Srikar Dronamraju
* Gautham R Shenoy  [2020-07-22 11:51:14]:

> Hi Srikar,
> 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 72f16dc0cb26..57468877499a 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
> > *(*mask_fn)(int))
> > if (!l2_cache)
> > return false;
> > 
> > +   cpumask_set_cpu(cpu, mask_fn(cpu));
> 
> 
> Ok, we need to do this because "cpu" is not yet set in the
> cpu_online_mask. Prior to your patch the "cpu" was getting set in
> cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in
> the patch.
> 

Right.

> 
> > for_each_cpu(i, cpu_online_mask) {
> > /*
> >  * when updating the marks the current CPU has not been marked
> > @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu)
> >  * add it to it's own thread sibling mask.
> >  */
> > cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> > +   cpumask_set_cpu(cpu, cpu_core_mask(cpu));

Note: Above, we are explicitly setting the cpu_core_mask.

> > 
> > for (i = first_thread; i < first_thread + threads_per_core; i++)
> > if (cpu_online(i))
> > set_cpus_related(i, cpu, cpu_sibling_mask);
> > 
> > add_cpu_to_smallcore_masks(cpu);
> > -   /*
> > -* Copy the thread sibling mask into the cache sibling mask
> > -* and mark any CPUs that share an L2 with this CPU.
> > -*/
> > -   for_each_cpu(i, cpu_sibling_mask(cpu))
> > -   set_cpus_related(cpu, i, cpu_l2_cache_mask);
> > update_mask_by_l2(cpu, cpu_l2_cache_mask);
> > 
> > -   /*
> > -* Copy the cache sibling mask into core sibling mask and mark
> > -* any CPUs on the same chip as this CPU.
> > -*/
> > -   for_each_cpu(i, cpu_l2_cache_mask(cpu))
> > -   set_cpus_related(cpu, i, cpu_core_mask);
> > +   if (pkg_id == -1) {
> 
> I suppose this "if" condition is an optimization, since if pkg_id != -1,
> we anyway set these CPUs in the cpu_core_mask below.
> 
> However...

This is not just an optimization.
The hunk removed would only work if cpu_l2_cache_mask is bigger than
cpu_sibling_mask. (this was the previous assumption that we want to break)
If the cpu_sibling_mask is bigger than cpu_l2_cache_mask and pkg_id is -1,
then setting only cpu_l2_cache_mask in cpu_core_mask will result in a broken 
topology.

> 
> > +   struct cpumask *(*mask)(int) = cpu_sibling_mask;
> > +
> > +   /*
> > +* Copy the sibling mask into core sibling mask and
> > +* mark any CPUs on the same chip as this CPU.
> > +*/
> > +   if (shared_caches)
> > +   mask = cpu_l2_cache_mask;
> > +
> > +   for_each_cpu(i, mask(cpu))
> > +   set_cpus_related(cpu, i, cpu_core_mask);
> > 
> > -   if (pkg_id == -1)
> > return;
> > +   }
> 
> 
> ... since "cpu" is not yet set in the cpu_online_mask, do we not miss setting
> "cpu" in the cpu_core_mask(cpu) in the for-loop below ?
> 
> 

As noted above, we are setting before. So we don't missing the cpu and hence
have not different from before.

> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH v2 16/16] powerpc/powernv/sriov: Remove vfs_expanded

2020-07-22 Thread Oliver O'Halloran
Previously iov->vfs_expanded was used for two purposes.

1) To work out how much we need to multiple the per-VF BAR size to figure
   out the total space required for the IOV BAR.

2) To indicate that IOV is not usable with this device (vfs_expanded == 0).

We don't really need the field for either since the multiple in 1) is
always the number PEs supported by the PHB. Similarly, we don't really need
it in 2) either since the IOV data field will be NULL if we can't use IOV
with the device.

Signed-off-by: Oliver O'Halloran 
---
v2: New
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 9 +++--
 arch/powerpc/platforms/powernv/pci.h   | 3 ---
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 76215d01405b..5742215b4093 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -213,8 +213,6 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
iov->need_shift = true;
}
 
-   iov->vfs_expanded = mul;
-
return;
 
 disable_iov:
@@ -256,6 +254,7 @@ void pnv_pci_ioda_fixup_iov(struct pci_dev *pdev)
 resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
  int resno)
 {
+   struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
struct pnv_iov_data *iov = pnv_iov_get(pdev);
resource_size_t align;
 
@@ -267,8 +266,6 @@ resource_size_t pnv_pci_iov_resource_alignment(struct 
pci_dev *pdev,
 */
if (!iov)
return align;
-   if (!iov->vfs_expanded)
-   return align;
 
align = pci_iov_resource_size(pdev, resno);
 
@@ -290,7 +287,7 @@ resource_size_t pnv_pci_iov_resource_alignment(struct 
pci_dev *pdev,
 * If the M64 BAR is in Single PE mode, return the VF BAR size or
 * M64 segment size if IOV BAR size is less.
 */
-   return iov->vfs_expanded * align;
+   return phb->ioda.total_pe_num * align;
 }
 
 static int pnv_pci_vf_release_m64(struct pci_dev *pdev, u16 num_vfs)
@@ -708,7 +705,7 @@ static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 
num_vfs)
return -ENXIO;
}
 
-   if (!iov->vfs_expanded) {
+   if (!iov) {
dev_info(>dev, "don't support this SRIOV device"
" with non 64bit-prefetchable IOV BAR\n");
return -ENOSPC;
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 902e928c7c22..c8cc152bdf52 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -234,9 +234,6 @@ void pnv_ioda_free_pe(struct pnv_ioda_pe *pe);
  * and this structure is used to keep track of it all.
  */
 struct pnv_iov_data {
-   /* number of VFs IOV BAR expanded. FIXME: rename this to something less 
bad */
-   u16 vfs_expanded;
-
/* number of VFs enabled */
u16 num_vfs;
 
-- 
2.26.2



[PATCH v2 15/16] powerpc/powernv/sriov: Make single PE mode a per-BAR setting

2020-07-22 Thread Oliver O'Halloran
Using single PE BARs to map an SR-IOV BAR is really a choice about what
strategy to use when mapping a BAR. It doesn't make much sense for this to
be a global setting since a device might have one large BAR which needs to
be mapped with single PE windows and another smaller BAR that can be mapped
with a regular segmented window. Make the segmented vs single decision a
per-BAR setting and clean up the logic that decides which mode to use.

Signed-off-by: Oliver O'Halloran 
---
v2: Dropped unused total_vfs variables in pnv_pci_ioda_fixup_iov_resources()
Dropped bar_no from pnv_pci_iov_resource_alignment()
Minor re-wording of comments.
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 131 ++---
 arch/powerpc/platforms/powernv/pci.h   |  11 +-
 2 files changed, 73 insertions(+), 69 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index ce8ad6851d73..76215d01405b 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -146,21 +146,17 @@
 static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 {
struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
-   const resource_size_t gate = phb->ioda.m64_segsize >> 2;
struct resource *res;
int i;
-   resource_size_t size, total_vf_bar_sz;
+   resource_size_t vf_bar_sz;
struct pnv_iov_data *iov;
-   int mul, total_vfs;
+   int mul;
 
iov = kzalloc(sizeof(*iov), GFP_KERNEL);
if (!iov)
goto disable_iov;
pdev->dev.archdata.iov_data = iov;
-
-   total_vfs = pci_sriov_get_totalvfs(pdev);
mul = phb->ioda.total_pe_num;
-   total_vf_bar_sz = 0;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
@@ -173,50 +169,50 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
goto disable_iov;
}
 
-   total_vf_bar_sz += pci_iov_resource_size(pdev,
-   i + PCI_IOV_RESOURCES);
+   vf_bar_sz = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
 
/*
-* If bigger than quarter of M64 segment size, just round up
-* power of two.
+* Generally, one segmented M64 BAR maps one IOV BAR. However,
+* if a VF BAR is too large we end up wasting a lot of space.
+* If each VF needs more than 1/4 of the default m64 segment
+* then each VF BAR should be mapped in single-PE mode to reduce
+* the amount of space required. This does however limit the
+* number of VFs we can support.
 *
-* Generally, one M64 BAR maps one IOV BAR. To avoid conflict
-* with other devices, IOV BAR size is expanded to be
-* (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
-* segment size , the expanded size would equal to half of the
-* whole M64 space size, which will exhaust the M64 Space and
-* limit the system flexibility.  This is a design decision to
-* set the boundary to quarter of the M64 segment size.
+* The 1/4 limit is arbitrary and can be tweaked.
 */
-   if (total_vf_bar_sz > gate) {
-   mul = roundup_pow_of_two(total_vfs);
-   dev_info(>dev,
-   "VF BAR Total IOV size %llx > %llx, roundup to 
%d VFs\n",
-   total_vf_bar_sz, gate, mul);
-   iov->m64_single_mode = true;
-   break;
-   }
-   }
+   if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
+   /*
+* On PHB3, the minimum size alignment of M64 BAR in
+* single mode is 32MB. If this VF BAR is smaller than
+* 32MB, but still too large for a segmented window
+* then we can't map it and need to disable SR-IOV for
+* this device.
+*/
+   if (vf_bar_sz < SZ_32M) {
+   pci_err(pdev, "VF BAR%d: %pR can't be mapped in 
single PE mode\n",
+   i, res);
+   goto disable_iov;
+   }
 
-   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-   res = >resource[i + PCI_IOV_RESOURCES];
-   if (!res->flags || res->parent)
+   iov->m64_single_mode[i] = true;
continue;
+   }
 
-   size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
/*
-* On PHB3, the minimum size 

[PATCH v2 14/16] powerpc/powernv/sriov: Refactor M64 BAR setup

2020-07-22 Thread Oliver O'Halloran
Split up the logic so that we have one branch that handles setting up a
segmented window and another that handles setting up single PE windows for
each VF.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: no changes
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 57 ++
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index f9aa5773dc6d..ce8ad6851d73 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -438,52 +438,49 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
struct resource   *res;
inti, j;
int64_trc;
-   inttotal_vfs;
resource_size_tsize, start;
-   intm64_bars;
+   intbase_pe_num;
 
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
-   total_vfs = pci_sriov_get_totalvfs(pdev);
-
-   if (iov->m64_single_mode)
-   m64_bars = num_vfs;
-   else
-   m64_bars = 1;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
if (!res->flags || !res->parent)
continue;
 
-   for (j = 0; j < m64_bars; j++) {
+   /* don't need single mode? map everything in one go! */
+   if (!iov->m64_single_mode) {
win = pnv_pci_alloc_m64_bar(phb, iov);
if (win < 0)
goto m64_failed;
 
-   if (iov->m64_single_mode) {
-   int pe_num = iov->vf_pe_arr[j].pe_number;
-
-   size = pci_iov_resource_size(pdev,
-   PCI_IOV_RESOURCES + i);
-   start = res->start + size * j;
-   rc = pnv_ioda_map_m64_single(phb, win,
-pe_num,
-start,
-size);
-   } else {
-   size = resource_size(res);
-   start = res->start;
-
-   rc = pnv_ioda_map_m64_segmented(phb, win, start,
-   size);
-   }
+   size = resource_size(res);
+   start = res->start;
 
-   if (rc != OPAL_SUCCESS) {
-   dev_err(>dev, "Failed to map M64 window 
#%d: %lld\n",
-   win, rc);
+   rc = pnv_ioda_map_m64_segmented(phb, win, start, size);
+   if (rc)
+   goto m64_failed;
+
+   continue;
+   }
+
+   /* otherwise map each VF with single PE BARs */
+   size = pci_iov_resource_size(pdev, PCI_IOV_RESOURCES + i);
+   base_pe_num = iov->vf_pe_arr[0].pe_number;
+
+   for (j = 0; j < num_vfs; j++) {
+   win = pnv_pci_alloc_m64_bar(phb, iov);
+   if (win < 0)
+   goto m64_failed;
+
+   start = res->start + size * j;
+   rc = pnv_ioda_map_m64_single(phb, win,
+base_pe_num + j,
+start,
+size);
+   if (rc)
goto m64_failed;
-   }
}
}
return 0;
-- 
2.26.2



[PATCH v2 13/16] powerpc/powernv/sriov: Move M64 BAR allocation into a helper

2020-07-22 Thread Oliver O'Halloran
I want to refactor the loop this code is currently inside of. Hoist it on
out.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: no change
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 31 ++
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index b60d8a054a61..f9aa5773dc6d 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -413,6 +413,23 @@ static int64_t pnv_ioda_map_m64_single(struct pnv_phb *phb,
return rc;
 }
 
+static int pnv_pci_alloc_m64_bar(struct pnv_phb *phb, struct pnv_iov_data *iov)
+{
+   int win;
+
+   do {
+   win = find_next_zero_bit(>ioda.m64_bar_alloc,
+   phb->ioda.m64_bar_idx + 1, 0);
+
+   if (win >= phb->ioda.m64_bar_idx + 1)
+   return -1;
+   } while (test_and_set_bit(win, >ioda.m64_bar_alloc));
+
+   set_bit(win, iov->used_m64_bar_mask);
+
+   return win;
+}
+
 static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
 {
struct pnv_iov_data   *iov;
@@ -440,17 +457,9 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 
num_vfs)
continue;
 
for (j = 0; j < m64_bars; j++) {
-
-   /* allocate a window ID for this BAR */
-   do {
-   win = 
find_next_zero_bit(>ioda.m64_bar_alloc,
-   phb->ioda.m64_bar_idx + 1, 0);
-
-   if (win >= phb->ioda.m64_bar_idx + 1)
-   goto m64_failed;
-   } while (test_and_set_bit(win, 
>ioda.m64_bar_alloc));
-   set_bit(win, iov->used_m64_bar_mask);
-
+   win = pnv_pci_alloc_m64_bar(phb, iov);
+   if (win < 0)
+   goto m64_failed;
 
if (iov->m64_single_mode) {
int pe_num = iov->vf_pe_arr[j].pe_number;
-- 
2.26.2



[PATCH v2 12/16] powerpc/powernv/sriov: De-indent setup and teardown

2020-07-22 Thread Oliver O'Halloran
Remove the IODA2 PHB checks. We already assume IODA2 in several places so
there's not much point in wrapping most of the setup and teardown process
in an if block.

Signed-off-by: Oliver O'Halloran 
---
v2: Added a note that iov->vf_pe_arr is a pointer into the PHB's PE array
rather than something we allocate.
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 86 --
 arch/powerpc/platforms/powernv/pci.h   |  5 +-
 2 files changed, 50 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 5981323cd9a6..b60d8a054a61 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -607,16 +607,18 @@ static void pnv_pci_sriov_disable(struct pci_dev *pdev)
num_vfs = iov->num_vfs;
base_pe = iov->vf_pe_arr[0].pe_number;
 
+   if (WARN_ON(!iov))
+   return;
+
/* Release VF PEs */
pnv_ioda_release_vf_PE(pdev);
 
-   if (phb->type == PNV_PHB_IODA2) {
-   if (!iov->m64_single_mode)
-   pnv_pci_vf_resource_shift(pdev, -base_pe);
+   /* Un-shift the IOV BAR resources */
+   if (!iov->m64_single_mode)
+   pnv_pci_vf_resource_shift(pdev, -base_pe);
 
-   /* Release M64 windows */
-   pnv_pci_vf_release_m64(pdev, num_vfs);
-   }
+   /* Release M64 windows */
+   pnv_pci_vf_release_m64(pdev, num_vfs);
 }
 
 static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
@@ -690,41 +692,51 @@ static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 
num_vfs)
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
 
-   if (phb->type == PNV_PHB_IODA2) {
-   if (!iov->vfs_expanded) {
-   dev_info(>dev, "don't support this SRIOV device"
-   " with non 64bit-prefetchable IOV BAR\n");
-   return -ENOSPC;
-   }
+   /*
+* There's a calls to IODA2 PE setup code littered throughout. We could
+* probably fix that, but we'd still have problems due to the
+* restriction inherent on IODA1 PHBs.
+*
+* NB: We class IODA3 as IODA2 since they're very similar.
+*/
+   if (phb->type != PNV_PHB_IODA2) {
+   pci_err(pdev, "SR-IOV is not supported on this PHB\n");
+   return -ENXIO;
+   }
 
-   /* allocate a contigious block of PEs for our VFs */
-   base_pe = pnv_ioda_alloc_pe(phb, num_vfs);
-   if (!base_pe) {
-   pci_err(pdev, "Unable to allocate PEs for %d VFs\n", 
num_vfs);
-   return -EBUSY;
-   }
+   if (!iov->vfs_expanded) {
+   dev_info(>dev, "don't support this SRIOV device"
+   " with non 64bit-prefetchable IOV BAR\n");
+   return -ENOSPC;
+   }
 
-   iov->vf_pe_arr = base_pe;
-   iov->num_vfs = num_vfs;
+   /* allocate a contigious block of PEs for our VFs */
+   base_pe = pnv_ioda_alloc_pe(phb, num_vfs);
+   if (!base_pe) {
+   pci_err(pdev, "Unable to allocate PEs for %d VFs\n", num_vfs);
+   return -EBUSY;
+   }
 
-   /* Assign M64 window accordingly */
-   ret = pnv_pci_vf_assign_m64(pdev, num_vfs);
-   if (ret) {
-   dev_info(>dev, "Not enough M64 window 
resources\n");
-   goto m64_failed;
-   }
+   iov->vf_pe_arr = base_pe;
+   iov->num_vfs = num_vfs;
 
-   /*
-* When using one M64 BAR to map one IOV BAR, we need to shift
-* the IOV BAR according to the PE# allocated to the VFs.
-* Otherwise, the PE# for the VF will conflict with others.
-*/
-   if (!iov->m64_single_mode) {
-   ret = pnv_pci_vf_resource_shift(pdev,
-   base_pe->pe_number);
-   if (ret)
-   goto shift_failed;
-   }
+   /* Assign M64 window accordingly */
+   ret = pnv_pci_vf_assign_m64(pdev, num_vfs);
+   if (ret) {
+   dev_info(>dev, "Not enough M64 window resources\n");
+   goto m64_failed;
+   }
+
+   /*
+* When using one M64 BAR to map one IOV BAR, we need to shift
+* the IOV BAR according to the PE# allocated to the VFs.
+* Otherwise, the PE# for the VF will conflict with others.
+*/
+   if (!iov->m64_single_mode) {
+   ret = pnv_pci_vf_resource_shift(pdev,
+   base_pe->pe_number);
+   if (ret)
+   goto shift_failed;
}
 
/* Setup VF PEs */
diff --git 

[PATCH v2 11/16] powerpc/powernv/sriov: Drop iov->pe_num_map[]

2020-07-22 Thread Oliver O'Halloran
Currently the iov->pe_num_map[] does one of two things depending on
whether single PE mode is being used or not. When it is, this contains an
array which maps a vf_index to the corresponding PE number. When single PE
mode is not being used this contains a scalar which is the base PE for the
set of enabled VFs (for for VFn is base + n).

The array was necessary because when calling pnv_ioda_alloc_pe() there is
no guarantee that the allocated PEs would be contigious. We can now
allocate contigious blocks of PEs so this is no longer an issue. This
allows us to drop the if (single_mode) {} .. else {} block scattered
through the SR-IOV code which is a nice clean up.

This also fixes a bug in pnv_pci_sriov_disable() which is the non-atomic
bitmap_clear() to manipulate the PE allocation map. Other users of the map
assume it will be accessed with atomic ops.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: Added a note to the commit message about bitmap_clear()
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 109 +
 arch/powerpc/platforms/powernv/pci.h   |   7 +-
 2 files changed, 28 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index d90e11218add..5981323cd9a6 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -453,11 +453,13 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
 
 
if (iov->m64_single_mode) {
+   int pe_num = iov->vf_pe_arr[j].pe_number;
+
size = pci_iov_resource_size(pdev,
PCI_IOV_RESOURCES + i);
start = res->start + size * j;
rc = pnv_ioda_map_m64_single(phb, win,
-iov->pe_num_map[j],
+pe_num,
 start,
 size);
} else {
@@ -596,38 +598,24 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, 
int offset)
 
 static void pnv_pci_sriov_disable(struct pci_dev *pdev)
 {
+   u16num_vfs, base_pe;
struct pnv_phb*phb;
-   struct pnv_ioda_pe*pe;
struct pnv_iov_data   *iov;
-   u16num_vfs, i;
 
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
num_vfs = iov->num_vfs;
+   base_pe = iov->vf_pe_arr[0].pe_number;
 
/* Release VF PEs */
pnv_ioda_release_vf_PE(pdev);
 
if (phb->type == PNV_PHB_IODA2) {
if (!iov->m64_single_mode)
-   pnv_pci_vf_resource_shift(pdev, -*iov->pe_num_map);
+   pnv_pci_vf_resource_shift(pdev, -base_pe);
 
/* Release M64 windows */
pnv_pci_vf_release_m64(pdev, num_vfs);
-
-   /* Release PE numbers */
-   if (iov->m64_single_mode) {
-   for (i = 0; i < num_vfs; i++) {
-   if (iov->pe_num_map[i] == IODA_INVALID_PE)
-   continue;
-
-   pe = >ioda.pe_array[iov->pe_num_map[i]];
-   pnv_ioda_free_pe(pe);
-   }
-   } else
-   bitmap_clear(phb->ioda.pe_alloc, *iov->pe_num_map, 
num_vfs);
-   /* Releasing pe_num_map */
-   kfree(iov->pe_num_map);
}
 }
 
@@ -653,13 +641,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 
num_vfs)
int vf_bus = pci_iov_virtfn_bus(pdev, vf_index);
struct pci_dn *vf_pdn;
 
-   if (iov->m64_single_mode)
-   pe_num = iov->pe_num_map[vf_index];
-   else
-   pe_num = *iov->pe_num_map + vf_index;
-
-   pe = >ioda.pe_array[pe_num];
-   pe->pe_number = pe_num;
+   pe = >vf_pe_arr[vf_index];
pe->phb = phb;
pe->flags = PNV_IODA_PE_VF;
pe->pbus = NULL;
@@ -667,6 +649,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 
num_vfs)
pe->mve_number = -1;
pe->rid = (vf_bus << 8) | vf_devfn;
 
+   pe_num = pe->pe_number;
pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%x\n",
pci_domain_nr(pdev->bus), pdev->bus->number,
PCI_SLOT(vf_devfn), PCI_FUNC(vf_devfn), pe_num);
@@ -698,9 +681,9 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 
num_vfs)
 
 static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
 

[PATCH v2 10/16] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()

2020-07-22 Thread Oliver O'Halloran
Rework the PE allocation logic to allow allocating blocks of PEs rather
than individually. We'll use this to allocate contigious blocks of PEs for
the SR-IOVs.

This patch also adds code to pnv_ioda_alloc_pe() and pnv_ioda_reserve_pe() to
use the existing, but unused, phb->pe_alloc_mutex. Currently these functions
use atomic bit ops to release a currently allocated PE number. However,
the pnv_ioda_alloc_pe() wants to have exclusive access to the bit map while
scanning for hole large enough to accomodate the allocation size.

Signed-off-by: Oliver O'Halloran 
---
v2: Add some details about the pe_alloc mutex and why we're using it.
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 41 ++-
 arch/powerpc/platforms/powernv/pci.h  |  2 +-
 2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2d36a9ebf0e9..c9c25fb0783c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -145,23 +145,45 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int 
pe_no)
return;
}
 
+   mutex_lock(>ioda.pe_alloc_mutex);
if (test_and_set_bit(pe_no, phb->ioda.pe_alloc))
pr_debug("%s: PE %x was reserved on PHB#%x\n",
 __func__, pe_no, phb->hose->global_number);
+   mutex_unlock(>ioda.pe_alloc_mutex);
 
pnv_ioda_init_pe(phb, pe_no);
 }
 
-struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
+struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count)
 {
-   long pe;
+   struct pnv_ioda_pe *ret = NULL;
+   int run = 0, pe, i;
 
+   mutex_lock(>ioda.pe_alloc_mutex);
+
+   /* scan backwards for a run of @count cleared bits */
for (pe = phb->ioda.total_pe_num - 1; pe >= 0; pe--) {
-   if (!test_and_set_bit(pe, phb->ioda.pe_alloc))
-   return pnv_ioda_init_pe(phb, pe);
+   if (test_bit(pe, phb->ioda.pe_alloc)) {
+   run = 0;
+   continue;
+   }
+
+   run++;
+   if (run == count)
+   break;
}
+   if (run != count)
+   goto out;
 
-   return NULL;
+   for (i = pe; i < pe + count; i++) {
+   set_bit(i, phb->ioda.pe_alloc);
+   pnv_ioda_init_pe(phb, i);
+   }
+   ret = >ioda.pe_array[pe];
+
+out:
+   mutex_unlock(>ioda.pe_alloc_mutex);
+   return ret;
 }
 
 void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
@@ -173,7 +195,10 @@ void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
WARN_ON(pe->npucomp); /* NPUs for nvlink are not supposed to be freed */
kfree(pe->npucomp);
memset(pe, 0, sizeof(struct pnv_ioda_pe));
+
+   mutex_lock(>ioda.pe_alloc_mutex);
clear_bit(pe_num, phb->ioda.pe_alloc);
+   mutex_unlock(>ioda.pe_alloc_mutex);
 }
 
 /* The default M64 BAR is shared by all PEs */
@@ -976,7 +1001,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct 
pci_dev *dev)
if (pdn->pe_number != IODA_INVALID_PE)
return NULL;
 
-   pe = pnv_ioda_alloc_pe(phb);
+   pe = pnv_ioda_alloc_pe(phb, 1);
if (!pe) {
pr_warn("%s: Not enough PE# available, disabling device\n",
pci_name(dev));
@@ -1047,7 +1072,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct 
pci_bus *bus, bool all)
 
/* The PE number isn't pinned by M64 */
if (!pe)
-   pe = pnv_ioda_alloc_pe(phb);
+   pe = pnv_ioda_alloc_pe(phb, 1);
 
if (!pe) {
pr_warn("%s: Not enough PE# available for PCI bus %04x:%02x\n",
@@ -3065,7 +3090,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_idx);
} else {
/* otherwise just allocate one */
-   root_pe = pnv_ioda_alloc_pe(phb);
+   root_pe = pnv_ioda_alloc_pe(phb, 1);
phb->ioda.root_pe_idx = root_pe->pe_number;
}
 
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 23fc5e391c7f..06431a452130 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -224,7 +224,7 @@ int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct 
pnv_ioda_pe *pe);
 void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe);
 void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe);
 
-struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb);
+struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count);
 void pnv_ioda_free_pe(struct pnv_ioda_pe *pe);
 
 #ifdef CONFIG_PCI_IOV
-- 
2.26.2



[PATCH v2 09/16] powerpc/powernv/sriov: Factor out M64 BAR setup

2020-07-22 Thread Oliver O'Halloran
The sequence required to use the single PE BAR mode is kinda janky and
requires a little explanation. The API was designed with P7-IOC style
windows where the setup process is something like:

1. Configure the window start / end address
2. Enable the window
3. Map the segments of each window to the PE

For Single PE BARs the process is:

1. Set the PE for segment zero on a disabled window
2. Set the range
3. Enable the window

Move the OPAL calls into their own helper functions where the quirks can be
contained.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: renamed "accordion" window to "segmented"
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 129 -
 1 file changed, 100 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index b48952e59ce0..d90e11218add 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -320,6 +320,99 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
u16 num_vfs)
return 0;
 }
 
+
+/*
+ * PHB3 and beyond support segmented windows. The window's address range
+ * is subdivided into phb->ioda.total_pe_num segments and there's a 1-1
+ * mapping between PEs and segments.
+ */
+static int64_t pnv_ioda_map_m64_segmented(struct pnv_phb *phb,
+ int window_id,
+ resource_size_t start,
+ resource_size_t size)
+{
+   int64_t rc;
+
+   rc = opal_pci_set_phb_mem_window(phb->opal_id,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+start,
+0, /* unused */
+size);
+   if (rc)
+   goto out;
+
+   rc = opal_pci_phb_mmio_enable(phb->opal_id,
+ OPAL_M64_WINDOW_TYPE,
+ window_id,
+ OPAL_ENABLE_M64_SPLIT);
+out:
+   if (rc)
+   pr_err("Failed to map M64 window #%d: %lld\n", window_id, rc);
+
+   return rc;
+}
+
+static int64_t pnv_ioda_map_m64_single(struct pnv_phb *phb,
+  int pe_num,
+  int window_id,
+  resource_size_t start,
+  resource_size_t size)
+{
+   int64_t rc;
+
+   /*
+* The API for setting up m64 mmio windows seems to have been designed
+* with P7-IOC in mind. For that chip each M64 BAR (window) had a fixed
+* split of 8 equally sized segments each of which could individually
+* assigned to a PE.
+*
+* The problem with this is that the API doesn't have any way to
+* communicate the number of segments we want on a BAR. This wasn't
+* a problem for p7-ioc since you didn't have a choice, but the
+* single PE windows added in PHB3 don't map cleanly to this API.
+*
+* As a result we've got this slightly awkward process where we
+* call opal_pci_map_pe_mmio_window() to put the single in single
+* PE mode, and set the PE for the window before setting the address
+* bounds. We need to do it this way because the single PE windows
+* for PHB3 have different alignment requirements on PHB3.
+*/
+   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+pe_num,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+0);
+   if (rc)
+   goto out;
+
+   /*
+* NB: In single PE mode the window needs to be aligned to 32MB
+*/
+   rc = opal_pci_set_phb_mem_window(phb->opal_id,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+start,
+0, /* ignored by FW, m64 is 1-1 */
+size);
+   if (rc)
+   goto out;
+
+   /*
+* Now actually enable it. We specified the BAR should be in "non-split"
+* mode so FW will validate that the BAR is in single PE mode.
+*/
+   rc = opal_pci_phb_mmio_enable(phb->opal_id,
+ OPAL_M64_WINDOW_TYPE,
+ window_id,
+ OPAL_ENABLE_M64_NON_SPLIT);
+out:
+   if (rc)
+   pr_err("Error mapping single PE BAR\n");
+
+   return rc;
+}
+
 static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
 {
struct pnv_iov_data   *iov;

[PATCH v2 08/16] powerpc/powernv/sriov: Simplify used window tracking

2020-07-22 Thread Oliver O'Halloran
No need for the multi-dimensional arrays, just use a bitmap.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: Fixed license to GPL-2.0-or-later

Added MAX_M64_BARS for the size of the M64 allocation bitmap rather than
open coding 64.
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 50 +++---
 arch/powerpc/platforms/powernv/pci.h   |  8 +++-
 2 files changed, 22 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 216ceeff69b0..b48952e59ce0 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: GPL-2.0
+// SPDX-License-Identifier: GPL-2.0-or-later
 
 #include 
 #include 
@@ -303,28 +303,20 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
u16 num_vfs)
 {
struct pnv_iov_data   *iov;
struct pnv_phb*phb;
-   inti, j;
-   intm64_bars;
+   int window_id;
 
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
 
-   if (iov->m64_single_mode)
-   m64_bars = num_vfs;
-   else
-   m64_bars = 1;
+   for_each_set_bit(window_id, iov->used_m64_bar_mask, MAX_M64_BARS) {
+   opal_pci_phb_mmio_enable(phb->opal_id,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+0);
 
-   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-   for (j = 0; j < m64_bars; j++) {
-   if (iov->m64_map[j][i] == IODA_INVALID_M64)
-   continue;
-   opal_pci_phb_mmio_enable(phb->opal_id,
-   OPAL_M64_WINDOW_TYPE, iov->m64_map[j][i], 0);
-   clear_bit(iov->m64_map[j][i], >ioda.m64_bar_alloc);
-   iov->m64_map[j][i] = IODA_INVALID_M64;
-   }
+   clear_bit(window_id, >ioda.m64_bar_alloc);
+   }
 
-   kfree(iov->m64_map);
return 0;
 }
 
@@ -350,23 +342,14 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
else
m64_bars = 1;
 
-   iov->m64_map = kmalloc_array(m64_bars,
-sizeof(*iov->m64_map),
-GFP_KERNEL);
-   if (!iov->m64_map)
-   return -ENOMEM;
-   /* Initialize the m64_map to IODA_INVALID_M64 */
-   for (i = 0; i < m64_bars ; i++)
-   for (j = 0; j < PCI_SRIOV_NUM_BARS; j++)
-   iov->m64_map[i][j] = IODA_INVALID_M64;
-
-
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
if (!res->flags || !res->parent)
continue;
 
for (j = 0; j < m64_bars; j++) {
+
+   /* allocate a window ID for this BAR */
do {
win = 
find_next_zero_bit(>ioda.m64_bar_alloc,
phb->ioda.m64_bar_idx + 1, 0);
@@ -374,8 +357,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 
num_vfs)
if (win >= phb->ioda.m64_bar_idx + 1)
goto m64_failed;
} while (test_and_set_bit(win, 
>ioda.m64_bar_alloc));
-
-   iov->m64_map[j][i] = win;
+   set_bit(win, iov->used_m64_bar_mask);
 
if (iov->m64_single_mode) {
size = pci_iov_resource_size(pdev,
@@ -391,12 +373,12 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
pe_num = iov->pe_num_map[j];
rc = opal_pci_map_pe_mmio_window(phb->opal_id,
pe_num, OPAL_M64_WINDOW_TYPE,
-   iov->m64_map[j][i], 0);
+   win, 0);
}
 
rc = opal_pci_set_phb_mem_window(phb->opal_id,
 OPAL_M64_WINDOW_TYPE,
-iov->m64_map[j][i],
+win,
 start,
 0, /* unused */
 size);
@@ -410,10 +392,10 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
 
if (iov->m64_single_mode)
rc = opal_pci_phb_mmio_enable(phb->opal_id,
-OPAL_M64_WINDOW_TYPE, 

[PATCH v2 07/16] powerpc/powernv/sriov: Rename truncate_iov

2020-07-22 Thread Oliver O'Halloran
This prevents SR-IOV being used by making the SR-IOV BAR resources
unallocatable. Rename it to reflect what it actually does.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: no changes
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index f4c74ab1284d..216ceeff69b0 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -155,7 +155,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
 
iov = kzalloc(sizeof(*iov), GFP_KERNEL);
if (!iov)
-   goto truncate_iov;
+   goto disable_iov;
pdev->dev.archdata.iov_data = iov;
 
total_vfs = pci_sriov_get_totalvfs(pdev);
@@ -170,7 +170,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
dev_warn(>dev, "Don't support SR-IOV with"
" non M64 VF BAR%d: %pR. \n",
 i, res);
-   goto truncate_iov;
+   goto disable_iov;
}
 
total_vf_bar_sz += pci_iov_resource_size(pdev,
@@ -209,7 +209,8 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
 * mode is 32MB.
 */
if (iov->m64_single_mode && (size < SZ_32M))
-   goto truncate_iov;
+   goto disable_iov;
+
dev_dbg(>dev, " Fixing VF BAR%d: %pR to\n", i, res);
res->end = res->start + size * mul - 1;
dev_dbg(>dev, "   %pR\n", res);
@@ -220,8 +221,8 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
 
return;
 
-truncate_iov:
-   /* To save MMIO space, IOV BAR is truncated. */
+disable_iov:
+   /* Save ourselves some MMIO space by disabling the unusable BARs */
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
res->flags = 0;
-- 
2.26.2



[PATCH v2 06/16] powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV

2020-07-22 Thread Oliver O'Halloran
SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
how anyone is supposed to know how it works except through a lot of
stuffering. Write up some docs about the overall story to help out
the next sucker^Wperson who needs to tinker with it.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: no changes
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 130 +
 1 file changed, 130 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 080ea39f5a83..f4c74ab1284d 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -12,6 +12,136 @@
 /* for pci_dev_is_added() */
 #include "../../../../drivers/pci/pci.h"
 
+/*
+ * The majority of the complexity in supporting SR-IOV on PowerNV comes from
+ * the need to put the MMIO space for each VF into a separate PE. Internally
+ * the PHB maps MMIO addresses to a specific PE using the "Memory BAR Table".
+ * The MBT historically only applied to the 64bit MMIO window of the PHB
+ * so it's common to see it referred to as the "M64BT".
+ *
+ * An MBT entry stores the mapped range as an , pair. This forces
+ * the address range that we want to map to be power-of-two sized and aligned.
+ * For conventional PCI devices this isn't really an issue since PCI device 
BARs
+ * have the same requirement.
+ *
+ * For a SR-IOV BAR things are a little more awkward since size and alignment
+ * are not coupled. The alignment is set based on the the per-VF BAR size, but
+ * the total BAR area is: number-of-vfs * per-vf-size. The number of VFs
+ * isn't necessarily a power of two, so neither is the total size. To fix that
+ * we need to finesse (read: hack) the Linux BAR allocator so that it will
+ * allocate the SR-IOV BARs in a way that lets us map them using the MBT.
+ *
+ * The changes to size and alignment that we need to do depend on the "mode"
+ * of MBT entry that we use. We only support SR-IOV on PHB3 (IODA2) and above,
+ * so as a baseline we can assume that we have the following BAR modes
+ * available:
+ *
+ *   NB: $PE_COUNT is the number of PEs that the PHB supports.
+ *
+ * a) A segmented BAR that splits the mapped range into $PE_COUNT equally sized
+ *segments. The n'th segment is mapped to the n'th PE.
+ * b) An un-segmented BAR that maps the whole address range to a specific PE.
+ *
+ *
+ * We prefer to use mode a) since it only requires one MBT entry per SR-IOV BAR
+ * For comparison b) requires one entry per-VF per-BAR, or:
+ * (num-vfs * num-sriov-bars) in total. To use a) we need the size of each 
segment
+ * to equal the size of the per-VF BAR area. So:
+ *
+ * new_size = per-vf-size * number-of-PEs
+ *
+ * The alignment for the SR-IOV BAR also needs to be changed from per-vf-size
+ * to "new_size", calculated above. Implementing this is a convoluted process
+ * which requires several hooks in the PCI core:
+ *
+ * 1. In pcibios_add_device() we call pnv_pci_ioda_fixup_iov().
+ *
+ *At this point the device has been probed and the device's BARs are sized,
+ *but no resource allocations have been done. The SR-IOV BARs are sized
+ *based on the maximum number of VFs supported by the device and we need
+ *to increase that to new_size.
+ *
+ * 2. Later, when Linux actually assigns resources it tries to make the 
resource
+ *allocations for each PCI bus as compact as possible. As a part of that it
+ *sorts the BARs on a bus by their required alignment, which is calculated
+ *using pci_resource_alignment().
+ *
+ *For IOV resources this goes:
+ *pci_resource_alignment()
+ *pci_sriov_resource_alignment()
+ *pcibios_sriov_resource_alignment()
+ *pnv_pci_iov_resource_alignment()
+ *
+ *Our hook overrides the default alignment, equal to the per-vf-size, with
+ *new_size computed above.
+ *
+ * 3. When userspace enables VFs for a device:
+ *
+ *sriov_enable()
+ *   pcibios_sriov_enable()
+ *   pnv_pcibios_sriov_enable()
+ *
+ *This is where we actually allocate PE numbers for each VF and setup the
+ *MBT mapping for each SR-IOV BAR. In steps 1) and 2) we setup an "arena"
+ *where each MBT segment is equal in size to the VF BAR so we can shift
+ *around the actual SR-IOV BAR location within this arena. We need this
+ *ability because the PE space is shared by all devices on the same PHB.
+ *When using mode a) described above segment 0 in maps to PE#0 which might
+ *be already being used by another device on the PHB.
+ *
+ *As a result we need allocate a contigious range of PE numbers, then shift
+ *the address programmed into the SR-IOV BAR of the PF so that the address
+ *of VF0 matches up with the segment corresponding to the first allocated
+ *PE number. This is handled in pnv_pci_vf_resource_shift().
+ *
+ *Once all that is done we return to the 

[PATCH v2 05/16] powerpc/powernv/sriov: Move SR-IOV into a separate file

2020-07-22 Thread Oliver O'Halloran
pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in
there. The SR-IOV support can be extracted easily enough and is mostly
standalone, so move it into a separate file.

This patch also moves the PowerNV SR-IOV specific fields from pci_dn and
moves them into a platform specific structure. I'm not sure how they ended
up in there in the first place, but leaking platform specifics into common
code has proven to be a terrible idea so far so lets stop doing that.

Signed-off-by: Oliver O'Halloran 
---
v2: no changes
---
 arch/powerpc/include/asm/device.h  |   3 +
 arch/powerpc/platforms/powernv/Makefile|   1 +
 arch/powerpc/platforms/powernv/pci-ioda.c  | 673 +
 arch/powerpc/platforms/powernv/pci-sriov.c | 642 
 arch/powerpc/platforms/powernv/pci.h   |  74 +++
 5 files changed, 738 insertions(+), 655 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-sriov.c

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 266542769e4b..4d8934db7ef5 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -49,6 +49,9 @@ struct dev_archdata {
 #ifdef CONFIG_CXL_BASE
struct cxl_context  *cxl_ctx;
 #endif
+#ifdef CONFIG_PCI_IOV
+   void *iov_data;
+#endif
 };
 
 struct pdev_archdata {
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index fe3f0fb5aeca..2eb6ae150d1f 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_FA_DUMP) += opal-fadump.o
 obj-$(CONFIG_PRESERVE_FA_DUMP) += opal-fadump.o
 obj-$(CONFIG_OPAL_CORE)+= opal-core.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
+obj-$(CONFIG_PCI_IOV)   += pci-sriov.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
 obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8fb17676d914..2d36a9ebf0e9 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -115,26 +115,6 @@ static int __init pci_reset_phbs_setup(char *str)
 
 early_param("ppc_pci_reset_phbs", pci_reset_phbs_setup);
 
-static inline bool pnv_pci_is_m64(struct pnv_phb *phb, struct resource *r)
-{
-   /*
-* WARNING: We cannot rely on the resource flags. The Linux PCI
-* allocation code sometimes decides to put a 64-bit prefetchable
-* BAR in the 32-bit window, so we have to compare the addresses.
-*
-* For simplicity we only test resource start.
-*/
-   return (r->start >= phb->ioda.m64_base &&
-   r->start < (phb->ioda.m64_base + phb->ioda.m64_size));
-}
-
-static inline bool pnv_pci_is_m64_flags(unsigned long resource_flags)
-{
-   unsigned long flags = (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH);
-
-   return (resource_flags & flags) == flags;
-}
-
 static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
 {
s64 rc;
@@ -172,7 +152,7 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int 
pe_no)
pnv_ioda_init_pe(phb, pe_no);
 }
 
-static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
+struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
long pe;
 
@@ -184,7 +164,7 @@ static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb 
*phb)
return NULL;
 }
 
-static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
+void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
 {
struct pnv_phb *phb = pe->phb;
unsigned int pe_num = pe->pe_number;
@@ -816,7 +796,7 @@ static void pnv_ioda_unset_peltv(struct pnv_phb *phb,
pe_warn(pe, "OPAL error %lld remove self from PELTV\n", rc);
 }
 
-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
struct pci_dev *parent;
uint8_t bcomp, dcomp, fcomp;
@@ -887,7 +867,7 @@ static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, 
struct pnv_ioda_pe *pe)
return 0;
 }
 
-static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
struct pci_dev *parent;
uint8_t bcomp, dcomp, fcomp;
@@ -982,91 +962,6 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, 
struct pnv_ioda_pe *pe)
return 0;
 }
 
-#ifdef CONFIG_PCI_IOV
-static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
-{
-   struct pci_dn *pdn = pci_get_pdn(dev);
-   int i;
-   struct resource *res, res2;
-   resource_size_t size;
-   u16 num_vfs;
-
-   if (!dev->is_physfn)
-   return -EINVAL;
-
-   /*
-* "offset" is in VFs.  The M64 windows are sized so that when 

[PATCH v2 04/16] powerpc/powernv/pci: Initialise M64 for IODA1 as a 1-1 window

2020-07-22 Thread Oliver O'Halloran
We pre-configure the m64 window for IODA1 as a 1-1 segment-PE mapping,
similar to PHB3. Currently the actual mapping of segments occurs in
pnv_ioda_pick_m64_pe(), but we can move it into pnv_ioda1_init_m64() and
drop the IODA1 specific code paths in the PE setup / teardown.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: no changes
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 55 +++
 1 file changed, 25 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index bb9c1cc60c33..8fb17676d914 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -311,6 +311,28 @@ static int pnv_ioda1_init_m64(struct pnv_phb *phb)
}
}
 
+   for (index = 0; index < phb->ioda.total_pe_num; index++) {
+   int64_t rc;
+
+   /*
+* P7IOC supports M64DT, which helps mapping M64 segment
+* to one particular PE#. However, PHB3 has fixed mapping
+* between M64 segment and PE#. In order to have same logic
+* for P7IOC and PHB3, we enforce fixed mapping between M64
+* segment and PE# on P7IOC.
+*/
+   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+   index, OPAL_M64_WINDOW_TYPE,
+   index / PNV_IODA1_M64_SEGS,
+   index % PNV_IODA1_M64_SEGS);
+   if (rc != OPAL_SUCCESS) {
+   pr_warn("%s: Error %lld mapping M64 for PHB#%x-PE#%x\n",
+   __func__, rc, phb->hose->global_number,
+   index);
+   goto fail;
+   }
+   }
+
/*
 * Exclude the segments for reserved and root bus PE, which
 * are first or last two PEs.
@@ -402,26 +424,6 @@ static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct 
pci_bus *bus, bool all)
pe->master = master_pe;
list_add_tail(>list, _pe->slaves);
}
-
-   /*
-* P7IOC supports M64DT, which helps mapping M64 segment
-* to one particular PE#. However, PHB3 has fixed mapping
-* between M64 segment and PE#. In order to have same logic
-* for P7IOC and PHB3, we enforce fixed mapping between M64
-* segment and PE# on P7IOC.
-*/
-   if (phb->type == PNV_PHB_IODA1) {
-   int64_t rc;
-
-   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-   pe->pe_number, OPAL_M64_WINDOW_TYPE,
-   pe->pe_number / PNV_IODA1_M64_SEGS,
-   pe->pe_number % PNV_IODA1_M64_SEGS);
-   if (rc != OPAL_SUCCESS)
-   pr_warn("%s: Error %lld mapping M64 for 
PHB#%x-PE#%x\n",
-   __func__, rc, phb->hose->global_number,
-   pe->pe_number);
-   }
}
 
kfree(pe_alloc);
@@ -3354,14 +3356,8 @@ static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
if (map[idx] != pe->pe_number)
continue;
 
-   if (win == OPAL_M64_WINDOW_TYPE)
-   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-   phb->ioda.reserved_pe_idx, win,
-   idx / PNV_IODA1_M64_SEGS,
-   idx % PNV_IODA1_M64_SEGS);
-   else
-   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-   phb->ioda.reserved_pe_idx, win, 0, idx);
+   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+   phb->ioda.reserved_pe_idx, win, 0, idx);
 
if (rc != OPAL_SUCCESS)
pe_warn(pe, "Error %lld unmapping (%d) segment#%d\n",
@@ -3380,8 +3376,7 @@ static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe 
*pe)
 phb->ioda.io_segmap);
pnv_ioda_free_pe_seg(pe, OPAL_M32_WINDOW_TYPE,
 phb->ioda.m32_segmap);
-   pnv_ioda_free_pe_seg(pe, OPAL_M64_WINDOW_TYPE,
-phb->ioda.m64_segmap);
+   /* M64 is pre-configured by pnv_ioda1_init_m64() */
} else if (phb->type == PNV_PHB_IODA2) {
pnv_ioda_free_pe_seg(pe, OPAL_M32_WINDOW_TYPE,
 phb->ioda.m32_segmap);
-- 
2.26.2



[PATCH v2 03/16] powerpc/powernv/pci: Add explicit tracking of the DMA setup state

2020-07-22 Thread Oliver O'Halloran
There's an optimisation in the PE setup which skips performing DMA
setup for a PE if we only have bridges in a PE. The assumption being
that only "real" devices will DMA to system memory, which is probably
fair. However, if we start off with only bridge devices in a PE then
add a non-bridge device the new device won't be able to use DMA because
we never configured it.

Fix this (admittedly pretty weird) edge case by tracking whether we've done
the DMA setup for the PE or not. If a non-bridge device is added to the PE
(via rescan or hotplug, or whatever) we can set up DMA on demand.

This also means the only remaining user of the old "DMA Weight" code is
the IODA1 DMA setup code that it was originally added for, which is good.

Cc: Alexey Kardashevskiy 
Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: no changes
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 48 ++-
 arch/powerpc/platforms/powernv/pci.h  |  7 
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index bfb40607aa0e..bb9c1cc60c33 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -141,6 +141,7 @@ static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb 
*phb, int pe_no)
 
phb->ioda.pe_array[pe_no].phb = phb;
phb->ioda.pe_array[pe_no].pe_number = pe_no;
+   phb->ioda.pe_array[pe_no].dma_setup_done = false;
 
/*
 * Clear the PE frozen state as it might be put into frozen state
@@ -1685,6 +1686,12 @@ static int pnv_pcibios_sriov_enable(struct pci_dev 
*pdev, u16 num_vfs)
 }
 #endif /* CONFIG_PCI_IOV */
 
+static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
+  struct pnv_ioda_pe *pe);
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+  struct pnv_ioda_pe *pe);
+
 static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
 {
struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
@@ -1713,6 +1720,24 @@ static void pnv_pci_ioda_dma_dev_setup(struct pci_dev 
*pdev)
pci_info(pdev, "Added to existing PE#%x\n", pe->pe_number);
}
 
+   /*
+* We assume that bridges *probably* don't need to do any DMA so we can
+* skip allocating a TCE table, etc unless we get a non-bridge device.
+*/
+   if (!pe->dma_setup_done && !pci_is_bridge(pdev)) {
+   switch (phb->type) {
+   case PNV_PHB_IODA1:
+   pnv_pci_ioda1_setup_dma_pe(phb, pe);
+   break;
+   case PNV_PHB_IODA2:
+   pnv_pci_ioda2_setup_dma_pe(phb, pe);
+   break;
+   default:
+   pr_warn("%s: No DMA for PHB#%x (type %d)\n",
+   __func__, phb->hose->global_number, phb->type);
+   }
+   }
+
if (pdn)
pdn->pe_number = pe->pe_number;
pe->device_count++;
@@ -,6 +2247,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
*phb,
pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
iommu_init_table(tbl, phb->hose->node, 0, 0);
 
+   pe->dma_setup_done = true;
return;
  fail:
/* XXX Failure: Try to fallback to 64-bit only ? */
@@ -2536,9 +2562,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
 {
int64_t rc;
 
-   if (!pnv_pci_ioda_pe_dma_weight(pe))
-   return;
-
/* TVE #1 is selected by PCI address bit 59 */
pe->tce_bypass_base = 1ull << 59;
 
@@ -2563,6 +2586,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
iommu_register_group(>table_group, phb->hose->global_number,
 pe->pe_number);
 #endif
+   pe->dma_setup_done = true;
 }
 
 int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, unsigned int hw_irq)
@@ -3136,7 +3160,6 @@ static void pnv_pci_fixup_bridge_resources(struct pci_bus 
*bus,
 
 static void pnv_pci_configure_bus(struct pci_bus *bus)
 {
-   struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
struct pci_dev *bridge = bus->self;
struct pnv_ioda_pe *pe;
bool all = (bridge && pci_pcie_type(bridge) == PCI_EXP_TYPE_PCI_BRIDGE);
@@ -3160,17 +3183,6 @@ static void pnv_pci_configure_bus(struct pci_bus *bus)
return;
 
pnv_ioda_setup_pe_seg(pe);
-   switch (phb->type) {
-   case PNV_PHB_IODA1:
-   pnv_pci_ioda1_setup_dma_pe(phb, pe);
-   break;
-   case PNV_PHB_IODA2:
-   pnv_pci_ioda2_setup_dma_pe(phb, pe);
-   break;
-   default:
-   pr_warn("%s: No DMA for PHB#%x (type %d)\n",
-   __func__, phb->hose->global_number, phb->type);
-   }
 }
 
 static resource_size_t 

[PATCH v2 02/16] powerpc/powernv/pci: Always tear down DMA windows on PE release

2020-07-22 Thread Oliver O'Halloran
Currently we have these two functions:

pnv_pci_ioda2_release_dma_pe(), and
pnv_pci_ioda2_release_pe_dma()

The first is used when tearing down VF PEs and the other is used for normal
devices. There's very little difference between the two though. The latter
(non-VF) will skip a call to pnv_pci_ioda2_unset_window() unless
CONFIG_IOMMU_API=y is set. There's no real point in doing this so fold the
two together.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
---
v2: no change
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++
 1 file changed, 3 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 687919db0347..bfb40607aa0e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,26 +1422,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
return -EBUSY;
 }
 
-static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
-   int num);
-
-static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct 
pnv_ioda_pe *pe)
-{
-   struct iommu_table*tbl;
-   int64_t   rc;
-
-   tbl = pe->table_group.tables[0];
-   rc = pnv_pci_ioda2_unset_window(>table_group, 0);
-   if (rc)
-   pe_warn(pe, "OPAL error %lld release DMA window\n", rc);
-
-   pnv_pci_ioda2_set_bypass(pe, false);
-   if (pe->table_group.group) {
-   iommu_group_put(pe->table_group.group);
-   BUG_ON(pe->table_group.group);
-   }
-   iommu_tce_table_put(tbl);
-}
+static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe);
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
 {
@@ -1455,11 +1436,12 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
if (!pdev->is_physfn)
return;
 
+   /* FIXME: Use pnv_ioda_release_pe()? */
list_for_each_entry_safe(pe, pe_n, >ioda.pe_list, list) {
if (pe->parent_dev != pdev)
continue;
 
-   pnv_pci_ioda2_release_dma_pe(pdev, pe);
+   pnv_pci_ioda2_release_pe_dma(pe);
 
/* Remove from list */
mutex_lock(>ioda.pe_list_mutex);
@@ -2429,7 +2411,6 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
return 0;
 }
 
-#if defined(CONFIG_IOMMU_API) || defined(CONFIG_PCI_IOV)
 static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
int num)
 {
@@ -2453,7 +2434,6 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
 
return ret;
 }
-#endif
 
 #ifdef CONFIG_IOMMU_API
 unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
@@ -3334,18 +3314,14 @@ static void pnv_pci_ioda2_release_pe_dma(struct 
pnv_ioda_pe *pe)
 {
struct iommu_table *tbl = pe->table_group.tables[0];
unsigned int weight = pnv_pci_ioda_pe_dma_weight(pe);
-#ifdef CONFIG_IOMMU_API
int64_t rc;
-#endif
 
if (!weight)
return;
 
-#ifdef CONFIG_IOMMU_API
rc = pnv_pci_ioda2_unset_window(>table_group, 0);
if (rc)
pe_warn(pe, "OPAL error %lld release DMA window\n", rc);
-#endif
 
pnv_pci_ioda2_set_bypass(pe, false);
if (pe->table_group.group) {
-- 
2.26.2



[PATCH v2 01/16] powernv/pci: Add pci_bus_to_pnvhb() helper

2020-07-22 Thread Oliver O'Halloran
Add a helper to go from a pci_bus structure to the pnv_phb that hosts that
bus. There's a lot of instances of the following pattern:

struct pci_controller *hose = pci_bus_to_host(pdev->bus);
struct pnv_phb *phb = hose->private_data;

Without any other uses of the pci_controller inside the function. This is
hard to read since it requires you to memorise the contents of the
private data fields and kind of error prone since it involves blindly
assigning a void pointer. Add a helper to make it more concise and
explicit.

Signed-off-by: Oliver O'Halloran 
---
v2: no change
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 88 +++
 arch/powerpc/platforms/powernv/pci.c  | 14 ++--
 arch/powerpc/platforms/powernv/pci.h  | 10 +++
 3 files changed, 38 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 31c3e6d58c41..687919db0347 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -252,8 +252,7 @@ static int pnv_ioda2_init_m64(struct pnv_phb *phb)
 static void pnv_ioda_reserve_dev_m64_pe(struct pci_dev *pdev,
 unsigned long *pe_bitmap)
 {
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
struct resource *r;
resource_size_t base, sgsz, start, end;
int segno, i;
@@ -351,8 +350,7 @@ static void pnv_ioda_reserve_m64_pe(struct pci_bus *bus,
 
 static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pci_bus *bus, bool all)
 {
-   struct pci_controller *hose = pci_bus_to_host(bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
struct pnv_ioda_pe *master_pe, *pe;
unsigned long size, *pe_alloc;
int i;
@@ -673,8 +671,7 @@ struct pnv_ioda_pe *pnv_pci_bdfn_to_pe(struct pnv_phb *phb, 
u16 bdfn)
 
 struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
 {
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
struct pci_dn *pdn = pci_get_pdn(dev);
 
if (!pdn)
@@ -1069,8 +1066,7 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, 
int offset)
 
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
struct pci_dn *pdn = pci_get_pdn(dev);
struct pnv_ioda_pe *pe;
 
@@ -1129,8 +1125,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct 
pci_dev *dev)
  */
 static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, bool all)
 {
-   struct pci_controller *hose = pci_bus_to_host(bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
struct pnv_ioda_pe *pe = NULL;
unsigned int pe_num;
 
@@ -1196,8 +1191,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_npu_PE(struct 
pci_dev *npu_pdev)
struct pnv_ioda_pe *pe;
struct pci_dev *gpu_pdev;
struct pci_dn *npu_pdn;
-   struct pci_controller *hose = pci_bus_to_host(npu_pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(npu_pdev->bus);
 
/*
 * Intentionally leak a reference on the npu device (for
@@ -1300,16 +1294,12 @@ static void pnv_pci_ioda_setup_nvlink(void)
 #ifdef CONFIG_PCI_IOV
 static int pnv_pci_vf_release_m64(struct pci_dev *pdev, u16 num_vfs)
 {
-   struct pci_bus*bus;
-   struct pci_controller *hose;
struct pnv_phb*phb;
struct pci_dn *pdn;
inti, j;
intm64_bars;
 
-   bus = pdev->bus;
-   hose = pci_bus_to_host(bus);
-   phb = hose->private_data;
+   phb = pci_bus_to_pnvhb(pdev->bus);
pdn = pci_get_pdn(pdev);
 
if (pdn->m64_single_mode)
@@ -1333,8 +1323,6 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
u16 num_vfs)
 
 static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
 {
-   struct pci_bus*bus;
-   struct pci_controller *hose;
struct pnv_phb*phb;
struct pci_dn *pdn;
unsigned int   win;
@@ -1346,9 +1334,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
intpe_num;
intm64_bars;
 
-   bus = pdev->bus;
-   hose = pci_bus_to_host(bus);
-   phb = hose->private_data;
+   phb = pci_bus_to_pnvhb(pdev->bus);
pdn = pci_get_pdn(pdev);
total_vfs = pci_sriov_get_totalvfs(pdev);
 
@@ 

Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread Gautham R Shenoy
Hello Srikar,

On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
> 
> While setting up the "CACHE" domain, check if shared_cache is already
> set.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
> powerpc/smp: Generalize 2nd sched domain
>   Moved shared_cache topology fixup to fixup_topology (Gautham)
>

Just one comment below.

>  arch/powerpc/kernel/smp.c | 49 ---
>  1 file changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 57468877499a..933ebdf97432 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
> 
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> + smt_idx,
> +#endif
> + bigcore_idx,
> + die_idx,
> +};
> +


[..snip..]

> @@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
>   /* Update topology CPU masks */
>   add_cpu_to_masks(cpu);
> 
> - if (has_big_cores)
> - sibling_mask = cpu_smallcore_mask;
>   /*
>* Check for any shared caches. Note that this must be done on a
>* per-core basis because one core in the pair might be disabled.
>*/
> - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> - shared_caches = true;
> + if (!shared_caches) {
> + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + sibling_mask = cpu_smallcore_mask;
> +
> + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> + shared_caches = true;

At the risk of repeating my comment to the v1 version of the patch, we
have shared caches only l2_cache_mask(cpu) is a strict superset of
sibling_mask(cpu).

"cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))" does not
capture this.

Could we please use

  if (!cpumask_equal(sibling_mask(cpu), mask) &&
  cpumask_subset(sibling_mask(cpu), mask) {
  }

?


> + }
> 
>   set_numa_node(numa_cpu_lookup_table[cpu]);
>   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1374,10 +1386,19 @@ int setup_profiling_timer(unsigned int multiplier)
> 
>  static void fixup_topology(void)
>  {
> + if (shared_caches) {
> + pr_info("Using shared cache scheduler topology\n");
> + powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> +#ifdef CONFIG_SCHED_DEBUG
> + powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> + powerpc_topology[bigcore_idx].sd_flags = 
> powerpc_shared_cache_flags;
> + }
> +
>  #ifdef CONFIG_SCHED_SMT
>   if (has_big_cores) {
>   pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> + powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>   }
>  #endif


Otherwise the patch looks good to me.

--
Thanks and Regards
gautham.


Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-07-22 Thread Gautham R Shenoy
Hi Srikar,

On Tue, Jul 21, 2020 at 05:08:09PM +0530, Srikar Dronamraju wrote:
> Current code assumes that cpumask of cpus sharing a l2-cache mask will
> always be a superset of cpu_sibling_mask.
> 
> Lets stop that assumption. cpu_l2_cache_mask is a superset of
> cpu_sibling_mask if and only if shared_caches is set.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
> powerpc/smp: Dont assume l2-cache to be superset of sibling
>   Set cpumask after verifying l2-cache. (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 72f16dc0cb26..57468877499a 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
> *(*mask_fn)(int))
>   if (!l2_cache)
>   return false;
> 
> + cpumask_set_cpu(cpu, mask_fn(cpu));


Ok, we need to do this because "cpu" is not yet set in the
cpu_online_mask. Prior to your patch the "cpu" was getting set in
cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in
the patch.


>   for_each_cpu(i, cpu_online_mask) {
>   /*
>* when updating the marks the current CPU has not been marked
> @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu)
>* add it to it's own thread sibling mask.
>*/
>   cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> + cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> 
>   for (i = first_thread; i < first_thread + threads_per_core; i++)
>   if (cpu_online(i))
>   set_cpus_related(i, cpu, cpu_sibling_mask);
> 
>   add_cpu_to_smallcore_masks(cpu);
> - /*
> -  * Copy the thread sibling mask into the cache sibling mask
> -  * and mark any CPUs that share an L2 with this CPU.
> -  */
> - for_each_cpu(i, cpu_sibling_mask(cpu))
> - set_cpus_related(cpu, i, cpu_l2_cache_mask);
>   update_mask_by_l2(cpu, cpu_l2_cache_mask);
> 
> - /*
> -  * Copy the cache sibling mask into core sibling mask and mark
> -  * any CPUs on the same chip as this CPU.
> -  */
> - for_each_cpu(i, cpu_l2_cache_mask(cpu))
> - set_cpus_related(cpu, i, cpu_core_mask);
> + if (pkg_id == -1) {

I suppose this "if" condition is an optimization, since if pkg_id != -1,
we anyway set these CPUs in the cpu_core_mask below.

However...

> + struct cpumask *(*mask)(int) = cpu_sibling_mask;
> +
> + /*
> +  * Copy the sibling mask into core sibling mask and
> +  * mark any CPUs on the same chip as this CPU.
> +  */
> + if (shared_caches)
> + mask = cpu_l2_cache_mask;
> +
> + for_each_cpu(i, mask(cpu))
> + set_cpus_related(cpu, i, cpu_core_mask);
> 
> - if (pkg_id == -1)
>   return;
> + }


... since "cpu" is not yet set in the cpu_online_mask, do we not miss setting
"cpu" in the cpu_core_mask(cpu) in the for-loop below ?


> 
>   for_each_cpu(i, cpu_online_mask)
>   if (get_physical_package_id(i) == pkg_id)


Before this patch it was unconditionally getting set in
cpu_core_mask(cpu) because of the fact that it was set in
cpu_l2_cache_mask(cpu) and we were unconditionally setting all the
CPUs in cpu_l2_cache_mask(cpu) in cpu_core_mask(cpu).

What am I missing ?

> -- 
> 2.17.1
>

--
Thanks and Regards
gautham.


Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes

2020-07-22 Thread Bharata B Rao
On Tue, Jul 21, 2020 at 10:25:58PM +1000, Michael Ellerman wrote:
> Bharata B Rao  writes:
> > On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote:
> >> Nathan Lynch  writes:
> >> > "Aneesh Kumar K.V"  writes:
> >> >> This is the next version of the fixes for memory unplug on radix.
> >> >> The issues and the fix are described in the actual patches.
> >> >
> >> > I guess this isn't actually causing problems at runtime right now, but I
> >> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
> >> > arch_remove_memory(), which ought to be mmu-agnostic:
> >> >
> >> > int __ref arch_add_memory(int nid, u64 start, u64 size,
> >> >struct mhp_params *params)
> >> > {
> >> >  unsigned long start_pfn = start >> PAGE_SHIFT;
> >> >  unsigned long nr_pages = size >> PAGE_SHIFT;
> >> >  int rc;
> >> >
> >> >  resize_hpt_for_hotplug(memblock_phys_mem_size());
> >> >
> >> >  start = (unsigned long)__va(start);
> >> >  rc = create_section_mapping(start, start + size, nid,
> >> >  params->pgprot);
> >> > ...
> >> 
> >> Hmm well spotted.
> >> 
> >> That does return early if the ops are not setup:
> >> 
> >> int resize_hpt_for_hotplug(unsigned long new_mem_size)
> >> {
> >>unsigned target_hpt_shift;
> >> 
> >>if (!mmu_hash_ops.resize_hpt)
> >>return 0;
> >> 
> >> 
> >> And:
> >> 
> >> void __init hpte_init_pseries(void)
> >> {
> >>...
> >>if (firmware_has_feature(FW_FEATURE_HPT_RESIZE))
> >>mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt;
> >> 
> >> And that comes in via ibm,hypertas-functions:
> >> 
> >>{FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"},
> >> 
> >> 
> >> But firmware is not necessarily going to add/remove that call based on
> >> whether we're using hash/radix.
> >
> > Correct but hpte_init_pseries() will not be called for radix guests.
> 
> Yeah, duh. You'd think the function name would have been a sufficient
> clue for me :)
> 
> >> So I think a follow-up patch is needed to make this more robust.
> >> 
> >> Aneesh/Bharata what platform did you test this series on? I'm curious
> >> how this didn't break.
> >
> > I have tested memory hotplug/unplug for radix guest on zz platform and
> > sanity-tested this for hash guest on P8.
> >
> > As noted above, mmu_hash_ops.resize_hpt will not be set for radix
> > guest and hence we won't see any breakage.
> 
> OK.
> 
> That's probably fine as it is then. Or maybe just a comment in
> resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if
> we're using radix.

Or we could move these calls to hpt-only routines like below?

David - Do you remember if there was any particular reason to have
these two hpt-resize calls within powerpc-generic memory hotplug code?

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index c89b32443cff..1e6fa371cc38 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -17,12 +17,6 @@ extern int create_section_mapping(unsigned long start, 
unsigned long end,
  int nid, pgprot_t prot);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 
-#ifdef CONFIG_PPC_BOOK3S_64
-extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
-#else
-static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { return 
0; }
-#endif
-
 #ifdef CONFIG_NUMA
 extern int hot_add_scn_to_nid(unsigned long scn_addr);
 #else
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index eec6f4e5e481..5daf53ec7600 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -787,7 +787,7 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int resize_hpt_for_hotplug(unsigned long new_mem_size)
+static int resize_hpt_for_hotplug(unsigned long new_mem_size)
 {
unsigned target_hpt_shift;
 
@@ -821,6 +821,8 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
return -1;
}
 
+   resize_hpt_for_hotplug(memblock_phys_mem_size());
+
rc = htab_bolt_mapping(start, end, __pa(start),
   pgprot_val(prot), mmu_linear_psize,
   mmu_kernel_ssize);
@@ -838,6 +840,10 @@ int hash__remove_section_mapping(unsigned long start, 
unsigned long end)
int rc = htab_remove_mapping(start, end, mmu_linear_psize,
 mmu_kernel_ssize);
WARN_ON(rc < 0);
+
+   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
+   pr_warn("Hash collision while resizing HPT\n");
+
return rc;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c2c11eb8dcfc..9dafc636588f 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -127,8 +127,6 @@ int __ref 

Re: [v3 02/15] KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR

2020-07-22 Thread Madhavan Srinivasan




On 7/22/20 10:24 AM, Paul Mackerras wrote:

On Wed, Jul 22, 2020 at 07:39:26AM +0530, Athira Rajeev wrote:



On 21-Jul-2020, at 9:24 AM, Paul Mackerras  wrote:

On Fri, Jul 17, 2020 at 10:38:14AM -0400, Athira Rajeev wrote:

Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
Split this to give mmcra and mmcrs its own entries in vcpu and
use a flat array for mmcr0 to mmcr2. This patch implements this
cleanup to make code easier to read.

Changing the way KVM stores these values internally is fine, but
changing the user ABI is not.  This part:


diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 264e266..e55d847 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -510,8 +510,8 @@ struct kvm_ppc_cpu_char {

#define KVM_REG_PPC_MMCR0   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
#define KVM_REG_PPC_MMCR1   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
-#define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
-#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
+#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+#define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)

means that existing userspace programs that used to work would now be
broken.  That is not acceptable (breaking the user ABI is only ever
acceptable with a very compelling reason).  So NAK to this part of the
patch.

Hi Paul

Thanks for checking the patch. I understood your point on user ABI breakage 
that this particular change can cause.
I will retain original KVM_REG_PPC_MMCRA and KVM_REG_PPC_MMCR2 order in `kvm.h`
And with that, additionally I will need below change ( on top of current patch 
) for my clean up updates for kvm cpu MMCR to work,
Because now mmcra and mmcrs will have its own entries in vcpu and is not part 
of the mmcr[] array
Please suggest if this looks good

Yes, that looks fine.

By the way, is the new MMCRS register here at all related to the MMCRS

Hi Paul,

We have only split the current array (mmcr[]) and separated it to mmcra 
and mmcrs.

Only new spr that is added is mmcr3 (for Power10).

Maddy


that there used to be on POWER8, but which wasn't present (as far as I
know) on POWER9?

Paul.




Re: [v3 02/15] KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR

2020-07-22 Thread Athira Rajeev


> On 22-Jul-2020, at 10:07 AM, Michael Ellerman  wrote:
> 
> Athira Rajeev  > writes:
>>> On 21-Jul-2020, at 9:24 AM, Paul Mackerras  wrote:
>>> On Fri, Jul 17, 2020 at 10:38:14AM -0400, Athira Rajeev wrote:
 Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
 in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
 Split this to give mmcra and mmcrs its own entries in vcpu and
 use a flat array for mmcr0 to mmcr2. This patch implements this
 cleanup to make code easier to read.
>>> 
>>> Changing the way KVM stores these values internally is fine, but
>>> changing the user ABI is not.  This part:
>>> 
 diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
 b/arch/powerpc/include/uapi/asm/kvm.h
 index 264e266..e55d847 100644
 --- a/arch/powerpc/include/uapi/asm/kvm.h
 +++ b/arch/powerpc/include/uapi/asm/kvm.h
 @@ -510,8 +510,8 @@ struct kvm_ppc_cpu_char {
 
 #define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
 #define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
 -#define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
 -#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
 +#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
 +#define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
>>> 
>>> means that existing userspace programs that used to work would now be
>>> broken.  That is not acceptable (breaking the user ABI is only ever
>>> acceptable with a very compelling reason).  So NAK to this part of the
>>> patch.
>> 
>> Hi Paul
>> 
>> Thanks for checking the patch. I understood your point on user ABI breakage 
>> that this particular change can cause.
>> I will retain original KVM_REG_PPC_MMCRA and KVM_REG_PPC_MMCR2 order in 
>> `kvm.h`
>> And with that, additionally I will need below change ( on top of current 
>> patch ) for my clean up updates for kvm cpu MMCR to work,
>> Because now mmcra and mmcrs will have its own entries in vcpu and is not 
>> part of the mmcr[] array
>> Please suggest if this looks good
> 
> I did the same patch I think in my testing branch, it's here:
> 
> https://github.com/linuxppc/linux/commit/daea78154eff1b7e2f36be05a8f95feb5a588912
>  
> 
> 
> 
> Can you please check that matches what you sent.

Hi Michael,

Yes, it matches. Thanks for making this change.

> 
> cheers
> 
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 3f90eee261fc..b10bb404f0d5 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1679,10 +1679,13 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu 
>> *vcpu, u64 id,
>>case KVM_REG_PPC_UAMOR:
>>*val = get_reg_val(id, vcpu->arch.uamor);
>>break;
>> -   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR2:
>> +   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR1:
>>i = id - KVM_REG_PPC_MMCR0;
>>*val = get_reg_val(id, vcpu->arch.mmcr[i]);
>>break;
>> +   case KVM_REG_PPC_MMCR2:
>> +   *val = get_reg_val(id, vcpu->arch.mmcr[2]);
>> +   break;
>>case KVM_REG_PPC_MMCRA:
>>*val = get_reg_val(id, vcpu->arch.mmcra);
>>break;
>> @@ -1906,10 +1909,13 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu 
>> *vcpu, u64 id,
>>case KVM_REG_PPC_UAMOR:
>>vcpu->arch.uamor = set_reg_val(id, *val);
>>break;
>> -   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR2:
>> +   case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR1:
>>i = id - KVM_REG_PPC_MMCR0;
>>vcpu->arch.mmcr[i] = set_reg_val(id, *val);
>>break;
>> +   case KVM_REG_PPC_MMCR2:
>> +   vcpu->arch.mmcr[2] = set_reg_val(id, *val);
>> +   break;
>>case KVM_REG_PPC_MMCRA:
>>vcpu->arch.mmcra = set_reg_val(id, *val);
>>break;
>> —
>> 
>> 
>>> 
>>> Regards,
>>> Paul.