Re: [PATCH] powerpc: Add doorbell tracepoints

2018-10-03 Thread Russell Currey
On Wed, 2018-10-03 at 10:29 +1000, Anton Blanchard wrote:
> When analysing sources of OS jitter, I noticed that doorbells cannot be
> traced.
> 
> Signed-off-by: Anton Blanchard 

Hi Anton,

snowpatch builds failed for this patch on all 64-bit configurations (ppc64e, 
ppc64
and ppc64le) with the following:

arch/powerpc/kernel/dbell.c:85:9: error: undefined identifier
'trace_doorbell_entry'
arch/powerpc/kernel/dbell.c:96:9: error: undefined identifier
'trace_doorbell_exit'
./arch/powerpc/include/asm/spinlock.h:171:9: warning: context imbalance in
'key_user_put' - unexpected unlock
arch/powerpc/kernel/dbell.c: In function 'doorbell_exception':
arch/powerpc/kernel/dbell.c:85:2: error: implicit declaration of function
'trace_doorbell_entry'; did you mean 'trace_irq_entry'? [-Werror=implicit-
function-declaration]
  trace_doorbell_entry(regs);
  ^~~~
  trace_irq_entry
arch/powerpc/kernel/dbell.c:96:2: error: implicit declaration of function
'trace_doorbell_exit'; did you mean 'trace_irq_exit'? 
[-Werror=implicit-function-
declaration]
  trace_doorbell_exit(regs);
  ^~~
  trace_irq_exit
cc1: all warnings being treated as errors
scripts/Makefile.build:305: recipe for target 'arch/powerpc/kernel/dbell.o' 
failed
make[1]: *** [arch/powerpc/kernel/dbell.o] Error 1
Makefile:1060: recipe for target 'arch/powerpc/kernel' failed

So does something else need to check for CONFIG_PPC_DOORBELL maybe?

You can see the failures here: http://patchwork.ozlabs.org/patch/978088/ - 
output
in build_new.log (I know it's not pretty in there yet, you can search for "Error
1" to find the build failure)

- Russell




Re: [PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting

2018-10-03 Thread Alistair Popple
Reviewed-by: Alistair Popple 

On Wednesday, 3 October 2018 11:51:34 AM AEST Mark Hairgrove wrote:
> This threshold is no longer used now that all invalidates issue a single
> ATSD to each active NPU.
> 
> Signed-off-by: Mark Hairgrove 
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c |   14 --
>  1 files changed, 0 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index e4c0fab..6f60e09 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -17,7 +17,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  #include 
> @@ -43,14 +42,6 @@
>  static DEFINE_SPINLOCK(npu_context_lock);
>  
>  /*
> - * When an address shootdown range exceeds this threshold we invalidate the
> - * entire TLB on the GPU for the given PID rather than each specific address 
> in
> - * the range.
> - */
> -static uint64_t atsd_threshold = 2 * 1024 * 1024;
> -static struct dentry *atsd_threshold_dentry;
> -
> -/*
>   * Other types of TCE cache invalidation are not functional in the
>   * hardware.
>   */
> @@ -966,11 +957,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
>   static int npu_index;
>   uint64_t rc = 0;
>  
> - if (!atsd_threshold_dentry) {
> - atsd_threshold_dentry = debugfs_create_x64("atsd_threshold",
> -0600, powerpc_debugfs_root, _threshold);
> - }
> -
>   phb->npu.nmmu_flush =
>   of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
>   for_each_child_of_node(phb->hose->dn, dn) {
> 




Re: [PATCH v2 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates

2018-10-03 Thread Alistair Popple
Reviewed-By: Alistair Popple 

On Wednesday, 3 October 2018 11:51:33 AM AEST Mark Hairgrove wrote:
> Prior to this change only two types of ATSDs were issued to the NPU:
> invalidates targeting a single page and invalidates targeting the whole
> address space. The crossover point happened at the configurable
> atsd_threshold which defaulted to 2M. Invalidates that size or smaller
> would issue per-page invalidates for the whole range.
> 
> The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
> These invalidates target addresses aligned to their size. 2M is a common
> invalidation size for GPU-enabled applications because that is a GPU
> page size, so reducing the number of invalidates by 32x in that case is a
> clear improvement.
> 
> ATSD latency is high in general so now we always issue a single invalidate
> rather than multiple. This will over-invalidate in some cases, but for any
> invalidation size over 2M it matches or improves the prior behavior.
> There's also an improvement for single-page invalidates since the prior
> version issued two invalidates for that case instead of one.
> 
> With this change all issued ATSDs now perform a flush, so the flush
> parameter has been removed from all the helpers.
> 
> To show the benefit here are some performance numbers from a
> microbenchmark which creates a 1G allocation then uses mprotect with
> PROT_NONE to trigger invalidates in strides across the allocation.
> 
> One NPU (1 GPU):
> 
>  mprotect rate (GB/s)
> Stride   Before  After  Speedup
> 64K 5.35.6   5%
> 1M 39.3   57.4  46%
> 2M 49.7   82.6  66%
> 4M286.6  285.7   0%
> 
> Two NPUs (6 GPUs):
> 
>  mprotect rate (GB/s)
> Stride   Before  After  Speedup
> 64K 6.57.4  13%
> 1M 33.4   67.9 103%
> 2M 38.7   93.1 141%
> 4M356.7  354.6  -1%
> 
> Anything over 2M is roughly the same as before since both cases issue a
> single ATSD.
> 
> Signed-off-by: Mark Hairgrove 
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c |  103 
> --
>  1 files changed, 55 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index c8f438a..e4c0fab 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -458,8 +459,7 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
>  #define XTS_ATSD_AVA1
>  #define XTS_ATSD_STAT   2
>  
> -static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long 
> psize,
> - bool flush)
> +static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long 
> psize)
>  {
>   unsigned long launch = 0;
>  
> @@ -477,8 +477,7 @@ static unsigned long get_atsd_launch_val(unsigned long 
> pid, unsigned long psize,
>   /* PID */
>   launch |= pid << PPC_BITLSHIFT(38);
>  
> - /* No flush */
> - launch |= !flush << PPC_BITLSHIFT(39);
> + /* Leave "No flush" (bit 39) 0 so every ATSD performs a flush */
>  
>   return launch;
>  }
> @@ -501,23 +500,22 @@ static void mmio_atsd_regs_write(struct mmio_atsd_reg
>  }
>  
>  static void mmio_invalidate_pid(struct mmio_atsd_reg 
> mmio_atsd_reg[NV_MAX_NPUS],
> - unsigned long pid, bool flush)
> + unsigned long pid)
>  {
> - unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
> + unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT);
>  
>   /* Invalidating the entire process doesn't use a va */
>   mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
>  }
>  
> -static void mmio_invalidate_va(struct mmio_atsd_reg 
> mmio_atsd_reg[NV_MAX_NPUS],
> - unsigned long va, unsigned long pid, bool flush)
> +static void mmio_invalidate_range(struct mmio_atsd_reg
> + mmio_atsd_reg[NV_MAX_NPUS], unsigned long pid,
> + unsigned long start, unsigned long psize)
>  {
> - unsigned long launch;
> -
> - launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
> + unsigned long launch = get_atsd_launch_val(pid, psize);
>  
>   /* Write all VAs first */
> - mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
> + mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, start);
>  
>   /* Issue one barrier for all address writes */
>   eieio();
> @@ -609,14 +607,36 @@ static void release_atsd_reg(struct mmio_atsd_reg 
> mmio_atsd_reg[NV_MAX_NPUS])
>  }
>  
>  /*
> - * Invalidate either a single address or an entire PID depending on
> - * the value of va.
> + * Invalidate a virtual address range
>   */
> -static void 

Re: [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates

2018-10-03 Thread Alistair Popple
Reviewed-by: Alistair Popple 

On Wednesday, 3 October 2018 11:51:32 AM AEST Mark Hairgrove wrote:
> There are two types of ATSDs issued to the NPU: invalidates targeting a
> specific virtual address and invalidates targeting the whole address
> space. In both cases prior to this change, the sequence was:
> 
> for each NPU
> - Write the target address to the XTS_ATSD_AVA register
> - EIEIO
> - Write the launch value to issue the ATSD
> 
> First, a target address is not required when invalidating the whole
> address space, so that write and the EIEIO have been removed. The AP
> (size) field in the launch is not needed either.
> 
> Second, for per-address invalidates the above sequence is inefficient in
> the common case of multiple NPUs because an EIEIO is issued per NPU. This
> unnecessarily forces the launches of later ATSDs to be ordered with the
> launches of earlier ones. The new sequence only issues a single EIEIO:
> 
> for each NPU
> - Write the target address to the XTS_ATSD_AVA register
> EIEIO
> for each NPU
> - Write the launch value to issue the ATSD
> 
> Performance results were gathered using a microbenchmark which creates a
> 1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
> strides across the allocation.
> 
> With only a single NPU active (one GPU) the difference is in the noise for
> both types of invalidates (+/-1%).
> 
> With two NPUs active (on a 6-GPU system) the effect is more noticeable:
> 
>  mprotect rate (GB/s)
> Stride   Before  After  Speedup
> 64K 5.96.5  10%
> 1M 31.2   33.4   7%
> 2M 36.3   38.7   7%
> 4M322.6  356.7  11%
> 
> Signed-off-by: Mark Hairgrove 
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c |   99 ++---
>  1 files changed, 48 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8006c54..c8f438a 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -454,79 +454,76 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
>  }
>  
>  /* MMIO ATSD register offsets */
> -#define XTS_ATSD_AVA  1
> -#define XTS_ATSD_STAT 2
> +#define XTS_ATSD_LAUNCH 0
> +#define XTS_ATSD_AVA1
> +#define XTS_ATSD_STAT   2
>  
> -static void mmio_launch_invalidate(struct mmio_atsd_reg *mmio_atsd_reg,
> - unsigned long launch, unsigned long va)
> +static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long 
> psize,
> + bool flush)
>  {
> - struct npu *npu = mmio_atsd_reg->npu;
> - int reg = mmio_atsd_reg->reg;
> + unsigned long launch = 0;
>  
> - __raw_writeq_be(va, npu->mmio_atsd_regs[reg] + XTS_ATSD_AVA);
> - eieio();
> - __raw_writeq_be(launch, npu->mmio_atsd_regs[reg]);
> + if (psize == MMU_PAGE_COUNT) {
> + /* IS set to invalidate entire matching PID */
> + launch |= PPC_BIT(12);
> + } else {
> + /* AP set to invalidate region of psize */
> + launch |= (u64)mmu_get_ap(psize) << PPC_BITLSHIFT(17);
> + }
> +
> + /* PRS set to process-scoped */
> + launch |= PPC_BIT(13);
> +
> + /* PID */
> + launch |= pid << PPC_BITLSHIFT(38);
> +
> + /* No flush */
> + launch |= !flush << PPC_BITLSHIFT(39);
> +
> + return launch;
>  }
>  
> -static void mmio_invalidate_pid(struct mmio_atsd_reg 
> mmio_atsd_reg[NV_MAX_NPUS],
> - unsigned long pid, bool flush)
> +static void mmio_atsd_regs_write(struct mmio_atsd_reg
> + mmio_atsd_reg[NV_MAX_NPUS], unsigned long offset,
> + unsigned long val)
>  {
> - int i;
> - unsigned long launch;
> + struct npu *npu;
> + int i, reg;
>  
>   for (i = 0; i <= max_npu2_index; i++) {
> - if (mmio_atsd_reg[i].reg < 0)
> + reg = mmio_atsd_reg[i].reg;
> + if (reg < 0)
>   continue;
>  
> - /* IS set to invalidate matching PID */
> - launch = PPC_BIT(12);
> -
> - /* PRS set to process-scoped */
> - launch |= PPC_BIT(13);
> -
> - /* AP */
> - launch |= (u64)
> - mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
> -
> - /* PID */
> - launch |= pid << PPC_BITLSHIFT(38);
> + npu = mmio_atsd_reg[i].npu;
> + __raw_writeq_be(val, npu->mmio_atsd_regs[reg] + offset);
> + }
> +}
>  
> - /* No flush */
> - launch |= !flush << PPC_BITLSHIFT(39);
> +static void mmio_invalidate_pid(struct mmio_atsd_reg 
> mmio_atsd_reg[NV_MAX_NPUS],
> + unsigned long pid, bool flush)
> +{
> + unsigned long launch = 

Re: [PATCH v3 30/33] KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode

2018-10-03 Thread Paul Mackerras
On Wed, Oct 03, 2018 at 04:15:15PM +1000, David Gibson wrote:
> On Tue, Oct 02, 2018 at 09:31:29PM +1000, Paul Mackerras wrote:
> > With this, the KVM-HV module can be loaded in a guest running under
> > KVM-HV, and if the hypervisor supports nested virtualization, this
> > guest can now act as a nested hypervisor and run nested guests.
> > 
> > This also adds some checks to inform userspace that HPT guests are not
> > supported by nested hypervisors, and to prevent userspace from
> > configuring a guest to use HPT mode.
> > 
> > Signed-off-by: Paul Mackerras 
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 20 
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index f630e91..196bff1 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -4237,6 +4237,10 @@ static int kvm_vm_ioctl_get_smmu_info_hv(struct kvm 
> > *kvm,
> >  {
> > struct kvm_ppc_one_seg_page_size *sps;
> >  
> > +   /* If we're a nested hypervisor, we only support radix guests */
> > +   if (kvmhv_on_pseries())
> > +   return -EINVAL;
> > +
> > /*
> >  * POWER7, POWER8 and POWER9 all support 32 storage keys for data.
> >  * POWER7 doesn't support keys for instruction accesses,
> > @@ -4822,11 +4826,15 @@ static int kvmppc_core_emulate_mfspr_hv(struct 
> > kvm_vcpu *vcpu, int sprn,
> >  
> >  static int kvmppc_core_check_processor_compat_hv(void)
> >  {
> > -   if (!cpu_has_feature(CPU_FTR_HVMODE) ||
> > -   !cpu_has_feature(CPU_FTR_ARCH_206))
> > -   return -EIO;
> > +   if (cpu_has_feature(CPU_FTR_HVMODE) &&
> > +   cpu_has_feature(CPU_FTR_ARCH_206))
> > +   return 0;
> >  
> > -   return 0;
> > +   /* Can run as nested hypervisor on POWER9 in radix mode. */
> > +   if (cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled())
> 
> Shouldn't we probe the parent hypervisor for ability to support nested
> guests before we say "yes" here?

Well, we do check that the parent hypervisor can support nested
hypervisors, it's just done later on.  And to match nitpick with
nitpick, this is a function evaluating _processor_ compatibility, and
a POWER9 processor in radix mode does have everything necessary to
support nested hypervisors -- if the parent hypervisor doesn't support
nested hypervisors, that's not a deficiency in the processor.

Paul.


Re: [PATCH v3 23/33] KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings

2018-10-03 Thread Paul Mackerras
On Wed, Oct 03, 2018 at 03:56:37PM +1000, David Gibson wrote:
> On Tue, Oct 02, 2018 at 09:31:22PM +1000, Paul Mackerras wrote:
> > From: Suraj Jitindar Singh 
> > 
> > When a host (L0) page which is mapped into a (L1) guest is in turn
> > mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
> > so that these mappings can be retrieved later.
> > 
> > Whenever we create an entry in a shadow_pgtable for a nested guest we
> > create a corresponding rmap entry and add it to the list for the
> > L1 guest memslot at the index of the L1 guest page it maps. This means
> > at the L1 guest memslot we end up with lists of rmaps.
> > 
> > When we are notified of a host page being invalidated which has been
> > mapped through to a (L1) guest, we can then walk the rmap list for that
> > guest page, and find and invalidate all of the corresponding
> > shadow_pgtable entries.
> > 
> > In order to reduce memory consumption, we compress the information for
> > each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
> > for the guest real page frame number -- which will fit in a single
> > unsigned long.  To avoid a scenario where a guest can trigger
> > unbounded memory allocations, we scan the list when adding an entry to
> > see if there is already an entry with the contents we need.  This can
> > occur, because we don't ever remove entries from the middle of a list.
> > 
> > A struct nested guest rmap is a list pointer and an rmap entry;
> > 
> > | next pointer |
> > 
> > | rmap entry   |
> > 
> > 
> > Thus the rmap pointer for each guest frame number in the memslot can be
> > either NULL, a single entry, or a pointer to a list of nested rmap entries.
> > 
> > gfn  memslot rmap array
> > -
> >  0  | NULL  |   (no rmap entry)
> > -
> >  1  | single rmap entry |   (rmap entry with low bit set)
> > -
> >  2  | list head pointer |   (list of rmap entries)
> > -
> > 
> > The final entry always has the lowest bit set and is stored in the next
> > pointer of the last list entry, or as a single rmap entry.
> > With a list of rmap entries looking like;
> > 
> > -   -   -
> > | list head ptr | > | next pointer  | > | single rmap entry 
> > |
> > -   -   -
> > | rmap entry|   | rmap entry|
> > -   -
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> > Signed-off-by: Paul Mackerras 
> > ---
> >  arch/powerpc/include/asm/kvm_book3s.h|   3 +
> >  arch/powerpc/include/asm/kvm_book3s_64.h |  70 -
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c   |  44 +++
> >  arch/powerpc/kvm/book3s_hv.c |   1 +
> >  arch/powerpc/kvm/book3s_hv_nested.c  | 130 
> > ++-
> >  5 files changed, 233 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> > b/arch/powerpc/include/asm/kvm_book3s.h
> > index d983778..1d2286d 100644
> > --- a/arch/powerpc/include/asm/kvm_book3s.h
> > +++ b/arch/powerpc/include/asm/kvm_book3s.h
> > @@ -196,6 +196,9 @@ extern int kvmppc_mmu_radix_translate_table(struct 
> > kvm_vcpu *vcpu, gva_t eaddr,
> > int table_index, u64 *pte_ret_p);
> >  extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
> > struct kvmppc_pte *gpte, bool data, bool iswrite);
> > +extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long 
> > gpa,
> > +   unsigned int shift, struct kvm_memory_slot *memslot,
> > +   unsigned int lpid);
> >  extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
> > bool writing, unsigned long gpa,
> > unsigned int lpid);
> > diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> > b/arch/powerpc/include/asm/kvm_book3s_64.h
> > index 5496152..38614f0 100644
> > --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> > +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> > @@ -53,6 +53,66 @@ struct kvm_nested_guest {
> > struct kvm_nested_guest *next;
> >  };
> >  
> > +/*
> > + * We define a nested rmap entry as a single 64-bit quantity
> > + * 0xFFF0  12-bit lpid field
> > + * 0x000FF000  40-bit guest physical address field
> 
> I thought we could potentially support guests with >1TiB of RAM..?

We can, that's really a (4k) page frame number, not a physical
address.  We can support 52-bit guest physical addresses.

Paul.


Re: [PATCH v03 1/5] powerpc/drmem: Export 'dynamic-memory' loader

2018-10-03 Thread Nathan Fontenot
On 10/02/2018 08:00 PM, Michael Ellerman wrote:
> Michael Bringmann  writes:
> 
>> powerpc/drmem: Export many of the functions of DRMEM to parse
>> "ibm,dynamic-memory" and "ibm,dynamic-memory-v2" during hotplug
>> operations and for Post Migration events.
> 
> This isn't a criticism of your patch, but I think the drmem.c code
> should be moved into platforms/pseries.
> 
> That would then make most of it private to platforms/pseries and we
> wouldn't need to export things in arch/powerpc/include/asm.

I don't have an issue with moving it to platform/pseries. I originally
put it in arch/powerpc/mm because the numa code also uses the drmem code.

The numa code was updated so that it could just be given a lmb struct
instead of having to parse the device tree directly for dynamic
reconfiguration memory. Having to support two versions of this dt
property this made more sense.

-Nathan

> 
> 
>> Also modify the DRMEM initialization code to allow it to,
>>
>> * Be called after system initialization
>> * Provide a separate user copy of the LMB array that is produces
>> * Free the user copy upon request
> 
> Is there any reason those can't be done as separate patches?
> 
>> In addition, a couple of changes were made to make the creation
>> of additional copies of the LMB array more useful including,
>>
>> * Add new iterator to work through a pair of drmem_info arrays.
>> * Modify DRMEM code to replace usages of dt_root_addr_cells, and
>>   dt_mem_next_cell, as these are only available at first boot.
> 
> Likewise?
> 
> cheers
> 
>> diff --git a/arch/powerpc/include/asm/drmem.h 
>> b/arch/powerpc/include/asm/drmem.h
>> index ce242b9..b0e70fd 100644
>> --- a/arch/powerpc/include/asm/drmem.h
>> +++ b/arch/powerpc/include/asm/drmem.h
>> @@ -35,6 +35,18 @@ struct drmem_lmb_info {
>>  _info->lmbs[0],   \
>>  _info->lmbs[drmem_info->n_lmbs - 1])
>>  
>> +#define for_each_dinfo_lmb(dinfo, lmb)  \
>> +for_each_drmem_lmb_in_range((lmb),  \
>> +>lmbs[0],\
>> +>lmbs[dinfo->n_lmbs - 1])
>> +
>> +#define for_each_pair_dinfo_lmb(dinfo1, lmb1, dinfo2, lmb2) \
>> +for ((lmb1) = (>lmbs[0]),   \
>> + (lmb2) = (>lmbs[0]);   \
>> + ((lmb1) <= (>lmbs[dinfo1->n_lmbs - 1])) && \
>> + ((lmb2) <= (>lmbs[dinfo2->n_lmbs - 1]));   \
>> + (lmb1)++, (lmb2)++)
>> +
>>  /*
>>   * The of_drconf_cell_v1 struct defines the layout of the LMB data
>>   * specified in the ibm,dynamic-memory device tree property.
>> @@ -94,6 +106,9 @@ void __init walk_drmem_lmbs(struct device_node *dn,
>>  void (*func)(struct drmem_lmb *, const __be32 **));
>>  int drmem_update_dt(void);
>>  
>> +struct drmem_lmb_info *drmem_lmbs_init(struct property *prop);
>> +void drmem_lmbs_free(struct drmem_lmb_info *dinfo);
>> +
>>  #ifdef CONFIG_PPC_PSERIES
>>  void __init walk_drmem_lmbs_early(unsigned long node,
>>  void (*func)(struct drmem_lmb *, const __be32 **));
>> diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
>> index 3f18036..13d2abb 100644
>> --- a/arch/powerpc/mm/drmem.c
>> +++ b/arch/powerpc/mm/drmem.c
>> @@ -20,6 +20,7 @@
>>  
>>  static struct drmem_lmb_info __drmem_info;
>>  struct drmem_lmb_info *drmem_info = &__drmem_info;
>> +static int n_root_addr_cells;
>>  
>>  u64 drmem_lmb_memory_max(void)
>>  {
>> @@ -193,12 +194,13 @@ int drmem_update_dt(void)
>>  return rc;
>>  }
>>  
>> -static void __init read_drconf_v1_cell(struct drmem_lmb *lmb,
>> +static void read_drconf_v1_cell(struct drmem_lmb *lmb,
>> const __be32 **prop)
>>  {
>>  const __be32 *p = *prop;
>>  
>> -lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
>> +lmb->base_addr = of_read_number(p, n_root_addr_cells);
>> +p += n_root_addr_cells;
>>  lmb->drc_index = of_read_number(p++, 1);
>>  
>>  p++; /* skip reserved field */
>> @@ -209,7 +211,7 @@ static void __init read_drconf_v1_cell(struct drmem_lmb 
>> *lmb,
>>  *prop = p;
>>  }
>>  
>> -static void __init __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 
>> *usm,
>> +static void __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 *usm,
>>  void (*func)(struct drmem_lmb *, const __be32 **))
>>  {
>>  struct drmem_lmb lmb;
>> @@ -225,13 +227,14 @@ static void __init __walk_drmem_v1_lmbs(const __be32 
>> *prop, const __be32 *usm,
>>  }
>>  }
>>  
>> -static void __init read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
>> +static void read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
>> const __be32 **prop)
>>  {
>>  const __be32 *p = *prop;
>>  
>>  dr_cell->seq_lmbs = of_read_number(p++, 1);
>> -dr_cell->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
>> +dr_cell->base_addr = of_read_number(p, 

Re: [PATCH] migration/mm: Add WARN_ON to try_offline_node

2018-10-03 Thread Michael Bringmann
On 10/03/2018 06:05 PM, Tyrel Datwyler wrote:
> On 10/03/2018 06:27 AM, Michael Bringmann wrote:
>> On 10/02/2018 02:45 PM, Tyrel Datwyler wrote:
>>> On 10/02/2018 11:13 AM, Michael Bringmann wrote:


 On 10/02/2018 11:04 AM, Michal Hocko wrote:
> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>>> [...]
 When the device-tree affinity attributes have changed for memory,
 the 'nid' affinity calculated points to a different node for the
 memory block than the one used to install it, previously on the
 source system.  The newly calculated 'nid' affinity may not yet
 be initialized on the target system.  The current memory tracking
 mechanisms do not record the node to which a memory block was
 associated when it was added.  Nathan is looking at adding this
 feature to the new implementation of LMBs, but it is not there
 yet, and won't be present in earlier kernels without backporting a
 significant number of changes.
>>>
>>> Then the patch you have proposed here just papers over a real issue, no?
>>> IIUC then you simply do not remove the memory if you lose the race.
>>
>> The problem occurs when removing memory after an affinity change
>> references a node that was previously unreferenced.  Other code
>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
>> node when adding memory to a system.  The 'removing memory' case is
>> specific to systems that perform LPM and allow device-tree changes.
>> The powerpc kernel does not have the option of accepting some PRRN
>> requests and accepting others.  It must perform them all.
>
> I am sorry, but you are still too cryptic for me. Either there is a
> correctness issue and the the patch doesn't really fix anything or the
> final race doesn't make any difference and then the ppc code should be
> explicit about that. Checking the node inside the hotplug core code just
> looks as a wrong layer to mitigate an arch specific problem. I am not
> saying the patch is a no-go but if anything we want a big fat comment
> explaining how this is possible because right now it just points to an
> incorrect API usage.
>
> That being said, this sounds pretty much ppc specific problem and I
> would _prefer_ it to be handled there (along with a big fat comment of
> course).

 Let me try again.  Regardless of the path to which we get to this 
 condition,
 we currently crash the kernel.  This patch changes that to a WARN_ON notice
 and continues executing the kernel without shutting down the system.  I saw
 the problem during powerpc testing, because that is the focus of my work.
 There are other paths to this function besides powerpc.  I feel that the
 kernel should keep running instead of halting.
>>>
>>> This is still basically a hack to get around a known race. In itself this 
>>> patch is still worth while in that we shouldn't crash the kernel on a null 
>>> pointer dereference. However, I think the actual problem still needs to be 
>>> addressed. We shouldn't run any PRRN events for the source system on the 
>>> target after a migration. The device tree update should have taken care of 
>>> telling us about new affinities and what not. Can we just throw out any 
>>> queued PRRN events when we wake up on the target?
>>
>> We are not talking about queued events provided on the source system, but 
>> about
>> new PRRN events sent by phyp to the kernel on the target system to update the
>> kernel state after migration.  No way to predict the content.
> 
> Okay, but either way shouldn't your other proposed patches to update memory 
> affinity by re-adding memory and changing the time topology updates are 
> stopped to include the post-mobility updates put things in the right nodes? 
> Or, am I missing something? I would assume a PRRN on the target would assume 
> the target was up-to-date with respect to where things are supposed to be 
> located.

This bug only recently came to our attention in a defect on a SLES12 SP3 
platform running PHYP Memory Mover concurrently with LPM.  Not a normal test 
case.

> 
> -Tyrel

Michael

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



[PATCH] dma-direct: Fix return value of dma_direct_supported

2018-10-03 Thread Alexander Duyck
It appears that in commit 9d7a224b463e ("dma-direct: always allow dma mask
<= physiscal memory size") the logic of the test was changed from a "<" to
a ">=" however I don't see any reason for that change. I am assuming that
there was some additional change planned, specifically I suspect the logic
was intended to be reversed and possibly used for a return. Since that is
the case I have gone ahead and done that.

This addresses issues I had on my system that prevented me from booting
with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU.

Fixes: 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory 
size")
Signed-off-by: Alexander Duyck 
---
 kernel/dma/direct.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5a0806b5351b..65872f6c2e93 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -301,9 +301,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
 
min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
 
-   if (mask >= phys_to_dma(dev, min_mask))
-   return 0;
-   return 1;
+   return mask >= phys_to_dma(dev, min_mask);
 }
 
 int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr)



Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size

2018-10-03 Thread Alexander Duyck
On Thu, Sep 27, 2018 at 3:38 PM Christoph Hellwig  wrote:
>
> This way an architecture with less than 4G of RAM can support dma_mask
> smaller than 32-bit without a ZONE_DMA.  Apparently that is a common
> case on powerpc.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Robin Murphy 
> ---
>  kernel/dma/direct.c | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 60c433b880e0..170bd322a94a 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -284,21 +284,25 @@ int dma_direct_map_sg(struct device *dev, struct 
> scatterlist *sgl, int nents,
> return nents;
>  }
>
> +/*
> + * Because 32-bit DMA masks are so common we expect every architecture to be
> + * able to satisfy them - either by not supporting more physical memory, or 
> by
> + * providing a ZONE_DMA32.  If neither is the case, the architecture needs to
> + * use an IOMMU instead of the direct mapping.
> + */
>  int dma_direct_supported(struct device *dev, u64 mask)
>  {
> -#ifdef CONFIG_ZONE_DMA
> -   if (mask < phys_to_dma(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
> -   return 0;
> -#else
> -   /*
> -* Because 32-bit DMA masks are so common we expect every architecture
> -* to be able to satisfy them - either by not supporting more physical
> -* memory, or by providing a ZONE_DMA32.  If neither is the case, the
> -* architecture needs to use an IOMMU instead of the direct mapping.
> -*/
> -   if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
> +   u64 min_mask;
> +
> +   if (IS_ENABLED(CONFIG_ZONE_DMA))
> +   min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
> +   else
> +   min_mask = DMA_BIT_MASK(32);
> +
> +   min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
> +
> +   if (mask >= phys_to_dma(dev, min_mask))
> return 0;
> -#endif
> return 1;
>  }

So I believe I have run into the same issue that Guenter reported. On
an x86_64 system w/ Intel IOMMU. I wasn't able to complete boot and
all probe attempts for various devices were failing with -EIO errors.

I believe the last mask check should be "if (mask < phys_to_dma(dev,
min_mask))" not a ">=" check.

- Alex


Re: [PATCH] migration/mm: Add WARN_ON to try_offline_node

2018-10-03 Thread Tyrel Datwyler
On 10/03/2018 06:27 AM, Michael Bringmann wrote:
> On 10/02/2018 02:45 PM, Tyrel Datwyler wrote:
>> On 10/02/2018 11:13 AM, Michael Bringmann wrote:
>>>
>>>
>>> On 10/02/2018 11:04 AM, Michal Hocko wrote:
 On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>> [...]
>>> When the device-tree affinity attributes have changed for memory,
>>> the 'nid' affinity calculated points to a different node for the
>>> memory block than the one used to install it, previously on the
>>> source system.  The newly calculated 'nid' affinity may not yet
>>> be initialized on the target system.  The current memory tracking
>>> mechanisms do not record the node to which a memory block was
>>> associated when it was added.  Nathan is looking at adding this
>>> feature to the new implementation of LMBs, but it is not there
>>> yet, and won't be present in earlier kernels without backporting a
>>> significant number of changes.
>>
>> Then the patch you have proposed here just papers over a real issue, no?
>> IIUC then you simply do not remove the memory if you lose the race.
>
> The problem occurs when removing memory after an affinity change
> references a node that was previously unreferenced.  Other code
> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
> node when adding memory to a system.  The 'removing memory' case is
> specific to systems that perform LPM and allow device-tree changes.
> The powerpc kernel does not have the option of accepting some PRRN
> requests and accepting others.  It must perform them all.

 I am sorry, but you are still too cryptic for me. Either there is a
 correctness issue and the the patch doesn't really fix anything or the
 final race doesn't make any difference and then the ppc code should be
 explicit about that. Checking the node inside the hotplug core code just
 looks as a wrong layer to mitigate an arch specific problem. I am not
 saying the patch is a no-go but if anything we want a big fat comment
 explaining how this is possible because right now it just points to an
 incorrect API usage.

 That being said, this sounds pretty much ppc specific problem and I
 would _prefer_ it to be handled there (along with a big fat comment of
 course).
>>>
>>> Let me try again.  Regardless of the path to which we get to this condition,
>>> we currently crash the kernel.  This patch changes that to a WARN_ON notice
>>> and continues executing the kernel without shutting down the system.  I saw
>>> the problem during powerpc testing, because that is the focus of my work.
>>> There are other paths to this function besides powerpc.  I feel that the
>>> kernel should keep running instead of halting.
>>
>> This is still basically a hack to get around a known race. In itself this 
>> patch is still worth while in that we shouldn't crash the kernel on a null 
>> pointer dereference. However, I think the actual problem still needs to be 
>> addressed. We shouldn't run any PRRN events for the source system on the 
>> target after a migration. The device tree update should have taken care of 
>> telling us about new affinities and what not. Can we just throw out any 
>> queued PRRN events when we wake up on the target?
> 
> We are not talking about queued events provided on the source system, but 
> about
> new PRRN events sent by phyp to the kernel on the target system to update the
> kernel state after migration.  No way to predict the content.

Okay, but either way shouldn't your other proposed patches to update memory 
affinity by re-adding memory and changing the time topology updates are stopped 
to include the post-mobility updates put things in the right nodes? Or, am I 
missing something? I would assume a PRRN on the target would assume the target 
was up-to-date with respect to where things are supposed to be located.

-Tyrel

> 
>>
>> -Tyrel
>>>
>>> Regards,
>>>
> 
> Michael
> 



[PATCH] selftests/powerpc: Add check for TM SPRS on coredump

2018-10-03 Thread Breno Leitao
Add a selftest to check if the TM SPRs are being properly saved into a
coredump. The segfault is caused by an illegal instruction and ideally it
happens after load_tm overflowed and TM became lazily disabled.

This test is implemented basically setting three TM_SPR and sleeping until
load_tm is expected to be zero.  Then it causes a segfault and open the
coredump to check if the SPRs saved in the coredump notes section contain
the same values that were set by the test.

This test needs root privileges in order to change the coredump file
pattern (/proc/sys/kernel/core_pattern).

Signed-off-by: Breno Leitao 
Signed-off-by: Gustavo Romero 
---
 tools/testing/selftests/powerpc/tm/Makefile  |   3 +-
 tools/testing/selftests/powerpc/tm/tm-core.c | 480 +++
 tools/testing/selftests/powerpc/tm/tm.h  |   4 +
 3 files changed, 486 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-core.c

diff --git a/tools/testing/selftests/powerpc/tm/Makefile 
b/tools/testing/selftests/powerpc/tm/Makefile
index c0e45d2dde25..b06a58fc6f79 100644
--- a/tools/testing/selftests/powerpc/tm/Makefile
+++ b/tools/testing/selftests/powerpc/tm/Makefile
@@ -4,7 +4,7 @@ SIGNAL_CONTEXT_CHK_TESTS := tm-signal-context-chk-gpr 
tm-signal-context-chk-fpu
 
 TEST_GEN_PROGS := tm-resched-dscr tm-syscall tm-signal-msr-resv 
tm-signal-stack \
tm-vmxcopy tm-fork tm-tar tm-tmspr tm-vmx-unavail tm-unavailable 
tm-trap \
-   $(SIGNAL_CONTEXT_CHK_TESTS) tm-sigreturn
+   $(SIGNAL_CONTEXT_CHK_TESTS) tm-sigreturn tm-core
 
 include ../../lib.mk
 
@@ -19,6 +19,7 @@ $(OUTPUT)/tm-vmx-unavail: CFLAGS += -pthread -m64
 $(OUTPUT)/tm-resched-dscr: ../pmu/lib.c
 $(OUTPUT)/tm-unavailable: CFLAGS += -O0 -pthread -m64 -Wno-error=uninitialized 
-mvsx
 $(OUTPUT)/tm-trap: CFLAGS += -O0 -pthread -m64
+$(OUTPUT)/tm-core: CFLAGS += -pthread
 
 SIGNAL_CONTEXT_CHK_TESTS := $(patsubst 
%,$(OUTPUT)/%,$(SIGNAL_CONTEXT_CHK_TESTS))
 $(SIGNAL_CONTEXT_CHK_TESTS): tm-signal.S
diff --git a/tools/testing/selftests/powerpc/tm/tm-core.c 
b/tools/testing/selftests/powerpc/tm/tm-core.c
new file mode 100644
index ..b01a3a1ae1bf
--- /dev/null
+++ b/tools/testing/selftests/powerpc/tm/tm-core.c
@@ -0,0 +1,480 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2018, Breno Leitao, Gustavo Romero, IBM Corp.
+ * Licensed under GPLv2.
+ *
+ * Test case that sets TM SPR and sleeps, waiting kernel load_tm to be
+ * zero. Then causes a segfault to generate a core dump file that will be
+ * analyzed. The coredump needs to have the same HTM SPR values set
+ * initially for a successful test.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tm.h"
+
+/* Default time that causes load_tm = 0 on P8/pseries */
+#define DEFAULT_SLEEP_TIME 0x00d000
+
+/* MAX string length for string concatenation */
+#define LEN_MAX1024
+
+/* Maximum coredump file size */
+#define CORE_FILE_LIMIT(5 * 1024 * 1024)
+
+/* Name of the coredump file to be generated */
+#define COREDUMPFILE   "core-tm-spr"
+
+/* Function that returns concatenated strings */
+#define COREDUMP(suffix)   (COREDUMPFILE#suffix)
+
+/* File to change the coredump file name pattern */
+#define CORE_PATTERN_FILE  "/proc/sys/kernel/core_pattern"
+
+/* Logging macros */
+#define err_at_line(status, errnum, format, ...) \
+   error_at_line(status, errnum,  __FILE__, __LINE__, format ##__VA_ARGS__)
+#define pr_err(code, format, ...) err_at_line(1, code, format, ##__VA_ARGS__)
+
+/* Child PID */
+static pid_t child;
+
+/* pthread attribute for both ping and pong threads */
+static pthread_attr_t attr;
+
+/* SPR values to be written to TM SPRs and verified later */
+const unsigned long texasr = 0x31;
+const unsigned long tfiar = 0xdeadbeef0;
+const unsigned long tfhar = 0xbaadf00d0;
+
+struct tm_sprs {
+   unsigned long texasr;
+   unsigned long tfhar;
+   unsigned long tfiar;
+};
+
+struct coremem {
+   void *p;
+   off_t len;
+};
+
+/* Set the process limits to be able to create a coredump file */
+static int increase_core_file_limit(void)
+{
+   struct rlimit rlim;
+   int ret;
+
+   ret = getrlimit(RLIMIT_CORE, );
+   if (ret != 0) {
+   pr_err(ret, "getrlimit CORE failed\n");
+   return -1;
+   }
+
+   if (rlim.rlim_cur != RLIM_INFINITY && rlim.rlim_cur < CORE_FILE_LIMIT) {
+   rlim.rlim_cur = CORE_FILE_LIMIT;
+
+   if (rlim.rlim_max != RLIM_INFINITY &&
+   rlim.rlim_max < CORE_FILE_LIMIT)
+   rlim.rlim_max = CORE_FILE_LIMIT;
+
+   ret = setrlimit(RLIMIT_CORE, );
+   if (ret != 0) {
+   pr_err(ret, "setrlimit CORE 

[PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting

2018-10-03 Thread Mark Hairgrove
This threshold is no longer used now that all invalidates issue a single
ATSD to each active NPU.

Signed-off-by: Mark Hairgrove 
---
 arch/powerpc/platforms/powernv/npu-dma.c |   14 --
 1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index e4c0fab..6f60e09 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -43,14 +42,6 @@
 static DEFINE_SPINLOCK(npu_context_lock);
 
 /*
- * When an address shootdown range exceeds this threshold we invalidate the
- * entire TLB on the GPU for the given PID rather than each specific address in
- * the range.
- */
-static uint64_t atsd_threshold = 2 * 1024 * 1024;
-static struct dentry *atsd_threshold_dentry;
-
-/*
  * Other types of TCE cache invalidation are not functional in the
  * hardware.
  */
@@ -966,11 +957,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
static int npu_index;
uint64_t rc = 0;
 
-   if (!atsd_threshold_dentry) {
-   atsd_threshold_dentry = debugfs_create_x64("atsd_threshold",
-  0600, powerpc_debugfs_root, _threshold);
-   }
-
phb->npu.nmmu_flush =
of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
for_each_child_of_node(phb->hose->dn, dn) {
-- 
1.7.2.5



[PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates

2018-10-03 Thread Mark Hairgrove
There are two types of ATSDs issued to the NPU: invalidates targeting a
specific virtual address and invalidates targeting the whole address
space. In both cases prior to this change, the sequence was:

for each NPU
- Write the target address to the XTS_ATSD_AVA register
- EIEIO
- Write the launch value to issue the ATSD

First, a target address is not required when invalidating the whole
address space, so that write and the EIEIO have been removed. The AP
(size) field in the launch is not needed either.

Second, for per-address invalidates the above sequence is inefficient in
the common case of multiple NPUs because an EIEIO is issued per NPU. This
unnecessarily forces the launches of later ATSDs to be ordered with the
launches of earlier ones. The new sequence only issues a single EIEIO:

for each NPU
- Write the target address to the XTS_ATSD_AVA register
EIEIO
for each NPU
- Write the launch value to issue the ATSD

Performance results were gathered using a microbenchmark which creates a
1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
strides across the allocation.

With only a single NPU active (one GPU) the difference is in the noise for
both types of invalidates (+/-1%).

With two NPUs active (on a 6-GPU system) the effect is more noticeable:

 mprotect rate (GB/s)
Stride   Before  After  Speedup
64K 5.96.5  10%
1M 31.2   33.4   7%
2M 36.3   38.7   7%
4M322.6  356.7  11%

Signed-off-by: Mark Hairgrove 
---
 arch/powerpc/platforms/powernv/npu-dma.c |   99 ++---
 1 files changed, 48 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 8006c54..c8f438a 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -454,79 +454,76 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
 }
 
 /* MMIO ATSD register offsets */
-#define XTS_ATSD_AVA  1
-#define XTS_ATSD_STAT 2
+#define XTS_ATSD_LAUNCH 0
+#define XTS_ATSD_AVA1
+#define XTS_ATSD_STAT   2
 
-static void mmio_launch_invalidate(struct mmio_atsd_reg *mmio_atsd_reg,
-   unsigned long launch, unsigned long va)
+static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long 
psize,
+   bool flush)
 {
-   struct npu *npu = mmio_atsd_reg->npu;
-   int reg = mmio_atsd_reg->reg;
+   unsigned long launch = 0;
 
-   __raw_writeq_be(va, npu->mmio_atsd_regs[reg] + XTS_ATSD_AVA);
-   eieio();
-   __raw_writeq_be(launch, npu->mmio_atsd_regs[reg]);
+   if (psize == MMU_PAGE_COUNT) {
+   /* IS set to invalidate entire matching PID */
+   launch |= PPC_BIT(12);
+   } else {
+   /* AP set to invalidate region of psize */
+   launch |= (u64)mmu_get_ap(psize) << PPC_BITLSHIFT(17);
+   }
+
+   /* PRS set to process-scoped */
+   launch |= PPC_BIT(13);
+
+   /* PID */
+   launch |= pid << PPC_BITLSHIFT(38);
+
+   /* No flush */
+   launch |= !flush << PPC_BITLSHIFT(39);
+
+   return launch;
 }
 
-static void mmio_invalidate_pid(struct mmio_atsd_reg 
mmio_atsd_reg[NV_MAX_NPUS],
-   unsigned long pid, bool flush)
+static void mmio_atsd_regs_write(struct mmio_atsd_reg
+   mmio_atsd_reg[NV_MAX_NPUS], unsigned long offset,
+   unsigned long val)
 {
-   int i;
-   unsigned long launch;
+   struct npu *npu;
+   int i, reg;
 
for (i = 0; i <= max_npu2_index; i++) {
-   if (mmio_atsd_reg[i].reg < 0)
+   reg = mmio_atsd_reg[i].reg;
+   if (reg < 0)
continue;
 
-   /* IS set to invalidate matching PID */
-   launch = PPC_BIT(12);
-
-   /* PRS set to process-scoped */
-   launch |= PPC_BIT(13);
-
-   /* AP */
-   launch |= (u64)
-   mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
-
-   /* PID */
-   launch |= pid << PPC_BITLSHIFT(38);
+   npu = mmio_atsd_reg[i].npu;
+   __raw_writeq_be(val, npu->mmio_atsd_regs[reg] + offset);
+   }
+}
 
-   /* No flush */
-   launch |= !flush << PPC_BITLSHIFT(39);
+static void mmio_invalidate_pid(struct mmio_atsd_reg 
mmio_atsd_reg[NV_MAX_NPUS],
+   unsigned long pid, bool flush)
+{
+   unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
 
-   /* Invalidating the entire process doesn't use a va */
-   mmio_launch_invalidate(_atsd_reg[i], launch, 0);
-   }
+   /* Invalidating the entire process doesn't use a va */
+   

[PATCH v2 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates

2018-10-03 Thread Mark Hairgrove
Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.

The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.

ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.

With this change all issued ATSDs now perform a flush, so the flush
parameter has been removed from all the helpers.

To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.

One NPU (1 GPU):

 mprotect rate (GB/s)
Stride   Before  After  Speedup
64K 5.35.6   5%
1M 39.3   57.4  46%
2M 49.7   82.6  66%
4M286.6  285.7   0%

Two NPUs (6 GPUs):

 mprotect rate (GB/s)
Stride   Before  After  Speedup
64K 6.57.4  13%
1M 33.4   67.9 103%
2M 38.7   93.1 141%
4M356.7  354.6  -1%

Anything over 2M is roughly the same as before since both cases issue a
single ATSD.

Signed-off-by: Mark Hairgrove 
---
 arch/powerpc/platforms/powernv/npu-dma.c |  103 --
 1 files changed, 55 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index c8f438a..e4c0fab 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -458,8 +459,7 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
 #define XTS_ATSD_AVA1
 #define XTS_ATSD_STAT   2
 
-static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long 
psize,
-   bool flush)
+static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long 
psize)
 {
unsigned long launch = 0;
 
@@ -477,8 +477,7 @@ static unsigned long get_atsd_launch_val(unsigned long pid, 
unsigned long psize,
/* PID */
launch |= pid << PPC_BITLSHIFT(38);
 
-   /* No flush */
-   launch |= !flush << PPC_BITLSHIFT(39);
+   /* Leave "No flush" (bit 39) 0 so every ATSD performs a flush */
 
return launch;
 }
@@ -501,23 +500,22 @@ static void mmio_atsd_regs_write(struct mmio_atsd_reg
 }
 
 static void mmio_invalidate_pid(struct mmio_atsd_reg 
mmio_atsd_reg[NV_MAX_NPUS],
-   unsigned long pid, bool flush)
+   unsigned long pid)
 {
-   unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
+   unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT);
 
/* Invalidating the entire process doesn't use a va */
mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
 }
 
-static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
-   unsigned long va, unsigned long pid, bool flush)
+static void mmio_invalidate_range(struct mmio_atsd_reg
+   mmio_atsd_reg[NV_MAX_NPUS], unsigned long pid,
+   unsigned long start, unsigned long psize)
 {
-   unsigned long launch;
-
-   launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
+   unsigned long launch = get_atsd_launch_val(pid, psize);
 
/* Write all VAs first */
-   mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
+   mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, start);
 
/* Issue one barrier for all address writes */
eieio();
@@ -609,14 +607,36 @@ static void release_atsd_reg(struct mmio_atsd_reg 
mmio_atsd_reg[NV_MAX_NPUS])
 }
 
 /*
- * Invalidate either a single address or an entire PID depending on
- * the value of va.
+ * Invalidate a virtual address range
  */
-static void mmio_invalidate(struct npu_context *npu_context, int va,
-   unsigned long address, bool flush)
+static void mmio_invalidate(struct npu_context *npu_context,
+   unsigned long start, unsigned long size)
 {
struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];

[PATCH v2 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead

2018-10-03 Thread Mark Hairgrove
When ATS is used in a process, all CPU TLB invalidates in that process
also trigger ATSD invalidates via mmu_notifiers. This additional overhead
is noticeable in applications which do heavy memory allocation or page
migration among nodes, particularly to and from GPUs.

This patch set reduces that overhead by rearranging how the ATSDs are
issued and by using size-based ATSD invalidates.

Mark Hairgrove (3):
  powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
  powerpc/powernv/npu: Use size-based ATSD invalidates
  powerpc/powernv/npu: Remove atsd_threshold debugfs setting

 arch/powerpc/platforms/powernv/npu-dma.c |  198 ++
 1 files changed, 94 insertions(+), 104 deletions(-)

-- 
1.7.2.5



Re: [PATCH 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates

2018-10-03 Thread Mark Hairgrove



On Wed, 3 Oct 2018, Alistair Popple wrote:

> > > 
> > > We also support 4K page sizes on PPC. If I am not mistaken this means 
> > > every ATSD
> > > would invalidate the entire GPU TLB for a the given PID on those systems. 
> > > Could
> > > we change the above check to `if (size <= PAGE_64K)` to avoid this?
> > 
> > PPC supports 4K pages but the GPU ATS implementation does not. For that
> > reason I didn't bother handling invalidates smaller than 64K. I'll add a
> > comment on that.
> 
> Interesting, I was not aware of that limitation. Do you know if it is a
> SW/driver limitation or a HW limitation?

HW.

> 
> > I don't know that this requirement is enforced anywhere though. I could
> > add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it
> > would be useful.
> 
> Given it's a static kernel build parameter perhaps it makes more sense to do 
> the
> check as part of the driver build in a conftest rather than a runtime failure?

Sure, we can do it in the driver. I'll just stick with a comment for the
kernel.


Re: [RFC PATCH kernel] vfio/spapr_tce: Get rid of possible infinite loop

2018-10-03 Thread Alex Williamson
On Tue,  2 Oct 2018 13:22:31 +1000
Alexey Kardashevskiy  wrote:

> As a part of cleanup, the SPAPR TCE IOMMU subdriver releases preregistered
> memory. If there is a bug in memory release, the loop in
> tce_iommu_release() becomes infinite; this actually happened to me.
> 
> This makes the loop finite and prints a warning on every failure to make
> the code more bug prone.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 10 +++---
>  1 file changed, 3 insertions(+), 7 deletions(-)

Should this have a stable/fixes tag?  Looks like it's relevant to:

4b6fad7097f8 powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

Also, not sure who you're wanting to take this since it was sent to ppc
lists.  If it's me, let me know, otherwise

Acked-by: Alex Williamson 

Thanks,
Alex

> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index b1a8ab3..ece0651 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -371,6 +371,7 @@ static void tce_iommu_release(void *iommu_data)
>  {
>   struct tce_container *container = iommu_data;
>   struct tce_iommu_group *tcegrp;
> + struct tce_iommu_prereg *tcemem, *tmtmp;
>   long i;
>  
>   while (tce_groups_attached(container)) {
> @@ -393,13 +394,8 @@ static void tce_iommu_release(void *iommu_data)
>   tce_iommu_free_table(container, tbl);
>   }
>  
> - while (!list_empty(>prereg_list)) {
> - struct tce_iommu_prereg *tcemem;
> -
> - tcemem = list_first_entry(>prereg_list,
> - struct tce_iommu_prereg, next);
> - WARN_ON_ONCE(tce_iommu_prereg_free(container, tcemem));
> - }
> + list_for_each_entry_safe(tcemem, tmtmp, >prereg_list, next)
> + WARN_ON(tce_iommu_prereg_free(container, tcemem));
>  
>   tce_iommu_disable(container);
>   if (container->mm)



Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread David Hildenbrand
On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> Dave Hansen  writes:
> 
>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>> It is more than just memmaps (e.g. forking udev process doing memory
>>> onlining also needs memory) but yes, the main idea is to make the
>>> onlining synchronous with hotplug.
>>
>> That's a good theoretical concern.
>>
>> But, is it a problem we need to solve in practice?
> 
> Yes, unfortunately. It was previously discovered that when we try to
> hotplug tons of memory to a low memory system (this is a common scenario
> with VMs) we end up with OOM because for all new memory blocks we need
> to allocate page tables, struct pages, ... and we need memory to do
> that. The userspace program doing memory onlining also needs memory to
> run and in case it prefers to fork to handle hundreds of notfifications
> ... well, it may get OOMkilled before it manages to online anything.
> 
> Allocating all kernel objects from the newly hotplugged blocks would
> definitely help to manage the situation but as I said this won't solve
> the 'forking udev' problem completely (it will likely remain in
> 'extreme' cases only. We can probably work around it by onlining with a
> dedicated process which doesn't do memory allocation).
> 

I guess the problem is even worse. We always have two phases

1. add memory - requires memory allocation
2. online memory - might require memory allocations e.g. for slab/slub

So if we just added memory but don't have sufficient memory to start a
user space process to trigger onlining, then we most likely also don't
have sufficient memory to online the memory right away (in some scenarios).

We would have to allocate all new memory for 1 and 2 from the memory to
be onlined. I guess the latter part is less trivial.

So while onlining the memory from the kernel might make things a little
more robust, we would still have the chance for OOM / onlining failing.

-- 

Thanks,

David / dhildenb


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread David Hildenbrand
On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

As already stated, I guess we should then provide user space with
sufficient information to make a good decision (to implement rules).

The eBPF is basically the same idea, only the rules are formulated
differently and directly handle in the kernel. Still it might be e.e.
relevant if memory is standby memory (that's what I remember the
official s390x name), or something else.

Right now, the (udev) rules we have make assumptions based on general
system properties (s390x, HyperV ...).

-- 

Thanks,

David / dhildenb


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread David Hildenbrand
On 03/10/2018 15:54, Michal Hocko wrote:
> On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
>> On 02/10/2018 15:47, Michal Hocko wrote:
> [...]
>>> Zone imbalance is an inherent problem of the highmem zone. It is
>>> essentially the highmem zone we all loved so much back in 32b days.
>>> Yes the movable zone doesn't have any addressing limitations so it is a
>>> bit more relaxed but considering the hotplug scenarios I have seen so
>>> far people just want to have full NUMA nodes movable to allow replacing
>>> DIMMs. And then we are back to square one and the zone imbalance issue.
>>> You have those regardless where memmaps are allocated from.
>>
>> Unfortunately yes. And things get more complicated as you are adding a
>> whole DIMMs and get notifications in the granularity of memory blocks.
>> Usually you are not interested in onlining any memory block of that DIMM
>> as MOVABLE as soon as you would have to online one memory block of that
>> DIMM as NORMAL - because that can already block the whole DIMM.
> 
> For the purpose of the hotremove, yes. But as Dave has noted people are
> (ab)using zone movable for other purposes - e.g. large pages.

That might be right for some very special use cases. For most of users
this is not the case (meaning it should be the default but if the user
wants to change it, he should be allowed to change it).

>  
> [...]
>>> Then the immediate question would be why to use memory hotplug for that
>>> at all? Why don't you simply start with a huge pre-allocated physical
>>> address space and balloon memory in an out per demand. Why do you want
>>> to inject new memory during the runtime?
>>
>> Let's assume you have a guest with 20GB size and eventually want to
>> allow to grow it to 4TB. You would have to allocate metadata for 4TB
>> right from the beginning. That's definitely now what we want. That is
>> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
>> hypervisor even tells you at which places additional memory has been
>> made available.
> 
> Then you have to live with the fact that your hot added memory will be
> self hosted and find a way for ballooning to work with that. The price
> would be that some part of the memory is not really balloonable in the
> end.
> 
 1. is a reason why distributions usually don't configure
 "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
 MOVABLE zone. That however implies, that e.g. for x86, you have to
 handle all new memory in user space, especially also HyperV memory.
 There, you then have to check for things like "isHyperV()" to decide
 "oh, yes, this should definitely not go to the MOVABLE zone".
>>>
>>> Why do you need a generic hotplug rule in the first place? Why don't you
>>> simply provide different set of rules for different usecases? Let users
>>> decide which usecase they prefer rather than try to be clever which
>>> almost always hits weird corner cases.
>>>
>>
>> Memory hotplug has to work as reliable as we can out of the box. Letting
>> the user make simple decisions like "oh, I am on hyper-V, I want to
>> online memory to the normal zone" does not feel right.
> 
> Users usually know what is their usecase and then it is just a matter of
> plumbing (e.g. distribution can provide proper tools to deploy those
> usecases) to chose the right and for user obscure way to make it work.

I disagree. If we can ship sane defaults, we should do that and allow to
make changes later on. This is how distributions have been working for
ever. But yes, allowing to make modifications is always a good idea to
tailor it to some special case user scenarios. (tuned or whatever we
have in place).

> 
>> But yes, we
>> should definitely allow to make modifications. So some sane default rule
>> + possible modification is usually a good idea.
>>
>> I think Dave has a point with using MOVABLE for huge page use cases. And
>> there might be other corner cases as you correctly state.
>>
>> I wonder if this patch itself minus modifying online/offline might make
>> sense. We can then implement simple rules in user space
>>
>> if (normal) {
>>  /* customers expect hotplugged DIMMs to be unpluggable */
>>  online_movable();
>> } else if (paravirt) {
>>  /* paravirt memory should as default always go to the NORMAL */
>>  online();
>> } else {
>>  /* standby memory will never get onlined automatically */
>> }
>>
>> Compared to having to guess what is to be done (isKVM(), isHyperV,
>> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
>> and ACPI support for x86 KVM).
> 
> I am worried that exporing a type will just push us even further to the
> corner. The current design is really simple and 2 stage and that is good
> because it allows for very different usecases. The more specific the API
> be the more likely we are going to hit "I haven't even dreamed somebody
> would be using hotplug for this thing". And I would bet this will happen
> sooner or 

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
Dave Hansen  writes:

> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>> It is more than just memmaps (e.g. forking udev process doing memory
>> onlining also needs memory) but yes, the main idea is to make the
>> onlining synchronous with hotplug.
>
> That's a good theoretical concern.
>
> But, is it a problem we need to solve in practice?

Yes, unfortunately. It was previously discovered that when we try to
hotplug tons of memory to a low memory system (this is a common scenario
with VMs) we end up with OOM because for all new memory blocks we need
to allocate page tables, struct pages, ... and we need memory to do
that. The userspace program doing memory onlining also needs memory to
run and in case it prefers to fork to handle hundreds of notfifications
... well, it may get OOMkilled before it manages to online anything.

Allocating all kernel objects from the newly hotplugged blocks would
definitely help to manage the situation but as I said this won't solve
the 'forking udev' problem completely (it will likely remain in
'extreme' cases only. We can probably work around it by onlining with a
dedicated process which doesn't do memory allocation).

-- 
Vitaly


RE: [PATCH v3 4/6] drivers: clk-qoriq: Add clockgen support for lx2160a

2018-10-03 Thread Vabhav Sharma


> -Original Message-
> From: devicetree-ow...@vger.kernel.org 
> On Behalf Of Stephen Boyd
> Sent: Tuesday, October 2, 2018 3:34 AM
> To: Vabhav Sharma ; a...@arndb.de;
> catalin.mari...@arm.com; devicet...@vger.kernel.org;
> gre...@linuxfoundation.org; kstew...@linuxfoundation.org; linux-arm-
> ker...@lists.infradead.org; linux-...@vger.kernel.org; linux-kernel-
> ow...@vger.kernel.org; linux-ker...@vger.kernel.org; linux-
> p...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; mark.rutl...@arm.com;
> mturque...@baylibre.com; o...@buserror.net; r...@rjwysocki.net;
> robh...@kernel.org; sudeep.ho...@arm.com; viresh.ku...@linaro.org;
> will.dea...@arm.com; yamada.masah...@socionext.com
> Cc: li...@armlinux.org.uk; Varun Sethi ; Udit Kumar
> ; Yogesh Narayan Gaur
> ; Andy Tang ; Vabhav
> Sharma 
> Subject: Re: [PATCH v3 4/6] drivers: clk-qoriq: Add clockgen support for 
> lx2160a
> 
> Same subject comment.
Ok
> 
> Quoting Vabhav Sharma (2018-09-23 17:08:59)
> > From: Yogesh Gaur 
> >
> > Add clockgen support for lx2160a.


RE: [PATCH v3 3/6] drivers: clk-qoriq: increase array size of cmux_to_group

2018-10-03 Thread Vabhav Sharma


> -Original Message-
> From: Stephen Boyd 
> Sent: Tuesday, October 2, 2018 3:34 AM
> To: Vabhav Sharma ; a...@arndb.de;
> catalin.mari...@arm.com; devicet...@vger.kernel.org;
> gre...@linuxfoundation.org; kstew...@linuxfoundation.org; linux-arm-
> ker...@lists.infradead.org; linux-...@vger.kernel.org; linux-kernel-
> ow...@vger.kernel.org; linux-ker...@vger.kernel.org; linux-
> p...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; mark.rutl...@arm.com;
> mturque...@baylibre.com; o...@buserror.net; r...@rjwysocki.net;
> robh...@kernel.org; sudeep.ho...@arm.com; viresh.ku...@linaro.org;
> will.dea...@arm.com; yamada.masah...@socionext.com
> Cc: li...@armlinux.org.uk; Varun Sethi ; Udit Kumar
> ; Yogesh Narayan Gaur
> ; Vabhav Sharma 
> Subject: Re: [PATCH v3 3/6] drivers: clk-qoriq: increase array size of
> cmux_to_group
> 
> Subject should be "clk: qoriq: increase array size ..."
Ok



Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
[...]
> > As David said some of the memory cannot be onlined without further steps
> > (e.g. when it is standby as David called it) and then I fail to see how
> > eBPF help in any way.
> 
> and also, we can fight till the end of days here trying to come up with
> an onlining solution which would work for everyone and eBPF would move
> this decision to distro level.

The point is that there is _no_ general onlining solution. This is
basically policy which belongs to the userspace.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v03 4/5] migration/memory: Evaluate LMB assoc changes

2018-10-03 Thread Michael Bringmann
On 10/02/2018 04:08 PM, Tyrel Datwyler wrote:
> On 10/01/2018 06:00 AM, Michael Bringmann wrote:
>> migration/memory: This patch adds code that recognizes changes to
>> the associativity of memory blocks described by the device-tree
>> properties in order to drive equivalent 'hotplug' operations to
>> update local and general kernel data structures to reflect those
>> changes.  These differences may include:
>>
>> * Evaluate 'ibm,dynamic-memory' properties when processing the
>>   updated device-tree properties of the system during Post Migration
>>   events (migration_store).  The new functionality looks for changes
>>   to the aa_index values for each drc_index/LMB to identify any memory
>>   blocks that should be readded.
>>
>> * In an LPAR migration scenario, the "ibm,associativity-lookup-arrays"
>>   property may change.  In the event that a row of the array differs,
>>   locate all assigned memory blocks with that 'aa_index' and 're-add'
>>   them to the system memory block data structures.  In the process of
>>   the 're-add', the system routines will update the corresponding entry
>>   for the memory in the LMB structures and any other relevant kernel
>>   data structures.
>>
>> A number of previous extensions made to the DRMEM code for scanning
>> device-tree properties and creating LMB arrays are used here to
>> ensure that the resulting code is simpler and more usable:
>>
>> * Use new paired list iterator for the DRMEM LMB info arrays to find
>>   differences in old and new versions of properties.
>> * Use new iterator for copies of the DRMEM info arrays to evaluate
>>   completely new structures.
>> * Combine common code for parsing and evaluating memory description
>>   properties based on the DRMEM LMB array model to greatly simplify
>>   extension from the older property 'ibm,dynamic-memory' to the new
>>   property model of 'ibm,dynamic-memory-v2'.
>>
>> Signed-off-by: Michael Bringmann 
>> ---
>> Changes in v03:
>>   -- Modify the code that parses the memory affinity attributes to
>>  mark relevant DRMEM LMB array entries using the internal_flags
>>  mechanism instead of generate unique hotplug actions for each
>>  memory block to be readded.  The change is intended to both
>>  simplify the code, and to require fewer resources on systems
>>  with huge amounts of memory.
>>   -- Save up notice about any all LMB entries until the end of the
>>  'migration_store' operation at which point a single action is
>>  queued to scan the entire DRMEM array.
>>   -- Add READD_MULTIPLE function for memory that scans the DRMEM
>>  array to identify multiple entries that were marked previously.
>>  The corresponding memory blocks are to be readded to the system
>>  to update relevant data structures outside of the powerpc-
>>  specific code.
>>   -- Change dlpar_memory_pmt_changes_action to directly queue worker
>>  to pseries work queue.
>> ---
>>  arch/powerpc/platforms/pseries/hotplug-memory.c |  220 
>> +++
>>  arch/powerpc/platforms/pseries/mobility.c   |4 
>>  arch/powerpc/platforms/pseries/pseries.h|4 
>>  3 files changed, 194 insertions(+), 34 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index c1578f5..68bde2e 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -561,8 +561,11 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
>>  }
>>  }
>>
>> -if (!lmb_found)
>> -rc = -EINVAL;
>> +if (!lmb_found) {
>> +pr_info("Failed to update memory for drc index %lx\n",
>> +(unsigned long) drc_index);
>> +return -EINVAL;
>> +}
>>
>>  if (rc)
>>  pr_info("Failed to update memory at %llx\n",
>> @@ -573,6 +576,30 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
>>  return rc;
>>  }
>>
>> +static int dlpar_memory_readd_multiple(void)
>> +{
>> +struct drmem_lmb *lmb;
>> +int rc;
>> +
>> +pr_info("Attempting to update multiple LMBs\n");
>> +
>> +for_each_drmem_lmb(lmb) {
>> +if (drmem_lmb_update(lmb)) {
>> +rc = dlpar_remove_lmb(lmb);
>> +
>> +if (!rc) {
>> +rc = dlpar_add_lmb(lmb);
>> +if (rc)
>> +dlpar_release_drc(lmb->drc_index);
>> +}
>> +
>> +drmem_remove_lmb_update(lmb);
>> +}
>> +}
>> +
>> +return rc;
>> +}
>> +
> 
> Actually, in retrospect this function should have been independently 
> implemented in your 3rd patch where you define the flag. So, disregard moving 
> the define into this patch and instead move the function implementation into 
> patch 3 where the commit message talks about adding this operation.

Okay.  It 

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Dave Hansen
On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> It is more than just memmaps (e.g. forking udev process doing memory
> onlining also needs memory) but yes, the main idea is to make the
> onlining synchronous with hotplug.

That's a good theoretical concern.

But, is it a problem we need to solve in practice?



Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
> On 02/10/2018 15:47, Michal Hocko wrote:
[...]
> > Zone imbalance is an inherent problem of the highmem zone. It is
> > essentially the highmem zone we all loved so much back in 32b days.
> > Yes the movable zone doesn't have any addressing limitations so it is a
> > bit more relaxed but considering the hotplug scenarios I have seen so
> > far people just want to have full NUMA nodes movable to allow replacing
> > DIMMs. And then we are back to square one and the zone imbalance issue.
> > You have those regardless where memmaps are allocated from.
> 
> Unfortunately yes. And things get more complicated as you are adding a
> whole DIMMs and get notifications in the granularity of memory blocks.
> Usually you are not interested in onlining any memory block of that DIMM
> as MOVABLE as soon as you would have to online one memory block of that
> DIMM as NORMAL - because that can already block the whole DIMM.

For the purpose of the hotremove, yes. But as Dave has noted people are
(ab)using zone movable for other purposes - e.g. large pages.
 
[...]
> > Then the immediate question would be why to use memory hotplug for that
> > at all? Why don't you simply start with a huge pre-allocated physical
> > address space and balloon memory in an out per demand. Why do you want
> > to inject new memory during the runtime?
> 
> Let's assume you have a guest with 20GB size and eventually want to
> allow to grow it to 4TB. You would have to allocate metadata for 4TB
> right from the beginning. That's definitely now what we want. That is
> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
> hypervisor even tells you at which places additional memory has been
> made available.

Then you have to live with the fact that your hot added memory will be
self hosted and find a way for ballooning to work with that. The price
would be that some part of the memory is not really balloonable in the
end.

> >> 1. is a reason why distributions usually don't configure
> >> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> >> MOVABLE zone. That however implies, that e.g. for x86, you have to
> >> handle all new memory in user space, especially also HyperV memory.
> >> There, you then have to check for things like "isHyperV()" to decide
> >> "oh, yes, this should definitely not go to the MOVABLE zone".
> > 
> > Why do you need a generic hotplug rule in the first place? Why don't you
> > simply provide different set of rules for different usecases? Let users
> > decide which usecase they prefer rather than try to be clever which
> > almost always hits weird corner cases.
> > 
> 
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right.

Users usually know what is their usecase and then it is just a matter of
plumbing (e.g. distribution can provide proper tools to deploy those
usecases) to chose the right and for user obscure way to make it work.

> But yes, we
> should definitely allow to make modifications. So some sane default rule
> + possible modification is usually a good idea.
> 
> I think Dave has a point with using MOVABLE for huge page use cases. And
> there might be other corner cases as you correctly state.
> 
> I wonder if this patch itself minus modifying online/offline might make
> sense. We can then implement simple rules in user space
> 
> if (normal) {
>   /* customers expect hotplugged DIMMs to be unpluggable */
>   online_movable();
> } else if (paravirt) {
>   /* paravirt memory should as default always go to the NORMAL */
>   online();
> } else {
>   /* standby memory will never get onlined automatically */
> }
> 
> Compared to having to guess what is to be done (isKVM(), isHyperV,
> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
> and ACPI support for x86 KVM).

I am worried that exporing a type will just push us even further to the
corner. The current design is really simple and 2 stage and that is good
because it allows for very different usecases. The more specific the API
be the more likely we are going to hit "I haven't even dreamed somebody
would be using hotplug for this thing". And I would bet this will happen
sooner or later.

Just look at how the whole auto onlining screwed the API to workaround
an implementation detail. It has created a one purpose behavior that
doesn't suite many usecases. Yet we have to live with that because
somebody really relies on it. Let's not repeat same errors.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
>> David Hildenbrand  writes:
>> 
>> > On 02/10/2018 15:47, Michal Hocko wrote:
>> ...
>> >> 
>> >> Why do you need a generic hotplug rule in the first place? Why don't you
>> >> simply provide different set of rules for different usecases? Let users
>> >> decide which usecase they prefer rather than try to be clever which
>> >> almost always hits weird corner cases.
>> >> 
>> >
>> > Memory hotplug has to work as reliable as we can out of the box. Letting
>> > the user make simple decisions like "oh, I am on hyper-V, I want to
>> > online memory to the normal zone" does not feel right. But yes, we
>> > should definitely allow to make modifications.
>> 
>> Last time I was thinking about the imperfectness of the auto-online
>> solution we have and any other solution we're able to suggest an idea
>> came to my mind - what if we add an eBPF attach point to the
>> auto-onlining mechanism effecively offloading decision-making to
>> userspace. We'll of couse need to provide all required data (e.g. how
>> memory blocks are aligned with physical DIMMs as it makes no sense to
>> online part of DIMM as normal and the rest as movable as it's going to
>> be impossible to unplug such DIMM anyways).
>
> And how does that differ from the notification mechanism we have? Just
> by not relying on the process scheduling? If yes then this revolves
> around the implementation detail that you care about time-to-hot-add
> vs. time-to-online. And that is a solveable problem - just allocate
> memmaps from the hot-added memory.

It is more than just memmaps (e.g. forking udev process doing memory
onlining also needs memory) but yes, the main idea is to make the
onlining synchronous with hotplug.

>
> As David said some of the memory cannot be onlined without further steps
> (e.g. when it is standby as David called it) and then I fail to see how
> eBPF help in any way.

and also, we can fight till the end of days here trying to come up with
an onlining solution which would work for everyone and eBPF would move
this decision to distro level.

-- 
Vitaly


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
> David Hildenbrand  writes:
> 
> > On 02/10/2018 15:47, Michal Hocko wrote:
> ...
> >> 
> >> Why do you need a generic hotplug rule in the first place? Why don't you
> >> simply provide different set of rules for different usecases? Let users
> >> decide which usecase they prefer rather than try to be clever which
> >> almost always hits weird corner cases.
> >> 
> >
> > Memory hotplug has to work as reliable as we can out of the box. Letting
> > the user make simple decisions like "oh, I am on hyper-V, I want to
> > online memory to the normal zone" does not feel right. But yes, we
> > should definitely allow to make modifications.
> 
> Last time I was thinking about the imperfectness of the auto-online
> solution we have and any other solution we're able to suggest an idea
> came to my mind - what if we add an eBPF attach point to the
> auto-onlining mechanism effecively offloading decision-making to
> userspace. We'll of couse need to provide all required data (e.g. how
> memory blocks are aligned with physical DIMMs as it makes no sense to
> online part of DIMM as normal and the rest as movable as it's going to
> be impossible to unplug such DIMM anyways).

And how does that differ from the notification mechanism we have? Just
by not relying on the process scheduling? If yes then this revolves
around the implementation detail that you care about time-to-hot-add
vs. time-to-online. And that is a solveable problem - just allocate
memmaps from the hot-added memory.

As David said some of the memory cannot be onlined without further steps
(e.g. when it is standby as David called it) and then I fail to see how
eBPF help in any way.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
David Hildenbrand  writes:

> On 02/10/2018 15:47, Michal Hocko wrote:
...
>> 
>> Why do you need a generic hotplug rule in the first place? Why don't you
>> simply provide different set of rules for different usecases? Let users
>> decide which usecase they prefer rather than try to be clever which
>> almost always hits weird corner cases.
>> 
>
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right. But yes, we
> should definitely allow to make modifications.

Last time I was thinking about the imperfectness of the auto-online
solution we have and any other solution we're able to suggest an idea
came to my mind - what if we add an eBPF attach point to the
auto-onlining mechanism effecively offloading decision-making to
userspace. We'll of couse need to provide all required data (e.g. how
memory blocks are aligned with physical DIMMs as it makes no sense to
online part of DIMM as normal and the rest as movable as it's going to
be impossible to unplug such DIMM anyways).

-- 
Vitaly


Re: [PATCH v03 3/5] migration/memory: Add hotplug READD_MULTIPLE

2018-10-03 Thread Michael Bringmann
On 10/02/2018 04:03 PM, Tyrel Datwyler wrote:
> On 10/01/2018 05:59 AM, Michael Bringmann wrote:
>> migration/memory: This patch adds a new pseries hotplug action
>> for CPU and memory operations, PSERIES_HP_ELOG_ACTION_READD_MULTIPLE.
>> This is a variant of the READD operation which performs the action
>> upon multiple instances of the resource at one time.  The operation
>> is to be triggered by device-tree analysis of updates by RTAS events
>> analyzed by 'migation_store' during post-migration processing.  It
>> will be used for memory updates, initially.
>>
>> Signed-off-by: Michael Bringmann 
>> ---
>>  arch/powerpc/include/asm/rtas.h |1 +
>>  arch/powerpc/mm/drmem.c |1 +
>>  2 files changed, 2 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/rtas.h 
>> b/arch/powerpc/include/asm/rtas.h
>> index 71e393c..e510d82 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -320,6 +320,7 @@ struct pseries_hp_errorlog {
>>  #define PSERIES_HP_ELOG_ACTION_ADD  1
>>  #define PSERIES_HP_ELOG_ACTION_REMOVE   2
>>  #define PSERIES_HP_ELOG_ACTION_READD3
>> +#define PSERIES_HP_ELOG_ACTION_READD_MULTIPLE   4
> 
> I'm confused, you have only added a define and not the actual implementation. 
> I really think this should be squashed into your 4th patch where the 
> operation is actually implemented.

Okay.

> 
>>
>>  #define PSERIES_HP_ELOG_ID_DRC_NAME 1
>>  #define PSERIES_HP_ELOG_ID_DRC_INDEX2
>> diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
>> index fd2cae92..2228586 100644
>> --- a/arch/powerpc/mm/drmem.c
>> +++ b/arch/powerpc/mm/drmem.c
>> @@ -422,6 +422,7 @@ static void init_drmem_v2_lmbs(const __be32 *prop,
>>
>>  lmb->aa_index = dr_cell.aa_index;
>>  lmb->flags = dr_cell.flags;
>> +lmb->internal_flags = 0;
> 
> And this should have been squashed into the previous patch where you added 
> the internal_flags field to the lmb struct.

Okay.

> 
> -Tyrel 
> 
>>  }
>>  }
>>  }
>>
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: [PATCH] migration/mm: Add WARN_ON to try_offline_node

2018-10-03 Thread Michael Bringmann
On 10/02/2018 02:45 PM, Tyrel Datwyler wrote:
> On 10/02/2018 11:13 AM, Michael Bringmann wrote:
>>
>>
>> On 10/02/2018 11:04 AM, Michal Hocko wrote:
>>> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
 On 10/02/2018 09:59 AM, Michal Hocko wrote:
> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
> [...]
>> When the device-tree affinity attributes have changed for memory,
>> the 'nid' affinity calculated points to a different node for the
>> memory block than the one used to install it, previously on the
>> source system.  The newly calculated 'nid' affinity may not yet
>> be initialized on the target system.  The current memory tracking
>> mechanisms do not record the node to which a memory block was
>> associated when it was added.  Nathan is looking at adding this
>> feature to the new implementation of LMBs, but it is not there
>> yet, and won't be present in earlier kernels without backporting a
>> significant number of changes.
>
> Then the patch you have proposed here just papers over a real issue, no?
> IIUC then you simply do not remove the memory if you lose the race.

 The problem occurs when removing memory after an affinity change
 references a node that was previously unreferenced.  Other code
 in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
 node when adding memory to a system.  The 'removing memory' case is
 specific to systems that perform LPM and allow device-tree changes.
 The powerpc kernel does not have the option of accepting some PRRN
 requests and accepting others.  It must perform them all.
>>>
>>> I am sorry, but you are still too cryptic for me. Either there is a
>>> correctness issue and the the patch doesn't really fix anything or the
>>> final race doesn't make any difference and then the ppc code should be
>>> explicit about that. Checking the node inside the hotplug core code just
>>> looks as a wrong layer to mitigate an arch specific problem. I am not
>>> saying the patch is a no-go but if anything we want a big fat comment
>>> explaining how this is possible because right now it just points to an
>>> incorrect API usage.
>>>
>>> That being said, this sounds pretty much ppc specific problem and I
>>> would _prefer_ it to be handled there (along with a big fat comment of
>>> course).
>>
>> Let me try again.  Regardless of the path to which we get to this condition,
>> we currently crash the kernel.  This patch changes that to a WARN_ON notice
>> and continues executing the kernel without shutting down the system.  I saw
>> the problem during powerpc testing, because that is the focus of my work.
>> There are other paths to this function besides powerpc.  I feel that the
>> kernel should keep running instead of halting.
> 
> This is still basically a hack to get around a known race. In itself this 
> patch is still worth while in that we shouldn't crash the kernel on a null 
> pointer dereference. However, I think the actual problem still needs to be 
> addressed. We shouldn't run any PRRN events for the source system on the 
> target after a migration. The device tree update should have taken care of 
> telling us about new affinities and what not. Can we just throw out any 
> queued PRRN events when we wake up on the target?

We are not talking about queued events provided on the source system, but about
new PRRN events sent by phyp to the kernel on the target system to update the
kernel state after migration.  No way to predict the content.

> 
> -Tyrel
>>
>> Regards,
>>

Michael

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: [PATCH v03 1/5] powerpc/drmem: Export 'dynamic-memory' loader

2018-10-03 Thread Michael Bringmann
On 10/02/2018 03:56 PM, Tyrel Datwyler wrote:
> On 10/01/2018 05:59 AM, Michael Bringmann wrote:
>> powerpc/drmem: Export many of the functions of DRMEM to parse
>> "ibm,dynamic-memory" and "ibm,dynamic-memory-v2" during hotplug
>> operations and for Post Migration events.
>>
>> Also modify the DRMEM initialization code to allow it to,
>>
>> * Be called after system initialization
>> * Provide a separate user copy of the LMB array that is produces
>> * Free the user copy upon request
>>
>> In addition, a couple of changes were made to make the creation
>> of additional copies of the LMB array more useful including,
>>
>> * Add new iterator to work through a pair of drmem_info arrays.
>> * Modify DRMEM code to replace usages of dt_root_addr_cells, and
>>   dt_mem_next_cell, as these are only available at first boot.
>>
>> Signed-off-by: Michael Bringmann 
>> ---
>>  arch/powerpc/include/asm/drmem.h |   15 
>>  arch/powerpc/mm/drmem.c  |   75 
>> --
>>  2 files changed, 70 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/drmem.h 
>> b/arch/powerpc/include/asm/drmem.h
>> index ce242b9..b0e70fd 100644
>> --- a/arch/powerpc/include/asm/drmem.h
>> +++ b/arch/powerpc/include/asm/drmem.h
>> @@ -35,6 +35,18 @@ struct drmem_lmb_info {
>>  _info->lmbs[0],   \
>>  _info->lmbs[drmem_info->n_lmbs - 1])
>>
>> +#define for_each_dinfo_lmb(dinfo, lmb)  \
>> +for_each_drmem_lmb_in_range((lmb),  \
>> +>lmbs[0],\
>> +>lmbs[dinfo->n_lmbs - 1])
>> +
>> +#define for_each_pair_dinfo_lmb(dinfo1, lmb1, dinfo2, lmb2) \
>> +for ((lmb1) = (>lmbs[0]),   \
>> + (lmb2) = (>lmbs[0]);   \
>> + ((lmb1) <= (>lmbs[dinfo1->n_lmbs - 1])) && \
>> + ((lmb2) <= (>lmbs[dinfo2->n_lmbs - 1]));   \
>> + (lmb1)++, (lmb2)++)
>> +
>>  /*
>>   * The of_drconf_cell_v1 struct defines the layout of the LMB data
>>   * specified in the ibm,dynamic-memory device tree property.
>> @@ -94,6 +106,9 @@ void __init walk_drmem_lmbs(struct device_node *dn,
>>  void (*func)(struct drmem_lmb *, const __be32 **));
>>  int drmem_update_dt(void);
>>
>> +struct drmem_lmb_info *drmem_lmbs_init(struct property *prop);
>> +void drmem_lmbs_free(struct drmem_lmb_info *dinfo);
>> +
>>  #ifdef CONFIG_PPC_PSERIES
>>  void __init walk_drmem_lmbs_early(unsigned long node,
>>  void (*func)(struct drmem_lmb *, const __be32 **));
>> diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
>> index 3f18036..13d2abb 100644
>> --- a/arch/powerpc/mm/drmem.c
>> +++ b/arch/powerpc/mm/drmem.c
>> @@ -20,6 +20,7 @@
>>
>>  static struct drmem_lmb_info __drmem_info;
>>  struct drmem_lmb_info *drmem_info = &__drmem_info;
>> +static int n_root_addr_cells;
> 
> What is the point of this new global? I see two places were it gets 
> initialized if it is null, and both those initializers simply set it to 
> "dt_root_addr_cells". I also checked the rest of the patches in the series 
> and none of those even reference this variable.

The point of these changes is to introduce code that can be used after
an LPAR migration.  The variable dt_root_addr_cells was observed to be
zero (0) after migration on powerpc.  As it is defined as,

./drivers/of/fdt.c:int __initdata dt_root_addr_cells;

its definition may be replaced by other code/variables after initial boot.
We needed the value to persist, and it was cheaper to cache it.

> 
>>
>>  u64 drmem_lmb_memory_max(void)
>>  {
>> @@ -193,12 +194,13 @@ int drmem_update_dt(void)
>>  return rc;
>>  }
>>
>> -static void __init read_drconf_v1_cell(struct drmem_lmb *lmb,
>> +static void read_drconf_v1_cell(struct drmem_lmb *lmb,
>> const __be32 **prop)
>>  {
>>  const __be32 *p = *prop;
>>
>> -lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
>> +lmb->base_addr = of_read_number(p, n_root_addr_cells);
>> +p += n_root_addr_cells;
> 
> Unnecessary code churn do to the introduction of "n_root_addr_cells".'

See note above.

> 
>>  lmb->drc_index = of_read_number(p++, 1);
>>
>>  p++; /* skip reserved field */
>> @@ -209,7 +211,7 @@ static void __init read_drconf_v1_cell(struct drmem_lmb 
>> *lmb,
>>  *prop = p;
>>  }
>>
>> -static void __init __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 
>> *usm,
>> +static void __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 *usm,
>>  void (*func)(struct drmem_lmb *, const __be32 **))
>>  {
>>  struct drmem_lmb lmb;
>> @@ -225,13 +227,14 @@ static void __init __walk_drmem_v1_lmbs(const __be32 
>> *prop, const __be32 *usm,
>>  }
>>  }
>>
>> -static void __init read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
>> +static void read_drconf_v2_cell(struct of_drconf_cell_v2 

Re: [PATCH v9 0/3] powerpc: Detection and scheduler optimization for POWER9 bigcore

2018-10-03 Thread Gautham R Shenoy
Hello Dave,

On Mon, Oct 01, 2018 at 07:05:11AM -0700, Dave Hansen wrote:
> On 10/01/2018 06:16 AM, Gautham R. Shenoy wrote:
> > 
> > Patch 3: Creates a pair of sysfs attributes named
> >   /sys/devices/system/cpu/cpuN/topology/smallcore_thread_siblings
> >   and
> >   /sys/devices/system/cpu/cpuN/topology/smallcore_thread_siblings_list
> >   exposing the small-core siblings that share the L1 cache
> >   to the userspace.
> 
> I really don't think you've justified the existence of a new user/kernel
> interface here.  We already have information about threads share L1
> caches in here:
> 
>   /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list


Hmmm. My bad, I wasn't aware of this sysfs interface. We can use this
to share information about L1 cache. And currently on the upstream
kernel we report incorrect value on POWER9 SMT8 core. I will fix
this. Thanks for calling this out.

> 
> The only question would be if anything would break because it assumes
> that all SMT siblings share all caches.  But, it breaks if your new
> interface is there or not; it's old software that we care about.
>

Up until POWER9 SMT8 cores, this assumption was true that all SMT
siblings share all caches. POWER9 SMT8 cores are the only exception. 

--
Thanks and Regards
gautham.



Re: [PATCH] powerpc: use PTRRELOC during early init

2018-10-03 Thread Andreas Schwab
On Okt 03 2018, Christophe LEROY  wrote:

> Did you try my proposed fix https://patchwork.ozlabs.org/patch/977195/ ?

That works as well.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: [RFC PATCH v3 4/7] powerpc: regain entire stack space

2018-10-03 Thread Christophe LEROY




Le 03/10/2018 à 09:07, Nicholas Piggin a écrit :

On Wed, 3 Oct 2018 08:45:25 +0200
Christophe LEROY  wrote:


Le 03/10/2018 à 08:30, Nicholas Piggin a écrit :

On Wed, 3 Oct 2018 07:52:59 +0200
Christophe LEROY  wrote:
   

Le 03/10/2018 à 07:34, Nicholas Piggin a écrit :

On Mon,  1 Oct 2018 12:30:25 + (UTC)
Christophe Leroy  wrote:
  

thread_info is not anymore in the stack, so the entire stack
can now be used.


Nice.
  


In the meantime, all pointers to the stacks are not anymore
pointers to thread_info so this patch changes them to void*


Wasn't this previously effectively already the case with patch
3/7? You had thread_info sized space left there, but it was not
used or initialized right? Does it make sense to move this part
of it to the previous patch?


Not really.

In 3/7 I changed the prototypes of two functions that really used the
pointer as a task pointer only.


I meant 2/7 here sorry.



Here it change things that before 4/7 were really used as both stack
pointers and thread_info pointers.


And here I meant 3/7



What uses it as a thread_info pointer? It seems more like a stack
with some amount of unused space in it but that's all.


Before 3/7, we have

void do_softirq_own_stack(void)
{
struct thread_info *curtp, *irqtp;

curtp = current_thread_info();
irqtp = softirq_ctx[smp_processor_id()];
irqtp->task = curtp->task;
irqtp->flags = 0;
call_do_softirq(irqtp);
irqtp->task = NULL;

/* Set any flag that may have been set on the
 * alternate stack
 */
if (irqtp->flags)
set_bits(irqtp->flags, >flags);
}

After 3/7, we have

   void do_softirq_own_stack(void)
   {
struct thread_info *irqtp;

irqtp = softirq_ctx[smp_processor_id()];
call_do_softirq(irqtp);
   }


So now only we can change irqtp to void* can't we ?


In patch 3 we can, right? That's what I mean by moving from
thread_info * to void * in patch 3 rather than 4.


Ah ok, that's what you meant. Sorry.



But if you prefer not to, it's fine. Maybe it keeps patch 3
a little smaller.


Yes indeed, that's the idea, keep patch 3 to the strict minimum and do 
cleanups afterwards.


Christophe



Thanks,
Nick



Re: [PATCH kernel] cxl: Remove unused include

2018-10-03 Thread Frederic Barrat




Le 03/10/2018 à 09:46, Alexey Kardashevskiy a écrit :



On 28/09/2018 22:46, Michael Ellerman wrote:

Alexey Kardashevskiy  writes:

The included opal.h gives a wrong idea that CXL makes PPC OPAL calls
while it does not so let's remote it.


But it does use eg.

   OPAL_PHB_CAPI_MODE_SNOOP_ON
   OPAL_PHB_CAPI_MODE_CAPI

Which come from opal-api.h via opal.h.

So you should at least include opal-api.h.



I'd say that since it includes pnv-pci.h (and this is why this patch
compiles), it should be a different set of values for powernv (which
would map 1:1 to OPAL_PHB_xxx). The powernv platform knowledge is
already intimate enough for a driver to have. Dunno, I found this opal.h
inclusion confusing, that's all.



I agree, it would be the cleaner way of doing it. It's going to be 
pretty low on my list though...


  Fred







cheers


diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index b66d832..8cbcbb7 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -17,7 +17,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
--
2.11.0






Re: [PATCH kernel] cxl: Remove unused include

2018-10-03 Thread Alexey Kardashevskiy



On 28/09/2018 22:46, Michael Ellerman wrote:
> Alexey Kardashevskiy  writes:
>> The included opal.h gives a wrong idea that CXL makes PPC OPAL calls
>> while it does not so let's remote it.
> 
> But it does use eg.
> 
>   OPAL_PHB_CAPI_MODE_SNOOP_ON
>   OPAL_PHB_CAPI_MODE_CAPI
> 
> Which come from opal-api.h via opal.h.
> 
> So you should at least include opal-api.h.


I'd say that since it includes pnv-pci.h (and this is why this patch
compiles), it should be a different set of values for powernv (which
would map 1:1 to OPAL_PHB_xxx). The powernv platform knowledge is
already intimate enough for a driver to have. Dunno, I found this opal.h
inclusion confusing, that's all.



> 
> cheers
> 
>> diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
>> index b66d832..8cbcbb7 100644
>> --- a/drivers/misc/cxl/pci.c
>> +++ b/drivers/misc/cxl/pci.c
>> @@ -17,7 +17,6 @@
>>  #include 
>>  #include 
>>  #include 
>> -#include 
>>  #include 
>>  #include 
>>  #include 
>> -- 
>> 2.11.0

-- 
Alexey


Re: [PATCH] migration/mm: Add WARN_ON to try_offline_node

2018-10-03 Thread Michal Hocko
On Tue 02-10-18 12:45:50, Tyrel Datwyler wrote:
> On 10/02/2018 11:13 AM, Michael Bringmann wrote:
> > 
> > 
> > On 10/02/2018 11:04 AM, Michal Hocko wrote:
> >> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
> >>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>  On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>  [...]
> > When the device-tree affinity attributes have changed for memory,
> > the 'nid' affinity calculated points to a different node for the
> > memory block than the one used to install it, previously on the
> > source system.  The newly calculated 'nid' affinity may not yet
> > be initialized on the target system.  The current memory tracking
> > mechanisms do not record the node to which a memory block was
> > associated when it was added.  Nathan is looking at adding this
> > feature to the new implementation of LMBs, but it is not there
> > yet, and won't be present in earlier kernels without backporting a
> > significant number of changes.
> 
>  Then the patch you have proposed here just papers over a real issue, no?
>  IIUC then you simply do not remove the memory if you lose the race.
> >>>
> >>> The problem occurs when removing memory after an affinity change
> >>> references a node that was previously unreferenced.  Other code
> >>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
> >>> node when adding memory to a system.  The 'removing memory' case is
> >>> specific to systems that perform LPM and allow device-tree changes.
> >>> The powerpc kernel does not have the option of accepting some PRRN
> >>> requests and accepting others.  It must perform them all.
> >>
> >> I am sorry, but you are still too cryptic for me. Either there is a
> >> correctness issue and the the patch doesn't really fix anything or the
> >> final race doesn't make any difference and then the ppc code should be
> >> explicit about that. Checking the node inside the hotplug core code just
> >> looks as a wrong layer to mitigate an arch specific problem. I am not
> >> saying the patch is a no-go but if anything we want a big fat comment
> >> explaining how this is possible because right now it just points to an
> >> incorrect API usage.
> >>
> >> That being said, this sounds pretty much ppc specific problem and I
> >> would _prefer_ it to be handled there (along with a big fat comment of
> >> course).
> > 
> > Let me try again.  Regardless of the path to which we get to this condition,
> > we currently crash the kernel.  This patch changes that to a WARN_ON notice
> > and continues executing the kernel without shutting down the system.  I saw
> > the problem during powerpc testing, because that is the focus of my work.
> > There are other paths to this function besides powerpc.  I feel that the
> > kernel should keep running instead of halting.
> 
> This is still basically a hack to get around a known race. In itself
> this patch is still worth while in that we shouldn't crash the kernel
> on a null pointer dereference. However, I think the actual problem
> still needs to be addressed. We shouldn't run any PRRN events for the
> source system on the target after a migration. The device tree update
> should have taken care of telling us about new affinities and what
> not. Can we just throw out any queued PRRN events when we wake up on
> the target?

And until a proper fix is developed can we have NODE_DATA test in the
affected code rather than pollute the generic code with something that
is essentially a wrong usage of the API? With a big fat warning
explaining what is going on here?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v3 4/7] powerpc: regain entire stack space

2018-10-03 Thread Nicholas Piggin
On Wed, 3 Oct 2018 08:45:25 +0200
Christophe LEROY  wrote:

> Le 03/10/2018 à 08:30, Nicholas Piggin a écrit :
> > On Wed, 3 Oct 2018 07:52:59 +0200
> > Christophe LEROY  wrote:
> >   
> >> Le 03/10/2018 à 07:34, Nicholas Piggin a écrit :  
> >>> On Mon,  1 Oct 2018 12:30:25 + (UTC)
> >>> Christophe Leroy  wrote:
> >>>  
>  thread_info is not anymore in the stack, so the entire stack
>  can now be used.  
> >>>
> >>> Nice.
> >>>  
> 
>  In the meantime, all pointers to the stacks are not anymore
>  pointers to thread_info so this patch changes them to void*  
> >>>
> >>> Wasn't this previously effectively already the case with patch
> >>> 3/7? You had thread_info sized space left there, but it was not
> >>> used or initialized right? Does it make sense to move this part
> >>> of it to the previous patch?  
> >>
> >> Not really.
> >>
> >> In 3/7 I changed the prototypes of two functions that really used the
> >> pointer as a task pointer only.  
> 
> I meant 2/7 here sorry.
> 
> >>
> >> Here it change things that before 4/7 were really used as both stack
> >> pointers and thread_info pointers.  
> 
> And here I meant 3/7
> 
> > 
> > What uses it as a thread_info pointer? It seems more like a stack
> > with some amount of unused space in it but that's all.  
> 
> Before 3/7, we have
> 
> void do_softirq_own_stack(void)
> {
>   struct thread_info *curtp, *irqtp;
> 
>   curtp = current_thread_info();
>   irqtp = softirq_ctx[smp_processor_id()];
>   irqtp->task = curtp->task;
>   irqtp->flags = 0;
>   call_do_softirq(irqtp);
>   irqtp->task = NULL;
> 
>   /* Set any flag that may have been set on the
>* alternate stack
>*/
>   if (irqtp->flags)
>   set_bits(irqtp->flags, >flags);
> }
> 
> After 3/7, we have
> 
>   void do_softirq_own_stack(void)
>   {
>   struct thread_info *irqtp;
> 
>   irqtp = softirq_ctx[smp_processor_id()];
>   call_do_softirq(irqtp);
>   }
> 
> 
> So now only we can change irqtp to void* can't we ?

In patch 3 we can, right? That's what I mean by moving from
thread_info * to void * in patch 3 rather than 4.

But if you prefer not to, it's fine. Maybe it keeps patch 3
a little smaller.

Thanks,
Nick


Re: [RFC PATCH v3 4/7] powerpc: regain entire stack space

2018-10-03 Thread Christophe LEROY




Le 03/10/2018 à 08:30, Nicholas Piggin a écrit :

On Wed, 3 Oct 2018 07:52:59 +0200
Christophe LEROY  wrote:


Le 03/10/2018 à 07:34, Nicholas Piggin a écrit :

On Mon,  1 Oct 2018 12:30:25 + (UTC)
Christophe Leroy  wrote:
   

thread_info is not anymore in the stack, so the entire stack
can now be used.


Nice.
   


In the meantime, all pointers to the stacks are not anymore
pointers to thread_info so this patch changes them to void*


Wasn't this previously effectively already the case with patch
3/7? You had thread_info sized space left there, but it was not
used or initialized right? Does it make sense to move this part
of it to the previous patch?


Not really.

In 3/7 I changed the prototypes of two functions that really used the
pointer as a task pointer only.


I meant 2/7 here sorry.



Here it change things that before 4/7 were really used as both stack
pointers and thread_info pointers.


And here I meant 3/7



What uses it as a thread_info pointer? It seems more like a stack
with some amount of unused space in it but that's all.


Before 3/7, we have

void do_softirq_own_stack(void)
{
struct thread_info *curtp, *irqtp;

curtp = current_thread_info();
irqtp = softirq_ctx[smp_processor_id()];
irqtp->task = curtp->task;
irqtp->flags = 0;
call_do_softirq(irqtp);
irqtp->task = NULL;

/* Set any flag that may have been set on the
 * alternate stack
 */
if (irqtp->flags)
set_bits(irqtp->flags, >flags);
}

After 3/7, we have

 void do_softirq_own_stack(void)
 {
struct thread_info *irqtp;

irqtp = softirq_ctx[smp_processor_id()];
call_do_softirq(irqtp);
 }


So now only we can change irqtp to void* can't we ?



That said I don't care to nitpick too much where things go exactly
if you like it better here that's fine.


No worry, I may have missed something, your comments are always welcome.

Thanks
Christophe



Thanks,
Nick



Re: [RFC PATCH v3 4/7] powerpc: regain entire stack space

2018-10-03 Thread Nicholas Piggin
On Wed, 3 Oct 2018 07:52:59 +0200
Christophe LEROY  wrote:

> Le 03/10/2018 à 07:34, Nicholas Piggin a écrit :
> > On Mon,  1 Oct 2018 12:30:25 + (UTC)
> > Christophe Leroy  wrote:
> >   
> >> thread_info is not anymore in the stack, so the entire stack
> >> can now be used.  
> > 
> > Nice.
> >   
> >>
> >> In the meantime, all pointers to the stacks are not anymore
> >> pointers to thread_info so this patch changes them to void*  
> > 
> > Wasn't this previously effectively already the case with patch
> > 3/7? You had thread_info sized space left there, but it was not
> > used or initialized right? Does it make sense to move this part
> > of it to the previous patch?  
> 
> Not really.
> 
> In 3/7 I changed the prototypes of two functions that really used the 
> pointer as a task pointer only.
> 
> Here it change things that before 4/7 were really used as both stack 
> pointers and thread_info pointers.

What uses it as a thread_info pointer? It seems more like a stack
with some amount of unused space in it but that's all.

That said I don't care to nitpick too much where things go exactly
if you like it better here that's fine.

Thanks,
Nick


Re: [RFC PATCH v3 3/7] powerpc: Activate CONFIG_THREAD_INFO_IN_TASK

2018-10-03 Thread Nicholas Piggin
On Wed, 3 Oct 2018 08:04:49 +0200
Christophe LEROY  wrote:

> Le 03/10/2018 à 07:52, Nicholas Piggin a écrit :
> > On Wed, 3 Oct 2018 07:47:05 +0200
> > Christophe LEROY  wrote:
> >   
> >> Le 03/10/2018 à 07:30, Nicholas Piggin a écrit :  
> >>> On Mon,  1 Oct 2018 12:30:23 + (UTC)
> >>> Christophe Leroy  wrote:
> >>>  
>  This patch activates CONFIG_THREAD_INFO_IN_TASK which
>  moves the thread_info into task_struct.
> 
>  Moving thread_info into task_struct has the following advantages:
>  - It protects thread_info from corruption in the case of stack
>  overflows.
>  - Its address is harder to determine if stack addresses are
>  leaked, making a number of attacks more difficult.
> 
>  This has the following consequences:
>  - thread_info is now located at the top of task_struct.  
> >>>
> >>> "top"... I got confused for a minute thinking high address and
> >>> wondering how you can change CURRENT_THREAD_INFO just to point
> >>> to current :)  
> >>
> >> Would 'beginning' be less confusing ?  
> > 
> > Yes, good idea.
> >   
>  @@ -83,7 +83,13 @@ int is_cpu_dead(unsigned int cpu);
> /* 32-bit */
> extern int smp_hw_index[];
> 
>  -#define raw_smp_processor_id()  (current_thread_info()->cpu)
>  +/*
>  + * This is particularly ugly: it appears we can't actually get the 
>  definition
>  + * of task_struct here, but we need access to the CPU this task is 
>  running on.
>  + * Instead of using task_struct we're using _TASK_CPU which is 
>  extracted from
>  + * asm-offsets.h by kbuild to get the current processor ID.
>  + */
>  +#define raw_smp_processor_id()  (*(unsigned 
>  int*)((void*)current + _TASK_CPU))  
> >>>
> >>> This is clever but yes ugly. Can't you include asm-offsets.h? riscv
> >>> seems to.  
> >>
> >> riscv has a clean asm-offsets.h . Our's defines constant with the same
> >> name as those defined in other headers which are included in C files. So
> >> including asm-offsets in C files does create conflicts like:
> >>
> >> ./include/generated/asm-offsets.h:71:0: warning: "TASK_SIZE" redefined
> >>#define TASK_SIZE -2147483648 /* TASK_SIZE */
> >> ./arch/powerpc/include/asm/processor.h:95:0: note: this is the location
> >> of the previous definition
> >>#define TASK_SIZE (CONFIG_TASK_SIZE)
> >>
> >> ./include/generated/asm-offsets.h:98:0: warning: "NSEC_PER_SEC" redefined
> >>#define NSEC_PER_SEC 10 /* NSEC_PER_SEC */
> >> ./include/linux/time64.h:36:0: note: this is the location of the
> >> previous definition
> >>#define NSEC_PER_SEC 10L
> >>
> >> ./arch/powerpc/include/asm/nohash/32/pgtable.h:34:0: warning:
> >> "PGD_TABLE_SIZE" redefined
> >>#define PGD_TABLE_SIZE (sizeof(pgd_t) << PGD_INDEX_SIZE)
> >> ./include/generated/asm-offsets.h:101:0: note: this is the location of
> >> the previous definition
> >>#define PGD_TABLE_SIZE 256 /* PGD_TABLE_SIZE */
> >>
> >> ...  
> > 
> > Okay.
> >   
> >>
> >> In v2, I had a patch to fix those redundancies
> >> (https://patchwork.ozlabs.org/patch/974363/) but I found it unconvenient.  
> > 
> > Because of merge conflicts, or you did not like the new names?  
> 
> Both, because of the amount of changes it implies, and also because of 
> the new names. I find it quite convenient to be able to use same names 
> both in C and ASM.

Yeah that's true. I guess this is okay for a one-off hack.

Thanks,
Nick


Re: [PATCH] powerpc: use PTRRELOC during early init

2018-10-03 Thread Christophe LEROY

Hi Andreas,

Le 03/10/2018 à 00:33, Andreas Schwab a écrit :

This fixes a crash on powerpc32 when using global data during early init
without relocating its address.

Fixes: 51c3c62b58 (powerpc: Avoid code patching freed init sections)
Signed-off-by: Andreas Schwab 
---
  arch/powerpc/lib/code-patching.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 6ae2777c22..6192fdae36 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -29,7 +29,7 @@ static int __patch_instruction(unsigned int *exec_addr, 
unsigned int instr,
int err;
  
  	/* Make sure we aren't patching a freed init section */

-   if (init_mem_is_free && init_section_contains(exec_addr, 4)) {
+   if (*PTRRELOC(_mem_is_free) && init_section_contains(exec_addr, 
4)) {


That's not the best solution. In the past we already did our best to 
separate early use of patch_instruction() , that's how 
raw_patch_instruction() was born, see 
https://patchwork.ozlabs.org/patch/840974/


Here, it idea is similar, this test should not apply to 
raw_patch_instruction()


Did you try my proposed fix https://patchwork.ozlabs.org/patch/977195/ ?

Christophe


pr_debug("Skipping init section patching addr: 0x%px\n", 
exec_addr);
return 0;
}



Re: [PATCH v3 24/33] KVM: PPC: Book3S HV: Emulate Privileged TLBIE for guest hypervisors

2018-10-03 Thread David Gibson
On Tue, Oct 02, 2018 at 09:31:23PM +1000, Paul Mackerras wrote:
> From: Suraj Jitindar Singh 
> 
> When running a nested (L2) guest the guest (L1) hypervisor will use
> hypervisor privileged tlb invalidation instructions (to manage the
> partition scoped page tables) which will result in hypervisor
> emulation assistance interrupts. We emulate these instructions on behalf
> of the L1 guest.
> 
> The tlbie instruction can invalidate different scopes:
> 
> Invalidate TLB for a given target address:
> - This invalidates a single L2 -> L1 pte
> - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
>   address space which is being invalidated. This is because a single
>   L2 -> L1 pte may have been mapped with more than one pte in the
>   L2 -> L0 page tables.
> 
> Invalidate the entire TLB for a given LPID or for all LPIDs:
> - Invalidate the entire shadow_pgtable for a given nested guest, or
>   for all nested guests.
> 
> Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
> - We don't cache the PWC, so nothing to do
> 
> Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
> - Here we re-read the partition table entry and remove the nested state
>   for any nested guest for which the first doubleword of the partition
>   table entry is now zero.
> 
> This also implements the H_TLB_INVALIDATE hcall.  It takes as parameters
> the tlbie instruction word (of which the RIC, PRS and R fields are used),
> the rS value (giving the lpid, where required) and the rB value (giving
> the IS, AP and EPN values).
> 
> [pau...@ozlabs.org - adapted to having the partition table in guest
> memory, added the H_TLB_INVALIDATE implementation.]
> 
> Signed-off-by: Suraj Jitindar Singh 
> Signed-off-by: Paul Mackerras 

Again, do we need this if we're moving to a paravirt tlbie?

> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  12 ++
>  arch/powerpc/include/asm/kvm_book3s.h |   1 +
>  arch/powerpc/include/asm/ppc-opcode.h |   1 +
>  arch/powerpc/kvm/book3s_emulate.c |   1 -
>  arch/powerpc/kvm/book3s_hv.c  |   3 +
>  arch/powerpc/kvm/book3s_hv_nested.c   | 210 
> +-
>  6 files changed, 225 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index b3520b5..66db23e 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -203,6 +203,18 @@ static inline unsigned int mmu_psize_to_shift(unsigned 
> int mmu_psize)
>   BUG();
>  }
>  
> +static inline unsigned int ap_to_shift(unsigned long ap)
> +{
> + int psize;
> +
> + for (psize = 0; psize < MMU_PAGE_COUNT; psize++) {
> + if (mmu_psize_defs[psize].ap == ap)
> + return mmu_psize_defs[psize].shift;
> + }
> +
> + return -1;
> +}
> +
>  static inline unsigned long get_sllp_encoding(int psize)
>  {
>   unsigned long sllp;
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 1d2286d..210e550 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -301,6 +301,7 @@ long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
>  void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
>  void kvmhv_release_all_nested(struct kvm *kvm);
>  long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
> +long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
>  int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu,
> u64 time_limit, unsigned long lpcr);
>  void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> b/arch/powerpc/include/asm/ppc-opcode.h
> index 665af14..6093bc8 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -104,6 +104,7 @@
>  #define OP_31_XOP_LHZUX 311
>  #define OP_31_XOP_MSGSNDP   142
>  #define OP_31_XOP_MSGCLRP   174
> +#define OP_31_XOP_TLBIE 306
>  #define OP_31_XOP_MFSPR 339
>  #define OP_31_XOP_LWAX  341
>  #define OP_31_XOP_LHAX  343
> diff --git a/arch/powerpc/kvm/book3s_emulate.c 
> b/arch/powerpc/kvm/book3s_emulate.c
> index 2654df2..8c7e933 100644
> --- a/arch/powerpc/kvm/book3s_emulate.c
> +++ b/arch/powerpc/kvm/book3s_emulate.c
> @@ -36,7 +36,6 @@
>  #define OP_31_XOP_MTSR   210
>  #define OP_31_XOP_MTSRIN 242
>  #define OP_31_XOP_TLBIEL 274
> -#define OP_31_XOP_TLBIE  306
>  /* Opcode is officially reserved, reuse it as sc 1 when sc 1 doesn't trap */
>  #define OP_31_XOP_FAKE_SC1   308
>  #define OP_31_XOP_SLBMTE 402
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 6629df4..3aa5d11e 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> 

Re: [PATCH v3 22/33] KVM: PPC: Book3S HV: Handle page fault for a nested guest

2018-10-03 Thread David Gibson
On Wed, Oct 03, 2018 at 03:39:13PM +1000, David Gibson wrote:
> On Tue, Oct 02, 2018 at 09:31:21PM +1000, Paul Mackerras wrote:
> > From: Suraj Jitindar Singh 
> > 
> > Consider a normal (L1) guest running under the main hypervisor (L0),
> > and then a nested guest (L2) running under the L1 guest which is acting
> > as a nested hypervisor. L0 has page tables to map the address space for
> > L1 providing the translation from L1 real address -> L0 real address;
> > 
> > L1
> > |
> > | (L1 -> L0)
> > |
> > > L0
> > 
> > There are also page tables in L1 used to map the address space for L2
> > providing the translation from L2 real address -> L1 read address. Since
> > the hardware can only walk a single level of page table, we need to
> > maintain in L0 a "shadow_pgtable" for L2 which provides the translation
> > from L2 real address -> L0 real address. Which looks like;
> > 
> > L2  L2
> > |   |
> > | (L2 -> L1)|
> > |   |
> > > L1| (L2 -> L0)
> >   | |
> >   | (L1 -> L0)  |
> >   | |
> >   > L0  > L0
> > 
> > When a page fault occurs while running a nested (L2) guest we need to
> > insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
> > do this we need to:
> > 
> > 1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
> >provide a page fault to L1 if this mapping doesn't exist.
> > 2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
> >or try to insert a pte for that mapping if it doesn't exist.
> > 3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
> > 
> > Once this mapping exists we can take rc faults when hardware is unable
> > to automatically set the reference and change bits in the pte. On these
> > we need to:
> > 
> > 1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
> >the fault down to L1.
> > 2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
> >host page.
> > 3. Set the rc bits in the L2 -> L0 pte.
> > 
> > As we reuse a large number of functions in book3s_64_mmu_radix.c for
> > this we also needed to refactor a number of these functions to take
> > an lpid parameter so that the correct lpid is used for tlb invalidations.
> > The functionality however has remained the same.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> > Signed-off-by: Paul Mackerras 
> 
> Some comments below, but no showstoppers, so,
> 
> Reviewed-by: David Gibson 

One more, again not a showstopper:

> > @@ -393,10 +396,20 @@ struct kvm_nested_guest *kvmhv_alloc_nested(struct 
> > kvm *kvm, unsigned int lpid)
> >   */
> >  static void kvmhv_release_nested(struct kvm_nested_guest *gp)
> >  {
> > +   struct kvm *kvm = gp->l1_host;
> > +
> > +   if (gp->shadow_pgtable) {
> > +   /*
> > +* No vcpu is using this struct and no call to
> > +* kvmhv_remove_nest_rmap can find this struct,

It's kind of dubious that you're referring to kvmhv_remove_nest_rmap()
a patch before it is introduced.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3 29/33] KVM: PPC: Book3S HV: Handle differing endianness for H_ENTER_NESTED

2018-10-03 Thread David Gibson
On Tue, Oct 02, 2018 at 09:31:28PM +1000, Paul Mackerras wrote:
> From: Suraj Jitindar Singh 
> 
> The hcall H_ENTER_NESTED takes as the two parameters the address in
> L1 guest memory of a hv_regs struct and a pt_regs struct which the
> L1 guest would like to use to run a L2 guest and in which are returned
> the exit state of the L2 guest.  For efficiency, these are in the
> endianness of the L1 guest, rather than being always big-endian as is
> usually the case for PAPR hypercalls.

Does that actually make a difference for efficiency?  I thought the
presence of the byte-reversing loads and stores meant byteswapping was
basically zero-cost on POWER.

> When reading/writing these structures, this patch handles the case
> where the endianness of the L1 guest differs from that of the L0
> hypervisor, by byteswapping the structures after reading and before
> writing them back.
> 
> Since all the fields of the pt_regs are of the same type, i.e.,
> unsigned long, we treat it as an array of unsigned longs.  The fields
> of struct hv_guest_state are not all the same, so its fields are
> byteswapped individually.
> 
> Signed-off-by: Suraj Jitindar Singh 
> Signed-off-by: Paul Mackerras 

Looks correct, though, so

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/kvm/book3s_hv_nested.c | 51 
> -
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 7b1088a..228dc11 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -51,6 +51,48 @@ void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct 
> hv_guest_state *hr)
>   hr->ppr = vcpu->arch.ppr;
>  }
>  
> +static void byteswap_pt_regs(struct pt_regs *regs)
> +{
> + unsigned long *addr = (unsigned long *) regs;
> +
> + for (; addr < ((unsigned long *) (regs + 1)); addr++)
> + *addr = swab64(*addr);
> +}
> +
> +static void byteswap_hv_regs(struct hv_guest_state *hr)
> +{
> + hr->version = swab64(hr->version);
> + hr->lpid = swab32(hr->lpid);
> + hr->vcpu_token = swab32(hr->vcpu_token);
> + hr->lpcr = swab64(hr->lpcr);
> + hr->pcr = swab64(hr->pcr);
> + hr->amor = swab64(hr->amor);
> + hr->dpdes = swab64(hr->dpdes);
> + hr->hfscr = swab64(hr->hfscr);
> + hr->tb_offset = swab64(hr->tb_offset);
> + hr->dawr0 = swab64(hr->dawr0);
> + hr->dawrx0 = swab64(hr->dawrx0);
> + hr->ciabr = swab64(hr->ciabr);
> + hr->hdec_expiry = swab64(hr->hdec_expiry);
> + hr->purr = swab64(hr->purr);
> + hr->spurr = swab64(hr->spurr);
> + hr->ic = swab64(hr->ic);
> + hr->vtb = swab64(hr->vtb);
> + hr->hdar = swab64(hr->hdar);
> + hr->hdsisr = swab64(hr->hdsisr);
> + hr->heir = swab64(hr->heir);
> + hr->asdr = swab64(hr->asdr);
> + hr->srr0 = swab64(hr->srr0);
> + hr->srr1 = swab64(hr->srr1);
> + hr->sprg[0] = swab64(hr->sprg[0]);
> + hr->sprg[1] = swab64(hr->sprg[1]);
> + hr->sprg[2] = swab64(hr->sprg[2]);
> + hr->sprg[3] = swab64(hr->sprg[3]);
> + hr->pidr = swab64(hr->pidr);
> + hr->cfar = swab64(hr->cfar);
> + hr->ppr = swab64(hr->ppr);
> +}
> +
>  static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap,
>struct hv_guest_state *hr)
>  {
> @@ -175,6 +217,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
> sizeof(struct hv_guest_state));
>   if (err)
>   return H_PARAMETER;
> + if (kvmppc_need_byteswap(vcpu))
> + byteswap_hv_regs(_hv);
>   if (l2_hv.version != HV_GUEST_STATE_VERSION)
>   return H_P2;
>  
> @@ -183,7 +227,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
> sizeof(struct pt_regs));
>   if (err)
>   return H_PARAMETER;
> -
> + if (kvmppc_need_byteswap(vcpu))
> + byteswap_pt_regs(_regs);
>   if (l2_hv.vcpu_token >= NR_CPUS)
>   return H_PARAMETER;
>  
> @@ -255,6 +300,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>   kvmhv_put_nested(l2);
>  
>   /* copy l2_hv_state and regs back to guest */
> + if (kvmppc_need_byteswap(vcpu)) {
> + byteswap_hv_regs(_hv);
> + byteswap_pt_regs(_regs);
> + }
>   err = kvm_vcpu_write_guest(vcpu, hv_ptr, _hv,
>  sizeof(struct hv_guest_state));
>   if (err)

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3 30/33] KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode

2018-10-03 Thread David Gibson
On Tue, Oct 02, 2018 at 09:31:29PM +1000, Paul Mackerras wrote:
> With this, the KVM-HV module can be loaded in a guest running under
> KVM-HV, and if the hypervisor supports nested virtualization, this
> guest can now act as a nested hypervisor and run nested guests.
> 
> This also adds some checks to inform userspace that HPT guests are not
> supported by nested hypervisors, and to prevent userspace from
> configuring a guest to use HPT mode.
> 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index f630e91..196bff1 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4237,6 +4237,10 @@ static int kvm_vm_ioctl_get_smmu_info_hv(struct kvm 
> *kvm,
>  {
>   struct kvm_ppc_one_seg_page_size *sps;
>  
> + /* If we're a nested hypervisor, we only support radix guests */
> + if (kvmhv_on_pseries())
> + return -EINVAL;
> +
>   /*
>* POWER7, POWER8 and POWER9 all support 32 storage keys for data.
>* POWER7 doesn't support keys for instruction accesses,
> @@ -4822,11 +4826,15 @@ static int kvmppc_core_emulate_mfspr_hv(struct 
> kvm_vcpu *vcpu, int sprn,
>  
>  static int kvmppc_core_check_processor_compat_hv(void)
>  {
> - if (!cpu_has_feature(CPU_FTR_HVMODE) ||
> - !cpu_has_feature(CPU_FTR_ARCH_206))
> - return -EIO;
> + if (cpu_has_feature(CPU_FTR_HVMODE) &&
> + cpu_has_feature(CPU_FTR_ARCH_206))
> + return 0;
>  
> - return 0;
> + /* Can run as nested hypervisor on POWER9 in radix mode. */
> + if (cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled())

Shouldn't we probe the parent hypervisor for ability to support nested
guests before we say "yes" here?

> + return 0;
> +
> + return -EIO;
>  }
>  
>  #ifdef CONFIG_KVM_XICS
> @@ -5144,6 +5152,10 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
> kvm_ppc_mmuv3_cfg *cfg)
>   if (radix && !radix_enabled())
>   return -EINVAL;
>  
> + /* If we're a nested hypervisor, we currently only support radix */
> + if (kvmhv_on_pseries() && !radix)
> + return -EINVAL;
> +
>   mutex_lock(>lock);
>   if (radix != kvm_is_radix(kvm)) {
>   if (kvm->arch.mmu_ready) {

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3 23/33] KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings

2018-10-03 Thread David Gibson
On Tue, Oct 02, 2018 at 09:31:22PM +1000, Paul Mackerras wrote:
> From: Suraj Jitindar Singh 
> 
> When a host (L0) page which is mapped into a (L1) guest is in turn
> mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
> so that these mappings can be retrieved later.
> 
> Whenever we create an entry in a shadow_pgtable for a nested guest we
> create a corresponding rmap entry and add it to the list for the
> L1 guest memslot at the index of the L1 guest page it maps. This means
> at the L1 guest memslot we end up with lists of rmaps.
> 
> When we are notified of a host page being invalidated which has been
> mapped through to a (L1) guest, we can then walk the rmap list for that
> guest page, and find and invalidate all of the corresponding
> shadow_pgtable entries.
> 
> In order to reduce memory consumption, we compress the information for
> each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
> for the guest real page frame number -- which will fit in a single
> unsigned long.  To avoid a scenario where a guest can trigger
> unbounded memory allocations, we scan the list when adding an entry to
> see if there is already an entry with the contents we need.  This can
> occur, because we don't ever remove entries from the middle of a list.
> 
> A struct nested guest rmap is a list pointer and an rmap entry;
> 
> | next pointer |
> 
> | rmap entry   |
> 
> 
> Thus the rmap pointer for each guest frame number in the memslot can be
> either NULL, a single entry, or a pointer to a list of nested rmap entries.
> 
> gfnmemslot rmap array
>   -
>  0| NULL  |   (no rmap entry)
>   -
>  1| single rmap entry |   (rmap entry with low bit set)
>   -
>  2| list head pointer |   (list of rmap entries)
>   -
> 
> The final entry always has the lowest bit set and is stored in the next
> pointer of the last list entry, or as a single rmap entry.
> With a list of rmap entries looking like;
> 
> - -   -
> | list head ptr   | > | next pointer  | > | single rmap entry 
> |
> - -   -
>   | rmap entry|   | rmap entry|
>   -   -
> 
> Signed-off-by: Suraj Jitindar Singh 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/include/asm/kvm_book3s.h|   3 +
>  arch/powerpc/include/asm/kvm_book3s_64.h |  70 -
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   |  44 +++
>  arch/powerpc/kvm/book3s_hv.c |   1 +
>  arch/powerpc/kvm/book3s_hv_nested.c  | 130 
> ++-
>  5 files changed, 233 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index d983778..1d2286d 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -196,6 +196,9 @@ extern int kvmppc_mmu_radix_translate_table(struct 
> kvm_vcpu *vcpu, gva_t eaddr,
>   int table_index, u64 *pte_ret_p);
>  extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
>   struct kvmppc_pte *gpte, bool data, bool iswrite);
> +extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
> + unsigned int shift, struct kvm_memory_slot *memslot,
> + unsigned int lpid);
>  extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
>   bool writing, unsigned long gpa,
>   unsigned int lpid);
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 5496152..38614f0 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -53,6 +53,66 @@ struct kvm_nested_guest {
>   struct kvm_nested_guest *next;
>  };
>  
> +/*
> + * We define a nested rmap entry as a single 64-bit quantity
> + * 0xFFF012-bit lpid field
> + * 0x000FF00040-bit guest physical address field

I thought we could potentially support guests with >1TiB of RAM..?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3 32/33] KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested

2018-10-03 Thread David Gibson
On Tue, Oct 02, 2018 at 09:31:31PM +1000, Paul Mackerras wrote:
> This adds code to call the H_TLB_INVALIDATE hypercall when running as
> a guest, in the cases where we need to invalidate TLBs (or other MMU
> caches) as part of managing the mappings for a nested guest.  Calling
> H_TLB_INVALIDATE is an alternative to doing the tlbie instruction and
> having it be emulated by our hypervisor.

Why is supporting two methods useful, rather than just settling on
one?  Having both just sounds like asking for untested and therefore
broken code paths lying around.

> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h |  5 +
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   | 30 +++--
>  arch/powerpc/kvm/book3s_hv_nested.c  | 33 
> +++-
>  3 files changed, 57 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index f0fa14c..95280e2 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_PPC_PSERIES
>  static inline bool kvmhv_on_pseries(void)
> @@ -121,6 +122,10 @@ struct kvm_nested_guest *kvmhv_get_nested(struct kvm 
> *kvm, int l1_lpid,
>  void kvmhv_put_nested(struct kvm_nested_guest *gp);
>  int kvmhv_nested_next_lpid(struct kvm *kvm, int lpid);
>  
> +/* Encoding of first parameter for H_TLB_INVALIDATE */
> +#define H_TLBIE_P1_ENC(ric, prs, r)  (___PPC_RIC(ric) | ___PPC_PRS(prs) | \
> +  ___PPC_R(r))
> +
>  /* Power architecture requires HPT is at least 256kiB, at most 64TiB */
>  #define PPC_MIN_HPT_ORDER18
>  #define PPC_MAX_HPT_ORDER46
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index b74abdd..6c93f5c 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -201,17 +201,43 @@ static void kvmppc_radix_tlbie_page(struct kvm *kvm, 
> unsigned long addr,
>   unsigned int pshift, unsigned int lpid)
>  {
>   unsigned long psize = PAGE_SIZE;
> + int psi;
> + long rc;
> + unsigned long rb;
>  
>   if (pshift)
>   psize = 1UL << pshift;
> + else
> + pshift = PAGE_SHIFT;
>  
>   addr &= ~(psize - 1);
> - radix__flush_tlb_lpid_page(lpid, addr, psize);
> +
> + if (!kvmhv_on_pseries()) {
> + radix__flush_tlb_lpid_page(lpid, addr, psize);
> + return;
> + }
> +
> + psi = shift_to_mmu_psize(pshift);
> + rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
> + rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
> + lpid, rb);
> + if (rc)
> + pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
>  }
>  
>  static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned int lpid)
>  {
> - radix__flush_pwc_lpid(lpid);
> + long rc;
> +
> + if (!kvmhv_on_pseries()) {
> + radix__flush_pwc_lpid(lpid);
> + return;
> + }
> +
> + rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
> + lpid, TLBIEL_INVAL_SET_LPID);
> + if (rc)
> + pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
>  }
>  
>  static unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 914ca78..81c81d51 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -368,17 +368,32 @@ void kvmhv_nested_exit(void)
>   }
>  }
>  
> +static void kvmhv_flush_lpid(unsigned int lpid)
> +{
> + long rc;
> +
> + if (!kvmhv_on_pseries()) {
> + radix__flush_tlb_lpid(lpid);
> + return;
> + }
> +
> + rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
> + lpid, TLBIEL_INVAL_SET_LPID);
> + if (rc)
> + pr_err("KVM: TLB LPID invalidation hcall failed, rc=%ld\n", rc);
> +}
> +
>  void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1)
>  {
> - if (cpu_has_feature(CPU_FTR_HVMODE)) {
> + if (!kvmhv_on_pseries()) {
>   mmu_partition_table_set_entry(lpid, dw0, dw1);
> - } else {
> - pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
> - pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
> - /* this will be emulated, L0 will do the necessary barriers */
> - asm volatile(PPC_TLBIE_5(%0, %1, 2, 0, 1) : :
> -  "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
> + return;
>   }
> +
> + pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
> + 

Re: [PATCH v3 33/33] KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization

2018-10-03 Thread David Gibson
On Tue, Oct 02, 2018 at 09:31:32PM +1000, Paul Mackerras wrote:
> With this, userspace can enable a KVM-HV guest to run nested guests
> under it.
> 
> Signed-off-by: Paul Mackerras 
> ---
>  Documentation/virtual/kvm/api.txt  | 14 ++
>  arch/powerpc/include/asm/kvm_ppc.h |  1 +
>  arch/powerpc/kvm/book3s_hv.c   | 17 +
>  arch/powerpc/kvm/powerpc.c | 12 
>  include/uapi/linux/kvm.h   |  1 +
>  5 files changed, 45 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 017d851..a2d4832 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4522,6 +4522,20 @@ hpage module parameter is not set to 1, -EINVAL is 
> returned.
>  While it is generally possible to create a huge page backed VM without
>  this capability, the VM will not be able to run.
>  
> +7.15 KVM_CAP_PPC_NESTED_HV
> +
> +Architectures: ppc
> +Parameters: enable flag (0 to disable, non-zero to enable)
> +Returns: 0 on success, -EINVAL when the implementation doesn't support
> +nested-HV virtualization.
> +
> +HV-KVM on POWER9 and later systems allows for "nested-HV"
> +virtualization, which provides a way for a guest VM to run guests that
> +can run using the CPU's supervisor mode (privileged non-hypervisor
> +state).  Enabling this capability on a VM depends on the CPU having
> +the necessary functionality and on the facility being enabled with a
> +kvm-hv module parameter.
> +
>  8. Other capabilities.
>  --
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index 245e564..80f0091 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -327,6 +327,7 @@ struct kvmppc_ops {
>   int (*set_smt_mode)(struct kvm *kvm, unsigned long mode,
>   unsigned long flags);
>   void (*giveup_ext)(struct kvm_vcpu *vcpu, ulong msr);
> + int (*enable_nested)(struct kvm *kvm, bool enable);
>  };
>  
>  extern struct kvmppc_ops *kvmppc_hv_ops;
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 196bff1..a5b3862 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -118,6 +118,11 @@ module_param_cb(h_ipi_redirect, _param_ops, 
> _ipi_redirect, 0644);
>  MODULE_PARM_DESC(h_ipi_redirect, "Redirect H_IPI wakeup to a free host 
> core");
>  #endif
>  
> +/* If set, guests are allowed to create and control nested guests */
> +static bool enable_nested = true;
> +module_param(enable_nested, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(enable_nested, "Enable nested virtualization (only on 
> POWER9)");

I'd suggest calling the module parameter just "nested" to match x86.

>  /* If set, the threads on each CPU core have to be in the same MMU mode */
>  static bool no_mixing_hpt_and_radix;
>  
> @@ -5188,6 +5193,17 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
> kvm_ppc_mmuv3_cfg *cfg)
>   return err;
>  }
>  
> +static int kvmhv_enable_nested(struct kvm *kvm, bool enable)
> +{
> + if (!(enable_nested && cpu_has_feature(CPU_FTR_ARCH_300)))
> + return -EINVAL;

Maybe EPERM, rather than EINVAL.  There's nothing invalid about the
ioctl() parameters - we just can't do what they want.

> +
> + /* kvm == NULL means the caller is testing if the capability exists */
> + if (kvm)
> + kvm->arch.nested_enable = enable;
> + return 0;
> +}
> +
>  static struct kvmppc_ops kvm_ops_hv = {
>   .get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
>   .set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
> @@ -5227,6 +5243,7 @@ static struct kvmppc_ops kvm_ops_hv = {
>   .configure_mmu = kvmhv_configure_mmu,
>   .get_rmmu_info = kvmhv_get_rmmu_info,
>   .set_smt_mode = kvmhv_set_smt_mode,
> + .enable_nested = kvmhv_enable_nested,
>  };
>  
>  static int kvm_init_subcore_bitmap(void)
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index eba5756..449ae1d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -596,6 +596,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   case KVM_CAP_PPC_MMU_HASH_V3:
>   r = !!(hv_enabled && cpu_has_feature(CPU_FTR_ARCH_300));
>   break;
> + case KVM_CAP_PPC_NESTED_HV:
> + r = !!(hv_enabled && kvmppc_hv_ops->enable_nested &&
> +!kvmppc_hv_ops->enable_nested(NULL, false));
> + break;
>  #endif
>   case KVM_CAP_SYNC_MMU:
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> @@ -2114,6 +2118,14 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   r = kvm->arch.kvm_ops->set_smt_mode(kvm, mode, flags);
>   break;
>   }
> +
> + case KVM_CAP_PPC_NESTED_HV:
> + r = -EINVAL;
> + if (!is_kvmppc_hv_enabled(kvm) ||
> + 

Re: [PATCH v3 28/33] KVM: PPC: Book3S HV: Sanitise hv_regs on nested guest entry

2018-10-03 Thread David Gibson
On Tue, Oct 02, 2018 at 09:31:27PM +1000, Paul Mackerras wrote:
> From: Suraj Jitindar Singh 
> 
> restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
> nested (L2) guest into the vcpu structure. We need to sanitise these
> values to ensure we don't let the L1 guest hypervisor do things we don't
> want it to.
> 
> We don't let data address watchpoints or completed instruction address
> breakpoints be set to match in hypervisor state.
> 
> We also don't let L1 enable features in the hypervisor facility status
> and control register (HFSCR) for L2 which we have disabled for L1. That
> is L2 will get the subset of features which the L0 hypervisor has
> enabled for L1 and the features L1 wants to enable for L2. This could
> mean we give L1 a hypervisor facility unavailable interrupt for a
> facility it thinks it has enabled, however it shouldn't have enabled a
> facility it itself doesn't have for the L2 guest.
> 
> We sanitise the registers when copying in the L2 hv_regs. We don't need
> to sanitise when copying back the L1 hv_regs since these shouldn't be
> able to contain invalid values as they're just what was copied out.
> 
> Signed-off-by: Suraj Jitindar Singh 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/reg.h  |  1 +
>  arch/powerpc/kvm/book3s_hv_nested.c | 17 +
>  2 files changed, 18 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 9c42abf..47489f6 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -415,6 +415,7 @@
>  #define   HFSCR_DSCR __MASK(FSCR_DSCR_LG)
>  #define   HFSCR_VECVSX   __MASK(FSCR_VECVSX_LG)
>  #define   HFSCR_FP   __MASK(FSCR_FP_LG)
> +#define   HFSCR_INTR_CAUSE (ASM_CONST(0xFF) << 56)   /* interrupt cause */
>  #define SPRN_TAR 0x32f   /* Target Address Register */
>  #define SPRN_LPCR0x13E   /* LPAR Control Register */
>  #define   LPCR_VPM0  ASM_CONST(0x8000)
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 7656cb3..7b1088a 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -86,6 +86,22 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
> int trap,
>   }
>  }
>  
> +static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state 
> *hr)
> +{
> + /*
> +  * Don't let L1 enable features for L2 which we've disabled for L1,
> +  * but preserve the interrupt cause field.
> +  */
> + hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr);
> +
> + /* Don't let data address watchpoint match in hypervisor state */
> + hr->dawrx0 &= ~DAWRX_HYP;
> +
> + /* Don't let completed instruction address breakpt match in HV state */
> + if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
> + hr->ciabr &= ~CIABR_PRIV;
> +}
> +
>  static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
>  {
>   struct kvmppc_vcore *vc = vcpu->arch.vcore;
> @@ -198,6 +214,7 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
>   LPCR_LPES | LPCR_MER;
>   lpcr = (vc->lpcr & ~mask) | (l2_hv.lpcr & mask);
> + sanitise_hv_regs(vcpu, _hv);
>   restore_hv_regs(vcpu, _hv);
>  
>   vcpu->arch.ret = RESUME_GUEST;

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 5/7] powerpc: 'current_set' is now a table of task_struct pointers

2018-10-03 Thread Nicholas Piggin
On Wed, 3 Oct 2018 08:00:43 +0200
Christophe LEROY  wrote:

> Le 03/10/2018 à 07:41, Nicholas Piggin a écrit :
> > On Mon,  1 Oct 2018 12:30:27 + (UTC)
> > Christophe Leroy  wrote:
> >   
> >> The table of pointers 'current_set' has been used for retrieving
> >> the stack and current. They used to be thread_info pointers as
> >> they were pointing to the stack and current was taken from the
> >> 'task' field of the thread_info.
> >>
> >> Now, the pointers of 'current_set' table are now both pointers
> >> to task_struct and pointers to thread_info.
> >>
> >> As they are used to get current, and the stack pointer is
> >> retrieved from current's stack field, this patch changes
> >> their type to task_struct, and renames secondary_ti to
> >> secondary_current.  
> > 
> > I'm not sure if current_set is actually needed is it? Because
> > 64-bit already initializes paca->ksave / PACAKSAVE. That might
> > be a cleanup to do after your series.  
> 
> head_64.S contains:
> 
> __secondary_start:
>   /* Set thread priority to MEDIUM */
>   HMT_MEDIUM
> 
>   /* Initialize the kernel stack */
>   LOAD_REG_ADDR(r3, current_set)
>   sldir28,r24,3   /* get current_set[cpu#] */
>   ldx r14,r3,r28
>   addir14,r14,THREAD_SIZE-STACK_FRAME_OVERHEAD
>   std r14,PACAKSAVE(r13)

Right, I don't *think* that's needed because boot CPU should already
have set PACAKSAVE before starting secondaries here. ld r14,PACAKSAVE
should have the same result I think.

But never mind that for your series, just something I saw that could
be cleaned up.

Thanks,
Nick


Re: [RFC PATCH v3 3/7] powerpc: Activate CONFIG_THREAD_INFO_IN_TASK

2018-10-03 Thread Christophe LEROY




Le 03/10/2018 à 07:52, Nicholas Piggin a écrit :

On Wed, 3 Oct 2018 07:47:05 +0200
Christophe LEROY  wrote:


Le 03/10/2018 à 07:30, Nicholas Piggin a écrit :

On Mon,  1 Oct 2018 12:30:23 + (UTC)
Christophe Leroy  wrote:
   

This patch activates CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

This has the following consequences:
- thread_info is now located at the top of task_struct.


"top"... I got confused for a minute thinking high address and
wondering how you can change CURRENT_THREAD_INFO just to point
to current :)


Would 'beginning' be less confusing ?


Yes, good idea.


@@ -83,7 +83,13 @@ int is_cpu_dead(unsigned int cpu);
   /* 32-bit */
   extern int smp_hw_index[];
   
-#define raw_smp_processor_id()	(current_thread_info()->cpu)

+/*
+ * This is particularly ugly: it appears we can't actually get the definition
+ * of task_struct here, but we need access to the CPU this task is running on.
+ * Instead of using task_struct we're using _TASK_CPU which is extracted from
+ * asm-offsets.h by kbuild to get the current processor ID.
+ */
+#define raw_smp_processor_id() (*(unsigned int*)((void*)current + 
_TASK_CPU))


This is clever but yes ugly. Can't you include asm-offsets.h? riscv
seems to.


riscv has a clean asm-offsets.h . Our's defines constant with the same
name as those defined in other headers which are included in C files. So
including asm-offsets in C files does create conflicts like:

./include/generated/asm-offsets.h:71:0: warning: "TASK_SIZE" redefined
   #define TASK_SIZE -2147483648 /* TASK_SIZE */
./arch/powerpc/include/asm/processor.h:95:0: note: this is the location
of the previous definition
   #define TASK_SIZE (CONFIG_TASK_SIZE)

./include/generated/asm-offsets.h:98:0: warning: "NSEC_PER_SEC" redefined
   #define NSEC_PER_SEC 10 /* NSEC_PER_SEC */
./include/linux/time64.h:36:0: note: this is the location of the
previous definition
   #define NSEC_PER_SEC 10L

./arch/powerpc/include/asm/nohash/32/pgtable.h:34:0: warning:
"PGD_TABLE_SIZE" redefined
   #define PGD_TABLE_SIZE (sizeof(pgd_t) << PGD_INDEX_SIZE)
./include/generated/asm-offsets.h:101:0: note: this is the location of
the previous definition
   #define PGD_TABLE_SIZE 256 /* PGD_TABLE_SIZE */

...


Okay.



In v2, I had a patch to fix those redundancies
(https://patchwork.ozlabs.org/patch/974363/) but I found it unconvenient.


Because of merge conflicts, or you did not like the new names?


Both, because of the amount of changes it implies, and also because of 
the new names. I find it quite convenient to be able to use same names 
both in C and ASM. And I didn't want my serie to imply big-bangs in 
unrelated or not directly related topics.


Christophe



Thanks,
Nick



Re: [RFC PATCH v3 7/7] powerpc/64: Modify CURRENT_THREAD_INFO()

2018-10-03 Thread Christophe LEROY




Le 03/10/2018 à 07:44, Nicholas Piggin a écrit :

On Mon,  1 Oct 2018 12:30:31 + (UTC)
Christophe Leroy  wrote:


CURRENT_THREAD_INFO() now uses the PACA to retrieve 'current' pointer,
it doesn't use 'sp' anymore.


Can you remove this too now? I think it will be clearer what's going on
and easier to read once everyone remembers current is the same offset as
current thread_info.


Ok I will.

Christophe



Overall nice series, thanks for doing this.

Thanks,
Nick



Re: [RFC PATCH v3 5/7] powerpc: 'current_set' is now a table of task_struct pointers

2018-10-03 Thread Christophe LEROY




Le 03/10/2018 à 07:41, Nicholas Piggin a écrit :

On Mon,  1 Oct 2018 12:30:27 + (UTC)
Christophe Leroy  wrote:


The table of pointers 'current_set' has been used for retrieving
the stack and current. They used to be thread_info pointers as
they were pointing to the stack and current was taken from the
'task' field of the thread_info.

Now, the pointers of 'current_set' table are now both pointers
to task_struct and pointers to thread_info.

As they are used to get current, and the stack pointer is
retrieved from current's stack field, this patch changes
their type to task_struct, and renames secondary_ti to
secondary_current.


I'm not sure if current_set is actually needed is it? Because
64-bit already initializes paca->ksave / PACAKSAVE. That might
be a cleanup to do after your series.


head_64.S contains:

__secondary_start:
/* Set thread priority to MEDIUM */
HMT_MEDIUM

/* Initialize the kernel stack */
LOAD_REG_ADDR(r3, current_set)
sldir28,r24,3   /* get current_set[cpu#] */
ldx r14,r3,r28
addir14,r14,THREAD_SIZE-STACK_FRAME_OVERHEAD
std r14,PACAKSAVE(r13)


32-bit doesn't seem to use it, it only uses secondary_ti it seems.



Reviewed-by: Nicholas Piggin 



Christophe