date:20190320

Re: [PATCH] powerpc/highmem: change BUG_ON() to WARN_ON()

2019-03-20 Thread Michael Ellerman

Christophe Leroy  writes:
> In arch/powerpc/mm/highmem.c, BUG_ON() is called only when
> CONFIG_DEBUG_HIGHMEM is selected, this means the BUG_ON() is
> not vital and can be replaced by a a WARN_ON
>
> At the sametime, use IS_ENABLED() instead of #ifdef to clean a bit.
>
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/highmem.c | 12 
>  1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/mm/highmem.c b/arch/powerpc/mm/highmem.c
> index 82a0e37557a5..b68c9f20fbdf 100644
> --- a/arch/powerpc/mm/highmem.c
> +++ b/arch/powerpc/mm/highmem.c
> @@ -56,7 +54,7 @@ EXPORT_SYMBOL(kmap_atomic_prot);
>  void __kunmap_atomic(void *kvaddr)
>  {
>   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
> - int type __maybe_unused;
> + int type;

Why don't we move type into the block below.

eg:

> @@ -66,12 +64,11 @@ void __kunmap_atomic(void *kvaddr)
>  
-   type = kmap_atomic_idx();
>  
> -#ifdef CONFIG_DEBUG_HIGHMEM
> - {
> + if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM)) {
int type = kmap_atomic_idx();
>   unsigned int idx;
>  
>   idx = type + KM_TYPE_NR * smp_processor_id();
> - BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
> + WARN_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));


cheers

Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section

2019-03-20 Thread Michael Ellerman

Catalin Marinas  writes:
> On Thu, Mar 21, 2019 at 12:15:46AM +1100, Michael Ellerman wrote:
>> Catalin Marinas  writes:
>> > On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote:
>> >> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void)
>> >>  
>> >>   /* data/bss scanning */
>> >>   scan_large_block(_sdata, _edata);
>> >> - scan_large_block(__bss_start, __bss_stop);
>> >> +
>> >> + if (bss_hole_start) {
>> >> + scan_large_block(__bss_start, bss_hole_start);
>> >> + scan_large_block(bss_hole_stop, __bss_stop);
>> >> + } else {
>> >> + scan_large_block(__bss_start, __bss_stop);
>> >> + }
>> >> +
>> >>   scan_large_block(__start_ro_after_init, __end_ro_after_init);
>> >
>> > I'm not a fan of this approach but I couldn't come up with anything
>> > better. I was hoping we could check for PageReserved() in scan_block()
>> > but on arm64 it ends up not scanning the .bss at all.
>> >
>> > Until another user appears, I'm ok with this patch.
>> >
>> > Acked-by: Catalin Marinas 
>> 
>> I actually would like to rework this kvm_tmp thing to not be in bss at
>> all. It's a bit of a hack and is incompatible with strict RWX.
>> 
>> If we size it a bit more conservatively we can hopefully just reserve
>> some space in the text section for it.
>> 
>> I'm not going to have time to work on that immediately though, so if
>> people want this fixed now then this patch could go in as a temporary
>> solution.
>
> I think I have a simpler idea. Kmemleak allows punching holes in
> allocated objects, so just turn the data/bss sections into dedicated
> kmemleak objects. This happens when kmemleak is initialised, before the
> initcalls are invoked. The kvm_free_tmp() would just free the
> corresponding part of the bss.
>
> Patch below, only tested briefly on arm64. Qian, could you give it a try
> on powerpc? Thanks.
>
> 8<--
> diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
> index 683b5b3805bd..c4b8cb3c298d 100644
> --- a/arch/powerpc/kernel/kvm.c
> +++ b/arch/powerpc/kernel/kvm.c
> @@ -712,6 +712,8 @@ static void kvm_use_magic_page(void)
>  
>  static __init void kvm_free_tmp(void)
>  {
> + kmemleak_free_part(_tmp[kvm_tmp_index],
> +ARRAY_SIZE(kvm_tmp) - kvm_tmp_index);
>   free_reserved_area(_tmp[kvm_tmp_index],
>  _tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL);
>  }

Fine by me as long as it works (sounds like it does).

Acked-by: Michael Ellerman  (powerpc)

cheers

Re: [RFC PATCH 1/1] KVM: PPC: Report single stepping capability

2019-03-20 Thread Alexey Kardashevskiy




On 21/03/2019 05:39, Fabiano Rosas wrote:
> When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request
> the next instruction to be single stepped via the
> KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure.
> 
> We currently don't have support for guest single stepping implemented
> in Book3S HV.
> 
> This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order
> to inform userspace about the state of single stepping support.
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  arch/powerpc/kvm/powerpc.c | 5 +
>  include/uapi/linux/kvm.h   | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 8885377ec3e0..5ba990b0ec74 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -538,6 +538,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   case KVM_CAP_IMMEDIATE_EXIT:
>   r = 1;
>   break;
> + case KVM_CAP_PPC_GUEST_DEBUG_SSTEP:
> +#ifdef CONFIG_BOOKE


In the cover letter (which is not really required for a single patch)
you say the capability will be present for BookE and PR KVM (which
Book3s) but here it is BookE only, is that intentional?

Also, you need to update Documentation/virtual/kvm/api.txt for the new
capability. After reading which I started wondering could not we just
use existing KVM_CAP_GUEST_DEBUG_HW_BPS?


> + r = 1;
> + break;
> +#endif
>   case KVM_CAP_PPC_PAIRED_SINGLES:
>   case KVM_CAP_PPC_OSI:
>   case KVM_CAP_PPC_GET_PVINFO:
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6d4ea4b6c922..33e8a4db867e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
>  #define KVM_CAP_HYPERV_CPUID 167
> +#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 168
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> 

-- 
Alexey

Re: [PATCH 1/2] ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton

2019-03-20 Thread Martin K. Petersen



Tyrel,

> For each ibmvscsi host created during a probe or destroyed during a
> remove we either add or remove that host to/from the global
> ibmvscsi_head list. This runs the risk of concurrent modification.
>
> This patch adds a simple spinlock around the list modification calls
> to prevent concurrent updates as is done similarly in the ibmvfc
> driver and ipr driver.

Applied to 5.1/scsi-fixes.

-- 
Martin K. Petersen  Oracle Linux Engineering

[PATCH] powerpc/security: Fix spectre_v2 reporting

2019-03-20 Thread Michael Ellerman

When I updated the spectre_v2 reporting to handle software count cache
flush I got the logic wrong when there's no software count cache
enabled at all.

The result is that on systems with the software count cache flush
disabled we print:

  Mitigation: Indirect branch cache disabled, Software count cache flush

Which correctly indicates that the count cache is disabled, but
incorrectly says the software count cache flush is enabled.

The root of the problem is that we are trying to handle all
combinations of options. But we know now that we only expect to see
the software count cache flush enabled if the other options are false.

So split the two cases, which simplifies the logic and fixes the bug.
We were also missing a space before "(hardware accelerated)".

The result is we see one of:

  Mitigation: Indirect branch serialisation (kernel only)
  Mitigation: Indirect branch cache disabled
  Mitigation: Software count cache flush
  Mitigation: Software count cache flush (hardware accelerated)

Fixes: ee13cb249fab ("powerpc/64s: Add support for software count cache flush")
Cc: sta...@vger.kernel.org # v4.19+
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/security.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index 9b8631533e02..b33bafb8fcea 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -190,29 +190,22 @@ ssize_t cpu_show_spectre_v2(struct device *dev, struct 
device_attribute *attr, c
bcs = security_ftr_enabled(SEC_FTR_BCCTRL_SERIALISED);
ccd = security_ftr_enabled(SEC_FTR_COUNT_CACHE_DISABLED);
 
-   if (bcs || ccd || count_cache_flush_type != COUNT_CACHE_FLUSH_NONE) {
-   bool comma = false;
+   if (bcs || ccd) {
seq_buf_printf(, "Mitigation: ");
 
-   if (bcs) {
+   if (bcs)
seq_buf_printf(, "Indirect branch serialisation 
(kernel only)");
-   comma = true;
-   }
 
-   if (ccd) {
-   if (comma)
-   seq_buf_printf(, ", ");
-   seq_buf_printf(, "Indirect branch cache disabled");
-   comma = true;
-   }
-
-   if (comma)
+   if (bcs && ccd)
seq_buf_printf(, ", ");
 
-   seq_buf_printf(, "Software count cache flush");
+   if (ccd)
+   seq_buf_printf(, "Indirect branch cache disabled");
+   } else if (count_cache_flush_type != COUNT_CACHE_FLUSH_NONE) {
+   seq_buf_printf(, "Mitigation: Software count cache flush");
 
if (count_cache_flush_type == COUNT_CACHE_FLUSH_HW)
-   seq_buf_printf(, "(hardware accelerated)");
+   seq_buf_printf(, " (hardware accelerated)");
} else if (btb_flush_enabled) {
seq_buf_printf(, "Mitigation: Branch predictor state flush");
} else {
-- 
2.20.1

Re: [PATCH 1/4] add generic builtin command line

2019-03-20 Thread Andrew Morton

On Wed, 20 Mar 2019 16:23:28 -0700 Daniel Walker  wrote:

> On Wed, Mar 20, 2019 at 03:53:19PM -0700, Andrew Morton wrote:
> > On Tue, 19 Mar 2019 16:24:45 -0700 Daniel Walker  wrote:
> > 
> > > This code allows architectures to use a generic builtin command line.
> > 
> > I wasn't cc'ed on [2/4].  No mailing lists were cc'ed on [0/4] but it
> > didn't say anything useful anyway ;)
> > 
> > I'll queue them up for testing and shall await feedback from the
> > powerpc developers.
> > 
> 
> You weren't CC'd , but it was To: you,
> 
>  35 From: Daniel Walker 
>  36 To: Andrew Morton ,
>  37 Christophe Leroy ,
>  38 Michael Ellerman ,
>  39 Rob Herring , xe-linux-exter...@cisco.com,
>  40 linuxppc-dev@lists.ozlabs.org, Frank Rowand 
> 
>  41 Cc: devicet...@vger.kernel.org, linux-ker...@vger.kernel.org
>  42 Subject: [PATCH 2/4] drivers: of: generic command line support

hm.

> Thanks for picking it up.

The patches (or some version of them) are already in linux-next,
which messes me up.  I'll disable them for now.

Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-20 Thread Dan Williams

On Wed, Mar 20, 2019 at 8:09 PM Oliver  wrote:
>
> On Thu, Mar 21, 2019 at 7:57 AM Dan Williams  wrote:
> >
> > On Wed, Mar 20, 2019 at 8:34 AM Dan Williams  
> > wrote:
> > >
> > > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
> > >  wrote:
> > > >
> > > > Aneesh Kumar K.V  writes:
> > > >
> > > > > Dan Williams  writes:
> > > > >
> > > > >>
> > > > >>> Now what will be page size used for mapping vmemmap?
> > > > >>
> > > > >> That's up to the architecture's vmemmap_populate() implementation.
> > > > >>
> > > > >>> Architectures
> > > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> > > > >>> device-dax with struct page in the device will have pfn reserve 
> > > > >>> area aligned
> > > > >>> to PAGE_SIZE with the above example? We can't map that using
> > > > >>> PMD_SIZE page size?
> > > > >>
> > > > >> IIUC, that's a different alignment. Currently that's handled by
> > > > >> padding the reservation area up to a section (128MB on x86) boundary,
> > > > >> but I'm working on patches to allow sub-section sized ranges to be
> > > > >> mapped.
> > > > >
> > > > > I am missing something w.r.t code. The below code align that using 
> > > > > nd_pfn->align
> > > > >
> > > > >   if (nd_pfn->mode == PFN_MODE_PMEM) {
> > > > >   unsigned long memmap_size;
> > > > >
> > > > >   /*
> > > > >* vmemmap_populate_hugepages() allocates the memmap 
> > > > > array in
> > > > >* HPAGE_SIZE chunks.
> > > > >*/
> > > > >   memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> > > > >   offset = ALIGN(start + SZ_8K + memmap_size + 
> > > > > dax_label_reserve,
> > > > >   nd_pfn->align) - start;
> > > > >   }
> > > > >
> > > > > IIUC that is finding the offset where to put vmemmap start. And that 
> > > > > has
> > > > > to be aligned to the page size with which we may end up mapping 
> > > > > vmemmap
> > > > > area right?
> > >
> > > Right, that's the physical offset of where the vmemmap ends, and the
> > > memory to be mapped begins.
> > >
> > > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > > > > is to compute howmany pfns we should map for this pfn dev right?
> > > > >
> > > >
> > > > Also i guess those 4K assumptions there is wrong?
> > >
> > > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
> > > needs to be revved and the PAGE_SIZE needs to be recorded in the
> > > info-block.
> >
> > How often does a system change page-size. Is it fixed or do
> > environment change it from one boot to the next? I'm thinking through
> > the behavior of what do when the recorded PAGE_SIZE in the info-block
> > does not match the current system page size. The simplest option is to
> > just fail the device and require it to be reconfigured. Is that
> > acceptable?
>
> The kernel page size is set at build time and as far as I know every
> distro configures their ppc64(le) kernel for 64K. I've used 4K kernels
> a few times in the past to debug PAGE_SIZE dependent problems, but I'd
> be surprised if anyone is using 4K in production.

Ah, ok.

> Anyway, my view is that using 4K here isn't really a problem since
> it's just the accounting unit of the pfn superblock format. The kernel
> reading form it should understand that and scale it to whatever
> accounting unit it wants to use internally. Currently we don't so that
> should probably be fixed, but that doesn't seem to cause any real
> issues. As far as I can tell the only user of npfns in
> __nvdimm_setup_pfn() whih prints the "number of pfns truncated"
> message.
>
> Am I missing something?

No, I don't think so. The only time it would break is if a system with
64K page size laid down an info-block with not enough reserved
capacity when the page-size is 4K (npfns too small). However, that
sounds like an exceptional case which is why no problems have been
reported to date.

Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-20 Thread Oliver

On Thu, Mar 21, 2019 at 7:57 AM Dan Williams  wrote:
>
> On Wed, Mar 20, 2019 at 8:34 AM Dan Williams  wrote:
> >
> > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
> >  wrote:
> > >
> > > Aneesh Kumar K.V  writes:
> > >
> > > > Dan Williams  writes:
> > > >
> > > >>
> > > >>> Now what will be page size used for mapping vmemmap?
> > > >>
> > > >> That's up to the architecture's vmemmap_populate() implementation.
> > > >>
> > > >>> Architectures
> > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> > > >>> device-dax with struct page in the device will have pfn reserve area 
> > > >>> aligned
> > > >>> to PAGE_SIZE with the above example? We can't map that using
> > > >>> PMD_SIZE page size?
> > > >>
> > > >> IIUC, that's a different alignment. Currently that's handled by
> > > >> padding the reservation area up to a section (128MB on x86) boundary,
> > > >> but I'm working on patches to allow sub-section sized ranges to be
> > > >> mapped.
> > > >
> > > > I am missing something w.r.t code. The below code align that using 
> > > > nd_pfn->align
> > > >
> > > >   if (nd_pfn->mode == PFN_MODE_PMEM) {
> > > >   unsigned long memmap_size;
> > > >
> > > >   /*
> > > >* vmemmap_populate_hugepages() allocates the memmap 
> > > > array in
> > > >* HPAGE_SIZE chunks.
> > > >*/
> > > >   memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> > > >   offset = ALIGN(start + SZ_8K + memmap_size + 
> > > > dax_label_reserve,
> > > >   nd_pfn->align) - start;
> > > >   }
> > > >
> > > > IIUC that is finding the offset where to put vmemmap start. And that has
> > > > to be aligned to the page size with which we may end up mapping vmemmap
> > > > area right?
> >
> > Right, that's the physical offset of where the vmemmap ends, and the
> > memory to be mapped begins.
> >
> > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > > > is to compute howmany pfns we should map for this pfn dev right?
> > > >
> > >
> > > Also i guess those 4K assumptions there is wrong?
> >
> > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
> > needs to be revved and the PAGE_SIZE needs to be recorded in the
> > info-block.
>
> How often does a system change page-size. Is it fixed or do
> environment change it from one boot to the next? I'm thinking through
> the behavior of what do when the recorded PAGE_SIZE in the info-block
> does not match the current system page size. The simplest option is to
> just fail the device and require it to be reconfigured. Is that
> acceptable?

The kernel page size is set at build time and as far as I know every
distro configures their ppc64(le) kernel for 64K. I've used 4K kernels
a few times in the past to debug PAGE_SIZE dependent problems, but I'd
be surprised if anyone is using 4K in production.

Anyway, my view is that using 4K here isn't really a problem since
it's just the accounting unit of the pfn superblock format. The kernel
reading form it should understand that and scale it to whatever
accounting unit it wants to use internally. Currently we don't so that
should probably be fixed, but that doesn't seem to cause any real
issues. As far as I can tell the only user of npfns in
__nvdimm_setup_pfn() whih prints the "number of pfns truncated"
message.

Am I missing something?

> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v2 13/13] syscall_get_arch: add "struct task_struct *" argument

2019-03-20 Thread Paul Moore

On Sun, Mar 17, 2019 at 7:30 PM Dmitry V. Levin  wrote:
>
> This argument is required to extend the generic ptrace API with
> PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going
> to be called from ptrace_request() along with syscall_get_nr(),
> syscall_get_arguments(), syscall_get_error(), and
> syscall_get_return_value() functions with a tracee as their argument.
>
> The primary intent is that the triple (audit_arch, syscall_nr, arg1..arg6)
> should describe what system call is being called and what its arguments
> are.
>
> Reverts: 5e937a9ae913 ("syscall_get_arch: remove useless function arguments")
> Reverts: 1002d94d3076 ("syscall.h: fix doc text for syscall_get_arch()")
> Reviewed-by: Andy Lutomirski  # for x86
> Reviewed-by: Palmer Dabbelt 
> Acked-by: Paul Moore 
> Acked-by: Paul Burton  # MIPS parts
> Acked-by: Michael Ellerman  (powerpc)
> Acked-by: Kees Cook  # seccomp parts
> Acked-by: Mark Salter  # for the c6x bit
> Cc: Elvira Khabirova 
> Cc: Eugene Syromyatnikov 
> Cc: Oleg Nesterov 
> Cc: x...@kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c6x-...@linux-c6x.org
> Cc: uclinux-h8-de...@lists.sourceforge.jp
> Cc: linux-hexa...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-m...@vger.kernel.org
> Cc: nios2-...@lists.rocketboards.org
> Cc: openr...@lists.librecores.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@lists.infradead.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-au...@redhat.com
> Signed-off-by: Dmitry V. Levin 
> ---
>
> Notes:
> v2: unchanged
>
>  arch/alpha/include/asm/syscall.h  |  2 +-
>  arch/arc/include/asm/syscall.h|  2 +-
>  arch/arm/include/asm/syscall.h|  2 +-
>  arch/arm64/include/asm/syscall.h  |  4 ++--
>  arch/c6x/include/asm/syscall.h|  2 +-
>  arch/csky/include/asm/syscall.h   |  2 +-
>  arch/h8300/include/asm/syscall.h  |  2 +-
>  arch/hexagon/include/asm/syscall.h|  2 +-
>  arch/ia64/include/asm/syscall.h   |  2 +-
>  arch/m68k/include/asm/syscall.h   |  2 +-
>  arch/microblaze/include/asm/syscall.h |  2 +-
>  arch/mips/include/asm/syscall.h   |  6 +++---
>  arch/mips/kernel/ptrace.c |  2 +-
>  arch/nds32/include/asm/syscall.h  |  2 +-
>  arch/nios2/include/asm/syscall.h  |  2 +-
>  arch/openrisc/include/asm/syscall.h   |  2 +-
>  arch/parisc/include/asm/syscall.h |  4 ++--
>  arch/powerpc/include/asm/syscall.h| 10 --
>  arch/riscv/include/asm/syscall.h  |  2 +-
>  arch/s390/include/asm/syscall.h   |  4 ++--
>  arch/sh/include/asm/syscall_32.h  |  2 +-
>  arch/sh/include/asm/syscall_64.h  |  2 +-
>  arch/sparc/include/asm/syscall.h  |  5 +++--
>  arch/unicore32/include/asm/syscall.h  |  2 +-
>  arch/x86/include/asm/syscall.h|  8 +---
>  arch/x86/um/asm/syscall.h |  2 +-
>  arch/xtensa/include/asm/syscall.h |  2 +-
>  include/asm-generic/syscall.h |  5 +++--
>  kernel/auditsc.c  |  4 ++--
>  kernel/seccomp.c  |  4 ++--
>  30 files changed, 52 insertions(+), 42 deletions(-)

Merged into audit/next, thanks everyone.

-- 
paul moore
www.paul-moore.com

Re: [PATCH 2/2] ibmvscsi: Fix empty event pool access during host removal

2019-03-20 Thread Martin K. Petersen



Tyrel,

> The event pool used for queueing commands is destroyed fairly early in
> the ibmvscsi_remove() code path. Since, this happens prior to the call
> so scsi_remove_host() it is possible for further calls to queuecommand
> to be processed which manifest as a panic due to a NULL pointer
> dereference as seen here:

Applied to 5.1/scsi-fixes. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

[PATCH 4/4] ibmvfc: Clean up transport events

2019-03-20 Thread Tyrel Datwyler

No change to functionality. Simply make transport event messages a litle
clearer, and rework CRQ format enums such that we have separate enums
for INIT messages and XPORT events.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 8 +---
 drivers/scsi/ibmvscsi/ibmvfc.h | 7 ++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 33dda4d32f65..3ad997ac3510 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -2756,16 +2756,18 @@ static void ibmvfc_handle_crq(struct ibmvfc_crq *crq, 
struct ibmvfc_host *vhost)
ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_NONE);
if (crq->format == IBMVFC_PARTITION_MIGRATED) {
/* We need to re-setup the interpartition connection */
-   dev_info(vhost->dev, "Re-enabling adapter\n");
+   dev_info(vhost->dev, "Partition migrated, Re-enabling 
adapter\n");
vhost->client_migrated = 1;
ibmvfc_purge_requests(vhost, DID_REQUEUE);
ibmvfc_link_down(vhost, IBMVFC_LINK_DOWN);
ibmvfc_set_host_action(vhost, 
IBMVFC_HOST_ACTION_REENABLE);
-   } else {
-   dev_err(vhost->dev, "Virtual adapter failed (rc=%d)\n", 
crq->format);
+   } else if (crq->format == IBMVFC_PARTNER_FAILED || crq->format 
== IBMVFC_PARTNER_DEREGISTER) {
+   dev_err(vhost->dev, "Host partner adapter deregistered 
or failed (rc=%d)\n", crq->format);
ibmvfc_purge_requests(vhost, DID_ERROR);
ibmvfc_link_down(vhost, IBMVFC_LINK_DOWN);
ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_RESET);
+   } else {
+   dev_err(vhost->dev, "Received unknown transport event 
from partner (rc=%d)\n", crq->format);
}
return;
case IBMVFC_CRQ_CMD_RSP:
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index b81a53c4a9a8..459cc288ba1d 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -78,9 +78,14 @@ enum ibmvfc_crq_valid {
IBMVFC_CRQ_XPORT_EVENT  = 0xFF,
 };
 
-enum ibmvfc_crq_format {
+enum ibmvfc_crq_init_msg {
IBMVFC_CRQ_INIT = 0x01,
IBMVFC_CRQ_INIT_COMPLETE= 0x02,
+};
+
+enum ibmvfc_crq_xport_evts {
+   IBMVFC_PARTNER_FAILED   = 0x01,
+   IBMVFC_PARTNER_DEREGISTER   = 0x02,
IBMVFC_PARTITION_MIGRATED   = 0x06,
 };
 
-- 
2.12.3

[PATCH 3/4] ibmvfc: Byte swap status and error codes when logging

2019-03-20 Thread Tyrel Datwyler

Status and error codes are returned in big endian from the VIOS. The
values are translated into a human readable format when logged, but
the values are also logged. This patch byte swaps those values so
that they are consistent between BE and LE platforms.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 18ee2a8ec3d5..33dda4d32f65 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -1497,7 +1497,7 @@ static void ibmvfc_log_error(struct ibmvfc_event *evt)
 
scmd_printk(KERN_ERR, cmnd, "Command (%02X) : %s (%x:%x) "
"flags: %x fcp_rsp: %x, resid=%d, scsi_status: %x\n",
-   cmnd->cmnd[0], err, vfc_cmd->status, vfc_cmd->error,
+   cmnd->cmnd[0], err, be16_to_cpu(vfc_cmd->status), 
be16_to_cpu(vfc_cmd->error),
rsp->flags, rsp_code, scsi_get_resid(cmnd), 
rsp->scsi_status);
 }
 
@@ -2023,7 +2023,7 @@ static int ibmvfc_reset_device(struct scsi_device *sdev, 
int type, char *desc)
sdev_printk(KERN_ERR, sdev, "%s reset failed: %s (%x:%x) "
"flags: %x fcp_rsp: %x, scsi_status: %x\n", desc,

ibmvfc_get_cmd_error(be16_to_cpu(rsp_iu.cmd.status), 
be16_to_cpu(rsp_iu.cmd.error)),
-   rsp_iu.cmd.status, rsp_iu.cmd.error, fc_rsp->flags, 
rsp_code,
+   be16_to_cpu(rsp_iu.cmd.status), 
be16_to_cpu(rsp_iu.cmd.error), fc_rsp->flags, rsp_code,
fc_rsp->scsi_status);
rsp_rc = -EIO;
} else
@@ -2382,7 +2382,7 @@ static int ibmvfc_abort_task_set(struct scsi_device *sdev)
sdev_printk(KERN_ERR, sdev, "Abort failed: %s (%x:%x) "
"flags: %x fcp_rsp: %x, scsi_status: %x\n",

ibmvfc_get_cmd_error(be16_to_cpu(rsp_iu.cmd.status), 
be16_to_cpu(rsp_iu.cmd.error)),
-   rsp_iu.cmd.status, rsp_iu.cmd.error, fc_rsp->flags, 
rsp_code,
+   be16_to_cpu(rsp_iu.cmd.status), 
be16_to_cpu(rsp_iu.cmd.error), fc_rsp->flags, rsp_code,
fc_rsp->scsi_status);
rsp_rc = -EIO;
} else
@@ -3349,7 +3349,7 @@ static void ibmvfc_tgt_prli_done(struct ibmvfc_event *evt)
 
tgt_log(tgt, level, "Process Login failed: %s (%x:%x) 
rc=0x%02X\n",
ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), 
be16_to_cpu(rsp->error)),
-   rsp->status, rsp->error, status);
+   be16_to_cpu(rsp->status), be16_to_cpu(rsp->error), 
status);
break;
}
 
@@ -3447,9 +3447,10 @@ static void ibmvfc_tgt_plogi_done(struct ibmvfc_event 
*evt)
ibmvfc_set_tgt_action(tgt, IBMVFC_TGT_ACTION_DEL_RPORT);
 
tgt_log(tgt, level, "Port Login failed: %s (%x:%x) %s (%x) %s 
(%x) rc=0x%02X\n",
-   ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), 
be16_to_cpu(rsp->error)), rsp->status, rsp->error,
-   ibmvfc_get_fc_type(be16_to_cpu(rsp->fc_type)), 
rsp->fc_type,
-   ibmvfc_get_ls_explain(be16_to_cpu(rsp->fc_explain)), 
rsp->fc_explain, status);
+   ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), 
be16_to_cpu(rsp->error)),
+be16_to_cpu(rsp->status), 
be16_to_cpu(rsp->error),
+   ibmvfc_get_fc_type(be16_to_cpu(rsp->fc_type)), 
be16_to_cpu(rsp->fc_type),
+   ibmvfc_get_ls_explain(be16_to_cpu(rsp->fc_explain)), 
be16_to_cpu(rsp->fc_explain), status);
break;
}
 
@@ -3620,7 +3621,7 @@ static void ibmvfc_tgt_adisc_done(struct ibmvfc_event 
*evt)
fc_explain = (be32_to_cpu(mad->fc_iu.response[1]) & 0xff00) 
>> 8;
tgt_info(tgt, "ADISC failed: %s (%x:%x) %s (%x) %s (%x) 
rc=0x%02X\n",
 ibmvfc_get_cmd_error(be16_to_cpu(mad->iu.status), 
be16_to_cpu(mad->iu.error)),
-mad->iu.status, mad->iu.error,
+be16_to_cpu(mad->iu.status), 
be16_to_cpu(mad->iu.error),
 ibmvfc_get_fc_type(fc_reason), fc_reason,
 ibmvfc_get_ls_explain(fc_explain), fc_explain, status);
break;
@@ -3832,9 +3833,10 @@ static void ibmvfc_tgt_query_target_done(struct 
ibmvfc_event *evt)
 
tgt_log(tgt, level, "Query Target failed: %s (%x:%x) %s (%x) %s 
(%x) rc=0x%02X\n",
ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), 
be16_to_cpu(rsp->error)),
-   rsp->status, rsp->error, 
ibmvfc_get_fc_type(be16_to_cpu(rsp->fc_type)),
-   rsp->fc_type,

[PATCH 2/4] ibmvfc: Add failed PRLI to cmd_status lookup array

2019-03-20 Thread Tyrel Datwyler

The VIOS uses the SCSI_ERROR class to report PRLI failures. These
errors are indicated with the combination of a IBMVFC_FC_SCSI_ERROR
return status and 0x8000 error code. Add these codes to cmd_status[]
with appropriate human readable error message.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index c3ce27039552..18ee2a8ec3d5 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -139,6 +139,7 @@ static const struct {
{ IBMVFC_FC_FAILURE, IBMVFC_VENDOR_SPECIFIC, DID_ERROR, 1, 1, "vendor 
specific" },
 
{ IBMVFC_FC_SCSI_ERROR, 0, DID_OK, 1, 0, "SCSI error" },
+   { IBMVFC_FC_SCSI_ERROR, IBMVFC_COMMAND_FAILED, DID_ERROR, 0, 1, "PRLI 
to device failed." },
 };
 
 static void ibmvfc_npiv_login(struct ibmvfc_host *);
-- 
2.12.3

[PATCH 1/4] ibmvfc: Remove "failed" from logged errors

2019-03-20 Thread Tyrel Datwyler

The text of messages logged with ibmvfc_log_error() always contain
the term "failed". In the case of cancelled commands during EH they
are reported back by the VIOS using error codes. This can be
confusing to somebody looking at these log messages as to whether
a command was successfully cancelled. The following real log
message for example it is unclear if the transaction was actaully
cancelled.

<6>sd 0:0:1:1: Cancelling outstanding commands.
<3>sd 0:0:1:1: [sde] Command (28) failed: transaction cancelled (2:6) flags: 0 
fcp_rsp: 0, resid=0, scsi_status: 0

Remove prefixing of "failed" to all error logged messages. The
ibmvfc_log_error() function translates the returned error/status
codes to a human readable message already.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index dbaa4f131433..c3ce27039552 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -1494,7 +1494,7 @@ static void ibmvfc_log_error(struct ibmvfc_event *evt)
if (rsp->flags & FCP_RSP_LEN_VALID)
rsp_code = rsp->data.info.rsp_code;
 
-   scmd_printk(KERN_ERR, cmnd, "Command (%02X) failed: %s (%x:%x) "
+   scmd_printk(KERN_ERR, cmnd, "Command (%02X) : %s (%x:%x) "
"flags: %x fcp_rsp: %x, resid=%d, scsi_status: %x\n",
cmnd->cmnd[0], err, vfc_cmd->status, vfc_cmd->error,
rsp->flags, rsp_code, scsi_get_resid(cmnd), 
rsp->scsi_status);
-- 
2.12.3

[PATCH] powerpc: vmlinux.lds: Drop Binutils 2.18 workarounds

2019-03-20 Thread Joel Stanley

Segher added some workarounds for GCC 4.2 and bintuils 2.18. We now set
4.6 and 2.20 as the minimum, so they can be dropped.

This is mostly a revert of c6995fe4 ("powerpc: Fix build bug with
binutils < 2.18 and GCC < 4.2").

Signed-off-by: Joel Stanley 
---
 arch/powerpc/kernel/vmlinux.lds.S | 35 ---
 1 file changed, 4 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 060a1acd7c6d..0551e9846676 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -17,25 +17,6 @@
 
 ENTRY(_stext)
 
-PHDRS {
-   kernel PT_LOAD FLAGS(7); /* RWX */
-   notes PT_NOTE FLAGS(0);
-   dummy PT_NOTE FLAGS(0);
-
-   /* binutils < 2.18 has a bug that makes it misbehave when taking an
-  ELF file with all segments at load address 0 as input.  This
-  happens when running "strip" on vmlinux, because of the AT() magic
-  in this linker script.  People using GCC >= 4.2 won't run into
-  this problem, because the "build-id" support will put some data
-  into the "notes" segment (at a non-zero load address).
-
-  To work around this, we force some data into both the "dummy"
-  segment and the kernel segment, so the dummy segment will get a
-  non-zero load address.  It's not enough to always create the
-  "notes" segment, since if nothing gets assigned to it, its load
-  address will be zero.  */
-}
-
 #ifdef CONFIG_PPC64
 OUTPUT_ARCH(powerpc:common64)
 jiffies = jiffies_64;
@@ -77,7 +58,7 @@ SECTIONS
 #else /* !CONFIG_PPC64 */
HEAD_TEXT
 #endif
-   } :kernel
+   }
 
__head_end = .;
 
@@ -126,7 +107,7 @@ SECTIONS
__got2_end = .;
 #endif /* CONFIG_PPC32 */
 
-   } :kernel
+   }
 
. = ALIGN(ETEXT_ALIGN_SIZE);
_etext = .;
@@ -177,15 +158,7 @@ SECTIONS
 #endif
EXCEPTION_TABLE(0)
 
-   NOTES :kernel :notes
-
-   /* The dummy segment contents for the bug workaround mentioned above
-  near PHDRS.  */
-   .dummy : AT(ADDR(.dummy) - LOAD_OFFSET) {
-   LONG(0)
-   LONG(0)
-   LONG(0)
-   } :kernel :dummy
+   NOTES
 
 /*
  * Init sections discarded at runtime
@@ -200,7 +173,7 @@ SECTIONS
 #ifdef CONFIG_PPC64
*(.tramp.ftrace.init);
 #endif
-   } :kernel
+   }
 
/* .exit.text is discarded at runtime, not link time,
 * to deal with references from __bug_table
-- 
2.20.1

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-20 Thread David Gibson

On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> On Wed, 20 Mar 2019 15:38:24 +1100
> David Gibson  wrote:
> 
> > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > Alexey Kardashevskiy  wrote:
> > >   
> > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > 
> > > > However the user may want to pass individual GPUs to the userspace so
> > > > in order to do so we need to put them into separate IOMMU groups and
> > > > cut off the interconnects.
> > > > 
> > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > > issued to the GPU.
> > > > 
> > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > determines what links need to be disabled. This relies on presence
> > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling 
> > > > which
> > > > PCI peers it is connected to (which includes NVLink bridges or peer 
> > > > GPUs).
> > > > 
> > > > This does not change the existing behaviour and instead adds
> > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > 
> > > > The alternative approaches would be:
> > > > 
> > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > to tell skiboot via an additional OPAL call whether or not we want this
> > > > isolation - skiboot is unaware of IOMMU groups.
> > > > 
> > > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > > the problem with that is at that point the device is not enabled, i.e.
> > > > config space is not restored so we need to enable the device (i.e. MMIO
> > > > bit in CMD register + program valid address to BAR0) in order to disable
> > > > links and then perhaps undo all this initialization to bring the device
> > > > back to the state where pci_try_reset_function() expects it to be.  
> > > 
> > > The trouble seems to be that this approach only maintains the isolation
> > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > incorporating groups in various ways.  
> > 
> > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > necessarily represent devices which *are* isolated, just devices which
> > *can be* isolated.  There are plenty of instances when we don't need
> > to isolate devices in different IOMMU groups: passing both groups to
> > the same guest or userspace VFIO driver for example, or indeed when
> > both groups are owned by regular host kernel drivers.
> > 
> > In at least some of those cases we also don't want to isolate the
> > devices when we don't have to, usually for performance reasons.
> 
> I see IOMMU groups as representing the current isolation of the device,
> not just the possible isolation.  If there are ways to break down that
> isolation then ideally the group would be updated to reflect it.  The
> ACS disable patches seem to support this, at boot time we can choose to
> disable ACS at certain points in the topology to favor peer-to-peer
> performance over isolation.  This is then reflected in the group
> composition, because even though ACS *can be* enabled at the given
> isolation points, it's intentionally not with this option.  Whether or
> not a given user who owns multiple devices needs that isolation is
> really beside the point, the user can choose to connect groups via IOMMU
> mappings or reconfigure the system to disable ACS and potentially more
> direct routing.  The IOMMU groups are still accurately reflecting the
> topology and IOMMU based isolation.

Huh, ok, I think we need to straighten this out.  Thinking of iommu
groups as possible rather than potential isolation was a conscious
decision on my part when we were first coming up with them.  The
rationale was that that way iommu groups could be static for the
lifetime of boot, with more dynamic isolation state layered on top.

Now, that was based on analogy with PAPR's concept of "Partitionable
Endpoints" which are decided by firmware before boot.  However, I
think it makes sense in other contexts too: if iommu groups represent
current isolation, then we need some other way to advertise possible
isolation - otherwise how will the admin (and/or tools) know how it
can configure the iommu groups.

VFIO already has the container, which represents explicitly a "group
of groups" that we don't care to isolate from each other.  I don't
actually know what other uses of the iommu group infrastructure we
have at

[PATCH 2/2] ibmvscsi: Fix empty event pool access during host removal

2019-03-20 Thread Tyrel Datwyler

The event pool used for queueing commands is destroyed fairly early in
the ibmvscsi_remove() code path. Since, this happens prior to the call
so scsi_remove_host() it is possible for further calls to queuecommand
to be processed which manifest as a panic due to a NULL pointer
dereference as seen here:

PANIC: "Unable to handle kernel paging request for data at address
0x"

Context process backtrace:

DSISR: 4200 Syscall Result: 
4 [c2cb3820] memcpy_power7 at c0064204
[Link Register] [c2cb3820] ibmvscsi_send_srp_event at d3ed14a4
5 [c2cb3920] ibmvscsi_send_srp_event at d3ed14a4 [ibmvscsi] 
?(unreliable)
6 [c2cb39c0] ibmvscsi_queuecommand at d3ed2388 [ibmvscsi]
7 [c2cb3a70] scsi_dispatch_cmd at d395c2d8 [scsi_mod]
8 [c2cb3af0] scsi_request_fn at d395ef88 [scsi_mod]
9 [c2cb3be0] __blk_run_queue at c0429860
10 [c2cb3c10] blk_delay_work at c042a0ec
11 [c2cb3c40] process_one_work at c00dac30
12 [c2cb3cd0] worker_thread at c00db110
13 [c2cb3d80] kthread at c00e3378
14 [c2cb3e30] ret_from_kernel_thread at c000982c

The kernel buffer log is overfilled with this log:

[11261.952732] ibmvscsi: found no event struct in pool!

This patch reorders the operations during host teardown. Start by
calling the SRP transport and Scsi_Host remove functions to flush any
outstanding work and set the host offline. LLDD teardown follows
including destruction of the event pool, freeing the Command Response
Queue (CRQ), and unmapping any persistent buffers. The event pool
destruction is protected by the scsi_host lock, and the pool is purged
prior of any requests for which we never received a response. Finally,
move the removal of the scsi host from our global list to the end so
that the host is easily locatable for debugging purposes during
teardown.

Cc:  # v2.6.12+
Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvscsi.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 2b22969f3f63..8cec5230fe31 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -2295,17 +2295,27 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const 
struct vio_device_id *id)
 static int ibmvscsi_remove(struct vio_dev *vdev)
 {
struct ibmvscsi_host_data *hostdata = dev_get_drvdata(>dev);
-   spin_lock(_driver_lock);
-   list_del(>host_list);
-   spin_unlock(_driver_lock);
-   unmap_persist_bufs(hostdata);
+   unsigned long flags;
+
+   srp_remove_host(hostdata->host);
+   scsi_remove_host(hostdata->host);
+
+   purge_requests(hostdata, DID_ERROR);
+
+   spin_lock_irqsave(hostdata->host->host_lock, flags);
release_event_pool(>pool, hostdata);
+   spin_unlock_irqrestore(hostdata->host->host_lock, flags);
+
ibmvscsi_release_crq_queue(>queue, hostdata,
max_events);
 
kthread_stop(hostdata->work_thread);
-   srp_remove_host(hostdata->host);
-   scsi_remove_host(hostdata->host);
+   unmap_persist_bufs(hostdata);
+
+   spin_lock(_driver_lock);
+   list_del(>host_list);
+   spin_unlock(_driver_lock);
+
scsi_host_put(hostdata->host);
 
return 0;
-- 
2.12.3

[PATCH 1/2] ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton

2019-03-20 Thread Tyrel Datwyler

For each ibmvscsi host created during a probe or destroyed during a
remove we either add or remove that host to/from the global ibmvscsi_head
list. This runs the risk of concurrent modification.

This patch adds a simple spinlock around the list modification calls to
prevent concurrent updates as is done similarly in the ibmvfc driver and
ipr driver.

Fixes: 32d6e4b6e4ea ("scsi: ibmvscsi: add vscsi hosts to global list_head")
Cc:  # v4.10+
Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvscsi.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 1135e74646e2..2b22969f3f63 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -96,6 +96,7 @@ static int client_reserve = 1;
 static char partition_name[96] = "UNKNOWN";
 static unsigned int partition_number = -1;
 static LIST_HEAD(ibmvscsi_head);
+static DEFINE_SPINLOCK(ibmvscsi_driver_lock);
 
 static struct scsi_transport_template *ibmvscsi_transport_template;
 
@@ -2270,7 +2271,9 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const 
struct vio_device_id *id)
}
 
dev_set_drvdata(>dev, hostdata);
+   spin_lock(_driver_lock);
list_add_tail(>host_list, _head);
+   spin_unlock(_driver_lock);
return 0;
 
   add_srp_port_failed:
@@ -2292,7 +2295,9 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const 
struct vio_device_id *id)
 static int ibmvscsi_remove(struct vio_dev *vdev)
 {
struct ibmvscsi_host_data *hostdata = dev_get_drvdata(>dev);
+   spin_lock(_driver_lock);
list_del(>host_list);
+   spin_unlock(_driver_lock);
unmap_persist_bufs(hostdata);
release_event_pool(>pool, hostdata);
ibmvscsi_release_crq_queue(>queue, hostdata,
-- 
2.12.3

Re: [PATCH 1/4] add generic builtin command line

2019-03-20 Thread Daniel Walker

On Wed, Mar 20, 2019 at 03:53:19PM -0700, Andrew Morton wrote:
> On Tue, 19 Mar 2019 16:24:45 -0700 Daniel Walker  wrote:
> 
> > This code allows architectures to use a generic builtin command line.
> 
> I wasn't cc'ed on [2/4].  No mailing lists were cc'ed on [0/4] but it
> didn't say anything useful anyway ;)
> 
> I'll queue them up for testing and shall await feedback from the
> powerpc developers.
> 

You weren't CC'd , but it was To: you,

 35 From: Daniel Walker 
 36 To: Andrew Morton ,
 37 Christophe Leroy ,
 38 Michael Ellerman ,
 39 Rob Herring , xe-linux-exter...@cisco.com,
 40 linuxppc-dev@lists.ozlabs.org, Frank Rowand 
 41 Cc: devicet...@vger.kernel.org, linux-ker...@vger.kernel.org
 42 Subject: [PATCH 2/4] drivers: of: generic command line support

and the first one [0/4] should have went to the linuxppc-dev , and 
xe-linux-external. Maybe 
our git-send-email isn't working with our mail servers.

Thanks for picking it up.

Daniel

Re: [PATCH v4 06/17] KVM: PPC: Book3S HV: XIVE: add controls for the EQ configuration

2019-03-20 Thread David Gibson

On Wed, Mar 20, 2019 at 09:37:40AM +0100, Cédric Le Goater wrote:
> These controls will be used by the H_INT_SET_QUEUE_CONFIG and
> H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
> Event Queue in the XIVE IC. They will also be used to restore the
> configuration of the XIVE EQs and to capture the internal run-time
> state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
> the EQ toggle bit and EQ index which are updated by the XIVE IC when
> event notifications are enqueued in the EQ.
> 
> The value of the guest physical address of the event queue is saved in
> the XIVE internal xive_q structure for later use. That is when
> migration needs to mark the EQ pages dirty to capture a consistent
> memory state of the VM.
> 
> To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
> OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
> but restoring the EQ state will.
> 
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  Changes since v3 :
> 
>  - fix the test ont the initial setting of the EQ toggle bit : 0 -> 1
>  - renamed qsize to qshift
>  - renamed qpage to qaddr
>  - checked host page size
>  - limited flags to KVM_XIVE_EQ_ALWAYS_NOTIFY to fit sPAPR specs
>  
>  Changes since v2 :
>  
>  - fixed comments on the KVM device attribute definitions
>  - fixed check on supported EQ size to restrict to 64K pages
>  - checked kvm_eq.flags that need to be zero
>  - removed the OPAL call when EQ qtoggle bit and index are zero. 
> 
>  arch/powerpc/include/asm/xive.h|   2 +
>  arch/powerpc/include/uapi/asm/kvm.h|  19 ++
>  arch/powerpc/kvm/book3s_xive.h |   2 +
>  arch/powerpc/kvm/book3s_xive.c |  15 +-
>  arch/powerpc/kvm/book3s_xive_native.c  | 242 +
>  Documentation/virtual/kvm/devices/xive.txt |  34 +++
>  6 files changed, 308 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index b579a943407b..c4e88abd3b67 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -73,6 +73,8 @@ struct xive_q {
>   u32 esc_irq;
>   atomic_tcount;
>   atomic_tpending_count;
> + u64 guest_qaddr;
> + u32 guest_qshift;
>  };
>  
>  /* Global enable flags for the XIVE support */
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index e8161e21629b..85005400fd86 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
>  #define KVM_DEV_XIVE_GRP_CTRL1
>  #define KVM_DEV_XIVE_GRP_SOURCE  2   /* 64-bit source 
> identifier */
>  #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG   3   /* 64-bit source 
> identifier */
> +#define KVM_DEV_XIVE_GRP_EQ_CONFIG   4   /* 64-bit EQ identifier */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0)
> @@ -696,4 +697,22 @@ struct kvm_ppc_cpu_char {
>  #define KVM_XIVE_SOURCE_EISN_SHIFT   33
>  #define KVM_XIVE_SOURCE_EISN_MASK0xfffeULL
>  
> +/* Layout of 64-bit EQ identifier */
> +#define KVM_XIVE_EQ_PRIORITY_SHIFT   0
> +#define KVM_XIVE_EQ_PRIORITY_MASK0x7
> +#define KVM_XIVE_EQ_SERVER_SHIFT 3
> +#define KVM_XIVE_EQ_SERVER_MASK  0xfff8ULL
> +
> +/* Layout of EQ configuration values (64 bytes) */
> +struct kvm_ppc_xive_eq {
> + __u32 flags;
> + __u32 qshift;
> + __u64 qaddr;
> + __u32 qtoggle;
> + __u32 qindex;
> + __u8  pad[40];
> +};
> +
> +#define KVM_XIVE_EQ_ALWAYS_NOTIFY0x0001
> +
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index ae26fe653d98..622f594d93e1 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -272,6 +272,8 @@ struct kvmppc_xive_src_block 
> *kvmppc_xive_create_src_block(
>   struct kvmppc_xive *xive, int irq);
>  void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
>  int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
> +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
> +   bool single_escalation);
>  
>  #endif /* CONFIG_KVM_XICS */
>  #endif /* _KVM_PPC_BOOK3S_XICS_H */
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index e09f3addffe5..c1b7aa7dbc28 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -166,7 +166,8 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
>   return IRQ_HANDLED;
>  }
>  
> -static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
> +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
> +   bool

Re: [PATCH 1/4] add generic builtin command line

2019-03-20 Thread Andrew Morton

On Tue, 19 Mar 2019 16:24:45 -0700 Daniel Walker  wrote:

> This code allows architectures to use a generic builtin command line.

I wasn't cc'ed on [2/4].  No mailing lists were cc'ed on [0/4] but it
didn't say anything useful anyway ;)

I'll queue them up for testing and shall await feedback from the
powerpc developers.

Re: [PATCH] hotplug/drc-info: ininitialize fndit to zero

2019-03-20 Thread Bjorn Helgaas

[+cc Michael B (original author)]

On Sat, Mar 16, 2019 at 09:40:16PM +, Colin King wrote:
> From: Colin Ian King 
> 
> Currently variable fndit is not initialized and contains a
> garbage value, later it is set to 1 if a drc entry is found.
> Ensure fndit is not containing garbage by initializing it to
> zero. Also remove an extraneous space at the end of an
> sprintf call.
> 
> Detected by static analysis with cppcheck.
> 
> Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info 
> property")
> Signed-off-by: Colin Ian King 

Michael E, I assume you'll take this since you took the original?
Let me know if you want me to.

> ---
>  drivers/pci/hotplug/rpaphp_core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/rpaphp_core.c 
> b/drivers/pci/hotplug/rpaphp_core.c
> index bcd5d357ca23..28213f44f64a 100644
> --- a/drivers/pci/hotplug/rpaphp_core.c
> +++ b/drivers/pci/hotplug/rpaphp_core.c
> @@ -230,7 +230,7 @@ static int rpaphp_check_drc_props_v2(struct device_node 
> *dn, char *drc_name,
>   struct of_drc_info drc;
>   const __be32 *value;
>   char cell_drc_name[MAX_DRC_NAME_LEN];
> - int j, fndit;
> + int j, fndit = 0;
>  
>   info = of_find_property(dn->parent, "ibm,drc-info", NULL);
>   if (info == NULL)
> @@ -254,7 +254,7 @@ static int rpaphp_check_drc_props_v2(struct device_node 
> *dn, char *drc_name,
>   /* Found it */
>  
>   if (fndit)
> - sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, 
> + sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix,
>   my_index);
>  
>   if (((drc_name == NULL) ||
> -- 
> 2.20.1
>

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-03-20 Thread Michael S. Tsirkin

On Wed, Mar 20, 2019 at 01:13:41PM -0300, Thiago Jung Bauermann wrote:
> >> Another way of looking at this issue which also explains our reluctance
> >> is that the only difference between a secure guest and a regular guest
> >> (at least regarding virtio) is that the former uses swiotlb while the
> >> latter doens't.
> >
> > But swiotlb is just one implementation. It's a guest internal thing. The
> > issue is that memory isn't host accessible.
> 
> >From what I understand of the ACCESS_PLATFORM definition, the host will
> only ever try to access memory addresses that are supplied to it by the
> guest, so all of the secure guest memory that the host cares about is
> accessible:
> 
> If this feature bit is set to 0, then the device has same access to
> memory addresses supplied to it as the driver has. In particular,
> the device will always use physical addresses matching addresses
> used by the driver (typically meaning physical addresses used by the
> CPU) and not translated further, and can access any address supplied
> to it by the driver. When clear, this overrides any
> platform-specific description of whether device access is limited or
> translated in any way, e.g. whether an IOMMU may be present.
> 
> All of the above is true for POWER guests, whether they are secure
> guests or not.
> 
> Or are you saying that a virtio device may want to access memory
> addresses that weren't supplied to it by the driver?

Your logic would apply to IOMMUs as well.  For your mode, there are
specific encrypted memory regions that driver has access to but device
does not. that seems to violate the constraint.


> >> And from the device's point of view they're
> >> indistinguishable. It can't tell one guest that is using swiotlb from
> >> one that isn't. And that implies that secure guest vs regular guest
> >> isn't a virtio interface issue, it's "guest internal affairs". So
> >> there's no reason to reflect that in the feature flags.
> >
> > So don't. The way not to reflect that in the feature flags is
> > to set ACCESS_PLATFORM.  Then you say *I don't care let platform device*.
> >
> >
> > Without ACCESS_PLATFORM
> > virtio has a very specific opinion about the security of the
> > device, and that opinion is that device is part of the guest
> > supervisor security domain.
> 
> Sorry for being a bit dense, but not sure what "the device is part of
> the guest supervisor security domain" means. In powerpc-speak,
> "supervisor" is the operating system so perhaps that explains my
> confusion. Are you saying that without ACCESS_PLATFORM, the guest
> considers the host to be part of the guest operating system's security
> domain?

I think so. The spec says "device has same access as driver".

> If so, does that have any other implication besides "the host
> can access any address supplied to it by the driver"? If that is the
> case, perhaps the definition of ACCESS_PLATFORM needs to be amended to
> include that information because it's not part of the current
> definition.
> 
> >> That said, we still would like to arrive at a proper design for this
> >> rather than add yet another hack if we can avoid it. So here's another
> >> proposal: considering that the dma-direct code (in kernel/dma/direct.c)
> >> automatically uses swiotlb when necessary (thanks to Christoph's recent
> >> DMA work), would it be ok to replace virtio's own direct-memory code
> >> that is used in the !ACCESS_PLATFORM case with the dma-direct code? That
> >> way we'll get swiotlb even with !ACCESS_PLATFORM, and virtio will get a
> >> code cleanup (replace open-coded stuff with calls to existing
> >> infrastructure).
> >
> > Let's say I have some doubts that there's an API that
> > matches what virtio with its bag of legacy compatibility exactly.
> 
> Ok.
> 
> >> > But the name "sev_active" makes me scared because at least AMD guys who
> >> > were doing the sensible thing and setting ACCESS_PLATFORM
> >>
> >> My understanding is, AMD guest-platform knows in advance that their
> >> guest will run in secure mode and hence sets the flag at the time of VM
> >> instantiation. Unfortunately we dont have that luxury on our platforms.
> >
> > Well you do have that luxury. It looks like that there are existing
> > guests that already acknowledge ACCESS_PLATFORM and you are not happy
> > with how that path is slow. So you are trying to optimize for
> > them by clearing ACCESS_PLATFORM and then you have lost ability
> > to invoke DMA API.
> >
> > For example if there was another flag just like ACCESS_PLATFORM
> > just not yet used by anyone, you would be all fine using that right?
> 
> Yes, a new flag sounds like a great idea. What about the definition
> below?
> 
> VIRTIO_F_ACCESS_PLATFORM_NO_IOMMU This feature has the same meaning as
> VIRTIO_F_ACCESS_PLATFORM both when set and when not set, with the
> exception that the IOMMU is explicitly defined to be off or bypassed
> when accessing memory addresses supplied to

Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-20 Thread Dan Williams

On Wed, Mar 20, 2019 at 8:34 AM Dan Williams  wrote:
>
> On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
>  wrote:
> >
> > Aneesh Kumar K.V  writes:
> >
> > > Dan Williams  writes:
> > >
> > >>
> > >>> Now what will be page size used for mapping vmemmap?
> > >>
> > >> That's up to the architecture's vmemmap_populate() implementation.
> > >>
> > >>> Architectures
> > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> > >>> device-dax with struct page in the device will have pfn reserve area 
> > >>> aligned
> > >>> to PAGE_SIZE with the above example? We can't map that using
> > >>> PMD_SIZE page size?
> > >>
> > >> IIUC, that's a different alignment. Currently that's handled by
> > >> padding the reservation area up to a section (128MB on x86) boundary,
> > >> but I'm working on patches to allow sub-section sized ranges to be
> > >> mapped.
> > >
> > > I am missing something w.r.t code. The below code align that using 
> > > nd_pfn->align
> > >
> > >   if (nd_pfn->mode == PFN_MODE_PMEM) {
> > >   unsigned long memmap_size;
> > >
> > >   /*
> > >* vmemmap_populate_hugepages() allocates the memmap array 
> > > in
> > >* HPAGE_SIZE chunks.
> > >*/
> > >   memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> > >   offset = ALIGN(start + SZ_8K + memmap_size + 
> > > dax_label_reserve,
> > >   nd_pfn->align) - start;
> > >   }
> > >
> > > IIUC that is finding the offset where to put vmemmap start. And that has
> > > to be aligned to the page size with which we may end up mapping vmemmap
> > > area right?
>
> Right, that's the physical offset of where the vmemmap ends, and the
> memory to be mapped begins.
>
> > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > > is to compute howmany pfns we should map for this pfn dev right?
> > >
> >
> > Also i guess those 4K assumptions there is wrong?
>
> Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
> needs to be revved and the PAGE_SIZE needs to be recorded in the
> info-block.

How often does a system change page-size. Is it fixed or do
environment change it from one boot to the next? I'm thinking through
the behavior of what do when the recorded PAGE_SIZE in the info-block
does not match the current system page size. The simplest option is to
just fail the device and require it to be reconfigured. Is that
acceptable?

[RFC PATCH 1/1] KVM: PPC: Report single stepping capability

2019-03-20 Thread Fabiano Rosas

When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request
the next instruction to be single stepped via the
KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure.

We currently don't have support for guest single stepping implemented
in Book3S HV.

This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order
to inform userspace about the state of single stepping support.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/kvm/powerpc.c | 5 +
 include/uapi/linux/kvm.h   | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8885377ec3e0..5ba990b0ec74 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -538,6 +538,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_IMMEDIATE_EXIT:
r = 1;
break;
+   case KVM_CAP_PPC_GUEST_DEBUG_SSTEP:
+#ifdef CONFIG_BOOKE
+   r = 1;
+   break;
+#endif
case KVM_CAP_PPC_PAIRED_SINGLES:
case KVM_CAP_PPC_OSI:
case KVM_CAP_PPC_GET_PVINFO:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..33e8a4db867e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
 #define KVM_CAP_HYPERV_CPUID 167
+#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 168
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.20.1

[RFC PATCH 0/1] KVM: PPC: Inform userspace about singlestep support

2019-03-20 Thread Fabiano Rosas

I am looking for a way to inform userspace about the lack of an
implementation in KVM HV for single stepping of instructions
(KVM_GUESTDGB_SINGLESTEP bit from SET_GUEST_DEBUG ioctl).

This will be used by QEMU to decide whether to attempt a call to the
set_guest_debug ioctl (for BookE, KVM PR) or fallback to a QEMU only
implementation (for KVM HV).

QEMU thread:
http://patchwork.ozlabs.org/cover/1049811/

My current proposal is to introduce a ppc-specific capability for
this. However I'm not sure if this would be better as a cap common for
all architectures or even if it should report on all of the possible
set_guest_debug flags to cover for the future.

Please comment. Thanks.


Fabiano Rosas (1):
  KVM: PPC: Report single stepping capability

 arch/powerpc/kvm/powerpc.c | 5 +
 include/uapi/linux/kvm.h   | 1 +
 2 files changed, 6 insertions(+)

--
2.20.1

Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section

2019-03-20 Thread Qian Cai

On Wed, 2019-03-20 at 18:16 +, Catalin Marinas wrote:
> I think I have a simpler idea. Kmemleak allows punching holes in
> allocated objects, so just turn the data/bss sections into dedicated
> kmemleak objects. This happens when kmemleak is initialised, before the
> initcalls are invoked. The kvm_free_tmp() would just free the
> corresponding part of the bss.
> 
> Patch below, only tested briefly on arm64. Qian, could you give it a try
> on powerpc? Thanks.

It works great so far!

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-20 Thread Alex Williamson

On Wed, 20 Mar 2019 15:38:24 +1100
David Gibson  wrote:

> On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > On Fri, 15 Mar 2019 19:18:35 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > platform puts all interconnected GPUs to the same IOMMU group.
> > > 
> > > However the user may want to pass individual GPUs to the userspace so
> > > in order to do so we need to put them into separate IOMMU groups and
> > > cut off the interconnects.
> > > 
> > > Thankfully V100 GPUs implement an interface to do by programming link
> > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > issued to the GPU.
> > > 
> > > This defines a reset_done() handler for V100 NVlink2 device which
> > > determines what links need to be disabled. This relies on presence
> > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > 
> > > This does not change the existing behaviour and instead adds
> > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > 
> > > The alternative approaches would be:
> > > 
> > > 1. do this in the system firmware (skiboot) but for that we would need
> > > to tell skiboot via an additional OPAL call whether or not we want this
> > > isolation - skiboot is unaware of IOMMU groups.
> > > 
> > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > the problem with that is at that point the device is not enabled, i.e.
> > > config space is not restored so we need to enable the device (i.e. MMIO
> > > bit in CMD register + program valid address to BAR0) in order to disable
> > > links and then perhaps undo all this initialization to bring the device
> > > back to the state where pci_try_reset_function() expects it to be.  
> > 
> > The trouble seems to be that this approach only maintains the isolation
> > exposed by the IOMMU group when vfio-pci is the active driver for the
> > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > incorporating groups in various ways.  
> 
> I don't think that reasoning is quite right.  An IOMMU group doesn't
> necessarily represent devices which *are* isolated, just devices which
> *can be* isolated.  There are plenty of instances when we don't need
> to isolate devices in different IOMMU groups: passing both groups to
> the same guest or userspace VFIO driver for example, or indeed when
> both groups are owned by regular host kernel drivers.
> 
> In at least some of those cases we also don't want to isolate the
> devices when we don't have to, usually for performance reasons.

I see IOMMU groups as representing the current isolation of the device,
not just the possible isolation.  If there are ways to break down that
isolation then ideally the group would be updated to reflect it.  The
ACS disable patches seem to support this, at boot time we can choose to
disable ACS at certain points in the topology to favor peer-to-peer
performance over isolation.  This is then reflected in the group
composition, because even though ACS *can be* enabled at the given
isolation points, it's intentionally not with this option.  Whether or
not a given user who owns multiple devices needs that isolation is
really beside the point, the user can choose to connect groups via IOMMU
mappings or reconfigure the system to disable ACS and potentially more
direct routing.  The IOMMU groups are still accurately reflecting the
topology and IOMMU based isolation.

> > So, if there's a device specific
> > way to configure the isolation reported in the group, which requires
> > some sort of active management against things like secondary bus
> > resets, then I think we need to manage it above the attached endpoint
> > driver.  
> 
> The problem is that above the endpoint driver, we don't actually have
> enough information about what should be isolated.  For VFIO we want to
> isolate things if they're in different containers, for most regular
> host kernel drivers we don't need to isolate at all (although we might
> as well when it doesn't have a cost).

This idea that we only want to isolate things if they're in different
containers is bogus, imo.  There are performance reasons why we might
not want things isolated, but there are also address space reasons why
we do.  If there are direct routes between devices, the user needs to
be aware of the IOVA pollution, if we maintain singleton groups, they
don't.  Granted we don't really account for this well in most
userspaces and fumble through it by luck of the address space layout
and lack of devices really attempting peer to peer access.

Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section

2019-03-20 Thread Catalin Marinas

On Thu, Mar 21, 2019 at 12:15:46AM +1100, Michael Ellerman wrote:
> Catalin Marinas  writes:
> > On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote:
> >> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void)
> >>  
> >>/* data/bss scanning */
> >>scan_large_block(_sdata, _edata);
> >> -  scan_large_block(__bss_start, __bss_stop);
> >> +
> >> +  if (bss_hole_start) {
> >> +  scan_large_block(__bss_start, bss_hole_start);
> >> +  scan_large_block(bss_hole_stop, __bss_stop);
> >> +  } else {
> >> +  scan_large_block(__bss_start, __bss_stop);
> >> +  }
> >> +
> >>scan_large_block(__start_ro_after_init, __end_ro_after_init);
> >
> > I'm not a fan of this approach but I couldn't come up with anything
> > better. I was hoping we could check for PageReserved() in scan_block()
> > but on arm64 it ends up not scanning the .bss at all.
> >
> > Until another user appears, I'm ok with this patch.
> >
> > Acked-by: Catalin Marinas 
> 
> I actually would like to rework this kvm_tmp thing to not be in bss at
> all. It's a bit of a hack and is incompatible with strict RWX.
> 
> If we size it a bit more conservatively we can hopefully just reserve
> some space in the text section for it.
> 
> I'm not going to have time to work on that immediately though, so if
> people want this fixed now then this patch could go in as a temporary
> solution.

I think I have a simpler idea. Kmemleak allows punching holes in
allocated objects, so just turn the data/bss sections into dedicated
kmemleak objects. This happens when kmemleak is initialised, before the
initcalls are invoked. The kvm_free_tmp() would just free the
corresponding part of the bss.

Patch below, only tested briefly on arm64. Qian, could you give it a try
on powerpc? Thanks.

8<--
diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 683b5b3805bd..c4b8cb3c298d 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -712,6 +712,8 @@ static void kvm_use_magic_page(void)
 
 static __init void kvm_free_tmp(void)
 {
+   kmemleak_free_part(_tmp[kvm_tmp_index],
+  ARRAY_SIZE(kvm_tmp) - kvm_tmp_index);
free_reserved_area(_tmp[kvm_tmp_index],
   _tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL);
 }
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 707fa5579f66..0f6adcbfc2c7 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1529,11 +1529,6 @@ static void kmemleak_scan(void)
}
rcu_read_unlock();
 
-   /* data/bss scanning */
-   scan_large_block(_sdata, _edata);
-   scan_large_block(__bss_start, __bss_stop);
-   scan_large_block(__start_ro_after_init, __end_ro_after_init);
-
 #ifdef CONFIG_SMP
/* per-cpu sections scanning */
for_each_possible_cpu(i)
@@ -2071,6 +2066,15 @@ void __init kmemleak_init(void)
}
local_irq_restore(flags);
 
+   /* register the data/bss sections */
+   create_object((unsigned long)_sdata, _edata - _sdata,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+   create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+   create_object((unsigned long)__start_ro_after_init,
+ __end_ro_after_init - __start_ro_after_init,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+
/*
 * This is the point where tracking allocations is safe. Automatic
 * scanning is started during the late initcall. Add the early logged

Re: [PATCH v3 2/5] ocxl: Clean up printf formats

2019-03-20 Thread Joe Perches

On Wed, 2019-03-20 at 16:34 +1100, Alastair D'Silva wrote:
> From: Alastair D'Silva 
> 
> Use %# instead of using a literal '0x'

I do not suggest this as reasonable.

There are 10's of thousands of uses of 0x%x in the kernel
and converting them to save a byte seems unnecessary.

$ git grep -P '0x%[\*\d\.]*[xX]' | wc -l
26120

And the %#x style is by far the lesser used form

$ git grep -P '%#[\*\d\.]*[xX]' | wc -l
2726

Also, the sized form of %#[size]x is frequently misused
where the size does not account for the initial 0x output.

> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
[]
> @@ -178,9 +178,9 @@ static int read_dvsec_vendor(struct pci_dev *dev)
>   pci_read_config_dword(dev, pos + OCXL_DVSEC_VENDOR_DLX_VERS, );
>  
>   dev_dbg(>dev, "Vendor specific DVSEC:\n");
> - dev_dbg(>dev, "  CFG version = 0x%x\n", cfg);
> - dev_dbg(>dev, "  TLX version = 0x%x\n", tlx);
> - dev_dbg(>dev, "  DLX version = 0x%x\n", dlx);
> + dev_dbg(>dev, "  CFG version = %#x\n", cfg);
> + dev_dbg(>dev, "  TLX version = %#x\n", tlx);
> + dev_dbg(>dev, "  DLX version = %#x\n", dlx);

etc...

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-03-20 Thread Thiago Jung Bauermann

Hello Michael,

Sorry for the delay in responding. We had some internal discussions on
this.

Michael S. Tsirkin  writes:

> On Mon, Feb 04, 2019 at 04:14:20PM -0200, Thiago Jung Bauermann wrote:
>>
>> Hello Michael,
>>
>> Michael S. Tsirkin  writes:
>>
>> > On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:
>> So while ACCESS_PLATFORM solves our problems for secure guests, we can't
>> turn it on by default because we can't affect legacy systems. Doing so
>> would penalize existing systems that can access all memory. They would
>> all have to unnecessarily go through address translations, and take a
>> performance hit.
>
> So as step one, you just give hypervisor admin an option to run legacy
> systems faster by blocking secure mode. I don't see why that is
> so terrible.

There are a few reasons why:

1. It's bad user experience to require people to fiddle with knobs for
obscure reasons if it's possible to design things such that they Just
Work.

2. "User" in this case can be a human directly calling QEMU, but could
also be libvirt or one of its users, or some other framework. This means
having to adjust and/or educate an open-ended number of people and
software. It's best avoided if possible.

3. The hypervisor admin and the admin of the guest system don't
necessarily belong to the same organization (e.g., cloud provider and
cloud customer), so there may be some friction when they need to
coordinate to get this right.

4. A feature of our design is that the guest may or may not decide to
"go secure" at boot time, so it's best not to depend on flags that may
or may not have been set at the time QEMU was started.

>> The semantics of ACCESS_PLATFORM assume that the hypervisor/QEMU knows
>> in advance - right when the VM is instantiated - that it will not have
>> access to all guest memory.
>
> Not quite. It just means that hypervisor can live with not having
> access to all memory. If platform wants to give it access
> to all memory that is quite all right.

Except that on powerpc it also means "there's an IOMMU present" and
there's no way to say "bypass IOMMU translation". :-/

>> Another way of looking at this issue which also explains our reluctance
>> is that the only difference between a secure guest and a regular guest
>> (at least regarding virtio) is that the former uses swiotlb while the
>> latter doens't.
>
> But swiotlb is just one implementation. It's a guest internal thing. The
> issue is that memory isn't host accessible.

>From what I understand of the ACCESS_PLATFORM definition, the host will
only ever try to access memory addresses that are supplied to it by the
guest, so all of the secure guest memory that the host cares about is
accessible:

If this feature bit is set to 0, then the device has same access to
memory addresses supplied to it as the driver has. In particular,
the device will always use physical addresses matching addresses
used by the driver (typically meaning physical addresses used by the
CPU) and not translated further, and can access any address supplied
to it by the driver. When clear, this overrides any
platform-specific description of whether device access is limited or
translated in any way, e.g. whether an IOMMU may be present.

All of the above is true for POWER guests, whether they are secure
guests or not.

Or are you saying that a virtio device may want to access memory
addresses that weren't supplied to it by the driver?

>> And from the device's point of view they're
>> indistinguishable. It can't tell one guest that is using swiotlb from
>> one that isn't. And that implies that secure guest vs regular guest
>> isn't a virtio interface issue, it's "guest internal affairs". So
>> there's no reason to reflect that in the feature flags.
>
> So don't. The way not to reflect that in the feature flags is
> to set ACCESS_PLATFORM.  Then you say *I don't care let platform device*.
>
>
> Without ACCESS_PLATFORM
> virtio has a very specific opinion about the security of the
> device, and that opinion is that device is part of the guest
> supervisor security domain.

Sorry for being a bit dense, but not sure what "the device is part of
the guest supervisor security domain" means. In powerpc-speak,
"supervisor" is the operating system so perhaps that explains my
confusion. Are you saying that without ACCESS_PLATFORM, the guest
considers the host to be part of the guest operating system's security
domain? If so, does that have any other implication besides "the host
can access any address supplied to it by the driver"? If that is the
case, perhaps the definition of ACCESS_PLATFORM needs to be amended to
include that information because it's not part of the current
definition.

>> That said, we still would like to arrive at a proper design for this
>> rather than add yet another hack if we can avoid it. So here's another
>> proposal: considering that the dma-direct code (in kernel/dma/direct.c)
>> automatically uses

Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-20 Thread Dan Williams

On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
 wrote:
>
> Aneesh Kumar K.V  writes:
>
> > Dan Williams  writes:
> >
> >>
> >>> Now what will be page size used for mapping vmemmap?
> >>
> >> That's up to the architecture's vmemmap_populate() implementation.
> >>
> >>> Architectures
> >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> >>> device-dax with struct page in the device will have pfn reserve area 
> >>> aligned
> >>> to PAGE_SIZE with the above example? We can't map that using
> >>> PMD_SIZE page size?
> >>
> >> IIUC, that's a different alignment. Currently that's handled by
> >> padding the reservation area up to a section (128MB on x86) boundary,
> >> but I'm working on patches to allow sub-section sized ranges to be
> >> mapped.
> >
> > I am missing something w.r.t code. The below code align that using 
> > nd_pfn->align
> >
> >   if (nd_pfn->mode == PFN_MODE_PMEM) {
> >   unsigned long memmap_size;
> >
> >   /*
> >* vmemmap_populate_hugepages() allocates the memmap array in
> >* HPAGE_SIZE chunks.
> >*/
> >   memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> >   offset = ALIGN(start + SZ_8K + memmap_size + 
> > dax_label_reserve,
> >   nd_pfn->align) - start;
> >   }
> >
> > IIUC that is finding the offset where to put vmemmap start. And that has
> > to be aligned to the page size with which we may end up mapping vmemmap
> > area right?

Right, that's the physical offset of where the vmemmap ends, and the
memory to be mapped begins.

> > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > is to compute howmany pfns we should map for this pfn dev right?
> >
>
> Also i guess those 4K assumptions there is wrong?

Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
needs to be revved and the PAGE_SIZE needs to be recorded in the
info-block.

Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING

2019-03-20 Thread Arnd Bergmann

On Wed, Mar 20, 2019 at 10:41 AM Arnd Bergmann  wrote:
>
> I've added your patch to my randconfig test setup and will let you
> know if I see anything noticeable. I'm currently testing clang-arm32,
> clang-arm64 and gcc-x86.

This is the only additional bug that has come up so far:

`.exit.text' referenced in section `.alt.smp.init' of
drivers/char/ipmi/ipmi_msghandler.o: defined in discarded section
`exit.text' of drivers/char/ipmi/ipmi_msghandler.o

diff --git a/arch/arm/kernel/atags.h b/arch/arm/kernel/atags.h
index 201100226301..84b12e33104d 100644
--- a/arch/arm/kernel/atags.h
+++ b/arch/arm/kernel/atags.h
@@ -5,7 +5,7 @@ void convert_to_tag_list(struct tag *tags);
 const struct machine_desc *setup_machine_tags(phys_addr_t __atags_pointer,
unsigned int machine_nr);
 #else
-static inline const struct machine_desc *
+static __always_inline const struct machine_desc *
 setup_machine_tags(phys_addr_t __atags_pointer, unsigned int machine_nr)
 {
early_print("no ATAGS support: can't continue\n");

Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section

2019-03-20 Thread Michael Ellerman

Catalin Marinas  writes:
> Hi Qian,
>
> On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote:
>> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void)
>>  
>>  /* data/bss scanning */
>>  scan_large_block(_sdata, _edata);
>> -scan_large_block(__bss_start, __bss_stop);
>> +
>> +if (bss_hole_start) {
>> +scan_large_block(__bss_start, bss_hole_start);
>> +scan_large_block(bss_hole_stop, __bss_stop);
>> +} else {
>> +scan_large_block(__bss_start, __bss_stop);
>> +}
>> +
>>  scan_large_block(__start_ro_after_init, __end_ro_after_init);
>
> I'm not a fan of this approach but I couldn't come up with anything
> better. I was hoping we could check for PageReserved() in scan_block()
> but on arm64 it ends up not scanning the .bss at all.
>
> Until another user appears, I'm ok with this patch.
>
> Acked-by: Catalin Marinas 

I actually would like to rework this kvm_tmp thing to not be in bss at
all. It's a bit of a hack and is incompatible with strict RWX.

If we size it a bit more conservatively we can hopefully just reserve
some space in the text section for it.

I'm not going to have time to work on that immediately though, so if
people want this fixed now then this patch could go in as a temporary
solution.

cheers

Re: [PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-20 Thread Laurent Vivier

On 20/03/2019 13:47, Michael Ellerman wrote:
> Laurent Vivier  writes:
>> Hi Michael,
>>
>> as it seems good now, could you pick up this patch for merging?
> 
> I'll start picking up patches for next starting after rc2, so next week.
> 
> If you think it's a bug fix I can put it into fixes now, but I don't
> think it's a bug fix is it?

No, it's only cosmetic.

Thanks,
Laurent

Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING

2019-03-20 Thread Arnd Bergmann

On Wed, Mar 20, 2019 at 11:19 AM Masahiro Yamada
 wrote:
> On Wed, Mar 20, 2019 at 6:39 PM Arnd Bergmann  wrote:
> >
> > On Wed, Mar 20, 2019 at 7:41 AM Masahiro Yamada
> >  wrote:
> >
> > > It is unclear to me how to fix it.
> > > That's why I ended up with "depends on !MIPS".
> > >
> > >
> > >   MODPOST vmlinux.o
> > > arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2':
> > > sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base'
> > > sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base'
> > > sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base'
> > > sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base'
> > > sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base'
> > > arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined
> > > references to `mips_gcr_base'
> > >
> > >
> > > Perhaps, MIPS folks may know how to fix it.
> >
> > I would guess like this:
> >
> > diff --git a/arch/mips/include/asm/mips-cm.h 
> > b/arch/mips/include/asm/mips-cm.h
> > index 8bc5df49b0e1..a27483fedb7d 100644
> > --- a/arch/mips/include/asm/mips-cm.h
> > +++ b/arch/mips/include/asm/mips-cm.h
> > @@ -79,7 +79,7 @@ static inline int mips_cm_probe(void)
> >   *
> >   * Returns true if a CM is present in the system, else false.
> >   */
> > -static inline bool mips_cm_present(void)
> > +static __always_inline bool mips_cm_present(void)
> >  {
> >  #ifdef CONFIG_MIPS_CM
> > return mips_gcr_base != NULL;
> > @@ -93,7 +93,7 @@ static inline bool mips_cm_present(void)
> >   *
> >   * Returns true if the system implements an L2-only sync region, else 
> > false.
> >   */
> > -static inline bool mips_cm_has_l2sync(void)
> > +static __always_inline bool mips_cm_has_l2sync(void)
> >  {
> >  #ifdef CONFIG_MIPS_CM
> > return mips_cm_l2sync_base != NULL;
> >
>
>
> Thanks, I applied the above, but I still see
>  undefined reference to `mips_gcr_base'
>
>
> I attached .config to produce this error.
>
> I use prebuilt mips-linux-gcc from
> https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/

I got to this patch experimentally, it fixes the problem for me:

diff --git a/arch/mips/mm/sc-mips.c b/arch/mips/mm/sc-mips.c
index 394673991bab..d70d02da038b 100644
--- a/arch/mips/mm/sc-mips.c
+++ b/arch/mips/mm/sc-mips.c
@@ -181,7 +181,7 @@ static int __init mips_sc_probe_cm3(void)
return 0;
 }

-static inline int __init mips_sc_probe(void)
+static __always_inline int __init mips_sc_probe(void)
 {
struct cpuinfo_mips *c = _cpu_data;
unsigned int config1, config2;
diff --git a/arch/mips/include/asm/bitops.h b/arch/mips/include/asm/bitops.h
index 830c93a010c3..186c28463bf3 100644
--- a/arch/mips/include/asm/bitops.h
+++ b/arch/mips/include/asm/bitops.h
@@ -548,7 +548,7 @@ static inline unsigned long __fls(unsigned long word)
  * Returns 0..SZLONG-1
  * Undefined if no bit exists, so code should check against 0 first.
  */
-static inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long __ffs(unsigned long word)
 {
return __fls(word & -word);
 }


It does look like a gcc bug though, as at least some of the references
are from a function that got split out from an inlined function but that
has no remaining call sites.

   Arnd

Re: [PATCH v5 05/10] powerpc: Add a framework for Kernel Userspace Access Protection

2019-03-20 Thread Christophe Leroy





Le 20/03/2019 à 13:57, Michael Ellerman a écrit :

Christophe Leroy  writes:

Le 08/03/2019 à 02:16, Michael Ellerman a écrit :

From: Christophe Leroy 

This patch implements a framework for Kernel Userspace Access
Protection.

Then subarches will have the possibility to provide their own
implementation by providing setup_kuap() and
allow/prevent_user_access().

Some platforms will need to know the area accessed and whether it is
accessed from read, write or both. Therefore source, destination and
size and handed over to the two functions.

mpe: Rename to allow/prevent rather than unlock/lock, and add
read/write wrappers. Drop the 32-bit code for now until we have an
implementation for it. Add kuap to pt_regs for 64-bit as well as
32-bit. Don't split strings, use pr_crit_ratelimited().

Signed-off-by: Christophe Leroy 
Signed-off-by: Russell Currey 
Signed-off-by: Michael Ellerman 
---
v5: Futex ops need read/write so use allow_user_acccess() there.
  Use #ifdef CONFIG_PPC64 in kup.h to fix build errors.
  Allow subarch to override allow_read/write_from/to_user().


Those little helpers that will just call allow_user_access() when
distinct read/write handling is not performed looks overkill to me.

Can't the subarch do it by itself based on the nullity of from/to ?

static inline void allow_user_access(void __user *to, const void __user
*from,
 unsigned long size)
{
if (to & from)
set_kuap(0);
else if (to)
set_kuap(AMR_KUAP_BLOCK_READ);
else if (from)
set_kuap(AMR_KUAP_BLOCK_WRITE);
}


You could implement it that way, but it reads better at the call sites
if we have:

allow_write_to_user(uaddr, sizeof(*uaddr));
vs:
allow_user_access(uaddr, NULL, sizeof(*uaddr));

So I'm inclined to keep them. It should all end up inlined and generate
the same code at the end of the day.



I was not suggesting to completly remove allow_write_to_user(), I fully 
agree that it reads better at the call sites.


I was just thinking that allow_write_to_user() could remain generic and 
call the subarch specific allow_user_access() instead of making multiple 
subarch's allow_write_to_user()


But both solution are OK for me at the end.

Christophe

Re: powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038

2019-03-20 Thread Michael Ellerman

On Wed, 2019-03-13 at 13:14:38 UTC, Michael Ellerman wrote:
> Jakub Drnec reported:
>   Setting the realtime clock can sometimes make the monotonic clock go
>   back by over a hundred years. Decreasing the realtime clock across
>   the y2k38 threshold is one reliable way to reproduce. Allegedly this
>   can also happen just by running ntpd, I have not managed to
>   reproduce that other than booting with rtc at >2038 and then running
>   ntp. When this happens, anything with timers (e.g. openjdk) breaks
>   rather badly.
> 
> And included a test case (slightly edited for brevity):
>   #define _POSIX_C_SOURCE 199309L
>   #include 
>   #include 
>   #include 
>   #include 
> 
>   long get_time(void) {
> struct timespec tp;
> clock_gettime(CLOCK_MONOTONIC, );
> return tp.tv_sec + tp.tv_nsec / 10;
>   }
> 
>   int main(void) {
> long last = get_time();
> while(1) {
>   long now = get_time();
>   if (now < last) {
> printf("clock went backwards by %ld seconds!\n", last - now);
>   }
>   last = now;
>   sleep(1);
> }
> return 0;
>   }
> 
> Which when run concurrently with:
>  # date -s 2040-1-1
>  # date -s 2037-1-1
> 
> Will detect the clock going backward.
> 
> The root cause is that wtom_clock_sec in struct vdso_data is only a
> 32-bit signed value, even though we set its value to be equal to
> tk->wall_to_monotonic.tv_sec which is 64-bits.
> 
> Because the monotonic clock starts at zero when the system boots the
> wall_to_montonic.tv_sec offset is negative for current and future
> dates. Currently on a freshly booted system the offset will be in the
> vicinity of negative 1.5 billion seconds.
> 
> However if the wall clock is set past the Y2038 boundary, the offset
> from wall to monotonic becomes less than negative 2^31, and no longer
> fits in 32-bits. When that value is assigned to wtom_clock_sec it is
> truncated and becomes positive, causing the VDSO assembly code to
> calculate CLOCK_MONOTONIC incorrectly.
> 
> That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
> it is not meant to do. Worse, if the time is then set back before the
> Y2038 boundary CLOCK_MONOTONIC will jump backward.
> 
> We can fix it simply by storing the full 64-bit offset in the
> vdso_data, and using that in the VDSO assembly code. We also shuffle
> some of the fields in vdso_data to avoid creating a hole.
> 
> The original commit that added the CLOCK_MONOTONIC support to the VDSO
> did actually use a 64-bit value for wtom_clock_sec, see commit
> a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support to
> 32 bits kernel") (Nov 2005). However just 3 days later it was
> converted to 32-bits in commit 0c37ec2aa88b ("[PATCH] powerpc: vdso
> fixes (take #2)"), and the bug has existed since then AFAICS.
> 
> Fixes: 0c37ec2aa88b ("[PATCH] powerpc: vdso fixes (take #2)")
> Cc: sta...@vger.kernel.org # v2.6.15+
> Link: http://lkml.kernel.org/r/hac.zfes.62bwlnvavmp.1st...@seznam.cz
> Reported-by: Jakub Drnec 
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/b5b4453e7912f056da1ca7572574cada

cheers

Re: [v2, 01/10] powerpc/6xx: fix setup and use of SPRN_SPRG_PGDIR for hash32

2019-03-20 Thread Michael Ellerman

On Mon, 2019-03-11 at 08:30:27 UTC, Christophe Leroy wrote:
> Not only the 603 but all 6xx need SPRN_SPRG_PGDIR to be initialised at
> startup. This patch move it from __setup_cpu_603() to start_here()
> and __secondary_start(), close to the initialisation of SPRN_THREAD.
> 
> Previously, virt addr of PGDIR was retrieved from thread struct.
> Now that it is the phys addr which is stored in SPRN_SPRG_PGDIR,
> hash_page() shall not convert it to phys anymore.
> This patch removes the conversion.
> 
> Fixes: 93c4a162b014("powerpc/6xx: Store PGDIR physical address in a SPRG")
> Reported-by: Guenter Roeck 
> Tested-by: Guenter Roeck 
> Signed-off-by: Christophe Leroy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/4622a2d43101ea2e3d54a2af090f25a5

cheers

Re: [PATCH v5 02/10] powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idle

2019-03-20 Thread Michael Ellerman

Akshay Adiga  writes:

> On Fri, Mar 08, 2019 at 12:16:11PM +1100, Michael Ellerman wrote:
>> In order to implement KUAP (Kernel Userspace Access Protection) on
>> Power9 we will be using the AMR, and therefore indirectly the
>> UAMOR/AMOR.
>> 
>> So save/restore these regs in the idle code.
>> 
>> Signed-off-by: Michael Ellerman 
>> ---
>> v5: Unchanged.
>> v4: New.
>> 
>>  arch/powerpc/kernel/idle_book3s.S | 27 +++
>>  1 file changed, 23 insertions(+), 4 deletions(-)
>
> Opps.. i posted a comment on the v4. 
>
> It would be good if we can make AMOR/UAMOR/AMR save-restore
> code power9 only.

Yes that would be a good optimisation.

If you can send an incremental patch against this one I'll squash it in.
If not I'll try and get it done at some point before merging.

cheers

Re: [PATCH v5 05/10] powerpc: Add a framework for Kernel Userspace Access Protection

2019-03-20 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 08/03/2019 à 02:16, Michael Ellerman a écrit :
>> From: Christophe Leroy 
>> 
>> This patch implements a framework for Kernel Userspace Access
>> Protection.
>> 
>> Then subarches will have the possibility to provide their own
>> implementation by providing setup_kuap() and
>> allow/prevent_user_access().
>> 
>> Some platforms will need to know the area accessed and whether it is
>> accessed from read, write or both. Therefore source, destination and
>> size and handed over to the two functions.
>> 
>> mpe: Rename to allow/prevent rather than unlock/lock, and add
>> read/write wrappers. Drop the 32-bit code for now until we have an
>> implementation for it. Add kuap to pt_regs for 64-bit as well as
>> 32-bit. Don't split strings, use pr_crit_ratelimited().
>> 
>> Signed-off-by: Christophe Leroy 
>> Signed-off-by: Russell Currey 
>> Signed-off-by: Michael Ellerman 
>> ---
>> v5: Futex ops need read/write so use allow_user_acccess() there.
>>  Use #ifdef CONFIG_PPC64 in kup.h to fix build errors.
>>  Allow subarch to override allow_read/write_from/to_user().
>
> Those little helpers that will just call allow_user_access() when 
> distinct read/write handling is not performed looks overkill to me.
>
> Can't the subarch do it by itself based on the nullity of from/to ?
>
> static inline void allow_user_access(void __user *to, const void __user 
> *from,
>unsigned long size)
> {
>   if (to & from)
>   set_kuap(0);
>   else if (to)
>   set_kuap(AMR_KUAP_BLOCK_READ);
>   else if (from)
>   set_kuap(AMR_KUAP_BLOCK_WRITE);
> }

You could implement it that way, but it reads better at the call sites
if we have:

allow_write_to_user(uaddr, sizeof(*uaddr));
vs:
allow_user_access(uaddr, NULL, sizeof(*uaddr));

So I'm inclined to keep them. It should all end up inlined and generate
the same code at the end of the day.

cheers

[PATCH] powerpc/dts/fsl: add crypto node alias for B4

2019-03-20 Thread Horia Geantă

crypto node alias is needed by U-boot to identify the node and
perform fix-ups, like adding "fsl,sec-era" property.

Signed-off-by: Horia Geantă 
---
 arch/powerpc/boot/dts/fsl/b4qds.dtsi | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/boot/dts/fsl/b4qds.dtsi 
b/arch/powerpc/boot/dts/fsl/b4qds.dtsi
index 999efd3bc167..05be919f3545 100644
--- a/arch/powerpc/boot/dts/fsl/b4qds.dtsi
+++ b/arch/powerpc/boot/dts/fsl/b4qds.dtsi
@@ -40,6 +40,7 @@
interrupt-parent = <>;
 
aliases {
+   crypto = 
phy_sgmii_10 = _sgmii_10;
phy_sgmii_11 = _sgmii_11;
phy_sgmii_1c = _sgmii_1c;
-- 
2.17.1

Re: [PATCH] powerpc: Make some functions static

2019-03-20 Thread Michael Ellerman

Mathieu Malaterre  writes:
> On Tue, Mar 12, 2019 at 10:14 PM Christophe Leroy
>  wrote:
>>
>>
>>
>> Le 12/03/2019 à 21:31, Mathieu Malaterre a écrit :
>> > In commit cb9e4d10c448 ("[POWERPC] Add support for 750CL Holly board")
>> > new functions were added. Since these functions can be made static,
>> > make it so. While doing so, it turns out that holly_power_off and
>> > holly_halt are unused, so remove them.
>>
>> I would have said 'since these functions are only used in this C file,
>> make them static'.
>>
>> I think this could be split in two patches:
>> 1/ Remove unused functions, ie holly_halt() and holly_power_off().
>> 2/ Make the other ones static.
>
> Michael do you want two patches ?

That would be better if it's not too much trouble. A patch with a title
of "Make some functions static" shouldn't really be deleting functions
entirely.

cheers

Re: Disable kcov for slb routines.

2019-03-20 Thread Michael Ellerman

Mahesh Jagannath Salgaonkar  writes:

> On 3/14/19 5:13 PM, Michael Ellerman wrote:
>> On Mon, 2019-03-04 at 08:25:51 UTC, Mahesh J Salgaonkar wrote:
>>> From: Mahesh Salgaonkar 
>>>
>>> The kcov instrumentation inside SLB routines causes duplicate SLB entries
>>> to be added resulting into SLB multihit machine checks.
>>> Disable kcov instrumentation on slb.o
>>>
>>> Signed-off-by: Mahesh Salgaonkar 
>>> Acked-by: Andrew Donnellan 
>>> Tested-by: Satheesh Rajendran 
>> 
>> Applied to powerpc next, thanks.
>> 
>> https://git.kernel.org/powerpc/c/19d6907521b04206676741b26e05a152
>> 
>> cheers
>> 
>
> There was a v2 at http://patchwork.ozlabs.org/patch/1051718/, looks like
> v1 got picked up. But I see the applied commit does address Andrew's
> comments.

Sorry not sure how I missed v2.

cheers

Re: [PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-20 Thread Michael Ellerman

Laurent Vivier  writes:
> Hi Michael,
>
> as it seems good now, could you pick up this patch for merging?

I'll start picking up patches for next starting after rc2, so next week.

If you think it's a bug fix I can put it into fixes now, but I don't
think it's a bug fix is it?

cheers

Re: Shift overflow warnings in arch/powerpc/boot/addnote.c on 32-bit builds

2019-03-20 Thread Michael Ellerman

Mark Cave-Ayland  writes:

> Hi all,
>
> Whilst building the latest git master on my G4 I noticed the following shift 
> overflow
> warnings in the build log for arch/powerpc/boot/addnote.c:
>
>
> arch/powerpc/boot/addnote.c: In function ‘main’:
> arch/powerpc/boot/addnote.c:75:47: warning: right shift count >= width of type
> [-Wshift-count-overflow]
>  #define PUT_64BE(off, v)((PUT_32BE((off), (v) >> 32L), \
>^~
> arch/powerpc/boot/addnote.c:72:39: note: in definition of macro ‘PUT_16BE’
>  #define PUT_16BE(off, v)(buf[off] = ((v) >> 8) & 0xff, \
>^
> arch/powerpc/boot/addnote.c:75:27: note: in expansion of macro ‘PUT_32BE’
>  #define PUT_64BE(off, v)((PUT_32BE((off), (v) >> 32L), \
>^~~~
> arch/powerpc/boot/addnote.c:94:50: note: in expansion of macro ‘PUT_64BE’
>  #define PUT_64(off, v)  (e_data == ELFDATA2MSB ? PUT_64BE(off, v) : \
>   ^~~~
> arch/powerpc/boot/addnote.c:183:3: note: in expansion of macro ‘PUT_64’
>PUT_64(ph + PH_OFFSET, ns);
>^~


I don't think there's any situation in which a 32-bit addnote will be
run against a 64-bit ELF is there?

So I don't think there's an actual bug, but it would be good if we could
make the warning go away.

cheers

Re: [RESEND PATCH v2] powerpc: mute unused-but-set-variable warnings

2019-03-20 Thread Michael Ellerman

Qian Cai  writes:
> On 3/19/19 5:21 AM, Christophe Leroy wrote:
>> Is there a reason for resending ? AFAICS, both are identical and still marked
>> new in patchwork:
>> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=76055
>> 
>
> "RESEND" because of no maintainer response for more than one week.

I don't know who told you to RESEND after a week, but especially at this
point in the development cycle a week is *way* too short.

And for trivial patches like this I may not get to them for several
weeks, I have other problems to fix like time going backward :)

In future please check patchwork and then if the patch is still new
after several weeks just send a ping in reply to that patch. A full
RESEND means I now have two identical patches to deal with in patchwork,
which makes more work for me.

cheers

Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING

2019-03-20 Thread Masahiro Yamada

Hi Arnd,


On Wed, Mar 20, 2019 at 6:39 PM Arnd Bergmann  wrote:
>
> On Wed, Mar 20, 2019 at 7:41 AM Masahiro Yamada
>  wrote:
>
> > It is unclear to me how to fix it.
> > That's why I ended up with "depends on !MIPS".
> >
> >
> >   MODPOST vmlinux.o
> > arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2':
> > sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base'
> > sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base'
> > sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base'
> > sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base'
> > sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base'
> > arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined
> > references to `mips_gcr_base'
> >
> >
> > Perhaps, MIPS folks may know how to fix it.
>
> I would guess like this:
>
> diff --git a/arch/mips/include/asm/mips-cm.h b/arch/mips/include/asm/mips-cm.h
> index 8bc5df49b0e1..a27483fedb7d 100644
> --- a/arch/mips/include/asm/mips-cm.h
> +++ b/arch/mips/include/asm/mips-cm.h
> @@ -79,7 +79,7 @@ static inline int mips_cm_probe(void)
>   *
>   * Returns true if a CM is present in the system, else false.
>   */
> -static inline bool mips_cm_present(void)
> +static __always_inline bool mips_cm_present(void)
>  {
>  #ifdef CONFIG_MIPS_CM
> return mips_gcr_base != NULL;
> @@ -93,7 +93,7 @@ static inline bool mips_cm_present(void)
>   *
>   * Returns true if the system implements an L2-only sync region, else false.
>   */
> -static inline bool mips_cm_has_l2sync(void)
> +static __always_inline bool mips_cm_has_l2sync(void)
>  {
>  #ifdef CONFIG_MIPS_CM
> return mips_cm_l2sync_base != NULL;
>


Thanks, I applied the above, but I still see
 undefined reference to `mips_gcr_base'


I attached .config to produce this error.

I use prebuilt mips-linux-gcc from
https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/


-- 
Best Regards
Masahiro Yamada


config.gz
Description: application/gzip

[PATCH v1 27/27] powerpc/mm: flatten function __find_linux_pte() step 3

2019-03-20 Thread Christophe Leroy

__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

Previous patches left a { } block. This patch removes it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/pgtable.c | 98 +++
 1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c1c6d0b79baa..db4a6253df92 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -348,59 +348,59 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
hpdp = (hugepd_t *)
goto out_huge;
}
-   {
-   /*
-* Even if we end up with an unmap, the pgtable will not
-* be freed, because we do an rcu free and here we are
-* irq disabled
-*/
-   pdshift = PUD_SHIFT;
-   pudp = pud_offset(, ea);
-   pud  = READ_ONCE(*pudp);
 
-   if (pud_none(pud))
-   return NULL;
+   /*
+* Even if we end up with an unmap, the pgtable will not
+* be freed, because we do an rcu free and here we are
+* irq disabled
+*/
+   pdshift = PUD_SHIFT;
+   pudp = pud_offset(, ea);
+   pud  = READ_ONCE(*pudp);
 
-   if (pud_huge(pud)) {
-   ret_pte = (pte_t *) pudp;
-   goto out;
-   }
-   if (is_hugepd(__hugepd(pud_val(pud {
-   hpdp = (hugepd_t *)
-   goto out_huge;
-   }
-   pdshift = PMD_SHIFT;
-   pmdp = pmd_offset(, ea);
-   pmd  = READ_ONCE(*pmdp);
-   /*
-* A hugepage collapse is captured by pmd_none, because
-* it mark the pmd none and do a hpte invalidate.
-*/
-   if (pmd_none(pmd))
-   return NULL;
-
-   if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
-   if (is_thp)
-   *is_thp = true;
-   ret_pte = (pte_t *)pmdp;
-   goto out;
-   }
-   /*
-* pmd_large check below will handle the swap pmd pte
-* we need to do both the check because they are config
-* dependent.
-*/
-   if (pmd_huge(pmd) || pmd_large(pmd)) {
-   ret_pte = (pte_t *)pmdp;
-   goto out;
-   }
-   if (is_hugepd(__hugepd(pmd_val(pmd {
-   hpdp = (hugepd_t *)
-   goto out_huge;
-   }
+   if (pud_none(pud))
+   return NULL;
 
-   return pte_offset_kernel(, ea);
+   if (pud_huge(pud)) {
+   ret_pte = (pte_t *)pudp;
+   goto out;
}
+   if (is_hugepd(__hugepd(pud_val(pud {
+   hpdp = (hugepd_t *)
+   goto out_huge;
+   }
+   pdshift = PMD_SHIFT;
+   pmdp = pmd_offset(, ea);
+   pmd  = READ_ONCE(*pmdp);
+   /*
+* A hugepage collapse is captured by pmd_none, because
+* it mark the pmd none and do a hpte invalidate.
+*/
+   if (pmd_none(pmd))
+   return NULL;
+
+   if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+   if (is_thp)
+   *is_thp = true;
+   ret_pte = (pte_t *)pmdp;
+   goto out;
+   }
+   /*
+* pmd_large check below will handle the swap pmd pte
+* we need to do both the check because they are config
+* dependent.
+*/
+   if (pmd_huge(pmd) || pmd_large(pmd)) {
+   ret_pte = (pte_t *)pmdp;
+   goto out;
+   }
+   if (is_hugepd(__hugepd(pmd_val(pmd {
+   hpdp = (hugepd_t *)
+   goto out_huge;
+   }
+
+   return pte_offset_kernel(, ea);
+
 out_huge:
if (!hpdp)
return NULL;
-- 
2.13.3

[PATCH v1 26/27] powerpc/mm: flatten function __find_linux_pte() step 2

2019-03-20 Thread Christophe Leroy

__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

Previous patch left { } blocks. This patch removes the first one
by shifting its content to the left.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/pgtable.c | 62 +++
 1 file changed, 30 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index d332abeedf0a..c1c6d0b79baa 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -369,39 +369,37 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
hpdp = (hugepd_t *)
goto out_huge;
}
-   {
-   pdshift = PMD_SHIFT;
-   pmdp = pmd_offset(, ea);
-   pmd  = READ_ONCE(*pmdp);
-   /*
-* A hugepage collapse is captured by pmd_none, because
-* it mark the pmd none and do a hpte invalidate.
-*/
-   if (pmd_none(pmd))
-   return NULL;
-
-   if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
-   if (is_thp)
-   *is_thp = true;
-   ret_pte = (pte_t *) pmdp;
-   goto out;
-   }
-   /*
-* pmd_large check below will handle the swap pmd pte
-* we need to do both the check because they are config
-* dependent.
-*/
-   if (pmd_huge(pmd) || pmd_large(pmd)) {
-   ret_pte = (pte_t *) pmdp;
-   goto out;
-   }
-   if (is_hugepd(__hugepd(pmd_val(pmd {
-   hpdp = (hugepd_t *)
-   goto out_huge;
-   }
-
-   return pte_offset_kernel(, ea);
+   pdshift = PMD_SHIFT;
+   pmdp = pmd_offset(, ea);
+   pmd  = READ_ONCE(*pmdp);
+   /*
+* A hugepage collapse is captured by pmd_none, because
+* it mark the pmd none and do a hpte invalidate.
+*/
+   if (pmd_none(pmd))
+   return NULL;
+
+   if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+   if (is_thp)
+   *is_thp = true;
+   ret_pte = (pte_t *)pmdp;
+   goto out;
+   }
+   /*
+* pmd_large check below will handle the swap pmd pte
+* we need to do both the check because they are config
+* dependent.
+*/
+   if (pmd_huge(pmd) || pmd_large(pmd)) {
+   ret_pte = (pte_t *)pmdp;
+   goto out;
}
+   if (is_hugepd(__hugepd(pmd_val(pmd {
+   hpdp = (hugepd_t *)
+   goto out_huge;
+   }
+
+   return pte_offset_kernel(, ea);
}
 out_huge:
if (!hpdp)
-- 
2.13.3

[PATCH v1 25/27] powerpc/mm: flatten function __find_linux_pte()

2019-03-20 Thread Christophe Leroy

__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

This patch flattens the function by getting rid of as much if/else
as possible. In order to ease the review, this is done in two steps.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/pgtable.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9f4ccd15849f..d332abeedf0a 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -339,12 +339,16 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 */
if (pgd_none(pgd))
return NULL;
-   else if (pgd_huge(pgd)) {
-   ret_pte = (pte_t *) pgdp;
+
+   if (pgd_huge(pgd)) {
+   ret_pte = (pte_t *)pgdp;
goto out;
-   } else if (is_hugepd(__hugepd(pgd_val(pgd
+   }
+   if (is_hugepd(__hugepd(pgd_val(pgd {
hpdp = (hugepd_t *)
-   else {
+   goto out_huge;
+   }
+   {
/*
 * Even if we end up with an unmap, the pgtable will not
 * be freed, because we do an rcu free and here we are
@@ -356,12 +360,16 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 
if (pud_none(pud))
return NULL;
-   else if (pud_huge(pud)) {
+
+   if (pud_huge(pud)) {
ret_pte = (pte_t *) pudp;
goto out;
-   } else if (is_hugepd(__hugepd(pud_val(pud
+   }
+   if (is_hugepd(__hugepd(pud_val(pud {
hpdp = (hugepd_t *)
-   else {
+   goto out_huge;
+   }
+   {
pdshift = PMD_SHIFT;
pmdp = pmd_offset(, ea);
pmd  = READ_ONCE(*pmdp);
@@ -386,12 +394,16 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
if (pmd_huge(pmd) || pmd_large(pmd)) {
ret_pte = (pte_t *) pmdp;
goto out;
-   } else if (is_hugepd(__hugepd(pmd_val(pmd
+   }
+   if (is_hugepd(__hugepd(pmd_val(pmd {
hpdp = (hugepd_t *)
-   else
-   return pte_offset_kernel(, ea);
+   goto out_huge;
+   }
+
+   return pte_offset_kernel(, ea);
}
}
+out_huge:
if (!hpdp)
return NULL;
 
-- 
2.13.3

[PATCH v1 24/27] powerpc: define subarch SLB_ADDR_LIMIT_DEFAULT

2019-03-20 Thread Christophe Leroy

This patch defines a subarch specific SLB_ADDR_LIMIT_DEFAULT
to remove the #ifdefs around the setup of mm->context.slb_addr_limit

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/slice.h | 2 ++
 arch/powerpc/include/asm/nohash/32/slice.h | 2 ++
 arch/powerpc/kernel/setup-common.c | 8 +---
 arch/powerpc/mm/slice.c| 6 +-
 4 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/slice.h 
b/arch/powerpc/include/asm/book3s/64/slice.h
index af498b0da21a..8da15958dcd1 100644
--- a/arch/powerpc/include/asm/book3s/64/slice.h
+++ b/arch/powerpc/include/asm/book3s/64/slice.h
@@ -13,6 +13,8 @@
 #define SLICE_NUM_HIGH (H_PGTABLE_RANGE >> SLICE_HIGH_SHIFT)
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
 
+#define SLB_ADDR_LIMIT_DEFAULT DEFAULT_MAP_WINDOW_USER64
+
 #else /* CONFIG_PPC_MM_SLICES */
 
 #define get_slice_psize(mm, addr)  ((mm)->context.user_psize)
diff --git a/arch/powerpc/include/asm/nohash/32/slice.h 
b/arch/powerpc/include/asm/nohash/32/slice.h
index 777d62e40ac0..39eb0154ae2d 100644
--- a/arch/powerpc/include/asm/nohash/32/slice.h
+++ b/arch/powerpc/include/asm/nohash/32/slice.h
@@ -13,6 +13,8 @@
 #define SLICE_NUM_HIGH 0ul
 #define GET_HIGH_SLICE_INDEX(addr) (addr & 0)
 
+#define SLB_ADDR_LIMIT_DEFAULT DEFAULT_MAP_WINDOW
+
 #endif /* CONFIG_PPC_MM_SLICES */
 
 #endif /* _ASM_POWERPC_NOHASH_32_SLICE_H */
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2e5dfb6e0823..af2682d052a2 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -948,14 +948,8 @@ void __init setup_arch(char **cmdline_p)
init_mm.brk = klimit;
 
 #ifdef CONFIG_PPC_MM_SLICES
-#ifdef CONFIG_PPC64
if (!radix_enabled())
-   init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64;
-#elif defined(CONFIG_PPC_8xx)
-   init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW;
-#else
-#error "context.addr_limit not initialized."
-#endif
+   init_mm.context.slb_addr_limit = SLB_ADDR_LIMIT_DEFAULT;
 #endif
 
 #ifdef CONFIG_SPAPR_TCE_IOMMU
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 50b1a5528384..64513cf47e5b 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -652,11 +652,7 @@ void slice_init_new_context_exec(struct mm_struct *mm)
 * case of fork it is just inherited from the mm being
 * duplicated.
 */
-#ifdef CONFIG_PPC64
-   mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64;
-#else
-   mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW;
-#endif
+   mm->context.slb_addr_limit = SLB_ADDR_LIMIT_DEFAULT;
 
mm->context.user_psize = psize;
 
-- 
2.13.3

[PATCH v1 23/27] powerpc/mm: remove a couple of #ifdef CONFIG_PPC_64K_PAGES in mm/slice.c

2019-03-20 Thread Christophe Leroy

This patch replaces a couple of #ifdef CONFIG_PPC_64K_PAGES
by IS_ENABLED(CONFIG_PPC_64K_PAGES) to improve code maintainability.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/slice.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 357d64e14757..50b1a5528384 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -558,14 +558,13 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
newaddr = slice_find_area(mm, len, _mask,
  psize, topdown, high_limit);
 
-#ifdef CONFIG_PPC_64K_PAGES
-   if (newaddr == -ENOMEM && psize == MMU_PAGE_64K) {
+   if (IS_ENABLED(CONFIG_PPC_64K_PAGES) && newaddr == -ENOMEM &&
+   psize == MMU_PAGE_64K) {
/* retry the search with 4k-page slices included */
slice_or_mask(_mask, _mask, compat_maskp);
newaddr = slice_find_area(mm, len, _mask,
  psize, topdown, high_limit);
}
-#endif
 
if (newaddr == -ENOMEM)
return -ENOMEM;
@@ -731,9 +730,9 @@ int slice_is_hugepage_only_range(struct mm_struct *mm, 
unsigned long addr,
VM_BUG_ON(radix_enabled());
 
maskp = slice_mask_for_size(>context, psize);
-#ifdef CONFIG_PPC_64K_PAGES
+
/* We need to account for 4k slices too */
-   if (psize == MMU_PAGE_64K) {
+   if (IS_ENABLED(CONFIG_PPC_64K_PAGES) && psize == MMU_PAGE_64K) {
const struct slice_mask *compat_maskp;
struct slice_mask available;
 
@@ -741,7 +740,6 @@ int slice_is_hugepage_only_range(struct mm_struct *mm, 
unsigned long addr,
slice_or_mask(, maskp, compat_maskp);
return !slice_check_range_fits(mm, , addr, len);
}
-#endif
 
return !slice_check_range_fits(mm, maskp, addr, len);
 }
-- 
2.13.3

[PATCH v1 22/27] powerpc/mm: move slice_mask_for_size() into mmu.h

2019-03-20 Thread Christophe Leroy

Move slice_mask_for_size() into subarch mmu.h

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/mmu.h | 22 +
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 18 ++
 arch/powerpc/mm/slice.c  | 36 
 3 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 1ceee000c18d..927e3714b0d8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -123,7 +123,6 @@ typedef struct {
/* NPU NMMU context */
struct npu_context *npu_context;
 
-#ifdef CONFIG_PPC_MM_SLICES
 /* SLB page size encodings*/
unsigned char low_slices_psize[BITS_PER_LONG / BITS_PER_BYTE];
unsigned char high_slices_psize[SLICE_ARRAY_SIZE];
@@ -136,9 +135,6 @@ typedef struct {
struct slice_mask mask_16m;
struct slice_mask mask_16g;
 # endif
-#else
-   u16 sllp;   /* SLB page size encoding */
-#endif
unsigned long vdso_base;
 #ifdef CONFIG_PPC_SUBPAGE_PROT
struct subpage_prot_table spt;
@@ -172,6 +168,24 @@ extern int mmu_vmalloc_psize;
 extern int mmu_vmemmap_psize;
 extern int mmu_io_psize;
 
+static inline struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int 
psize)
+{
+#ifdef CONFIG_PPC_64K_PAGES
+   if (psize == MMU_PAGE_64K)
+   return >mask_64k;
+#endif
+   if (psize == MMU_PAGE_4K)
+   return >mask_4k;
+#ifdef CONFIG_HUGETLB_PAGE
+   if (psize == MMU_PAGE_16M)
+   return >mask_16m;
+   if (psize == MMU_PAGE_16G)
+   return >mask_16g;
+#endif
+   WARN_ON(true);
+   return NULL;
+}
+
 /* MMU initialization */
 void mmu_early_init_devtree(void);
 void hash__early_init_devtree(void);
diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h 
b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
index 0a1a3fc54e54..4ba92c48b3a5 100644
--- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
@@ -255,4 +255,22 @@ extern s32 patch__itlbmiss_perf, patch__dtlbmiss_perf;
 
 #define mmu_linear_psize   MMU_PAGE_8M
 
+#ifndef __ASSEMBLY__
+#ifdef CONFIG_PPC_MM_SLICES
+static inline struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int 
psize)
+{
+   if (psize == mmu_virtual_psize)
+   return >mask_base_psize;
+#ifdef CONFIG_HUGETLB_PAGE
+   if (psize == MMU_PAGE_512K)
+   return >mask_512k;
+   if (psize == MMU_PAGE_8M)
+   return >mask_8m;
+#endif
+   WARN_ON(true);
+   return NULL;
+}
+#endif
+#endif /* !__ASSEMBLY__ */
+
 #endif /* _ASM_POWERPC_MMU_8XX_H_ */
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 231fd88d97e2..357d64e14757 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -126,42 +126,6 @@ static void slice_mask_for_free(struct mm_struct *mm, 
struct slice_mask *ret,
__set_bit(i, ret->high_slices);
 }
 
-#ifdef CONFIG_PPC_BOOK3S_64
-static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize)
-{
-#ifdef CONFIG_PPC_64K_PAGES
-   if (psize == MMU_PAGE_64K)
-   return >mask_64k;
-#endif
-   if (psize == MMU_PAGE_4K)
-   return >mask_4k;
-#ifdef CONFIG_HUGETLB_PAGE
-   if (psize == MMU_PAGE_16M)
-   return >mask_16m;
-   if (psize == MMU_PAGE_16G)
-   return >mask_16g;
-#endif
-   WARN_ON(true);
-   return NULL;
-}
-#elif defined(CONFIG_PPC_8xx)
-static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize)
-{
-   if (psize == mmu_virtual_psize)
-   return >mask_base_psize;
-#ifdef CONFIG_HUGETLB_PAGE
-   if (psize == MMU_PAGE_512K)
-   return >mask_512k;
-   if (psize == MMU_PAGE_8M)
-   return >mask_8m;
-#endif
-   WARN_ON(true);
-   return NULL;
-}
-#else
-#error "Must define the slice masks for page sizes supported by the platform"
-#endif
-
 static bool slice_check_range_fits(struct mm_struct *mm,
   const struct slice_mask *available,
   unsigned long start, unsigned long len)
-- 
2.13.3

[PATCH v1 21/27] powerpc/mm: hand a context_t over to slice_mask_for_size() instead of mm_struct

2019-03-20 Thread Christophe Leroy

slice_mask_for_size() only uses mm->context, so hand directly a
pointer to the context. This will help moving the function in
subarch mmu.h in the next patch by avoiding having to include
the definition of struct mm_struct

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/slice.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index f98b9e812c62..231fd88d97e2 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -127,33 +127,33 @@ static void slice_mask_for_free(struct mm_struct *mm, 
struct slice_mask *ret,
 }
 
 #ifdef CONFIG_PPC_BOOK3S_64
-static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize)
+static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize)
 {
 #ifdef CONFIG_PPC_64K_PAGES
if (psize == MMU_PAGE_64K)
-   return >context.mask_64k;
+   return >mask_64k;
 #endif
if (psize == MMU_PAGE_4K)
-   return >context.mask_4k;
+   return >mask_4k;
 #ifdef CONFIG_HUGETLB_PAGE
if (psize == MMU_PAGE_16M)
-   return >context.mask_16m;
+   return >mask_16m;
if (psize == MMU_PAGE_16G)
-   return >context.mask_16g;
+   return >mask_16g;
 #endif
WARN_ON(true);
return NULL;
 }
 #elif defined(CONFIG_PPC_8xx)
-static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize)
+static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize)
 {
if (psize == mmu_virtual_psize)
-   return >context.mask_base_psize;
+   return >mask_base_psize;
 #ifdef CONFIG_HUGETLB_PAGE
if (psize == MMU_PAGE_512K)
-   return >context.mask_512k;
+   return >mask_512k;
if (psize == MMU_PAGE_8M)
-   return >context.mask_8m;
+   return >mask_8m;
 #endif
WARN_ON(true);
return NULL;
@@ -221,7 +221,7 @@ static void slice_convert(struct mm_struct *mm,
unsigned long i, flags;
int old_psize;
 
-   psize_mask = slice_mask_for_size(mm, psize);
+   psize_mask = slice_mask_for_size(>context, psize);
 
/* We need to use a spinlock here to protect against
 * concurrent 64k -> 4k demotion ...
@@ -238,7 +238,7 @@ static void slice_convert(struct mm_struct *mm,
 
/* Update the slice_mask */
old_psize = (lpsizes[index] >> (mask_index * 4)) & 0xf;
-   old_mask = slice_mask_for_size(mm, old_psize);
+   old_mask = slice_mask_for_size(>context, old_psize);
old_mask->low_slices &= ~(1u << i);
psize_mask->low_slices |= 1u << i;
 
@@ -257,7 +257,7 @@ static void slice_convert(struct mm_struct *mm,
 
/* Update the slice_mask */
old_psize = (hpsizes[index] >> (mask_index * 4)) & 0xf;
-   old_mask = slice_mask_for_size(mm, old_psize);
+   old_mask = slice_mask_for_size(>context, old_psize);
__clear_bit(i, old_mask->high_slices);
__set_bit(i, psize_mask->high_slices);
 
@@ -504,7 +504,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/* First make up a "good" mask of slices that have the right size
 * already
 */
-   maskp = slice_mask_for_size(mm, psize);
+   maskp = slice_mask_for_size(>context, psize);
 
/*
 * Here "good" means slices that are already the right page size,
@@ -531,7 +531,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
 * a pointer to good mask for the next code to use.
 */
if (IS_ENABLED(CONFIG_PPC_64K_PAGES) && psize == MMU_PAGE_64K) {
-   compat_maskp = slice_mask_for_size(mm, MMU_PAGE_4K);
+   compat_maskp = slice_mask_for_size(>context, MMU_PAGE_4K);
if (fixed)
slice_or_mask(_mask, maskp, compat_maskp);
else
@@ -709,7 +709,7 @@ void slice_init_new_context_exec(struct mm_struct *mm)
/*
 * Slice mask cache starts zeroed, fill the default size cache.
 */
-   mask = slice_mask_for_size(mm, psize);
+   mask = slice_mask_for_size(>context, psize);
mask->low_slices = ~0UL;
if (SLICE_NUM_HIGH)
bitmap_fill(mask->high_slices, SLICE_NUM_HIGH);
@@ -766,14 +766,14 @@ int slice_is_hugepage_only_range(struct mm_struct *mm, 
unsigned long addr,
 
VM_BUG_ON(radix_enabled());
 
-   maskp = slice_mask_for_size(mm, psize);
+   maskp = slice_mask_for_size(>context, psize);
 #ifdef CONFIG_PPC_64K_PAGES
/* We need to account for 4k slices too */
if (psize == MMU_PAGE_64K) {
const struct slice_mask *compat_maskp;
struct slice_mask available;
 
-   compat_maskp =

[PATCH v1 19/27] powerpc/mm: drop slice DEBUG

2019-03-20 Thread Christophe Leroy

slice is now an improved functionnality. Drop the DEBUG stuff.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/slice.c | 62 -
 1 file changed, 4 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 011d470ea340..99983dc4e484 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -41,28 +41,6 @@
 
 static DEFINE_SPINLOCK(slice_convert_lock);
 
-#ifdef DEBUG
-int _slice_debug = 1;
-
-static void slice_print_mask(const char *label, const struct slice_mask *mask)
-{
-   if (!_slice_debug)
-   return;
-   pr_devel("%s low_slice: %*pbl\n", label,
-   (int)SLICE_NUM_LOW, >low_slices);
-   pr_devel("%s high_slice: %*pbl\n", label,
-   (int)SLICE_NUM_HIGH, mask->high_slices);
-}
-
-#define slice_dbg(fmt...) do { if (_slice_debug) pr_devel(fmt); } while (0)
-
-#else
-
-static void slice_print_mask(const char *label, const struct slice_mask *mask) 
{}
-#define slice_dbg(fmt...)
-
-#endif
-
 static inline bool slice_addr_is_low(unsigned long addr)
 {
u64 tmp = (u64)addr;
@@ -245,9 +223,6 @@ static void slice_convert(struct mm_struct *mm,
unsigned long i, flags;
int old_psize;
 
-   slice_dbg("slice_convert(mm=%p, psize=%d)\n", mm, psize);
-   slice_print_mask(" mask", mask);
-
psize_mask = slice_mask_for_size(mm, psize);
 
/* We need to use a spinlock here to protect against
@@ -293,10 +268,6 @@ static void slice_convert(struct mm_struct *mm,
(((unsigned long)psize) << (mask_index * 4));
}
 
-   slice_dbg(" lsps=%lx, hsps=%lx\n",
- (unsigned long)mm->context.low_slices_psize,
- (unsigned long)mm->context.high_slices_psize);
-
spin_unlock_irqrestore(_convert_lock, flags);
 
copro_flush_all_slbs(mm);
@@ -523,14 +494,9 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
BUG_ON(mm->context.slb_addr_limit == 0);
VM_BUG_ON(radix_enabled());
 
-   slice_dbg("slice_get_unmapped_area(mm=%p, psize=%d...\n", mm, psize);
-   slice_dbg(" addr=%lx, len=%lx, flags=%lx, topdown=%d\n",
- addr, len, flags, topdown);
-
/* If hint, make sure it matches our alignment restrictions */
if (!fixed && addr) {
addr = _ALIGN_UP(addr, page_size);
-   slice_dbg(" aligned addr=%lx\n", addr);
/* Ignore hint if it's too large or overlaps a VMA */
if (addr > high_limit - len || addr < mmap_min_addr ||
!slice_area_is_free(mm, addr, len))
@@ -576,17 +542,12 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
slice_copy_mask(_mask, maskp);
}
 
-   slice_print_mask(" good_mask", _mask);
-   if (compat_maskp)
-   slice_print_mask(" compat_mask", compat_maskp);
-
/* First check hint if it's valid or if we have MAP_FIXED */
if (addr != 0 || fixed) {
/* Check if we fit in the good mask. If we do, we just return,
 * nothing else to do
 */
if (slice_check_range_fits(mm, _mask, addr, len)) {
-   slice_dbg(" fits good !\n");
newaddr = addr;
goto return_addr;
}
@@ -596,13 +557,10 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
 */
newaddr = slice_find_area(mm, len, _mask,
  psize, topdown, high_limit);
-   if (newaddr != -ENOMEM) {
-   /* Found within the good mask, we don't have to setup,
-* we thus return directly
-*/
-   slice_dbg(" found area at 0x%lx\n", newaddr);
+
+   /* Found within good mask, don't have to setup, thus return 
directly */
+   if (newaddr != -ENOMEM)
goto return_addr;
-   }
}
/*
 * We don't fit in the good mask, check what other slices are
@@ -610,11 +568,9 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
 */
slice_mask_for_free(mm, _mask, high_limit);
slice_or_mask(_mask, _mask, _mask);
-   slice_print_mask(" potential", _mask);
 
if (addr != 0 || fixed) {
if (slice_check_range_fits(mm, _mask, addr, len)) {
-   slice_dbg(" fits potential !\n");
newaddr = addr;
goto convert;
}
@@ -624,18 +580,14 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
if (fixed)
return -EBUSY;
 
-   slice_dbg(" search...\n");
-
/* If we

[PATCH v1 20/27] powerpc/mm: remove unnecessary #ifdef CONFIG_PPC64

2019-03-20 Thread Christophe Leroy

For PPC32 that's a noop, but gcc is smart enough ignore it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/slice.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 99983dc4e484..f98b9e812c62 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -96,13 +96,11 @@ static int slice_high_has_vma(struct mm_struct *mm, 
unsigned long slice)
unsigned long start = slice << SLICE_HIGH_SHIFT;
unsigned long end = start + (1ul << SLICE_HIGH_SHIFT);
 
-#ifdef CONFIG_PPC64
/* Hack, so that each addresses is controlled by exactly one
 * of the high or low area bitmaps, the first high area starts
 * at 4GB, not 0 */
if (start == 0)
-   start = SLICE_LOW_TOP;
-#endif
+   start = (unsigned long)SLICE_LOW_TOP;
 
return !slice_area_is_free(mm, start, end - start);
 }
-- 
2.13.3

[PATCH v1 18/27] powerpc/mm: cleanup remaining ifdef mess in hugetlbpage.c

2019-03-20 Thread Christophe Leroy

Only 3 subarches support huge pages. So when it either 2 of them, it
is not the third one.

And mmu_has_feature() is known by all subarches so IS_ENABLED() can
be used instead of #ifdef

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/hugetlbpage.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index dd62006e1243..a463ebf276b6 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -226,7 +226,7 @@ int __init alloc_bootmem_huge_page(struct hstate *h)
return __alloc_bootmem_huge_page(h);
 }
 
-#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
+#ifndef CONFIG_PPC_BOOK3S_64
 #define HUGEPD_FREELIST_SIZE \
((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))
 
@@ -596,10 +596,10 @@ static int __init hugetlbpage_init(void)
return 0;
}
 
-#if !defined(CONFIG_PPC_FSL_BOOK3E) && !defined(CONFIG_PPC_8xx)
-   if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled() &&
+   !mmu_has_feature(MMU_FTR_16M_PAGE))
return -ENODEV;
-#endif
+
for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
unsigned shift;
unsigned pdshift;
@@ -637,10 +637,8 @@ static int __init hugetlbpage_init(void)
pgtable_cache_add(PTE_INDEX_SIZE);
else if (pdshift > shift)
pgtable_cache_add(pdshift - shift);
-#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
-   else
+   else if (IS_ENABLED(CONFIG_PPC_FSL_BOOK3E) || 
IS_ENABLED(CONFIG_PPC_8xx))
pgtable_cache_add(PTE_T_ORDER);
-#endif
}
 
if (IS_ENABLED(HUGETLB_PAGE_SIZE_VARIABLE))
-- 
2.13.3

[PATCH v1 17/27] powerpc/mm: cleanup HPAGE_SHIFT setup

2019-03-20 Thread Christophe Leroy

Only book3s/64 may select default among several HPAGE_SHIFT at runtime.
8xx always defines 512K pages as default
FSL_BOOK3E always defines 4M pages as default

This patch limits HUGETLB_PAGE_SIZE_VARIABLE to book3s/64
moves the definitions in subarches files.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig |  2 +-
 arch/powerpc/include/asm/hugetlb.h   |  2 ++
 arch/powerpc/include/asm/page.h  | 11 ---
 arch/powerpc/mm/hugetlbpage-hash64.c | 16 
 arch/powerpc/mm/hugetlbpage.c| 23 +++
 5 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 5d8e692d6470..7815eb0cc2a5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -390,7 +390,7 @@ source "kernel/Kconfig.hz"
 
 config HUGETLB_PAGE_SIZE_VARIABLE
bool
-   depends on HUGETLB_PAGE
+   depends on HUGETLB_PAGE && PPC_BOOK3S_64
default y
 
 config MATH_EMULATION
diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 84598c6b0959..20a101046cff 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -15,6 +15,8 @@
 
 extern bool hugetlb_disabled;
 
+void hugetlbpage_init_default(void);
+
 void flush_dcache_icache_hugepage(struct page *page);
 
 int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 0c11a7513919..eef10fe0e06f 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -28,10 +28,15 @@
 #define PAGE_SIZE  (ASM_CONST(1) << PAGE_SHIFT)
 
 #ifndef __ASSEMBLY__
-#ifdef CONFIG_HUGETLB_PAGE
-extern unsigned int HPAGE_SHIFT;
-#else
+#ifndef CONFIG_HUGETLB_PAGE
 #define HPAGE_SHIFT PAGE_SHIFT
+#elif defined(CONFIG_PPC_BOOK3S_64)
+extern unsigned int hpage_shift;
+#define HPAGE_SHIFT hpage_shift
+#elif defined(CONFIG_PPC_8xx)
+#define HPAGE_SHIFT19  /* 512k pages */
+#elif defined(CONFIG_PPC_FSL_BOOK3E)
+#define HPAGE_SHIFT22  /* 4M pages */
 #endif
 #define HPAGE_SIZE ((1UL) << HPAGE_SHIFT)
 #define HPAGE_MASK (~(HPAGE_SIZE - 1))
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c 
b/arch/powerpc/mm/hugetlbpage-hash64.c
index b0d9209d9a86..7a58204c3688 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -15,6 +15,9 @@
 #include 
 #include 
 
+unsigned int hpage_shift;
+EXPORT_SYMBOL(hpage_shift);
+
 extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
  unsigned long pa, unsigned long rlags,
  unsigned long vflags, int psize, int ssize);
@@ -145,3 +148,16 @@ void huge_ptep_modify_prot_commit(struct vm_area_struct 
*vma, unsigned long addr
   old_pte, pte);
set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
 }
+
+void hugetlbpage_init_default(void)
+{
+   /* Set default large page size. Currently, we pick 16M or 1M
+* depending on what is available
+*/
+   if (mmu_psize_defs[MMU_PAGE_16M].shift)
+   hpage_shift = mmu_psize_defs[MMU_PAGE_16M].shift;
+   else if (mmu_psize_defs[MMU_PAGE_1M].shift)
+   hpage_shift = mmu_psize_defs[MMU_PAGE_1M].shift;
+   else if (mmu_psize_defs[MMU_PAGE_2M].shift)
+   hpage_shift = mmu_psize_defs[MMU_PAGE_2M].shift;
+}
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 3b449c9d4e47..dd62006e1243 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -28,9 +28,6 @@
 
 bool hugetlb_disabled = false;
 
-unsigned int HPAGE_SHIFT;
-EXPORT_SYMBOL(HPAGE_SHIFT);
-
 #define hugepd_none(hpd)   (hpd_val(hpd) == 0)
 
 #define PTE_T_ORDER(__builtin_ffs(sizeof(pte_t)) - 
__builtin_ffs(sizeof(void *)))
@@ -646,23 +643,9 @@ static int __init hugetlbpage_init(void)
 #endif
}
 
-#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
-   /* Default hpage size = 4M on FSL_BOOK3E and 512k on 8xx */
-   if (mmu_psize_defs[MMU_PAGE_4M].shift)
-   HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
-   else if (mmu_psize_defs[MMU_PAGE_512K].shift)
-   HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_512K].shift;
-#else
-   /* Set default large page size. Currently, we pick 16M or 1M
-* depending on what is available
-*/
-   if (mmu_psize_defs[MMU_PAGE_16M].shift)
-   HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
-   else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-   HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
-   else if (mmu_psize_defs[MMU_PAGE_2M].shift)
-   HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_2M].shift;
-#endif
+   if (IS_ENABLED(HUGETLB_PAGE_SIZE_VARIABLE))
+

[PATCH v1 16/27] powerpc/mm: move hugetlb_disabled into asm/hugetlb.h

2019-03-20 Thread Christophe Leroy

No need to have this in asm/page.h, move it into asm/hugetlb.h

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/hugetlb.h | 2 ++
 arch/powerpc/include/asm/page.h| 1 -
 arch/powerpc/kernel/fadump.c   | 1 +
 arch/powerpc/mm/hash_utils_64.c| 1 +
 4 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index fd5c0873a57d..84598c6b0959 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -13,6 +13,8 @@
 #include 
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
+extern bool hugetlb_disabled;
+
 void flush_dcache_icache_hugepage(struct page *page);
 
 int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index ed870468ef6f..0c11a7513919 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -29,7 +29,6 @@
 
 #ifndef __ASSEMBLY__
 #ifdef CONFIG_HUGETLB_PAGE
-extern bool hugetlb_disabled;
 extern unsigned int HPAGE_SHIFT;
 #else
 #define HPAGE_SHIFT PAGE_SHIFT
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 45a8d0be1c96..25f063f56ec5 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0a4f939a8161..16ce13af6b9c 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
-- 
2.13.3

[PATCH v1 15/27] powerpc/mm: cleanup ifdef mess in add_huge_page_size()

2019-03-20 Thread Christophe Leroy

Introduce a subarch specific helper check_and_get_huge_psize()
to check the huge page sizes and cleanup the ifdef mess in
add_huge_page_size()

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 27 +
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h |  5 
 arch/powerpc/include/asm/nohash/hugetlb-book3e.h |  8 +
 arch/powerpc/mm/hugetlbpage.c| 37 ++--
 4 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index 177c81079209..4522a56a6269 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -108,4 +108,31 @@ static inline void hugepd_populate(hugepd_t *hpdp, pte_t 
*new, unsigned int pshi
 
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 
+static inline int check_and_get_huge_psize(int shift)
+{
+   int mmu_psize;
+
+   if (shift > SLICE_HIGH_SHIFT)
+   return -EINVAL;
+
+   mmu_psize = shift_to_mmu_psize(shift);
+
+   /*
+* We need to make sure that for different page sizes reported by
+* firmware we only add hugetlb support for page sizes that can be
+* supported by linux page table layout.
+* For now we have
+* Radix: 2M and 1G
+* Hash: 16M and 16G
+*/
+   if (radix_enabled()) {
+   if (mmu_psize != MMU_PAGE_2M && mmu_psize != MMU_PAGE_1G)
+   return -EINVAL;
+   } else {
+   if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
+   return -EINVAL;
+   }
+   return mmu_psize;
+}
+
 #endif
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h 
b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index eb90c2db7601..a442b499d5c8 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -37,4 +37,9 @@ static inline void hugepd_populate(hugepd_t *hpdp, pte_t 
*new, unsigned int pshi
 (pshift == PAGE_SHIFT_8M ? _PMD_PAGE_8M : 
_PMD_PAGE_512K));
 }
 
+static inline int check_and_get_huge_psize(int shift)
+{
+   return shift_to_mmu_psize(shift);
+}
+
 #endif /* _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H */
diff --git a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h 
b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h
index 51439bcfe313..ecd8694cb229 100644
--- a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h
+++ b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h
@@ -34,4 +34,12 @@ static inline void hugepd_populate(hugepd_t *hpdp, pte_t 
*new, unsigned int pshi
*hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | pshift);
 }
 
+static inline int check_and_get_huge_psize(int shift)
+{
+   if (shift & 1)  /* Not a power of 4 */
+   return -EINVAL;
+
+   return shift_to_mmu_psize(shift);
+}
+
 #endif /* _ASM_POWERPC_NOHASH_HUGETLB_BOOK3E_H */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 87358b89513e..3b449c9d4e47 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -549,13 +549,6 @@ unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
return vma_kernel_pagesize(vma);
 }
 
-static inline bool is_power_of_4(unsigned long x)
-{
-   if (is_power_of_2(x))
-   return (__ilog2(x) % 2) ? false : true;
-   return false;
-}
-
 static int __init add_huge_page_size(unsigned long long size)
 {
int shift = __ffs(size);
@@ -563,37 +556,13 @@ static int __init add_huge_page_size(unsigned long long 
size)
 
/* Check that it is a page size supported by the hardware and
 * that it fits within pagetable and slice limits. */
-   if (size <= PAGE_SIZE)
-   return -EINVAL;
-#if defined(CONFIG_PPC_FSL_BOOK3E)
-   if (!is_power_of_4(size))
+   if (size <= PAGE_SIZE || !is_power_of_2(size))
return -EINVAL;
-#elif !defined(CONFIG_PPC_8xx)
-   if (!is_power_of_2(size) || (shift > SLICE_HIGH_SHIFT))
-   return -EINVAL;
-#endif
 
-   if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
+   mmu_psize = check_and_get_huge_psize(size);
+   if (mmu_psize < 0)
return -EINVAL;
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   /*
-* We need to make sure that for different page sizes reported by
-* firmware we only add hugetlb support for page sizes that can be
-* supported by linux page table layout.
-* For now we have
-* Radix: 2M and 1G
-* Hash: 16M and 16G
-*/
-   if (radix_enabled()) {
-   if (mmu_psize != MMU_PAGE_2M && mmu_psize != MMU_PAGE_1G)
-   return -EINVAL;
-   } else {
-   if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
-   return -EINVAL;

[PATCH v1 14/27] powerpc/mm: no slice for nohash/64

2019-03-20 Thread Christophe Leroy

Only nohash/32 and book3s/64 support mm slices.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/64/slice.h | 7 ---
 arch/powerpc/include/asm/slice.h   | 4 +---
 arch/powerpc/platforms/Kconfig.cputype | 4 
 3 files changed, 5 insertions(+), 10 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/nohash/64/slice.h

diff --git a/arch/powerpc/include/asm/nohash/64/slice.h 
b/arch/powerpc/include/asm/nohash/64/slice.h
deleted file mode 100644
index 30adfdd4afde..
--- a/arch/powerpc/include/asm/nohash/64/slice.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_POWERPC_NOHASH_64_SLICE_H
-#define _ASM_POWERPC_NOHASH_64_SLICE_H
-
-#define get_slice_psize(mm, addr)  MMU_PAGE_4K
-
-#endif /* _ASM_POWERPC_NOHASH_64_SLICE_H */
diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h
index d85c85422fdf..49d950a14e25 100644
--- a/arch/powerpc/include/asm/slice.h
+++ b/arch/powerpc/include/asm/slice.h
@@ -4,9 +4,7 @@
 
 #ifdef CONFIG_PPC_BOOK3S_64
 #include 
-#elif defined(CONFIG_PPC64)
-#include 
-#elif defined(CONFIG_PPC_MMU_NOHASH)
+#elif defined(CONFIG_PPC_MMU_NOHASH_32)
 #include 
 #endif
 
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 842b2c7e156a..51ceeb046867 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -354,6 +354,10 @@ config PPC_MMU_NOHASH
def_bool y
depends on !PPC_BOOK3S
 
+config PPC_MMU_NOHASH_32
+   def_bool y
+   depends on PPC_MMU_NOHASH && PPC32
+
 config PPC_BOOK3E_MMU
def_bool y
depends on FSL_BOOKE || PPC_BOOK3E
-- 
2.13.3

[PATCH v1 13/27] powerpc/mm: define get_slice_psize() all the time

2019-03-20 Thread Christophe Leroy

get_slice_psize() can be defined regardless of CONFIG_PPC_MM_SLICES
to avoid ifdefs

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/slice.h | 4 
 arch/powerpc/mm/hugetlbpage.c| 4 +---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h
index 44816cbc4198..d85c85422fdf 100644
--- a/arch/powerpc/include/asm/slice.h
+++ b/arch/powerpc/include/asm/slice.h
@@ -38,6 +38,10 @@ void slice_setup_new_exec(void);
 
 static inline void slice_init_new_context_exec(struct mm_struct *mm) {}
 
+#ifndef get_slice_psize
+#define get_slice_psize(mm, addr)  MMU_PAGE_4K
+#endif
+
 #endif /* CONFIG_PPC_MM_SLICES */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26a57ebaf5cf..87358b89513e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -540,14 +540,12 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr,
 
 unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
 {
-#ifdef CONFIG_PPC_MM_SLICES
/* With radix we don't use slice, so derive it from vma*/
-   if (!radix_enabled()) {
+   if (IS_ENABLED(CONFIG_PPC_MM_SLICES) && !radix_enabled()) {
unsigned int psize = get_slice_psize(vma->vm_mm, vma->vm_start);
 
return 1UL << mmu_psize_to_shift(psize);
}
-#endif
return vma_kernel_pagesize(vma);
 }
 
-- 
2.13.3

[PATCH v1 11/27] powerpc/mm: split asm/hugetlb.h into dedicated subarch files

2019-03-20 Thread Christophe Leroy

Three subarches support hugepages:
- fsl book3e
- book3s/64
- 8xx

This patch splits asm/hugetlb.h to reduce the #ifdef mess.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 41 +++
 arch/powerpc/include/asm/hugetlb.h   | 89 ++--
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 32 +
 arch/powerpc/include/asm/nohash/hugetlb-book3e.h | 31 +
 4 files changed, 108 insertions(+), 85 deletions(-)
 create mode 100644 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
 create mode 100644 arch/powerpc/include/asm/nohash/hugetlb-book3e.h

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index ec2a55a553c7..2f9cf2bc601c 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -62,4 +62,45 @@ extern pte_t huge_ptep_modify_prot_start(struct 
vm_area_struct *vma,
 extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 unsigned long addr, pte_t *ptep,
 pte_t old_pte, pte_t new_pte);
+/*
+ * This should work for other subarchs too. But right now we use the
+ * new format only for 64bit book3s
+ */
+static inline pte_t *hugepd_page(hugepd_t hpd)
+{
+   if (WARN_ON(!hugepd_ok(hpd)))
+   return NULL;
+   /*
+* We have only four bits to encode, MMU page size
+*/
+   BUILD_BUG_ON((MMU_PAGE_COUNT - 1) > 0xf);
+   return __va(hpd_val(hpd) & HUGEPD_ADDR_MASK);
+}
+
+static inline unsigned int hugepd_mmu_psize(hugepd_t hpd)
+{
+   return (hpd_val(hpd) & HUGEPD_SHIFT_MASK) >> 2;
+}
+
+static inline unsigned int hugepd_shift(hugepd_t hpd)
+{
+   return mmu_psize_to_shift(hugepd_mmu_psize(hpd));
+}
+static inline void flush_hugetlb_page(struct vm_area_struct *vma,
+ unsigned long vmaddr)
+{
+   if (radix_enabled())
+   return radix__flush_hugetlb_page(vma, vmaddr);
+}
+
+static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
+   unsigned int pdshift)
+{
+   unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> 
hugepd_shift(hpd);
+
+   return hugepd_page(hpd) + idx;
+}
+
+void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
+
 #endif
diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 48c29686c78e..fd5c0873a57d 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -6,85 +6,13 @@
 #include 
 
 #ifdef CONFIG_PPC_BOOK3S_64
-
 #include 
-/*
- * This should work for other subarchs too. But right now we use the
- * new format only for 64bit book3s
- */
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-   if (WARN_ON(!hugepd_ok(hpd)))
-   return NULL;
-   /*
-* We have only four bits to encode, MMU page size
-*/
-   BUILD_BUG_ON((MMU_PAGE_COUNT - 1) > 0xf);
-   return __va(hpd_val(hpd) & HUGEPD_ADDR_MASK);
-}
-
-static inline unsigned int hugepd_mmu_psize(hugepd_t hpd)
-{
-   return (hpd_val(hpd) & HUGEPD_SHIFT_MASK) >> 2;
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-   return mmu_psize_to_shift(hugepd_mmu_psize(hpd));
-}
-static inline void flush_hugetlb_page(struct vm_area_struct *vma,
- unsigned long vmaddr)
-{
-   if (radix_enabled())
-   return radix__flush_hugetlb_page(vma, vmaddr);
-}
-
-#else
-
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-   if (WARN_ON(!hugepd_ok(hpd)))
-   return NULL;
-#ifdef CONFIG_PPC_8xx
-   return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK);
-#else
-   return (pte_t *)((hpd_val(hpd) &
- ~HUGEPD_SHIFT_MASK) | PD_HUGE);
-#endif
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-#ifdef CONFIG_PPC_8xx
-   return ((hpd_val(hpd) & _PMD_PAGE_MASK) >> 1) + 17;
-#else
-   return hpd_val(hpd) & HUGEPD_SHIFT_MASK;
-#endif
-}
-
+#elif defined(CONFIG_PPC_FSL_BOOK3E)
+#include 
+#elif defined(CONFIG_PPC_8xx)
+#include 
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
-
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
-   unsigned pdshift)
-{
-   /*
-* On FSL BookE, we have multiple higher-level table entries that
-* point to the same hugepte.  Just use the first one since they're all
-* identical.  So for that case, idx=0.
-*/
-   unsigned long idx = 0;
-
-   pte_t *dir = hugepd_page(hpd);
-#ifdef CONFIG_PPC_8xx
-   idx = (addr & ((1UL << pdshift) - 1)) >> PAGE_SHIFT;
-#elif !defined(CONFIG_PPC_FSL_BOOK3E)
-   idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(hpd);
-#endif
-
-   return dir + idx;
-}
-
 void flush_dcache_icache_hugepage(struct page *page);
 
 int

[PATCH v1 12/27] powerpc/mm: add a helper to populate hugepd

2019-03-20 Thread Christophe Leroy

This patchs adds a subarch helper to populate hugepd.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h |  5 +
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h |  8 
 arch/powerpc/include/asm/nohash/hugetlb-book3e.h |  6 ++
 arch/powerpc/mm/hugetlbpage.c| 20 +---
 4 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index 2f9cf2bc601c..177c81079209 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -101,6 +101,11 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned 
long addr,
return hugepd_page(hpd) + idx;
 }
 
+static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int 
pshift)
+{
+   *hpdp = __hugepd(__pa(new) | HUGEPD_VAL_BITS | 
(shift_to_mmu_psize(pshift) << 2));
+}
+
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 
 #endif
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h 
b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 209e6a219835..eb90c2db7601 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H
 #define _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H
 
+#define PAGE_SHIFT_8M  23
+
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
if (WARN_ON(!hugepd_ok(hpd)))
@@ -29,4 +31,10 @@ static inline void flush_hugetlb_page(struct vm_area_struct 
*vma,
flush_tlb_page(vma, vmaddr);
 }
 
+static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int 
pshift)
+{
+   *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT |
+(pshift == PAGE_SHIFT_8M ? _PMD_PAGE_8M : 
_PMD_PAGE_512K));
+}
+
 #endif /* _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H */
diff --git a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h 
b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h
index e94f1cd048ee..51439bcfe313 100644
--- a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h
+++ b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h
@@ -28,4 +28,10 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned 
long addr,
 
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 
+static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int 
pshift)
+{
+   /* We use the old format for PPC_FSL_BOOK3E */
+   *hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | pshift);
+}
+
 #endif /* _ASM_POWERPC_NOHASH_HUGETLB_BOOK3E_H */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 29d1568c7775..26a57ebaf5cf 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -26,12 +26,6 @@
 #include 
 #include 
 
-#define PAGE_SHIFT_64K 16
-#define PAGE_SHIFT_512K19
-#define PAGE_SHIFT_8M  23
-#define PAGE_SHIFT_16M 24
-#define PAGE_SHIFT_16G 34
-
 bool hugetlb_disabled = false;
 
 unsigned int HPAGE_SHIFT;
@@ -95,19 +89,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
for (i = 0; i < num_hugepd; i++, hpdp++) {
if (unlikely(!hugepd_none(*hpdp)))
break;
-   else {
-#ifdef CONFIG_PPC_BOOK3S_64
-   *hpdp = __hugepd(__pa(new) | HUGEPD_VAL_BITS |
-(shift_to_mmu_psize(pshift) << 2));
-#elif defined(CONFIG_PPC_8xx)
-   *hpdp = __hugepd(__pa(new) | _PMD_USER |
-(pshift == PAGE_SHIFT_8M ? 
_PMD_PAGE_8M :
- _PMD_PAGE_512K) | _PMD_PRESENT);
-#else
-   /* We use the old format for PPC_FSL_BOOK3E */
-   *hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | 
pshift);
-#endif
-   }
+   hugepd_populate(hpdp, new, pshift);
}
/* If we bailed from the for loop early, an error occurred, clean up */
if (i < num_hugepd) {
-- 
2.13.3

[PATCH v1 10/27] powerpc/mm: make gup_hugepte() static

2019-03-20 Thread Christophe Leroy

gup_huge_pd() is the only user of gup_hugepte() and it is
located in the same file. This patch moves gup_huge_pd()
after gup_hugepte() and makes gup_hugepte() static.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/pgtable.h |  3 ---
 arch/powerpc/mm/hugetlbpage.c  | 38 +++---
 2 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 505550fb2935..c51846da41a7 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -89,9 +89,6 @@ extern void paging_init(void);
  */
 extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t *);
 
-extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-  unsigned long end, int write,
-  struct page **pages, int *nr);
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_large(pmd) 0
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 6d9751b188c1..29d1568c7775 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -539,23 +539,6 @@ static unsigned long hugepte_addr_end(unsigned long addr, 
unsigned long end,
return (__boundary - 1 < end - 1) ? __boundary : end;
 }
 
-int gup_huge_pd(hugepd_t hugepd, unsigned long addr, unsigned pdshift,
-   unsigned long end, int write, struct page **pages, int *nr)
-{
-   pte_t *ptep;
-   unsigned long sz = 1UL << hugepd_shift(hugepd);
-   unsigned long next;
-
-   ptep = hugepte_offset(hugepd, addr, pdshift);
-   do {
-   next = hugepte_addr_end(addr, end, sz);
-   if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr))
-   return 0;
-   } while (ptep++, addr = next, addr != end);
-
-   return 1;
-}
-
 #ifdef CONFIG_PPC_MM_SLICES
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
unsigned long len, unsigned long pgoff,
@@ -754,8 +737,8 @@ void flush_dcache_icache_hugepage(struct page *page)
}
 }
 
-int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-   unsigned long end, int write, struct page **pages, int *nr)
+static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+  unsigned long end, int write, struct page **pages, int 
*nr)
 {
unsigned long pte_end;
struct page *head, *page;
@@ -801,3 +784,20 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned 
long addr,
 
return 1;
 }
+
+int gup_huge_pd(hugepd_t hugepd, unsigned long addr, unsigned int pdshift,
+   unsigned long end, int write, struct page **pages, int *nr)
+{
+   pte_t *ptep;
+   unsigned long sz = 1UL << hugepd_shift(hugepd);
+   unsigned long next;
+
+   ptep = hugepte_offset(hugepd, addr, pdshift);
+   do {
+   next = hugepte_addr_end(addr, end, sz);
+   if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr))
+   return 0;
+   } while (ptep++, addr = next, addr != end);
+
+   return 1;
+}
-- 
2.13.3

[PATCH v1 09/27] powerpc/mm: make hugetlbpage.c depend on CONFIG_HUGETLB_PAGE

2019-03-20 Thread Christophe Leroy

The only function in hugetlbpage.c which doesn't depend on
CONFIG_HUGETLB_PAGE is gup_hugepte(), and this function is
only called from gup_huge_pd() which depends on
CONFIG_HUGETLB_PAGE so all the content of hugetlbpage.c
depends on CONFIG_HUGETLB_PAGE.

This patch modifies Makefile to only compile hugetlbpage.c
when CONFIG_HUGETLB_PAGE is set.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/Makefile  | 2 +-
 arch/powerpc/mm/hugetlbpage.c | 5 -
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 2c23d1ece034..20b900537fc9 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_PPC_FSL_BOOK3E)  += fsl_booke_mmu.o
 obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_SPLPAR)   += vphn.o
 obj-$(CONFIG_PPC_MM_SLICES)+= slice.o
-obj-y  += hugetlbpage.o
+obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 ifdef CONFIG_HUGETLB_PAGE
 obj-$(CONFIG_PPC_BOOK3S_64)+= hugetlbpage-hash64.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= hugetlbpage-radix.o
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 202ae006aa39..6d9751b188c1 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -26,9 +26,6 @@
 #include 
 #include 
 
-
-#ifdef CONFIG_HUGETLB_PAGE
-
 #define PAGE_SHIFT_64K 16
 #define PAGE_SHIFT_512K19
 #define PAGE_SHIFT_8M  23
@@ -757,8 +754,6 @@ void flush_dcache_icache_hugepage(struct page *page)
}
 }
 
-#endif /* CONFIG_HUGETLB_PAGE */
-
 int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
 {
-- 
2.13.3

[PATCH v1 08/27] powerpc/mm: move __find_linux_pte() out of hugetlbpage.c

2019-03-20 Thread Christophe Leroy

__find_linux_pte() is the only function in hugetlbpage.c
which is compiled in regardless on CONFIG_HUGETLBPAGE

This patch moves it in pgtable.c.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/hugetlbpage.c | 103 -
 arch/powerpc/mm/pgtable.c | 104 ++
 2 files changed, 104 insertions(+), 103 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index cf2978e235f3..202ae006aa39 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -759,109 +759,6 @@ void flush_dcache_icache_hugepage(struct page *page)
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
-/*
- * We have 4 cases for pgds and pmds:
- * (1) invalid (all zeroes)
- * (2) pointer to next table, as normal; bottom 6 bits == 0
- * (3) leaf pte for huge page _PAGE_PTE set
- * (4) hugepd pointer, _PAGE_PTE = 0 and bits [2..6] indicate size of table
- *
- * So long as we atomically load page table pointers we are safe against 
teardown,
- * we can follow the address down to the the page and take a ref on it.
- * This function need to be called with interrupts disabled. We use this 
variant
- * when we have MSR[EE] = 0 but the paca->irq_soft_mask = IRQS_ENABLED
- */
-pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
-   bool *is_thp, unsigned *hpage_shift)
-{
-   pgd_t pgd, *pgdp;
-   pud_t pud, *pudp;
-   pmd_t pmd, *pmdp;
-   pte_t *ret_pte;
-   hugepd_t *hpdp = NULL;
-   unsigned pdshift = PGDIR_SHIFT;
-
-   if (hpage_shift)
-   *hpage_shift = 0;
-
-   if (is_thp)
-   *is_thp = false;
-
-   pgdp = pgdir + pgd_index(ea);
-   pgd  = READ_ONCE(*pgdp);
-   /*
-* Always operate on the local stack value. This make sure the
-* value don't get updated by a parallel THP split/collapse,
-* page fault or a page unmap. The return pte_t * is still not
-* stable. So should be checked there for above conditions.
-*/
-   if (pgd_none(pgd))
-   return NULL;
-   else if (pgd_huge(pgd)) {
-   ret_pte = (pte_t *) pgdp;
-   goto out;
-   } else if (is_hugepd(__hugepd(pgd_val(pgd
-   hpdp = (hugepd_t *)
-   else {
-   /*
-* Even if we end up with an unmap, the pgtable will not
-* be freed, because we do an rcu free and here we are
-* irq disabled
-*/
-   pdshift = PUD_SHIFT;
-   pudp = pud_offset(, ea);
-   pud  = READ_ONCE(*pudp);
-
-   if (pud_none(pud))
-   return NULL;
-   else if (pud_huge(pud)) {
-   ret_pte = (pte_t *) pudp;
-   goto out;
-   } else if (is_hugepd(__hugepd(pud_val(pud
-   hpdp = (hugepd_t *)
-   else {
-   pdshift = PMD_SHIFT;
-   pmdp = pmd_offset(, ea);
-   pmd  = READ_ONCE(*pmdp);
-   /*
-* A hugepage collapse is captured by pmd_none, because
-* it mark the pmd none and do a hpte invalidate.
-*/
-   if (pmd_none(pmd))
-   return NULL;
-
-   if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
-   if (is_thp)
-   *is_thp = true;
-   ret_pte = (pte_t *) pmdp;
-   goto out;
-   }
-   /*
-* pmd_large check below will handle the swap pmd pte
-* we need to do both the check because they are config
-* dependent.
-*/
-   if (pmd_huge(pmd) || pmd_large(pmd)) {
-   ret_pte = (pte_t *) pmdp;
-   goto out;
-   } else if (is_hugepd(__hugepd(pmd_val(pmd
-   hpdp = (hugepd_t *)
-   else
-   return pte_offset_kernel(, ea);
-   }
-   }
-   if (!hpdp)
-   return NULL;
-
-   ret_pte = hugepte_offset(*hpdp, ea, pdshift);
-   pdshift = hugepd_shift(*hpdp);
-out:
-   if (hpage_shift)
-   *hpage_shift = pdshift;
-   return ret_pte;
-}
-EXPORT_SYMBOL_GPL(__find_linux_pte);
-
 int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
 {
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index d3d61d29b4f1..9f4ccd15849f 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -30,6 +30,7 @@

[PATCH v1 07/27] powerpc/book3e: hugetlbpage is only for CONFIG_PPC_FSL_BOOK3E

2019-03-20 Thread Christophe Leroy

As per Kconfig.cputype, only CONFIG_PPC_FSL_BOOK3E gets to
select SYS_SUPPORTS_HUGETLBFS so simplify accordingly.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/Makefile |  2 +-
 arch/powerpc/mm/hugetlbpage-book3e.c | 47 +++-
 2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 3c1bd9fa23cd..2c23d1ece034 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -37,7 +37,7 @@ obj-y += hugetlbpage.o
 ifdef CONFIG_HUGETLB_PAGE
 obj-$(CONFIG_PPC_BOOK3S_64)+= hugetlbpage-hash64.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= hugetlbpage-radix.o
-obj-$(CONFIG_PPC_BOOK3E_MMU)   += hugetlbpage-book3e.o
+obj-$(CONFIG_PPC_FSL_BOOK3E)   += hugetlbpage-book3e.o
 endif
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += hugepage-hash64.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
diff --git a/arch/powerpc/mm/hugetlbpage-book3e.c 
b/arch/powerpc/mm/hugetlbpage-book3e.c
index c911fe9bfa0e..61915f4d3c7f 100644
--- a/arch/powerpc/mm/hugetlbpage-book3e.c
+++ b/arch/powerpc/mm/hugetlbpage-book3e.c
@@ -11,8 +11,9 @@
 
 #include 
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
 #ifdef CONFIG_PPC64
+#include 
+
 static inline int tlb1_next(void)
 {
struct paca_struct *paca = get_paca();
@@ -29,28 +30,6 @@ static inline int tlb1_next(void)
tcd->esel_next = next;
return this;
 }
-#else
-static inline int tlb1_next(void)
-{
-   int index, ncams;
-
-   ncams = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
-
-   index = this_cpu_read(next_tlbcam_idx);
-
-   /* Just round-robin the entries and wrap when we hit the end */
-   if (unlikely(index == ncams - 1))
-   __this_cpu_write(next_tlbcam_idx, tlbcam_index);
-   else
-   __this_cpu_inc(next_tlbcam_idx);
-
-   return index;
-}
-#endif /* !PPC64 */
-#endif /* FSL */
-
-#if defined(CONFIG_PPC_FSL_BOOK3E) && defined(CONFIG_PPC64)
-#include 
 
 static inline void book3e_tlb_lock(void)
 {
@@ -93,6 +72,23 @@ static inline void book3e_tlb_unlock(void)
paca->tcd_ptr->lock = 0;
 }
 #else
+static inline int tlb1_next(void)
+{
+   int index, ncams;
+
+   ncams = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
+
+   index = this_cpu_read(next_tlbcam_idx);
+
+   /* Just round-robin the entries and wrap when we hit the end */
+   if (unlikely(index == ncams - 1))
+   __this_cpu_write(next_tlbcam_idx, tlbcam_index);
+   else
+   __this_cpu_inc(next_tlbcam_idx);
+
+   return index;
+}
+
 static inline void book3e_tlb_lock(void)
 {
 }
@@ -134,10 +130,7 @@ void book3e_hugetlb_preload(struct vm_area_struct *vma, 
unsigned long ea,
unsigned long psize, tsize, shift;
unsigned long flags;
struct mm_struct *mm;
-
-#ifdef CONFIG_PPC_FSL_BOOK3E
int index;
-#endif
 
if (unlikely(is_kernel_addr(ea)))
return;
@@ -161,11 +154,9 @@ void book3e_hugetlb_preload(struct vm_area_struct *vma, 
unsigned long ea,
return;
}
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
/* We have to use the CAM(TLB1) on FSL parts for hugepages */
index = tlb1_next();
mtspr(SPRN_MAS0, MAS0_ESEL(index) | MAS0_TLBSEL(1));
-#endif
 
mas1 = MAS1_VALID | MAS1_TID(mm->context.id) | MAS1_TSIZE(tsize);
mas2 = ea & ~((1UL << shift) - 1);
-- 
2.13.3

[PATCH v1 06/27] powerpc/64: only book3s/64 supports CONFIG_PPC_64K_PAGES

2019-03-20 Thread Christophe Leroy

CONFIG_PPC_64K_PAGES cannot be selected by nohash/64

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig |  1 -
 arch/powerpc/include/asm/nohash/64/pgalloc.h |  3 ---
 arch/powerpc/include/asm/nohash/64/pgtable.h |  4 
 arch/powerpc/include/asm/nohash/64/slice.h   |  4 
 arch/powerpc/include/asm/nohash/pte-book3e.h |  5 -
 arch/powerpc/include/asm/pgtable-be-types.h  |  7 ++-
 arch/powerpc/include/asm/pgtable-types.h |  7 ++-
 arch/powerpc/include/asm/task_size_64.h  |  2 +-
 arch/powerpc/mm/tlb_low_64e.S| 31 
 arch/powerpc/mm/tlb_nohash.c | 13 
 10 files changed, 5 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d0be82c3061..5d8e692d6470 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -375,7 +375,6 @@ config ZONE_DMA
 config PGTABLE_LEVELS
int
default 2 if !PPC64
-   default 3 if PPC_64K_PAGES && !PPC_BOOK3S_64
default 4
 
 source "arch/powerpc/sysdev/Kconfig"
diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h 
b/arch/powerpc/include/asm/nohash/64/pgalloc.h
index 66d086f85bd5..ded453f9b5a8 100644
--- a/arch/powerpc/include/asm/nohash/64/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h
@@ -171,12 +171,9 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, 
pgtable_t table,
 
 #define __pmd_free_tlb(tlb, pmd, addr)   \
pgtable_free_tlb(tlb, pmd, PMD_CACHE_INDEX)
-#ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)   \
pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
 
-#endif /* CONFIG_PPC_64K_PAGES */
-
 #define check_pgt_cache()  do { } while (0)
 
 #endif /* _ASM_POWERPC_PGALLOC_64_H */
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index e77ed9761632..3efbd8a1720a 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -10,10 +10,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC_64K_PAGES
-#error "Page size not supported"
-#endif
-
 #define FIRST_USER_ADDRESS 0UL
 
 /*
diff --git a/arch/powerpc/include/asm/nohash/64/slice.h 
b/arch/powerpc/include/asm/nohash/64/slice.h
index 1a32d1fae6af..30adfdd4afde 100644
--- a/arch/powerpc/include/asm/nohash/64/slice.h
+++ b/arch/powerpc/include/asm/nohash/64/slice.h
@@ -2,10 +2,6 @@
 #ifndef _ASM_POWERPC_NOHASH_64_SLICE_H
 #define _ASM_POWERPC_NOHASH_64_SLICE_H
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define get_slice_psize(mm, addr)  MMU_PAGE_64K
-#else /* CONFIG_PPC_64K_PAGES */
 #define get_slice_psize(mm, addr)  MMU_PAGE_4K
-#endif /* !CONFIG_PPC_64K_PAGES */
 
 #endif /* _ASM_POWERPC_NOHASH_64_SLICE_H */
diff --git a/arch/powerpc/include/asm/nohash/pte-book3e.h 
b/arch/powerpc/include/asm/nohash/pte-book3e.h
index dd40d200f274..813918f40765 100644
--- a/arch/powerpc/include/asm/nohash/pte-book3e.h
+++ b/arch/powerpc/include/asm/nohash/pte-book3e.h
@@ -60,13 +60,8 @@
 #define _PAGE_SPECIAL  _PAGE_SW0
 
 /* Base page size */
-#ifdef CONFIG_PPC_64K_PAGES
-#define _PAGE_PSIZE_PAGE_PSIZE_64K
-#define PTE_RPN_SHIFT  (28)
-#else
 #define _PAGE_PSIZE_PAGE_PSIZE_4K
 #definePTE_RPN_SHIFT   (24)
-#endif
 
 #define PTE_WIMGE_SHIFT (19)
 #define PTE_BAP_SHIFT  (2)
diff --git a/arch/powerpc/include/asm/pgtable-be-types.h 
b/arch/powerpc/include/asm/pgtable-be-types.h
index a89c67b62680..5932a9883eb7 100644
--- a/arch/powerpc/include/asm/pgtable-be-types.h
+++ b/arch/powerpc/include/asm/pgtable-be-types.h
@@ -34,10 +34,8 @@ static inline __be64 pmd_raw(pmd_t x)
 }
 
 /*
- * 64 bit hash always use 4 level table. Everybody else use 4 level
- * only for 4K page size.
+ * 64 bit always use 4 level table
  */
-#if defined(CONFIG_PPC_BOOK3S_64) || !defined(CONFIG_PPC_64K_PAGES)
 typedef struct { __be64 pud; } pud_t;
 #define __pud(x)   ((pud_t) { cpu_to_be64(x) })
 #define __pud_raw(x)   ((pud_t) { (x) })
@@ -51,7 +49,6 @@ static inline __be64 pud_raw(pud_t x)
return x.pud;
 }
 
-#endif /* CONFIG_PPC_BOOK3S_64 || !CONFIG_PPC_64K_PAGES */
 #endif /* CONFIG_PPC64 */
 
 /* PGD level */
@@ -77,7 +74,7 @@ typedef struct { unsigned long pgprot; } pgprot_t;
  * With hash config 64k pages additionally define a bigger "real PTE" type that
  * gathers the "second half" part of the PTE for pseudo 64k pages
  */
-#if defined(CONFIG_PPC_64K_PAGES) && defined(CONFIG_PPC_BOOK3S_64)
+#ifdef CONFIG_PPC_64K_PAGES
 typedef struct { pte_t pte; unsigned long hidx; } real_pte_t;
 #else
 typedef struct { pte_t pte; } real_pte_t;
diff --git a/arch/powerpc/include/asm/pgtable-types.h 
b/arch/powerpc/include/asm/pgtable-types.h
index 3b0edf041b2e..02e75e89c93e 100644
--- a/arch/powerpc/include/asm/pgtable-types.h
+++ b/arch/powerpc/include/asm/pgtable-types.h
@@ -24,17 +24,14 @@ static inline unsigned long pmd_val(pmd_t x)
 }
 
 /*
- * 64 bit hash always use 4

[PATCH v1 05/27] powerpc/mm: drop slice_set_user_psize()

2019-03-20 Thread Christophe Leroy

slice_set_user_psize() is not used anymore, drop it.

Fixes: 1753dd183036 ("powerpc/mm/slice: Simplify and optimise slice context 
initialisation")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/slice.h | 5 -
 arch/powerpc/include/asm/nohash/64/slice.h | 1 -
 2 files changed, 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/slice.h 
b/arch/powerpc/include/asm/book3s/64/slice.h
index db0dedab65ee..af498b0da21a 100644
--- a/arch/powerpc/include/asm/book3s/64/slice.h
+++ b/arch/powerpc/include/asm/book3s/64/slice.h
@@ -16,11 +16,6 @@
 #else /* CONFIG_PPC_MM_SLICES */
 
 #define get_slice_psize(mm, addr)  ((mm)->context.user_psize)
-#define slice_set_user_psize(mm, psize)\
-do {   \
-   (mm)->context.user_psize = (psize); \
-   (mm)->context.sllp = SLB_VSID_USER | mmu_psize_defs[(psize)].sllp; \
-} while (0)
 
 #endif /* CONFIG_PPC_MM_SLICES */
 
diff --git a/arch/powerpc/include/asm/nohash/64/slice.h 
b/arch/powerpc/include/asm/nohash/64/slice.h
index ad0d6e3cc1c5..1a32d1fae6af 100644
--- a/arch/powerpc/include/asm/nohash/64/slice.h
+++ b/arch/powerpc/include/asm/nohash/64/slice.h
@@ -7,6 +7,5 @@
 #else /* CONFIG_PPC_64K_PAGES */
 #define get_slice_psize(mm, addr)  MMU_PAGE_4K
 #endif /* !CONFIG_PPC_64K_PAGES */
-#define slice_set_user_psize(mm, psize)do { BUG(); } while (0)
 
 #endif /* _ASM_POWERPC_NOHASH_64_SLICE_H */
-- 
2.13.3

[PATCH v1 02/27] powerpc/mm: don't BUG in add_huge_page_size()

2019-03-20 Thread Christophe Leroy

No reason to BUG() in add_huge_page_size(). Just WARN and
reject the add.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/hugetlbpage.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9e732bb2c84a..cf2978e235f3 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -634,7 +634,8 @@ static int __init add_huge_page_size(unsigned long long 
size)
}
 #endif
 
-   BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
+   if (WARN_ON(mmu_psize_defs[mmu_psize].shift != shift))
+   return -EINVAL;
 
/* Return if huge page size has already been setup */
if (size_to_hstate(size))
-- 
2.13.3

[PATCH v1 00/27] Reduce ifdef mess in hugetlbpage.c and slice.c

2019-03-20 Thread Christophe Leroy

The main purpose of this series is to reduce the amount of #ifdefs in
hugetlbpage.c and slice.c

At the same time, it does some cleanup by reducing the number of BUG_ON()
and dropping unused functions.

It also removes 64k pages related code in nohash/64 as 64k pages are
can only by selected on book3s/64

Christophe Leroy (27):
  powerpc/mm: Don't BUG() in hugepd_page()
  powerpc/mm: don't BUG in add_huge_page_size()
  powerpc/mm: don't BUG() in slice_mask_for_size()
  powerpc/book3e: drop mmu_get_tsize()
  powerpc/mm: drop slice_set_user_psize()
  powerpc/64: only book3s/64 supports CONFIG_PPC_64K_PAGES
  powerpc/book3e: hugetlbpage is only for CONFIG_PPC_FSL_BOOK3E
  powerpc/mm: move __find_linux_pte() out of hugetlbpage.c
  powerpc/mm: make hugetlbpage.c depend on CONFIG_HUGETLB_PAGE
  powerpc/mm: make gup_hugepte() static
  powerpc/mm: split asm/hugetlb.h into dedicated subarch files
  powerpc/mm: add a helper to populate hugepd
  powerpc/mm: define get_slice_psize() all the time
  powerpc/mm: no slice for nohash/64
  powerpc/mm: cleanup ifdef mess in add_huge_page_size()
  powerpc/mm: move hugetlb_disabled into asm/hugetlb.h
  powerpc/mm: cleanup HPAGE_SHIFT setup
  powerpc/mm: cleanup remaining ifdef mess in hugetlbpage.c
  powerpc/mm: drop slice DEBUG
  powerpc/mm: remove unnecessary #ifdef CONFIG_PPC64
  powerpc/mm: hand a context_t over to slice_mask_for_size() instead of
mm_struct
  powerpc/mm: move slice_mask_for_size() into mmu.h
  powerpc/mm: remove a couple of #ifdef CONFIG_PPC_64K_PAGES in
mm/slice.c
  powerpc: define subarch SLB_ADDR_LIMIT_DEFAULT
  powerpc/mm: flatten function __find_linux_pte()
  powerpc/mm: flatten function __find_linux_pte() step 2
  powerpc/mm: flatten function __find_linux_pte() step 3

 arch/powerpc/Kconfig |   3 +-
 arch/powerpc/include/asm/book3s/64/hugetlb.h |  73 +++
 arch/powerpc/include/asm/book3s/64/mmu.h |  22 +-
 arch/powerpc/include/asm/book3s/64/slice.h   |   7 +-
 arch/powerpc/include/asm/hugetlb.h   |  87 +---
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h |  45 +
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h |  18 ++
 arch/powerpc/include/asm/nohash/32/slice.h   |   2 +
 arch/powerpc/include/asm/nohash/64/pgalloc.h |   3 -
 arch/powerpc/include/asm/nohash/64/pgtable.h |   4 -
 arch/powerpc/include/asm/nohash/64/slice.h   |  12 --
 arch/powerpc/include/asm/nohash/hugetlb-book3e.h |  45 +
 arch/powerpc/include/asm/nohash/pte-book3e.h |   5 -
 arch/powerpc/include/asm/page.h  |  12 +-
 arch/powerpc/include/asm/pgtable-be-types.h  |   7 +-
 arch/powerpc/include/asm/pgtable-types.h |   7 +-
 arch/powerpc/include/asm/pgtable.h   |   3 -
 arch/powerpc/include/asm/slice.h |   8 +-
 arch/powerpc/include/asm/task_size_64.h  |   2 +-
 arch/powerpc/kernel/fadump.c |   1 +
 arch/powerpc/kernel/setup-common.c   |   8 +-
 arch/powerpc/mm/Makefile |   4 +-
 arch/powerpc/mm/hash_utils_64.c  |   1 +
 arch/powerpc/mm/hugetlbpage-book3e.c |  52 ++---
 arch/powerpc/mm/hugetlbpage-hash64.c |  16 ++
 arch/powerpc/mm/hugetlbpage.c| 245 ---
 arch/powerpc/mm/pgtable.c| 114 +++
 arch/powerpc/mm/slice.c  | 132 ++--
 arch/powerpc/mm/tlb_low_64e.S|  31 ---
 arch/powerpc/mm/tlb_nohash.c |  13 --
 arch/powerpc/platforms/Kconfig.cputype   |   4 +
 31 files changed, 438 insertions(+), 548 deletions(-)
 create mode 100644 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
 delete mode 100644 arch/powerpc/include/asm/nohash/64/slice.h
 create mode 100644 arch/powerpc/include/asm/nohash/hugetlb-book3e.h

-- 
2.13.3

[PATCH v1 01/27] powerpc/mm: Don't BUG() in hugepd_page()

2019-03-20 Thread Christophe Leroy

Don't BUG(), just warn and return NULL.
If the NULL value is not handled, it will get catched anyway.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/hugetlb.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 8d40565ad0c3..48c29686c78e 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -14,7 +14,8 @@
  */
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
-   BUG_ON(!hugepd_ok(hpd));
+   if (WARN_ON(!hugepd_ok(hpd)))
+   return NULL;
/*
 * We have only four bits to encode, MMU page size
 */
@@ -42,7 +43,8 @@ static inline void flush_hugetlb_page(struct vm_area_struct 
*vma,
 
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
-   BUG_ON(!hugepd_ok(hpd));
+   if (WARN_ON(!hugepd_ok(hpd)))
+   return NULL;
 #ifdef CONFIG_PPC_8xx
return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK);
 #else
-- 
2.13.3

[PATCH v1 04/27] powerpc/book3e: drop mmu_get_tsize()

2019-03-20 Thread Christophe Leroy

This function is not used anymore, drop it.

Fixes: b42279f0165c ("powerpc/mm/nohash: MM_SLICE is only used by book3s 64")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/hugetlbpage-book3e.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage-book3e.c 
b/arch/powerpc/mm/hugetlbpage-book3e.c
index f84ec46cdb26..c911fe9bfa0e 100644
--- a/arch/powerpc/mm/hugetlbpage-book3e.c
+++ b/arch/powerpc/mm/hugetlbpage-book3e.c
@@ -49,11 +49,6 @@ static inline int tlb1_next(void)
 #endif /* !PPC64 */
 #endif /* FSL */
 
-static inline int mmu_get_tsize(int psize)
-{
-   return mmu_psize_defs[psize].enc;
-}
-
 #if defined(CONFIG_PPC_FSL_BOOK3E) && defined(CONFIG_PPC64)
 #include 
 
-- 
2.13.3

[PATCH v1 03/27] powerpc/mm: don't BUG() in slice_mask_for_size()

2019-03-20 Thread Christophe Leroy

When no mask is found for the page size, WARN() and return NULL
instead of BUG()ing.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/slice.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index aec91dbcdc0b..011d470ea340 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -165,7 +165,8 @@ static struct slice_mask *slice_mask_for_size(struct 
mm_struct *mm, int psize)
if (psize == MMU_PAGE_16G)
return >context.mask_16g;
 #endif
-   BUG();
+   WARN_ON(true);
+   return NULL;
 }
 #elif defined(CONFIG_PPC_8xx)
 static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize)
@@ -178,7 +179,8 @@ static struct slice_mask *slice_mask_for_size(struct 
mm_struct *mm, int psize)
if (psize == MMU_PAGE_8M)
return >context.mask_8m;
 #endif
-   BUG();
+   WARN_ON(true);
+   return NULL;
 }
 #else
 #error "Must define the slice masks for page sizes supported by the platform"
-- 
2.13.3

Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING

2019-03-20 Thread Arnd Bergmann

On Wed, Mar 20, 2019 at 7:21 AM Masahiro Yamada
 wrote:
>
> Commit 60a3cdd06394 ("x86: add optimized inlining") introduced
> CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.
>
> The idea is obviously arch-agnostic although we need some code fixups.
> This commit moves the config entry from arch/x86/Kconfig.debug to
> lib/Kconfig.debug so that all architectures (except MIPS for now) can
> benefit from it.
>
> At this moment, I added "depends on !MIPS" because fixing 0day bot reports
> for MIPS was complex to me.
>
> I tested this patch on my arm/arm64 boards.
>
> This can make a huge difference in kernel image size especially when
> CONFIG_OPTIMIZE_FOR_SIZE is enabled.
>
> For example, I got 3.5% smaller arm64 kernel image for v5.1-rc1.
>
>   dec   file
>   18983424  arch/arm64/boot/Image.before
>   18321920  arch/arm64/boot/Image.after
>
> This also slightly improves the "Kernel hacking" Kconfig menu.
> Commit e61aca5158a8 ("Merge branch 'kconfig-diet' from Dave Hansen')
> mentioned this config option would be a good fit in the "compiler option"
> menu. I did so.

I think this is a good idea in general, but it is likely to cause a lot of
new warnings. Especially the -Wmaybe-uninitialized warnings get
new false positives every time we get substantially different inlining
decisions.

I've added your patch to my randconfig test setup and will let you
know if I see anything noticeable. I'm currently testing clang-arm32,
clang-arm64 and gcc-x86.

  Arnd

Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING

2019-03-20 Thread Arnd Bergmann

On Wed, Mar 20, 2019 at 7:41 AM Masahiro Yamada
 wrote:

> It is unclear to me how to fix it.
> That's why I ended up with "depends on !MIPS".
>
>
>   MODPOST vmlinux.o
> arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2':
> sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base'
> sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base'
> sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base'
> sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base'
> sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base'
> arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined
> references to `mips_gcr_base'
>
>
> Perhaps, MIPS folks may know how to fix it.

I would guess like this:

diff --git a/arch/mips/include/asm/mips-cm.h b/arch/mips/include/asm/mips-cm.h
index 8bc5df49b0e1..a27483fedb7d 100644
--- a/arch/mips/include/asm/mips-cm.h
+++ b/arch/mips/include/asm/mips-cm.h
@@ -79,7 +79,7 @@ static inline int mips_cm_probe(void)
  *
  * Returns true if a CM is present in the system, else false.
  */
-static inline bool mips_cm_present(void)
+static __always_inline bool mips_cm_present(void)
 {
 #ifdef CONFIG_MIPS_CM
return mips_gcr_base != NULL;
@@ -93,7 +93,7 @@ static inline bool mips_cm_present(void)
  *
  * Returns true if the system implements an L2-only sync region, else false.
  */
-static inline bool mips_cm_has_l2sync(void)
+static __always_inline bool mips_cm_has_l2sync(void)
 {
 #ifdef CONFIG_MIPS_CM
return mips_cm_l2sync_base != NULL;

[PATCH v4 11/17] KVM: introduce a 'mmap' method for KVM devices

2019-03-20 Thread Cédric Le Goater

Some KVM devices will want to handle special mappings related to the
underlying HW. For instance, the XIVE interrupt controller of the
POWER9 processor has MMIO pages for thread interrupt management and
for interrupt source control that need to be exposed to the guest when
the OS has the required support.

Cc: Paolo Bonzini 
Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d55c63db09b..831d963451d8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1245,6 +1245,7 @@ struct kvm_device_ops {
int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
  unsigned long arg);
+   int (*mmap)(struct kvm_device *dev, struct vm_area_struct *vma);
 };
 
 void kvm_device_get(struct kvm_device *dev);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f25aa98a94df..5e2fa5c7dd1a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2884,6 +2884,16 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
 }
 #endif
 
+static int kvm_device_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct kvm_device *dev = filp->private_data;
+
+   if (dev->ops->mmap)
+   return dev->ops->mmap(dev, vma);
+
+   return -ENODEV;
+}
+
 static int kvm_device_ioctl_attr(struct kvm_device *dev,
 int (*accessor)(struct kvm_device *dev,
 struct kvm_device_attr *attr),
@@ -2933,6 +2943,7 @@ static const struct file_operations kvm_device_fops = {
.unlocked_ioctl = kvm_device_ioctl,
.release = kvm_device_release,
KVM_COMPAT(kvm_device_ioctl),
+   .mmap = kvm_device_mmap,
 };
 
 struct kvm_device *kvm_device_from_filp(struct file *filp)
-- 
2.20.1

[PATCH v4 13/17] KVM: PPC: Book3S HV: XIVE: add a mapping for the source ESB pages

2019-03-20 Thread Cédric Le Goater

Each source is associated with an Event State Buffer (ESB) with a
even/odd pair of pages which provides commands to manage the source:
to trigger, to EOI, to turn off the source for instance.

The custom VM fault handler will deduce the guest IRQ number from the
offset of the fault, and the ESB page of the associated XIVE interrupt
will be inserted into the VMA using the internal structure caching
information on the interrupts.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 57 ++
 Documentation/virtual/kvm/devices/xive.txt |  7 +++
 3 files changed, 65 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0998e8edc91a..b0f72dea8b11 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -721,5 +721,6 @@ struct kvm_ppc_xive_eq {
 #define KVM_XIVE_EQ_ALWAYS_NOTIFY  0x0001
 
 #define KVM_XIVE_TIMA_PAGE_OFFSET  0
+#define KVM_XIVE_ESB_PAGE_OFFSET   4
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 0cfad45d8b75..d0a055030efd 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -165,6 +165,59 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
return rc;
 }
 
+static vm_fault_t xive_native_esb_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct kvm_device *dev = vma->vm_file->private_data;
+   struct kvmppc_xive *xive = dev->private;
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   struct xive_irq_data *xd;
+   u32 hw_num;
+   u16 src;
+   u64 page;
+   unsigned long irq;
+   u64 page_offset;
+
+   /*
+* Linux/KVM uses a two pages ESB setting, one for trigger and
+* one for EOI
+*/
+   page_offset = vmf->pgoff - vma->vm_pgoff;
+   irq = page_offset / 2;
+
+   sb = kvmppc_xive_find_source(xive, irq, );
+   if (!sb) {
+   pr_devel("%s: source %lx not found !\n", __func__, irq);
+   return VM_FAULT_SIGBUS;
+   }
+
+   state = >irq_state[src];
+   kvmppc_xive_select_irq(state, _num, );
+
+   arch_spin_lock(>lock);
+
+   /*
+* first/even page is for trigger
+* second/odd page is for EOI and management.
+*/
+   page = page_offset % 2 ? xd->eoi_page : xd->trig_page;
+   arch_spin_unlock(>lock);
+
+   if (WARN_ON(!page)) {
+   pr_err("%s: acessing invalid ESB page for source %lx !\n",
+  __func__, irq);
+   return VM_FAULT_SIGBUS;
+   }
+
+   vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct xive_native_esb_vmops = {
+   .fault = xive_native_esb_fault,
+};
+
 static vm_fault_t xive_native_tima_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -194,6 +247,10 @@ static int kvmppc_xive_native_mmap(struct kvm_device *dev,
if (vma_pages(vma) > 4)
return -EINVAL;
vma->vm_ops = _native_tima_vmops;
+   } else if (vma->vm_pgoff == KVM_XIVE_ESB_PAGE_OFFSET) {
+   if (vma_pages(vma) > KVMPPC_XIVE_NR_IRQS * 2)
+   return -EINVAL;
+   vma->vm_ops = _native_esb_vmops;
} else {
return -EINVAL;
}
diff --git a/Documentation/virtual/kvm/devices/xive.txt 
b/Documentation/virtual/kvm/devices/xive.txt
index 944fd0971b13..2d795805b39e 100644
--- a/Documentation/virtual/kvm/devices/xive.txt
+++ b/Documentation/virtual/kvm/devices/xive.txt
@@ -36,6 +36,13 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
   third (operating system) and the fourth (user level) are exposed the
   guest.
 
+  2. Event State Buffer (ESB)
+
+  Each source is associated with an Event State Buffer (ESB) with
+  either a pair of even/odd pair of pages which provides commands to
+  manage the source: to trigger, to EOI, to turn off the source for
+  instance.
+
 * Groups:
 
   1. KVM_DEV_XIVE_GRP_CTRL
-- 
2.20.1

[PATCH v4 10/17] KVM: PPC: Book3S HV: XIVE: add get/set accessors for the VP XIVE state

2019-03-20 Thread Cédric Le Goater

The state of the thread interrupt management registers needs to be
collected for migration. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the underlying XIVE NVT structure and merges it in the
KVM state.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 
 Changes since v3 :

 - Fixed xive_timaval description in documentation
 
 Changes since v2 :

 - reduced the size of kvmppc_one_reg timaval attribute to two u64s
 - stopped returning of the OS CAM line value

 arch/powerpc/include/asm/kvm_ppc.h | 11 
 arch/powerpc/include/uapi/asm/kvm.h|  2 +
 arch/powerpc/kvm/book3s.c  | 24 +++
 arch/powerpc/kvm/book3s_xive_native.c  | 76 ++
 Documentation/virtual/kvm/devices/xive.txt | 17 +
 5 files changed, 130 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6928a35ac3c7..0579c9b253db 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -273,6 +273,7 @@ union kvmppc_one_reg {
u64 addr;
u64 length;
}   vpaval;
+   u64 xive_timaval[2];
 };
 
 struct kvmppc_ops {
@@ -605,6 +606,10 @@ extern int kvmppc_xive_native_connect_vcpu(struct 
kvm_device *dev,
 extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
+extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu,
+union kvmppc_one_reg *val);
+extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu,
+union kvmppc_one_reg *val);
 
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
@@ -637,6 +642,12 @@ static inline int kvmppc_xive_native_connect_vcpu(struct 
kvm_device *dev,
 static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
 static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
+static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu,
+   union kvmppc_one_reg *val)
+{ return 0; }
+static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu,
+   union kvmppc_one_reg *val)
+{ return -ENOENT; }
 
 #endif /* CONFIG_KVM_XIVE */
 
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 12744608a61c..cd3f16b70a2e 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -482,6 +482,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_REG_PPC_ICP_PPRI_SHIFT16  /* pending irq priority */
 #define  KVM_REG_PPC_ICP_PPRI_MASK 0xff
 
+#define KVM_REG_PPC_VP_STATE   (KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x8d)
+
 /* Device control API: PPC-specific devices */
 #define KVM_DEV_MPIC_GRP_MISC  1
 #define   KVM_DEV_MPIC_BASE_ADDR   0   /* 64-bit */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 7c3348fa27e1..efd15101eef0 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -651,6 +651,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
*val = get_reg_val(id, 
kvmppc_xics_get_icp(vcpu));
break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+   case KVM_REG_PPC_VP_STATE:
+   if (!vcpu->arch.xive_vcpu) {
+   r = -ENXIO;
+   break;
+   }
+   if (xive_enabled())
+   r = kvmppc_xive_native_get_vp(vcpu, val);
+   else
+   r = -ENXIO;
+   break;
+#endif /* CONFIG_KVM_XIVE */
case KVM_REG_PPC_FSCR:
*val = get_reg_val(id, vcpu->arch.fscr);
break;
@@ -724,6 +736,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, 
*val));
break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+   case KVM_REG_PPC_VP_STATE:
+   if (!vcpu->arch.xive_vcpu) {
+   r = -ENXIO;
+   break;
+   }
+   if (xive_enabled())
+   r = kvmppc_xive_native_set_vp(vcpu, val);
+   else
+   r = -ENXIO;
+   break;
+#endif /* CONFIG_KVM_XIVE */
case KVM_REG_PPC_FSCR:
vcpu->arch.fscr =

[PATCH v4 12/17] KVM: PPC: Book3S HV: XIVE: add a TIMA mapping

2019-03-20 Thread Cédric Le Goater

Each thread has an associated Thread Interrupt Management context
composed of a set of registers. These registers let the thread handle
priority management and interrupt acknowledgment. The most important
are :

- Interrupt Pending Buffer (IPB)
- Current Processor Priority   (CPPR)
- Notification Source Register (NSR)

They are exposed to software in four different pages each proposing a
view with a different privilege. The first page is for the physical
thread context and the second for the hypervisor. Only the third
(operating system) and the fourth (user level) are exposed the guest.

A custom VM fault handler will populate the VMA with the appropriate
pages, which should only be the OS page for now.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/xive.h|  1 +
 arch/powerpc/include/uapi/asm/kvm.h|  2 ++
 arch/powerpc/kvm/book3s_xive_native.c  | 39 ++
 arch/powerpc/sysdev/xive/native.c  | 11 ++
 Documentation/virtual/kvm/devices/xive.txt | 23 +
 5 files changed, 76 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index c4e88abd3b67..eaf76f57023a 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -23,6 +23,7 @@
  * same offset regardless of where the code is executing
  */
 extern void __iomem *xive_tima;
+extern unsigned long xive_tima_os;
 
 /*
  * Offset in the TM area of our current execution level (provided by
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index cd3f16b70a2e..0998e8edc91a 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -720,4 +720,6 @@ struct kvm_ppc_xive_eq {
 
 #define KVM_XIVE_EQ_ALWAYS_NOTIFY  0x0001
 
+#define KVM_XIVE_TIMA_PAGE_OFFSET  0
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index a8c62e07ebee..0cfad45d8b75 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -165,6 +165,44 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
return rc;
 }
 
+static vm_fault_t xive_native_tima_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+
+   switch (vmf->pgoff - vma->vm_pgoff) {
+   case 0: /* HW - forbid access */
+   case 1: /* HV - forbid access */
+   return VM_FAULT_SIGBUS;
+   case 2: /* OS */
+   vmf_insert_pfn(vma, vmf->address, xive_tima_os >> PAGE_SHIFT);
+   return VM_FAULT_NOPAGE;
+   case 3: /* USER - TODO */
+   default:
+   return VM_FAULT_SIGBUS;
+   }
+}
+
+static const struct vm_operations_struct xive_native_tima_vmops = {
+   .fault = xive_native_tima_fault,
+};
+
+static int kvmppc_xive_native_mmap(struct kvm_device *dev,
+  struct vm_area_struct *vma)
+{
+   /* We only allow mappings at fixed offset for now */
+   if (vma->vm_pgoff == KVM_XIVE_TIMA_PAGE_OFFSET) {
+   if (vma_pages(vma) > 4)
+   return -EINVAL;
+   vma->vm_ops = _native_tima_vmops;
+   } else {
+   return -EINVAL;
+   }
+
+   vma->vm_flags |= VM_IO | VM_PFNMAP;
+   vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
+   return 0;
+}
+
 static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
 u64 addr)
 {
@@ -1043,6 +1081,7 @@ struct kvm_device_ops kvm_xive_native_ops = {
.set_attr = kvmppc_xive_native_set_attr,
.get_attr = kvmppc_xive_native_get_attr,
.has_attr = kvmppc_xive_native_has_attr,
+   .mmap = kvmppc_xive_native_mmap,
 };
 
 void kvmppc_xive_native_init_module(void)
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 0c037e933e55..7782201e5fe8 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -521,6 +521,9 @@ u32 xive_native_default_eq_shift(void)
 }
 EXPORT_SYMBOL_GPL(xive_native_default_eq_shift);
 
+unsigned long xive_tima_os;
+EXPORT_SYMBOL_GPL(xive_tima_os);
+
 bool __init xive_native_init(void)
 {
struct device_node *np;
@@ -573,6 +576,14 @@ bool __init xive_native_init(void)
for_each_possible_cpu(cpu)
kvmppc_set_xive_tima(cpu, r.start, tima);
 
+   /* Resource 2 is OS window */
+   if (of_address_to_resource(np, 2, )) {
+   pr_err("Failed to get thread mgmnt area resource\n");
+   return false;
+   }
+
+   xive_tima_os = r.start;
+
/* Grab size of provisionning pages */
xive_parse_provisioning(np);
 
diff --git a/Documentation/virtual/kvm/devices/xive.txt 
b/Documentation/virtual/kvm/devices/xive.txt
index 702836d5ad7a..944fd0971b13 100644
---

[PATCH v4 15/17] KVM: PPC: Book3S HV: XIVE: activate XIVE exploitation mode

2019-03-20 Thread Cédric Le Goater

Full support for the XIVE native exploitation mode is now available,
advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on
PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM
Hypervisor) is not yet available. XIVE should also have been activated
which is default setting on POWER9 systems running a recent Linux
kernel.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 arch/powerpc/kvm/powerpc.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b0858ee61460..f54926c78320 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -573,10 +573,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
 #ifdef CONFIG_KVM_XIVE
case KVM_CAP_PPC_IRQ_XIVE:
/*
-* Return false until all the XIVE infrastructure is
-* in place including support for migration.
+* We need XIVE to be enabled on the platform (implies
+* a POWER9 processor) and the PowerNV platform, as
+* nested is not yet supported.
 */
-   r = 0;
+   r = xive_enabled() && !!cpu_has_feature(CPU_FTR_HVMODE);
break;
 #endif
 
-- 
2.20.1

[PATCH v4 05/17] KVM: PPC: Book3S HV: XIVE: add a control to configure a source

2019-03-20 Thread Cédric Le Goater

This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
QEMU to configure the target of a source and also to restore the
configuration of a source when migrating the VM.

The XIVE source interrupt structure is extended with the value of the
Effective Interrupt Source Number. The EISN is the interrupt number
pushed in the event queue that the guest OS will use to dispatch
events internally. Caching the EISN value in KVM eases the test when
checking if a reconfiguration is indeed needed.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2:

 - fixed comments on the KVM device attribute definitions
 - handled MASKED EAS configuration
 - fixed locking on source block
 
 arch/powerpc/include/uapi/asm/kvm.h| 11 +++
 arch/powerpc/kvm/book3s_xive.h |  4 +
 arch/powerpc/kvm/book3s_xive.c |  5 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 97 ++
 Documentation/virtual/kvm/devices/xive.txt | 21 +
 5 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index d468294c2a67..e8161e21629b 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -680,9 +680,20 @@ struct kvm_ppc_cpu_char {
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
+#define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
 #define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1)
 
+/* Layout of 64-bit XIVE source configuration attribute values */
+#define KVM_XIVE_SOURCE_PRIORITY_SHIFT 0
+#define KVM_XIVE_SOURCE_PRIORITY_MASK  0x7
+#define KVM_XIVE_SOURCE_SERVER_SHIFT   3
+#define KVM_XIVE_SOURCE_SERVER_MASK0xfff8ULL
+#define KVM_XIVE_SOURCE_MASKED_SHIFT   32
+#define KVM_XIVE_SOURCE_MASKED_MASK0x1ULL
+#define KVM_XIVE_SOURCE_EISN_SHIFT 33
+#define KVM_XIVE_SOURCE_EISN_MASK  0xfffeULL
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 1be921cb5dcb..ae26fe653d98 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -61,6 +61,9 @@ struct kvmppc_xive_irq_state {
bool saved_p;
bool saved_q;
u8 saved_scan_prio;
+
+   /* Xive native */
+   u32 eisn;   /* Guest Effective IRQ number */
 };
 
 /* Select the "right" interrupt (IPI vs. passthrough) */
@@ -268,6 +271,7 @@ int kvmppc_xive_debug_show_queues(struct seq_file *m, 
struct kvm_vcpu *vcpu);
 struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
struct kvmppc_xive *xive, int irq);
 void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 6c9f9fd0855f..e09f3addffe5 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -342,7 +342,7 @@ static int xive_try_pick_queue(struct kvm_vcpu *vcpu, u8 
prio)
return atomic_add_unless(>count, 1, max) ? 0 : -EBUSY;
 }
 
-static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
 {
struct kvm_vcpu *vcpu;
int i, rc;
@@ -530,7 +530,7 @@ static int xive_target_interrupt(struct kvm *kvm,
 * priority. The count for that new target will have
 * already been incremented.
 */
-   rc = xive_select_target(kvm, , prio);
+   rc = kvmppc_xive_select_target(kvm, , prio);
 
/*
 * We failed to find a target ? Not much we can do
@@ -1504,6 +1504,7 @@ struct kvmppc_xive_src_block 
*kvmppc_xive_create_src_block(
 
for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
sb->irq_state[i].number = (bid << KVMPPC_XICS_ICS_SHIFT) | i;
+   sb->irq_state[i].eisn = 0;
sb->irq_state[i].guest_priority = MASKED;
sb->irq_state[i].saved_priority = MASKED;
sb->irq_state[i].act_priority = MASKED;
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 5f2bd6c137b7..492825a35958 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -242,6 +242,99 @@ static int kvmppc_xive_native_set_source(struct 
kvmppc_xive *xive, long irq,
return rc;
 }
 
+static int kvmppc_xive_native_update_source_config(struct kvmppc_xive *xive,
+   struct kvmppc_xive_src_block *sb,
+   struct

[PATCH v4 09/17] KVM: PPC: Book3S HV: XIVE: add a control to dirty the XIVE EQ pages

2019-03-20 Thread Cédric Le Goater

When migration of a VM is initiated, a first copy of the RAM is
transferred to the destination before the VM is stopped, but there is
no guarantee that the EQ pages in which the event notifications are
queued have not been modified.

To make sure migration will capture a consistent memory state, the
XIVE device should perform a XIVE quiesce sequence to stop the flow of
event notifications and stabilize the EQs. This is the purpose of the
KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
to force their transfer.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2 :

 - Extra comments
 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 85 ++
 Documentation/virtual/kvm/devices/xive.txt | 29 
 3 files changed, 115 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index e4abe30f6fc6..12744608a61c 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char {
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define   KVM_DEV_XIVE_RESET   1
+#define   KVM_DEV_XIVE_EQ_SYNC 2
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index d45dc2ec0557..44ce74086550 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -674,6 +674,88 @@ static int kvmppc_xive_reset(struct kvmppc_xive *xive)
return 0;
 }
 
+static void kvmppc_xive_native_sync_sources(struct kvmppc_xive_src_block *sb)
+{
+   int j;
+
+   for (j = 0; j < KVMPPC_XICS_IRQ_PER_ICS; j++) {
+   struct kvmppc_xive_irq_state *state = >irq_state[j];
+   struct xive_irq_data *xd;
+   u32 hw_num;
+
+   if (!state->valid)
+   continue;
+
+   /*
+* The struct kvmppc_xive_irq_state reflects the state
+* of the EAS configuration and not the state of the
+* source. The source is masked setting the PQ bits to
+* '-Q', which is what is being done before calling
+* the KVM_DEV_XIVE_EQ_SYNC control.
+*
+* If a source EAS is configured, OPAL syncs the XIVE
+* IC of the source and the XIVE IC of the previous
+* target if any.
+*
+* So it should be fine ignoring MASKED sources as
+* they have been synced already.
+*/
+   if (state->act_priority == MASKED)
+   continue;
+
+   kvmppc_xive_select_irq(state, _num, );
+   xive_native_sync_source(hw_num);
+   xive_native_sync_queue(hw_num);
+   }
+}
+
+static int kvmppc_xive_native_vcpu_eq_sync(struct kvm_vcpu *vcpu)
+{
+   struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+   unsigned int prio;
+
+   if (!xc)
+   return -ENOENT;
+
+   for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
+   struct xive_q *q = >queues[prio];
+
+   if (!q->qpage)
+   continue;
+
+   /* Mark EQ page dirty for migration */
+   mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qaddr));
+   }
+   return 0;
+}
+
+static int kvmppc_xive_native_eq_sync(struct kvmppc_xive *xive)
+{
+   struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
+   unsigned int i;
+
+   pr_devel("%s\n", __func__);
+
+   mutex_lock(>lock);
+   for (i = 0; i <= xive->max_sbid; i++) {
+   struct kvmppc_xive_src_block *sb = xive->src_blocks[i];
+
+   if (sb) {
+   arch_spin_lock(>lock);
+   kvmppc_xive_native_sync_sources(sb);
+   arch_spin_unlock(>lock);
+   }
+   }
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   kvmppc_xive_native_vcpu_eq_sync(vcpu);
+   }
+   mutex_unlock(>lock);
+
+   return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -684,6 +766,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
switch (attr->attr) {
case KVM_DEV_XIVE_RESET:
return kvmppc_xive_reset(xive);
+   case KVM_DEV_XIVE_EQ_SYNC:
+   return kvmppc_xive_native_eq_sync(xive);
}
break;
case

Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode

2019-03-20 Thread Ondrej Mosnáček

Hi Daniel,

pi 15. 3. 2019 o 3:09 Daniel Axtens  napísal(a):
> The original assembly imported from OpenSSL has two copy-paste
> errors in handling CTR mode. When dealing with a 2 or 3 block tail,
> the code branches to the CBC decryption exit path, rather than to
> the CTR exit path.
>
> This leads to corruption of the IV, which leads to subsequent blocks
> being corrupted.
>
> This can be detected with libkcapi test suite, which is available at
> https://github.com/smuellerDD/libkcapi
>
> Reported-by: Ondrej Mosnáček 
> Fixes: 5c380d623ed3 ("crypto: vmx - Add support for VMS instructions by ASM")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Daniel Axtens 

Thank you for looking into this and for posting the patch(es)! I
tested the patch yesterday and I can confirm that it makes the
libkcapi tests/reproducer pass.

Assuming you will want to cover the other failures from the new
testmgr tests by a separate patch:

Tested-by: Ondrej Mosnacek 

> ---
>  drivers/crypto/vmx/aesp8-ppc.pl | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/crypto/vmx/aesp8-ppc.pl b/drivers/crypto/vmx/aesp8-ppc.pl
> index d6a9f63d65ba..de78282b8f44 100644
> --- a/drivers/crypto/vmx/aesp8-ppc.pl
> +++ b/drivers/crypto/vmx/aesp8-ppc.pl
> @@ -1854,7 +1854,7 @@ Lctr32_enc8x_three:
> stvx_u  $out1,$x10,$out
> stvx_u  $out2,$x20,$out
> addi$out,$out,0x30
> -   b   Lcbc_dec8x_done
> +   b   Lctr32_enc8x_done
>
>  .align 5
>  Lctr32_enc8x_two:
> @@ -1866,7 +1866,7 @@ Lctr32_enc8x_two:
> stvx_u  $out0,$x00,$out
> stvx_u  $out1,$x10,$out
> addi$out,$out,0x20
> -   b   Lcbc_dec8x_done
> +   b   Lctr32_enc8x_done
>
>  .align 5
>  Lctr32_enc8x_one:
> --
> 2.19.1
>

[PATCH v4 17/17] KVM: PPC: Book3S HV: XIVE: clear the vCPU interrupt presenters

2019-03-20 Thread Cédric Le Goater

When the VM boots, the CAS negotiation process determines which
interrupt mode to use and invokes a machine reset. At that time, the
previous KVM interrupt device is 'destroyed' before the chosen one is
created. Upon destruction, the vCPU interrupt presenters using the KVM
device should be cleared first, the machine will reconnect them later
to the new device after it is created.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2 :

 - removed comments on possible race in kvmppc_native_connect_vcpu()
   for the XIVE KVM device. This is still an issue in the
   XICS-over-XIVE device.
   
 arch/powerpc/kvm/book3s_xics.c| 19 +
 arch/powerpc/kvm/book3s_xive.c| 39 +--
 arch/powerpc/kvm/book3s_xive_native.c | 12 +
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index f27ee57ab46e..81cdabf4295f 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -1342,6 +1342,25 @@ static void kvmppc_xics_free(struct kvm_device *dev)
struct kvmppc_xics *xics = dev->private;
int i;
struct kvm *kvm = xics->kvm;
+   struct kvm_vcpu *vcpu;
+
+   /*
+* When destroying the VM, the vCPUs are destroyed first and
+* the vCPU list should be empty. If this is not the case,
+* then we are simply destroying the device and we should
+* clean up the vCPU interrupt presenters first.
+*/
+   if (atomic_read(>online_vcpus) != 0) {
+   /*
+* call kick_all_cpus_sync() to ensure that all CPUs
+* have executed any pending interrupts
+*/
+   if (is_kvmppc_hv_enabled(kvm))
+   kick_all_cpus_sync();
+
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvmppc_xics_free_icp(vcpu);
+   }
 
debugfs_remove(xics->dentry);
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 480a3fc6b9fd..cf6a4c6c5a28 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1100,11 +1100,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct 
kvm_vcpu *vcpu)
 void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
-   struct kvmppc_xive *xive = xc->xive;
+   struct kvmppc_xive *xive;
int i;
 
+   if (!kvmppc_xics_enabled(vcpu))
+   return;
+
+   if (!xc)
+   return;
+
pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num);
 
+   xive = xc->xive;
+
/* Ensure no interrupt is still routed to that VP */
xc->valid = false;
kvmppc_xive_disable_vcpu_interrupts(vcpu);
@@ -1141,6 +1149,10 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
}
/* Free the VP */
kfree(xc);
+
+   /* Cleanup the vcpu */
+   vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+   vcpu->arch.xive_vcpu = NULL;
 }
 
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
@@ -1158,7 +1170,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
}
if (xive->kvm != vcpu->kvm)
return -EPERM;
-   if (vcpu->arch.irq_type)
+   if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
return -EBUSY;
if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
pr_devel("Duplicate !\n");
@@ -1828,8 +1840,31 @@ static void kvmppc_xive_free(struct kvm_device *dev)
 {
struct kvmppc_xive *xive = dev->private;
struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
int i;
 
+   /*
+* When destroying the VM, the vCPUs are destroyed first and
+* the vCPU list should be empty. If this is not the case,
+* then we are simply destroying the device and we should
+* clean up the vCPU interrupt presenters first.
+*/
+   if (atomic_read(>online_vcpus) != 0) {
+   /*
+* call kick_all_cpus_sync() to ensure that all CPUs
+* have executed any pending interrupts
+*/
+   if (is_kvmppc_hv_enabled(kvm))
+   kick_all_cpus_sync();
+
+   /*
+* TODO: There is still a race window with the early
+* checks in kvmppc_native_connect_vcpu()
+*/
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvmppc_xive_cleanup_vcpu(vcpu);
+   }
+
debugfs_remove(xive->dentry);
 
if (kvm)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 6a502eee6744..96e6b5c50eb3 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -961,8 +961,20 @@ static void kvmppc_xive_native_free(struct kvm_device *dev)
 {
struct kvmppc_xive *xive =

[PATCH v4 16/17] KVM: introduce a KVM_DESTROY_DEVICE ioctl

2019-03-20 Thread Cédric Le Goater

The 'destroy' method is currently used to destroy all devices when the
VM is destroyed after the vCPUs have been freed.

This new KVM ioctl exposes the same KVM device method. It acts as a
software reset of the VM to 'destroy' selected devices when necessary
and perform the required cleanups on the vCPUs. Called with the
kvm->lock.

The 'destroy' method could be improved by returning an error code.

Cc: Paolo Bonzini 
Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v3 :

 - Removed temporary TODO comment in kvm_ioctl_destroy_device()
   regarding kvm_put_kvm()
 
 Changes since v2 :

 - checked that device is owned by VM
 
 include/uapi/linux/kvm.h  |  7 ++
 virt/kvm/kvm_main.c   | 41 +++
 Documentation/virtual/kvm/api.txt | 20 +++
 3 files changed, 68 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52bf74a1616e..d78fafa54274 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1183,6 +1183,11 @@ struct kvm_create_device {
__u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 };
 
+struct kvm_destroy_device {
+   __u32   fd; /* in: device handle */
+   __u32   flags;  /* in: unused */
+};
+
 struct kvm_device_attr {
__u32   flags;  /* no flags currently defined */
__u32   group;  /* device-defined */
@@ -1331,6 +1336,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DESTROY_DEVICE   _IOWR(KVMIO,  0xf0, struct kvm_destroy_device)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5e2fa5c7dd1a..9601c2ddecc5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3032,6 +3032,33 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
return 0;
 }
 
+static int kvm_ioctl_destroy_device(struct kvm *kvm,
+   struct kvm_destroy_device *dd)
+{
+   struct fd f;
+   struct kvm_device *dev;
+
+   f = fdget(dd->fd);
+   if (!f.file)
+   return -EBADF;
+
+   dev = kvm_device_from_filp(f.file);
+   fdput(f);
+
+   if (!dev)
+   return -ENODEV;
+
+   if (dev->kvm != kvm)
+   return -EPERM;
+
+   mutex_lock(>lock);
+   list_del(>vm_node);
+   dev->ops->destroy(dev);
+   mutex_unlock(>lock);
+
+   return 0;
+}
+
 static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 {
switch (arg) {
@@ -3276,6 +3303,20 @@ static long kvm_vm_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_DESTROY_DEVICE: {
+   struct kvm_destroy_device dd;
+
+   r = -EFAULT;
+   if (copy_from_user(, argp, sizeof(dd)))
+   goto out;
+
+   r = kvm_ioctl_destroy_device(kvm, );
+   if (r)
+   goto out;
+
+   r = 0;
+   break;
+   }
case KVM_CHECK_EXTENSION:
r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
break;
diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 8022ecce2c47..abe8433adf4f 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3874,6 +3874,26 @@ number of valid entries in the 'entries' array, which is 
then filled.
 'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
 userspace should not expect to get any particular value there.
 
+4.119 KVM_DESTROY_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_destroy_device (in)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EPERM: The device does not belong to the VM
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Destroys an emulated device in the kernel.
+
+struct kvm_destroy_device {
+   __u32   fd; /* in: device handle */
+   __u32   flags;  /* unused */
+};
+
 5. The kvm_run structure
 
 
-- 
2.20.1

[PATCH v4 14/17] KVM: PPC: Book3S HV: XIVE: add passthrough support

2019-03-20 Thread Cédric Le Goater

The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
implement an IRQ space for the guest using the generic IPI interrupts
of the XIVE IC controller. These interrupts are allocated at the OPAL
level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
Interrupt management is performed in the XIVE way: using loads and
stores on the addresses of the XIVE IPI interrupt ESB pages.

Both KVM devices share the same internal structure caching information
on the interrupts, among which the xive_irq_data struct containing the
addresses of the IPI ESB pages and an extra one in case of pass-through.
The later contains the addresses of the ESB pages of the underlying HW
controller interrupts, PHB4 in all cases for now.

A guest, when running in the XICS legacy interrupt mode, lets the KVM
XICS-over-XIVE device "handle" interrupt management, that is to
perform the loads and stores on the addresses of the ESB pages of the
guest interrupts. However, when running in XIVE native exploitation
mode, the KVM XIVE native device exposes the interrupt ESB pages to
the guest and lets the guest perform directly the loads and stores.

The VMA exposing the ESB pages make use of a custom VM fault handler
which role is to populate the VMA with appropriate pages. When a fault
occurs, the guest IRQ number is deduced from the offset, and the ESB
pages of associated XIVE IPI interrupt are inserted in the VMA (using
the internal structure caching information on the interrupts).

Supporting device passthrough in the guest running in XIVE native
exploitation mode adds some extra refinements because the ESB pages
of a different HW controller (PHB4) need to be exposed to the guest
along with the initial IPI ESB pages of the XIVE IC controller. But
the overall mechanic is the same.

When the device HW irqs are mapped into or unmapped from the guest
IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped(), are called to record or clear the
passthrough interrupt information and to perform the switch.

The approach taken by this patch is to clear the ESB pages of the
guest IRQ number being mapped and let the VM fault handler repopulate.
The handler will insert the ESB page corresponding to the HW interrupt
of the device being passed-through or the initial IPI ESB page if the
device is being removed.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2 :

 - extra comment in documentation

 arch/powerpc/kvm/book3s_xive.h |  9 +
 arch/powerpc/kvm/book3s_xive.c | 15 
 arch/powerpc/kvm/book3s_xive_native.c  | 41 ++
 Documentation/virtual/kvm/devices/xive.txt | 19 ++
 4 files changed, 84 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 622f594d93e1..e011622dc038 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -94,6 +94,11 @@ struct kvmppc_xive_src_block {
struct kvmppc_xive_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS];
 };
 
+struct kvmppc_xive;
+
+struct kvmppc_xive_ops {
+   int (*reset_mapped)(struct kvm *kvm, unsigned long guest_irq);
+};
 
 struct kvmppc_xive {
struct kvm *kvm;
@@ -132,6 +137,10 @@ struct kvmppc_xive {
 
/* Flags */
u8  single_escalation;
+
+   struct kvmppc_xive_ops *ops;
+   struct address_space   *mapping;
+   struct mutex mapping_lock;
 };
 
 #define KVMPPC_XIVE_Q_COUNT8
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index c1b7aa7dbc28..480a3fc6b9fd 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -937,6 +937,13 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
/* Turn the IPI hard off */
xive_vm_esb_load(>ipi_data, XIVE_ESB_SET_PQ_01);
 
+   /*
+* Reset ESB guest mapping. Needed when ESB pages are exposed
+* to the guest in XIVE native mode
+*/
+   if (xive->ops && xive->ops->reset_mapped)
+   xive->ops->reset_mapped(kvm, guest_irq);
+
/* Grab info about irq */
state->pt_number = hw_irq;
state->pt_data = irq_data_get_irq_handler_data(host_data);
@@ -1022,6 +1029,14 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned 
long guest_irq,
state->pt_number = 0;
state->pt_data = NULL;
 
+   /*
+* Reset ESB guest mapping. Needed when ESB pages are exposed
+* to the guest in XIVE native mode
+*/
+   if (xive->ops && xive->ops->reset_mapped) {
+   xive->ops->reset_mapped(kvm, guest_irq);
+   }
+
/* Reconfigure the IPI */
xive_native_configure_irq(state->ipi_number,
  kvmppc_xive_vp(xive, state->act_server),
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index d0a055030efd..6a502eee6744 100644
---

[PATCH v4 08/17] KVM: PPC: Book3S HV: XIVE: add a control to sync the sources

2019-03-20 Thread Cédric Le Goater

This control will be used by the H_INT_SYNC hcall from QEMU to flush
event notifications on the XIVE IC owning the source.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2 :

 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 36 ++
 Documentation/virtual/kvm/devices/xive.txt |  8 +
 3 files changed, 45 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index f045f9dee42e..e4abe30f6fc6 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -683,6 +683,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
+#define KVM_DEV_XIVE_GRP_SOURCE_SYNC   5   /* 64-bit source identifier */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index b54d6fa978fe..d45dc2ec0557 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -335,6 +335,38 @@ static int kvmppc_xive_native_set_source_config(struct 
kvmppc_xive *xive,
   priority, masked, eisn);
 }
 
+static int kvmppc_xive_native_sync_source(struct kvmppc_xive *xive,
+ long irq, u64 addr)
+{
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   struct xive_irq_data *xd;
+   u32 hw_num;
+   u16 src;
+   int rc = 0;
+
+   pr_devel("%s irq=0x%lx", __func__, irq);
+
+   sb = kvmppc_xive_find_source(xive, irq, );
+   if (!sb)
+   return -ENOENT;
+
+   state = >irq_state[src];
+
+   rc = -EINVAL;
+
+   arch_spin_lock(>lock);
+
+   if (state->valid) {
+   kvmppc_xive_select_irq(state, _num, );
+   xive_native_sync_source(hw_num);
+   rc = 0;
+   }
+
+   arch_spin_unlock(>lock);
+   return rc;
+}
+
 static int xive_native_validate_queue_size(u32 qshift)
 {
/*
@@ -663,6 +695,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
case KVM_DEV_XIVE_GRP_EQ_CONFIG:
return kvmppc_xive_native_set_queue_config(xive, attr->attr,
   attr->addr);
+   case KVM_DEV_XIVE_GRP_SOURCE_SYNC:
+   return kvmppc_xive_native_sync_source(xive, attr->attr,
+ attr->addr);
}
return -ENXIO;
 }
@@ -692,6 +727,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
break;
case KVM_DEV_XIVE_GRP_SOURCE:
case KVM_DEV_XIVE_GRP_SOURCE_CONFIG:
+   case KVM_DEV_XIVE_GRP_SOURCE_SYNC:
if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
attr->attr < KVMPPC_XIVE_NR_IRQS)
return 0;
diff --git a/Documentation/virtual/kvm/devices/xive.txt 
b/Documentation/virtual/kvm/devices/xive.txt
index acd5cb9d1339..26fc918b02fb 100644
--- a/Documentation/virtual/kvm/devices/xive.txt
+++ b/Documentation/virtual/kvm/devices/xive.txt
@@ -92,3 +92,11 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
 -EINVAL: Invalid queue address
 -EFAULT: Invalid user pointer for attr->addr.
 -EIO:Configuration of the underlying HW failed
+
+  5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
+  Synchronize the source to flush event notifications
+  Attributes:
+Interrupt source number  (64-bit)
+  Errors:
+-ENOENT: Unknown source number
+-EINVAL: Not initialized source number
-- 
2.20.1

[PATCH v4 07/17] KVM: PPC: Book3S HV: XIVE: add a global reset control

2019-03-20 Thread Cédric Le Goater

This control is to be used by the H_INT_RESET hcall from QEMU. Its
purpose is to clear all configuration of the sources and EQs. This is
necessary in case of a kexec (for a kdump kernel for instance) to make
sure that no remaining configuration is left from the previous boot
setup so that the new kernel can start safely from a clean state.

The queue 7 is ignored when the XIVE device is configured to run in
single escalation mode. Prio 7 is used by escalations.

The XIVE VP is kept enabled as the vCPU is still active and connected
to the XIVE device.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2 :

 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 85 ++
 Documentation/virtual/kvm/devices/xive.txt |  5 ++
 3 files changed, 91 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 85005400fd86..f045f9dee42e 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -679,6 +679,7 @@ struct kvm_ppc_cpu_char {
 
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
+#define   KVM_DEV_XIVE_RESET   1
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 2c335454da72..b54d6fa978fe 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -565,6 +565,83 @@ static int kvmppc_xive_native_get_queue_config(struct 
kvmppc_xive *xive,
return 0;
 }
 
+static void kvmppc_xive_reset_sources(struct kvmppc_xive_src_block *sb)
+{
+   int i;
+
+   for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
+   struct kvmppc_xive_irq_state *state = >irq_state[i];
+
+   if (!state->valid)
+   continue;
+
+   if (state->act_priority == MASKED)
+   continue;
+
+   state->eisn = 0;
+   state->act_server = 0;
+   state->act_priority = MASKED;
+   xive_vm_esb_load(>ipi_data, XIVE_ESB_SET_PQ_01);
+   xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
+   if (state->pt_number) {
+   xive_vm_esb_load(state->pt_data, XIVE_ESB_SET_PQ_01);
+   xive_native_configure_irq(state->pt_number,
+ 0, MASKED, 0);
+   }
+   }
+}
+
+static int kvmppc_xive_reset(struct kvmppc_xive *xive)
+{
+   struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
+   unsigned int i;
+
+   pr_devel("%s\n", __func__);
+
+   mutex_lock(>lock);
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+   unsigned int prio;
+
+   if (!xc)
+   continue;
+
+   kvmppc_xive_disable_vcpu_interrupts(vcpu);
+
+   for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
+
+   /* Single escalation, no queue 7 */
+   if (prio == 7 && xive->single_escalation)
+   break;
+
+   if (xc->esc_virq[prio]) {
+   free_irq(xc->esc_virq[prio], vcpu);
+   irq_dispose_mapping(xc->esc_virq[prio]);
+   kfree(xc->esc_virq_names[prio]);
+   xc->esc_virq[prio] = 0;
+   }
+
+   kvmppc_xive_native_cleanup_queue(vcpu, prio);
+   }
+   }
+
+   for (i = 0; i <= xive->max_sbid; i++) {
+   struct kvmppc_xive_src_block *sb = xive->src_blocks[i];
+
+   if (sb) {
+   arch_spin_lock(>lock);
+   kvmppc_xive_reset_sources(sb);
+   arch_spin_unlock(>lock);
+   }
+   }
+
+   mutex_unlock(>lock);
+
+   return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -572,6 +649,10 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
 
switch (attr->group) {
case KVM_DEV_XIVE_GRP_CTRL:
+   switch (attr->attr) {
+   case KVM_DEV_XIVE_RESET:
+   return kvmppc_xive_reset(xive);
+   }
break;
case KVM_DEV_XIVE_GRP_SOURCE:
return kvmppc_xive_native_set_source(xive, attr->attr,
@@ -604,6 +685,10 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
 {

[PATCH v4 06/17] KVM: PPC: Book3S HV: XIVE: add controls for the EQ configuration

2019-03-20 Thread Cédric Le Goater

These controls will be used by the H_INT_SET_QUEUE_CONFIG and
H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
Event Queue in the XIVE IC. They will also be used to restore the
configuration of the XIVE EQs and to capture the internal run-time
state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
the EQ toggle bit and EQ index which are updated by the XIVE IC when
event notifications are enqueued in the EQ.

The value of the guest physical address of the event queue is saved in
the XIVE internal xive_q structure for later use. That is when
migration needs to mark the EQ pages dirty to capture a consistent
memory state of the VM.

To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
but restoring the EQ state will.

Signed-off-by: Cédric Le Goater 
---

 Changes since v3 :

 - fix the test ont the initial setting of the EQ toggle bit : 0 -> 1
 - renamed qsize to qshift
 - renamed qpage to qaddr
 - checked host page size
 - limited flags to KVM_XIVE_EQ_ALWAYS_NOTIFY to fit sPAPR specs
 
 Changes since v2 :
 
 - fixed comments on the KVM device attribute definitions
 - fixed check on supported EQ size to restrict to 64K pages
 - checked kvm_eq.flags that need to be zero
 - removed the OPAL call when EQ qtoggle bit and index are zero. 

 arch/powerpc/include/asm/xive.h|   2 +
 arch/powerpc/include/uapi/asm/kvm.h|  19 ++
 arch/powerpc/kvm/book3s_xive.h |   2 +
 arch/powerpc/kvm/book3s_xive.c |  15 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 242 +
 Documentation/virtual/kvm/devices/xive.txt |  34 +++
 6 files changed, 308 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index b579a943407b..c4e88abd3b67 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -73,6 +73,8 @@ struct xive_q {
u32 esc_irq;
atomic_tcount;
atomic_tpending_count;
+   u64 guest_qaddr;
+   u32 guest_qshift;
 };
 
 /* Global enable flags for the XIVE support */
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index e8161e21629b..85005400fd86 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
+#define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
@@ -696,4 +697,22 @@ struct kvm_ppc_cpu_char {
 #define KVM_XIVE_SOURCE_EISN_SHIFT 33
 #define KVM_XIVE_SOURCE_EISN_MASK  0xfffeULL
 
+/* Layout of 64-bit EQ identifier */
+#define KVM_XIVE_EQ_PRIORITY_SHIFT 0
+#define KVM_XIVE_EQ_PRIORITY_MASK  0x7
+#define KVM_XIVE_EQ_SERVER_SHIFT   3
+#define KVM_XIVE_EQ_SERVER_MASK0xfff8ULL
+
+/* Layout of EQ configuration values (64 bytes) */
+struct kvm_ppc_xive_eq {
+   __u32 flags;
+   __u32 qshift;
+   __u64 qaddr;
+   __u32 qtoggle;
+   __u32 qindex;
+   __u8  pad[40];
+};
+
+#define KVM_XIVE_EQ_ALWAYS_NOTIFY  0x0001
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index ae26fe653d98..622f594d93e1 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -272,6 +272,8 @@ struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
struct kvmppc_xive *xive, int irq);
 void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
 int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
+ bool single_escalation);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e09f3addffe5..c1b7aa7dbc28 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -166,7 +166,8 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
+ bool single_escalation)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
struct xive_q *q = >queues[prio];
@@ -185,7 +186,7 @@ static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 
prio)
return -EIO;
}
 
-   if

[PATCH v4 04/17] KVM: PPC: Book3S HV: XIVE: add a control to initialize a source

2019-03-20 Thread Cédric Le Goater

The XIVE KVM device maintains a list of interrupt sources for the VM
which are allocated in the pool of generic interrupts (IPIs) of the
main XIVE IC controller. These are used for the CPU IPIs as well as
for virtual device interrupts. The IRQ number space is defined by
QEMU.

The XIVE device reuses the source structures of the XICS-on-XIVE
device for the source blocks (2-level tree) and for the source
interrupts. Under XIVE native, the source interrupt caches mostly
configuration information and is less used than under the XICS-on-XIVE
device in which hcalls are still necessary at run-time.

When a source is initialized in KVM, an IPI interrupt source is simply
allocated at the OPAL level and then MASKED. KVM only needs to know
about its type: LSI or MSI.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2:

 - extra documentation in commit log
 - fixed comments on XIVE IRQ number space
 - removed usage of the __x_* macros
 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|   5 +
 arch/powerpc/kvm/book3s_xive.h |  10 ++
 arch/powerpc/kvm/book3s_xive.c |   8 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 106 +
 Documentation/virtual/kvm/devices/xive.txt |  15 +++
 5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index be0ce1f17625..d468294c2a67 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -679,5 +679,10 @@ struct kvm_ppc_cpu_char {
 
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
+#define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
+
+/* Layout of 64-bit XIVE source attribute values */
+#define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
+#define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1)
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index d366df69b9cb..1be921cb5dcb 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -12,6 +12,13 @@
 #ifdef CONFIG_KVM_XICS
 #include "book3s_xics.h"
 
+/*
+ * The XIVE Interrupt source numbers are within the range 0 to
+ * KVMPPC_XICS_NR_IRQS.
+ */
+#define KVMPPC_XIVE_FIRST_IRQ  0
+#define KVMPPC_XIVE_NR_IRQSKVMPPC_XICS_NR_IRQS
+
 /*
  * State for one guest irq source.
  *
@@ -258,6 +265,9 @@ extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, 
unsigned long xirr);
  */
 void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
 int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+   struct kvmppc_xive *xive, int irq);
+void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e7f1ada1c3de..6c9f9fd0855f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1480,8 +1480,8 @@ static int xive_get_source(struct kvmppc_xive *xive, long 
irq, u64 addr)
return 0;
 }
 
-static struct kvmppc_xive_src_block *xive_create_src_block(struct kvmppc_xive 
*xive,
-  int irq)
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+   struct kvmppc_xive *xive, int irq)
 {
struct kvm *kvm = xive->kvm;
struct kvmppc_xive_src_block *sb;
@@ -1560,7 +1560,7 @@ static int xive_set_source(struct kvmppc_xive *xive, long 
irq, u64 addr)
sb = kvmppc_xive_find_source(xive, irq, );
if (!sb) {
pr_devel("No source, creating source block...\n");
-   sb = xive_create_src_block(xive, irq);
+   sb = kvmppc_xive_create_src_block(xive, irq);
if (!sb) {
pr_devel("Failed to create block...\n");
return -ENOMEM;
@@ -1784,7 +1784,7 @@ static void kvmppc_xive_cleanup_irq(u32 hw_num, struct 
xive_irq_data *xd)
xive_cleanup_irq_data(xd);
 }
 
-static void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
+void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 {
int i;
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 6fa73cfd9d9c..5f2bd6c137b7 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -26,6 +26,17 @@
 
 #include "book3s_xive.h"
 
+static u8 xive_vm_esb_load(struct xive_irq_data *xd, u32 offset)
+{
+   u64 val;
+
+   if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
+   offset |= offset << 4;
+
+   val = in_be64(xd->eoi_mmio + offset);
+   return (u8)val;
+}
+
 static void kvmppc_xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
 {

[PATCH v4 03/17] KVM: PPC: Book3S HV: XIVE: introduce a new capability KVM_CAP_PPC_IRQ_XIVE

2019-03-20 Thread Cédric Le Goater

The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
let QEMU connect the vCPU presenters to the XIVE KVM device if
required. The capability is not advertised for now as the full support
for the XIVE native exploitation mode is not yet available. When this
is case, the capability will be advertised on PowerNV Hypervisors
only. Nested guests (pseries KVM Hypervisor) are not supported.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2:

 - made use of the xive_vp() macro to compute VP identifiers
 - reworked locking in kvmppc_xive_native_connect_vcpu() to fix races 
 - stop advertising KVM_CAP_PPC_IRQ_XIVE as support is not fully
   available yet 

 arch/powerpc/include/asm/kvm_host.h   |   1 +
 arch/powerpc/include/asm/kvm_ppc.h|  13 +++
 arch/powerpc/kvm/book3s_xive.h|  11 ++
 include/uapi/linux/kvm.h  |   1 +
 arch/powerpc/kvm/book3s_xive.c|  88 ---
 arch/powerpc/kvm/book3s_xive_native.c | 150 ++
 arch/powerpc/kvm/powerpc.c|  36 +++
 Documentation/virtual/kvm/api.txt |   9 ++
 8 files changed, 268 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 008523224e7a..9cc6abdce1b9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -450,6 +450,7 @@ struct kvmppc_passthru_irqmap {
 #define KVMPPC_IRQ_DEFAULT 0
 #define KVMPPC_IRQ_MPIC1
 #define KVMPPC_IRQ_XICS2 /* Includes a XIVE option */
+#define KVMPPC_IRQ_XIVE3 /* XIVE native exploitation mode */
 
 #define MMIO_HPTE_CACHE_SIZE   4
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index f3383e76017a..6928a35ac3c7 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -595,6 +595,14 @@ extern int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
 
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
+}
+
+extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+  struct kvm_vcpu *vcpu, u32 cpu);
+extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
 
@@ -622,6 +630,11 @@ static inline int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 ir
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
 
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+   { return 0; }
+static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+ struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
+static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
 static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
 
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index a08ae6fd4c51..d366df69b9cb 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -198,6 +198,11 @@ static inline struct kvmppc_xive_src_block 
*kvmppc_xive_find_source(struct kvmpp
return xive->src_blocks[bid];
 }
 
+static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
+{
+   return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
+}
+
 /*
  * Mapping between guest priorities and host priorities
  * is as follow.
@@ -248,5 +253,11 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, 
unsigned long server,
 extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr);
 extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr);
 
+/*
+ * Common Xive routines for XICS-over-XIVE and XIVE native
+ */
+void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
+int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
+
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6368163d3a0..52bf74a1616e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
 #define KVM_CAP_HYPERV_CPUID 167
+#define KVM_CAP_PPC_IRQ_XIVE 168
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index f78d002f0fe0..e7f1ada1c3de

[PATCH v4 02/17] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-03-20 Thread Cédric Le Goater

This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new KVM device
to be created by QEMU, only available when running on a L0 hypervisor.
Support for nested guests is not available yet.

The XIVE device reuses the device structure of the XICS-on-XIVE device
as they have a lot in common. That could possibly change in the future
if the need arise.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

Changes since v3:

 - removed a couple of useless includes
 
 Changes since v2:

 - removed ->q_order setting. Only useful in the XICS-on-XIVE KVM
   device which allocates the EQs on behalf of the guest.
 - returned -ENXIO when VP base is invalid

 arch/powerpc/include/asm/kvm_host.h|   1 +
 arch/powerpc/include/asm/kvm_ppc.h |   8 +
 arch/powerpc/include/uapi/asm/kvm.h|   3 +
 include/uapi/linux/kvm.h   |   2 +
 arch/powerpc/kvm/book3s.c  |   7 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 179 +
 Documentation/virtual/kvm/devices/xive.txt |  19 +++
 arch/powerpc/kvm/Makefile  |   2 +-
 8 files changed, 219 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
 create mode 100644 Documentation/virtual/kvm/devices/xive.txt

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e6b5bb012ccb..008523224e7a 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -222,6 +222,7 @@ extern struct kvm_device_ops kvm_xics_ops;
 struct kvmppc_xive;
 struct kvmppc_xive_vcpu;
 extern struct kvm_device_ops kvm_xive_ops;
+extern struct kvm_device_ops kvm_xive_native_ops;
 
 struct kvmppc_passthru_irqmap;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index ac22b28ae78d..f3383e76017a 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -594,6 +594,10 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+
+extern void kvmppc_xive_native_init_module(void);
+extern void kvmppc_xive_native_exit_module(void);
+
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
   u32 priority) { return -1; }
@@ -617,6 +621,10 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+
+static inline void kvmppc_xive_native_init_module(void) { }
+static inline void kvmppc_xive_native_exit_module(void) { }
+
 #endif /* CONFIG_KVM_XIVE */
 
 #if defined(CONFIG_PPC_POWERNV) && defined(CONFIG_KVM_BOOK3S_64_HANDLER)
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 26ca425f4c2c..be0ce1f17625 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -677,4 +677,7 @@ struct kvm_ppc_cpu_char {
 #define  KVM_XICS_PRESENTED(1ULL << 43)
 #define  KVM_XICS_QUEUED   (1ULL << 44)
 
+/* POWER9 XIVE Native Interrupt Controller */
+#define KVM_DEV_XIVE_GRP_CTRL  1
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..e6368163d3a0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1211,6 +1211,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3   KVM_DEV_TYPE_ARM_VGIC_V3
KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS  KVM_DEV_TYPE_ARM_VGIC_ITS
+   KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE  KVM_DEV_TYPE_XIVE
KVM_DEV_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 10c5579d20ce..7c3348fa27e1 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1050,6 +1050,9 @@ static int kvmppc_book3s_init(void)
if (xics_on_xive()) {
kvmppc_xive_init_module();
kvm_register_device_ops(_xive_ops, KVM_DEV_TYPE_XICS);
+   kvmppc_xive_native_init_module();
+   kvm_register_device_ops(_xive_native_ops,
+   KVM_DEV_TYPE_XIVE);
} else
 #endif
kvm_register_device_ops(_xics_ops, KVM_DEV_TYPE_XICS);
@@ -1060,8 +1063,10 @@ static int kvmppc_book3s_init(void)
 static void kvmppc_book3s_exit(void)
 {
 #ifdef CONFIG_KVM_XICS
-   if (xics_on_xive())
+   if (xics_on_xive()) {
kvmppc_xive_exit_module();
+

[PATCH v4 01/17] powerpc/xive: add OPAL extensions for the XIVE native exploitation support

2019-03-20 Thread Cédric Le Goater

The support for XIVE native exploitation mode in Linux/KVM needs a
couple more OPAL calls to get and set the state of the XIVE internal
structures being used by a sPAPR guest.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v3:
 
 - rebased on 5.1-rc1 
 
 Changes since v2:
 
 - remove extra OPAL call definitions
 
 arch/powerpc/include/asm/opal-api.h|  7 +-
 arch/powerpc/include/asm/opal.h|  7 ++
 arch/powerpc/include/asm/xive.h| 14 +++
 arch/powerpc/platforms/powernv/opal-call.c |  3 +
 arch/powerpc/sysdev/xive/native.c  | 99 ++
 5 files changed, 127 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b239ea..e1d118ac61dc 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -186,8 +186,8 @@
 #define OPAL_XIVE_FREE_IRQ 140
 #define OPAL_XIVE_SYNC 141
 #define OPAL_XIVE_DUMP 142
-#define OPAL_XIVE_RESERVED3143
-#define OPAL_XIVE_RESERVED4144
+#define OPAL_XIVE_GET_QUEUE_STATE  143
+#define OPAL_XIVE_SET_QUEUE_STATE  144
 #define OPAL_SIGNAL_SYSTEM_RESET   145
 #define OPAL_NPU_INIT_CONTEXT  146
 #define OPAL_NPU_DESTROY_CONTEXT   147
@@ -210,7 +210,8 @@
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   165
 #defineOPAL_NX_COPROC_INIT 167
-#define OPAL_LAST  167
+#define OPAL_XIVE_GET_VP_STATE 170
+#define OPAL_LAST  170
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a55b01c90bb1..4e978d4dea5c 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -279,6 +279,13 @@ int64_t opal_xive_allocate_irq(uint32_t chip_id);
 int64_t opal_xive_free_irq(uint32_t girq);
 int64_t opal_xive_sync(uint32_t type, uint32_t id);
 int64_t opal_xive_dump(uint32_t type, uint32_t id);
+int64_t opal_xive_get_queue_state(uint64_t vp, uint32_t prio,
+ __be32 *out_qtoggle,
+ __be32 *out_qindex);
+int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t prio,
+ uint32_t qtoggle,
+ uint32_t qindex);
+int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01);
 int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target,
uint64_t desc, uint16_t pe_number);
 
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 3c704f5dd3ae..b579a943407b 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -109,12 +109,26 @@ extern int xive_native_configure_queue(u32 vp_id, struct 
xive_q *q, u8 prio,
 extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio);
 
 extern void xive_native_sync_source(u32 hw_irq);
+extern void xive_native_sync_queue(u32 hw_irq);
 extern bool is_xive_irq(struct irq_chip *chip);
 extern int xive_native_enable_vp(u32 vp_id, bool single_escalation);
 extern int xive_native_disable_vp(u32 vp_id);
 extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 
*out_chip_id);
 extern bool xive_native_has_single_escalation(void);
 
+extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio,
+ u64 *out_qpage,
+ u64 *out_qsize,
+ u64 *out_qeoi_page,
+ u32 *out_escalate_irq,
+ u64 *out_qflags);
+
+extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle,
+  u32 *qindex);
+extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
+  u32 qindex);
+extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
+
 #else
 
 static inline bool xive_enabled(void) { return false; }
diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
b/arch/powerpc/platforms/powernv/opal-call.c
index daad8c45c8e7..7472244e7f30 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -260,6 +260,9 @@ OPAL_CALL(opal_xive_get_vp_info,
OPAL_XIVE_GET_VP_INFO);
 OPAL_CALL(opal_xive_set_vp_info,   OPAL_XIVE_SET_VP_INFO);
 OPAL_CALL(opal_xive_sync,  OPAL_XIVE_SYNC);
 OPAL_CALL(opal_xive_dump,  OPAL_XIVE_DUMP);
+OPAL_CALL(opal_xive_get_queue_state,

[PATCH v4 00/17] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-03-20 Thread Cédric Le Goater

Hello,

On the POWER9 processor, the XIVE interrupt controller can control
interrupt sources using MMIOs to trigger events, to EOI or to turn off
the sources. Priority management and interrupt acknowledgment is also
controlled by MMIO in the CPU presenter sub-engine.

PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
special support from the hypervisor to do the same. This is called the
XIVE native exploitation mode and today, it can be activated under the
PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
and still offers the old interrupt mode interface using a KVM device
implementing the XICS hcalls over XIVE.

The following series is proposal to add the same support under KVM.

A new KVM device is introduced for the XIVE native exploitation
mode. It reuses most of the XICS-over-XIVE glue implementation
structures which are internal to KVM but has a completely different
interface. A set of KVM device ioctls provide support for the
hypervisor calls, all handled in QEMU, to configure the sources and
the event queues. From there, all interrupt control is transferred to
the guest which can use MMIOs.

These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
similarly to VFIO, and the associated VMAs are populated dynamically
with the appropriate pages using a fault handler. These are now
implemented using mmap()s of the KVM device fd.

Migration has its own specific needs regarding memory. The patchset
provides a specific control to quiesce XIVE before capturing the
memory. The save and restore of the internal state is based on the
same ioctls used for the hcalls.

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with a
interrupt controller using the XICS legacy model, as found on POWER8,
or in XIVE exploitation mode. Which means that the KVM interrupt
device should be created at run-time, after the machine has started.
This requires extra support from KVM to destroy KVM devices. It is
introduced at the end of the patchset as it still requires some
attention and a XIVE-only VM would not need.

This is based on 5.1-rc1 and should be a candidate for 5.2 now. The
OPAL patches have not yet been merged.


GitHub trees available here :
 
QEMU sPAPR:

  https://github.com/legoater/qemu/commits/xive-next
  
Linux/KVM:

  https://github.com/legoater/linux/commits/xive-5.1

OPAL:

  https://github.com/legoater/skiboot/commits/xive

Thanks,

C.

Caveats :

 - We should introduce a set of definitions common to XIVE and XICS
 - The XICS-over-XIVE device file book3s_xive.c could be renamed to
   book3s_xics_on_xive.c or book3s_xics_p9.c
 - The XICS-over-XIVE device has locking issues in the setup. 

Changes since v3:

 - removed a couple of useless includes
 - fix the test ont the initial setting of the EQ toggle bit : 0 -> 1
 - renamed qsize to qshift
 - renamed qpage to qaddr
 - checked host page size
 - limited flags to KVM_XIVE_EQ_ALWAYS_NOTIFY to fit sPAPR specs
 - Fixed xive_timaval description in documentation

Changes since v2:

 - removed extra OPAL call definitions
 - removed ->q_order setting. Only useful in the XICS-on-XIVE KVM
   device which allocates the EQs on behalf of the guest.
 - returned -ENXIO when VP base is invalid
 - made use of the xive_vp() macro to compute VP identifiers
 - reworked locking in kvmppc_xive_native_connect_vcpu() to fix races 
 - stop advertising KVM_CAP_PPC_IRQ_XIVE as support is not fully
   available yet
 - fixed comment on XIVE IRQ number space
 - removed usage of the __x_* macros
 - fixed locking on source block
 - fixed comments on the KVM device attribute definitions
 - handled MASKED EAS configuration
 - fixed check on supported EQ size to restrict to 64K pages
 - checked kvm_eq.flags that need to be zero
 - removed the OPAL call when EQ qtoggle bit and index are zero. 
 - reduced the size of kvmppc_one_reg timaval attribute to two u64s
 - stopped returning of the OS CAM line value
 
Changes since v1:

 - Better documentation (was missing)
 - Nested support. XIVE not advertised on non PowerNV platforms. This
   is a good way to test the fallback on QEMU emulated devices.
 - ESB and TIMA special mapping done using the KVM device fd
 - All hcalls moved to QEMU. Dropped the patch moving the hcall flags.
 - Reworked of the KVM device ioctl controls to support hcalls and
   migration needs to capture/save states
 - Merged the control syncing XIVE and marking the EQ page dirty
 - Fixed passthrough support using the KVM device file address_space
   to clear the ESB pages from the mapping
 - Misc enhancements and fixes 

Cédric Le Goater (17):
  powerpc/xive: add OPAL extensions for the XIVE native exploitation
support
  KVM: PPC: Book3S HV: add a new KVM device for the XIVE native
exploitation mode
  KVM: PPC: Book3S HV: XIVE: introduce a new capability
KVM_CAP_PPC_IRQ_XIVE
  KVM: PPC: Book3S HV: XIVE: add a control to initialize a source
  KVM:

Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-20 Thread Aneesh Kumar K.V

Dan Williams  writes:

>
>> Now what will be page size used for mapping vmemmap?
>
> That's up to the architecture's vmemmap_populate() implementation.
>
>> Architectures
>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
>> device-dax with struct page in the device will have pfn reserve area aligned
>> to PAGE_SIZE with the above example? We can't map that using
>> PMD_SIZE page size?
>
> IIUC, that's a different alignment. Currently that's handled by
> padding the reservation area up to a section (128MB on x86) boundary,
> but I'm working on patches to allow sub-section sized ranges to be
> mapped.

I am missing something w.r.t code. The below code align that using nd_pfn->align

if (nd_pfn->mode == PFN_MODE_PMEM) {
unsigned long memmap_size;

/*
 * vmemmap_populate_hugepages() allocates the memmap array in
 * HPAGE_SIZE chunks.
 */
memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
nd_pfn->align) - start;
  }

IIUC that is finding the offset where to put vmemmap start. And that has
to be aligned to the page size with which we may end up mapping vmemmap
area right?

Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
is to compute howmany pfns we should map for this pfn dev right?

-aneesh

Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-20 Thread Aneesh Kumar K.V

Aneesh Kumar K.V  writes:

> Dan Williams  writes:
>
>>
>>> Now what will be page size used for mapping vmemmap?
>>
>> That's up to the architecture's vmemmap_populate() implementation.
>>
>>> Architectures
>>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
>>> device-dax with struct page in the device will have pfn reserve area aligned
>>> to PAGE_SIZE with the above example? We can't map that using
>>> PMD_SIZE page size?
>>
>> IIUC, that's a different alignment. Currently that's handled by
>> padding the reservation area up to a section (128MB on x86) boundary,
>> but I'm working on patches to allow sub-section sized ranges to be
>> mapped.
>
> I am missing something w.r.t code. The below code align that using 
> nd_pfn->align
>
>   if (nd_pfn->mode == PFN_MODE_PMEM) {
>   unsigned long memmap_size;
>
>   /*
>* vmemmap_populate_hugepages() allocates the memmap array in
>* HPAGE_SIZE chunks.
>*/
>   memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
>   offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
>   nd_pfn->align) - start;
>   }
>
> IIUC that is finding the offset where to put vmemmap start. And that has
> to be aligned to the page size with which we may end up mapping vmemmap
> area right?
>
> Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> is to compute howmany pfns we should map for this pfn dev right?
>   

Also i guess those 4K assumptions there is wrong?

modified   drivers/nvdimm/pfn_devs.c
@@ -783,7 +783,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
return -ENXIO;
}
 
-   npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
+   npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE;
pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
pfn_sb->dataoff = cpu_to_le64(offset);
pfn_sb->npfns = cpu_to_le64(npfns);


-aneesh

[PATCH 3/3] powerpc/mm: print hash info in a helper

2019-03-20 Thread Christophe Leroy

Reduce #ifdef mess by defining a helper to print
hash info at startup.

In the meantime, remove the display of hash table address
to reduce leak of non necessary information.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/setup-common.c | 19 +--
 arch/powerpc/mm/hash_utils_64.c|  8 
 arch/powerpc/mm/mmu_decl.h |  5 -
 arch/powerpc/mm/ppc_mmu_32.c   |  9 -
 4 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2e5dfb6e0823..f24a74f7912d 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -799,12 +799,6 @@ void arch_setup_pdev_archdata(struct platform_device *pdev)
 static __init void print_system_info(void)
 {
pr_info("-\n");
-#ifdef CONFIG_PPC_BOOK3S_64
-   pr_info("ppc64_pft_size= 0x%llx\n", ppc64_pft_size);
-#endif
-#ifdef CONFIG_PPC_BOOK3S_32
-   pr_info("Hash_size = 0x%lx\n", Hash_size);
-#endif
pr_info("phys_mem_size = 0x%llx\n",
(unsigned long long)memblock_phys_mem_size());
 
@@ -826,18 +820,7 @@ static __init void print_system_info(void)
pr_info("firmware_features = 0x%016lx\n", powerpc_firmware_features);
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (htab_address)
-   pr_info("htab_address  = 0x%p\n", htab_address);
-   if (htab_hash_mask)
-   pr_info("htab_hash_mask= 0x%lx\n", htab_hash_mask);
-#endif
-#ifdef CONFIG_PPC_BOOK3S_32
-   if (Hash)
-   pr_info("Hash  = 0x%p\n", Hash);
-   if (Hash_mask)
-   pr_info("Hash_mask = 0x%lx\n", Hash_mask);
-#endif
+   print_system_hash_info();
 
if (PHYSICAL_START > 0)
pr_info("physical_start= 0x%llx\n",
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0a4f939a8161..017380b890bb 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1909,3 +1909,11 @@ static int __init hash64_debugfs(void)
 }
 machine_device_initcall(pseries, hash64_debugfs);
 #endif /* CONFIG_DEBUG_FS */
+
+void __init print_system_hash_info(void)
+{
+   pr_info("ppc64_pft_size= 0x%llx\n", ppc64_pft_size);
+
+   if (htab_hash_mask)
+   pr_info("htab_hash_mask= 0x%lx\n", htab_hash_mask);
+}
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index f7f1374ba3ee..dc617ade83ab 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -83,6 +83,8 @@ static inline void _tlbivax_bcast(unsigned long address, 
unsigned int pid,
 }
 #endif
 
+static inline void print_system_hash_info(void) {}
+
 #else /* CONFIG_PPC_MMU_NOHASH */
 
 extern void hash_preload(struct mm_struct *mm, unsigned long ea,
@@ -92,6 +94,8 @@ extern void hash_preload(struct mm_struct *mm, unsigned long 
ea,
 extern void _tlbie(unsigned long address);
 extern void _tlbia(void);
 
+void print_system_hash_info(void);
+
 #endif /* CONFIG_PPC_MMU_NOHASH */
 
 #ifdef CONFIG_PPC32
@@ -105,7 +109,6 @@ extern unsigned int rtas_data, rtas_size;
 
 struct hash_pte;
 extern struct hash_pte *Hash;
-extern unsigned long Hash_size, Hash_mask;
 
 #endif /* CONFIG_PPC32 */
 
diff --git a/arch/powerpc/mm/ppc_mmu_32.c b/arch/powerpc/mm/ppc_mmu_32.c
index 088f14d57cce..864096489b6d 100644
--- a/arch/powerpc/mm/ppc_mmu_32.c
+++ b/arch/powerpc/mm/ppc_mmu_32.c
@@ -37,7 +37,7 @@
 #include "mmu_decl.h"
 
 struct hash_pte *Hash;
-unsigned long Hash_size, Hash_mask;
+static unsigned long Hash_size, Hash_mask;
 unsigned long _SDR1;
 
 struct ppc_bat BATS[8][2]; /* 8 pairs of IBAT, DBAT */
@@ -392,3 +392,10 @@ void setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
else /* Anything else has 256M mapped */
memblock_set_current_limit(min_t(u64, first_memblock_size, 
0x1000));
 }
+
+void __init print_system_hash_info(void)
+{
+   pr_info("Hash_size = 0x%lx\n", Hash_size);
+   if (Hash_mask)
+   pr_info("Hash_mask = 0x%lx\n", Hash_mask);
+}
-- 
2.13.3

1 2 >

1 - 100 of 114 matches

Mail list logo