Re: [PATCH] powerpc/highmem: change BUG_ON() to WARN_ON()
Christophe Leroy writes: > In arch/powerpc/mm/highmem.c, BUG_ON() is called only when > CONFIG_DEBUG_HIGHMEM is selected, this means the BUG_ON() is > not vital and can be replaced by a a WARN_ON > > At the sametime, use IS_ENABLED() instead of #ifdef to clean a bit. > > Signed-off-by: Christophe Leroy > --- > arch/powerpc/mm/highmem.c | 12 > 1 file changed, 4 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/mm/highmem.c b/arch/powerpc/mm/highmem.c > index 82a0e37557a5..b68c9f20fbdf 100644 > --- a/arch/powerpc/mm/highmem.c > +++ b/arch/powerpc/mm/highmem.c > @@ -56,7 +54,7 @@ EXPORT_SYMBOL(kmap_atomic_prot); > void __kunmap_atomic(void *kvaddr) > { > unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; > - int type __maybe_unused; > + int type; Why don't we move type into the block below. eg: > @@ -66,12 +64,11 @@ void __kunmap_atomic(void *kvaddr) > - type = kmap_atomic_idx(); > > -#ifdef CONFIG_DEBUG_HIGHMEM > - { > + if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM)) { int type = kmap_atomic_idx(); > unsigned int idx; > > idx = type + KM_TYPE_NR * smp_processor_id(); > - BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); > + WARN_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); cheers
Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section
Catalin Marinas writes: > On Thu, Mar 21, 2019 at 12:15:46AM +1100, Michael Ellerman wrote: >> Catalin Marinas writes: >> > On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote: >> >> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void) >> >> >> >> /* data/bss scanning */ >> >> scan_large_block(_sdata, _edata); >> >> - scan_large_block(__bss_start, __bss_stop); >> >> + >> >> + if (bss_hole_start) { >> >> + scan_large_block(__bss_start, bss_hole_start); >> >> + scan_large_block(bss_hole_stop, __bss_stop); >> >> + } else { >> >> + scan_large_block(__bss_start, __bss_stop); >> >> + } >> >> + >> >> scan_large_block(__start_ro_after_init, __end_ro_after_init); >> > >> > I'm not a fan of this approach but I couldn't come up with anything >> > better. I was hoping we could check for PageReserved() in scan_block() >> > but on arm64 it ends up not scanning the .bss at all. >> > >> > Until another user appears, I'm ok with this patch. >> > >> > Acked-by: Catalin Marinas >> >> I actually would like to rework this kvm_tmp thing to not be in bss at >> all. It's a bit of a hack and is incompatible with strict RWX. >> >> If we size it a bit more conservatively we can hopefully just reserve >> some space in the text section for it. >> >> I'm not going to have time to work on that immediately though, so if >> people want this fixed now then this patch could go in as a temporary >> solution. > > I think I have a simpler idea. Kmemleak allows punching holes in > allocated objects, so just turn the data/bss sections into dedicated > kmemleak objects. This happens when kmemleak is initialised, before the > initcalls are invoked. The kvm_free_tmp() would just free the > corresponding part of the bss. > > Patch below, only tested briefly on arm64. Qian, could you give it a try > on powerpc? Thanks. > > 8<-- > diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c > index 683b5b3805bd..c4b8cb3c298d 100644 > --- a/arch/powerpc/kernel/kvm.c > +++ b/arch/powerpc/kernel/kvm.c > @@ -712,6 +712,8 @@ static void kvm_use_magic_page(void) > > static __init void kvm_free_tmp(void) > { > + kmemleak_free_part(_tmp[kvm_tmp_index], > +ARRAY_SIZE(kvm_tmp) - kvm_tmp_index); > free_reserved_area(_tmp[kvm_tmp_index], > _tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL); > } Fine by me as long as it works (sounds like it does). Acked-by: Michael Ellerman (powerpc) cheers
Re: [RFC PATCH 1/1] KVM: PPC: Report single stepping capability
On 21/03/2019 05:39, Fabiano Rosas wrote: > When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request > the next instruction to be single stepped via the > KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure. > > We currently don't have support for guest single stepping implemented > in Book3S HV. > > This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order > to inform userspace about the state of single stepping support. > > Signed-off-by: Fabiano Rosas > --- > arch/powerpc/kvm/powerpc.c | 5 + > include/uapi/linux/kvm.h | 1 + > 2 files changed, 6 insertions(+) > > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 8885377ec3e0..5ba990b0ec74 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -538,6 +538,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long > ext) > case KVM_CAP_IMMEDIATE_EXIT: > r = 1; > break; > + case KVM_CAP_PPC_GUEST_DEBUG_SSTEP: > +#ifdef CONFIG_BOOKE In the cover letter (which is not really required for a single patch) you say the capability will be present for BookE and PR KVM (which Book3s) but here it is BookE only, is that intentional? Also, you need to update Documentation/virtual/kvm/api.txt for the new capability. After reading which I started wondering could not we just use existing KVM_CAP_GUEST_DEBUG_HW_BPS? > + r = 1; > + break; > +#endif > case KVM_CAP_PPC_PAIRED_SINGLES: > case KVM_CAP_PPC_OSI: > case KVM_CAP_PPC_GET_PVINFO: > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 6d4ea4b6c922..33e8a4db867e 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_ARM_VM_IPA_SIZE 165 > #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166 > #define KVM_CAP_HYPERV_CPUID 167 > +#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 168 > > #ifdef KVM_CAP_IRQ_ROUTING > > -- Alexey
Re: [PATCH 1/2] ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton
Tyrel, > For each ibmvscsi host created during a probe or destroyed during a > remove we either add or remove that host to/from the global > ibmvscsi_head list. This runs the risk of concurrent modification. > > This patch adds a simple spinlock around the list modification calls > to prevent concurrent updates as is done similarly in the ibmvfc > driver and ipr driver. Applied to 5.1/scsi-fixes. -- Martin K. Petersen Oracle Linux Engineering
[PATCH] powerpc/security: Fix spectre_v2 reporting
When I updated the spectre_v2 reporting to handle software count cache flush I got the logic wrong when there's no software count cache enabled at all. The result is that on systems with the software count cache flush disabled we print: Mitigation: Indirect branch cache disabled, Software count cache flush Which correctly indicates that the count cache is disabled, but incorrectly says the software count cache flush is enabled. The root of the problem is that we are trying to handle all combinations of options. But we know now that we only expect to see the software count cache flush enabled if the other options are false. So split the two cases, which simplifies the logic and fixes the bug. We were also missing a space before "(hardware accelerated)". The result is we see one of: Mitigation: Indirect branch serialisation (kernel only) Mitigation: Indirect branch cache disabled Mitigation: Software count cache flush Mitigation: Software count cache flush (hardware accelerated) Fixes: ee13cb249fab ("powerpc/64s: Add support for software count cache flush") Cc: sta...@vger.kernel.org # v4.19+ Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/security.c | 23 --- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c index 9b8631533e02..b33bafb8fcea 100644 --- a/arch/powerpc/kernel/security.c +++ b/arch/powerpc/kernel/security.c @@ -190,29 +190,22 @@ ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, c bcs = security_ftr_enabled(SEC_FTR_BCCTRL_SERIALISED); ccd = security_ftr_enabled(SEC_FTR_COUNT_CACHE_DISABLED); - if (bcs || ccd || count_cache_flush_type != COUNT_CACHE_FLUSH_NONE) { - bool comma = false; + if (bcs || ccd) { seq_buf_printf(, "Mitigation: "); - if (bcs) { + if (bcs) seq_buf_printf(, "Indirect branch serialisation (kernel only)"); - comma = true; - } - if (ccd) { - if (comma) - seq_buf_printf(, ", "); - seq_buf_printf(, "Indirect branch cache disabled"); - comma = true; - } - - if (comma) + if (bcs && ccd) seq_buf_printf(, ", "); - seq_buf_printf(, "Software count cache flush"); + if (ccd) + seq_buf_printf(, "Indirect branch cache disabled"); + } else if (count_cache_flush_type != COUNT_CACHE_FLUSH_NONE) { + seq_buf_printf(, "Mitigation: Software count cache flush"); if (count_cache_flush_type == COUNT_CACHE_FLUSH_HW) - seq_buf_printf(, "(hardware accelerated)"); + seq_buf_printf(, " (hardware accelerated)"); } else if (btb_flush_enabled) { seq_buf_printf(, "Mitigation: Branch predictor state flush"); } else { -- 2.20.1
Re: [PATCH 1/4] add generic builtin command line
On Wed, 20 Mar 2019 16:23:28 -0700 Daniel Walker wrote: > On Wed, Mar 20, 2019 at 03:53:19PM -0700, Andrew Morton wrote: > > On Tue, 19 Mar 2019 16:24:45 -0700 Daniel Walker wrote: > > > > > This code allows architectures to use a generic builtin command line. > > > > I wasn't cc'ed on [2/4]. No mailing lists were cc'ed on [0/4] but it > > didn't say anything useful anyway ;) > > > > I'll queue them up for testing and shall await feedback from the > > powerpc developers. > > > > You weren't CC'd , but it was To: you, > > 35 From: Daniel Walker > 36 To: Andrew Morton , > 37 Christophe Leroy , > 38 Michael Ellerman , > 39 Rob Herring , xe-linux-exter...@cisco.com, > 40 linuxppc-dev@lists.ozlabs.org, Frank Rowand > > 41 Cc: devicet...@vger.kernel.org, linux-ker...@vger.kernel.org > 42 Subject: [PATCH 2/4] drivers: of: generic command line support hm. > Thanks for picking it up. The patches (or some version of them) are already in linux-next, which messes me up. I'll disable them for now.
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Wed, Mar 20, 2019 at 8:09 PM Oliver wrote: > > On Thu, Mar 21, 2019 at 7:57 AM Dan Williams wrote: > > > > On Wed, Mar 20, 2019 at 8:34 AM Dan Williams > > wrote: > > > > > > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V > > > wrote: > > > > > > > > Aneesh Kumar K.V writes: > > > > > > > > > Dan Williams writes: > > > > > > > > > >> > > > > >>> Now what will be page size used for mapping vmemmap? > > > > >> > > > > >> That's up to the architecture's vmemmap_populate() implementation. > > > > >> > > > > >>> Architectures > > > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a > > > > >>> device-dax with struct page in the device will have pfn reserve > > > > >>> area aligned > > > > >>> to PAGE_SIZE with the above example? We can't map that using > > > > >>> PMD_SIZE page size? > > > > >> > > > > >> IIUC, that's a different alignment. Currently that's handled by > > > > >> padding the reservation area up to a section (128MB on x86) boundary, > > > > >> but I'm working on patches to allow sub-section sized ranges to be > > > > >> mapped. > > > > > > > > > > I am missing something w.r.t code. The below code align that using > > > > > nd_pfn->align > > > > > > > > > > if (nd_pfn->mode == PFN_MODE_PMEM) { > > > > > unsigned long memmap_size; > > > > > > > > > > /* > > > > >* vmemmap_populate_hugepages() allocates the memmap > > > > > array in > > > > >* HPAGE_SIZE chunks. > > > > >*/ > > > > > memmap_size = ALIGN(64 * npfns, HPAGE_SIZE); > > > > > offset = ALIGN(start + SZ_8K + memmap_size + > > > > > dax_label_reserve, > > > > > nd_pfn->align) - start; > > > > > } > > > > > > > > > > IIUC that is finding the offset where to put vmemmap start. And that > > > > > has > > > > > to be aligned to the page size with which we may end up mapping > > > > > vmemmap > > > > > area right? > > > > > > Right, that's the physical offset of where the vmemmap ends, and the > > > memory to be mapped begins. > > > > > > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that > > > > > is to compute howmany pfns we should map for this pfn dev right? > > > > > > > > > > > > > Also i guess those 4K assumptions there is wrong? > > > > > > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata > > > needs to be revved and the PAGE_SIZE needs to be recorded in the > > > info-block. > > > > How often does a system change page-size. Is it fixed or do > > environment change it from one boot to the next? I'm thinking through > > the behavior of what do when the recorded PAGE_SIZE in the info-block > > does not match the current system page size. The simplest option is to > > just fail the device and require it to be reconfigured. Is that > > acceptable? > > The kernel page size is set at build time and as far as I know every > distro configures their ppc64(le) kernel for 64K. I've used 4K kernels > a few times in the past to debug PAGE_SIZE dependent problems, but I'd > be surprised if anyone is using 4K in production. Ah, ok. > Anyway, my view is that using 4K here isn't really a problem since > it's just the accounting unit of the pfn superblock format. The kernel > reading form it should understand that and scale it to whatever > accounting unit it wants to use internally. Currently we don't so that > should probably be fixed, but that doesn't seem to cause any real > issues. As far as I can tell the only user of npfns in > __nvdimm_setup_pfn() whih prints the "number of pfns truncated" > message. > > Am I missing something? No, I don't think so. The only time it would break is if a system with 64K page size laid down an info-block with not enough reserved capacity when the page-size is 4K (npfns too small). However, that sounds like an exceptional case which is why no problems have been reported to date.
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Thu, Mar 21, 2019 at 7:57 AM Dan Williams wrote: > > On Wed, Mar 20, 2019 at 8:34 AM Dan Williams wrote: > > > > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V > > wrote: > > > > > > Aneesh Kumar K.V writes: > > > > > > > Dan Williams writes: > > > > > > > >> > > > >>> Now what will be page size used for mapping vmemmap? > > > >> > > > >> That's up to the architecture's vmemmap_populate() implementation. > > > >> > > > >>> Architectures > > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a > > > >>> device-dax with struct page in the device will have pfn reserve area > > > >>> aligned > > > >>> to PAGE_SIZE with the above example? We can't map that using > > > >>> PMD_SIZE page size? > > > >> > > > >> IIUC, that's a different alignment. Currently that's handled by > > > >> padding the reservation area up to a section (128MB on x86) boundary, > > > >> but I'm working on patches to allow sub-section sized ranges to be > > > >> mapped. > > > > > > > > I am missing something w.r.t code. The below code align that using > > > > nd_pfn->align > > > > > > > > if (nd_pfn->mode == PFN_MODE_PMEM) { > > > > unsigned long memmap_size; > > > > > > > > /* > > > >* vmemmap_populate_hugepages() allocates the memmap > > > > array in > > > >* HPAGE_SIZE chunks. > > > >*/ > > > > memmap_size = ALIGN(64 * npfns, HPAGE_SIZE); > > > > offset = ALIGN(start + SZ_8K + memmap_size + > > > > dax_label_reserve, > > > > nd_pfn->align) - start; > > > > } > > > > > > > > IIUC that is finding the offset where to put vmemmap start. And that has > > > > to be aligned to the page size with which we may end up mapping vmemmap > > > > area right? > > > > Right, that's the physical offset of where the vmemmap ends, and the > > memory to be mapped begins. > > > > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that > > > > is to compute howmany pfns we should map for this pfn dev right? > > > > > > > > > > Also i guess those 4K assumptions there is wrong? > > > > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata > > needs to be revved and the PAGE_SIZE needs to be recorded in the > > info-block. > > How often does a system change page-size. Is it fixed or do > environment change it from one boot to the next? I'm thinking through > the behavior of what do when the recorded PAGE_SIZE in the info-block > does not match the current system page size. The simplest option is to > just fail the device and require it to be reconfigured. Is that > acceptable? The kernel page size is set at build time and as far as I know every distro configures their ppc64(le) kernel for 64K. I've used 4K kernels a few times in the past to debug PAGE_SIZE dependent problems, but I'd be surprised if anyone is using 4K in production. Anyway, my view is that using 4K here isn't really a problem since it's just the accounting unit of the pfn superblock format. The kernel reading form it should understand that and scale it to whatever accounting unit it wants to use internally. Currently we don't so that should probably be fixed, but that doesn't seem to cause any real issues. As far as I can tell the only user of npfns in __nvdimm_setup_pfn() whih prints the "number of pfns truncated" message. Am I missing something? > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 13/13] syscall_get_arch: add "struct task_struct *" argument
On Sun, Mar 17, 2019 at 7:30 PM Dmitry V. Levin wrote: > > This argument is required to extend the generic ptrace API with > PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going > to be called from ptrace_request() along with syscall_get_nr(), > syscall_get_arguments(), syscall_get_error(), and > syscall_get_return_value() functions with a tracee as their argument. > > The primary intent is that the triple (audit_arch, syscall_nr, arg1..arg6) > should describe what system call is being called and what its arguments > are. > > Reverts: 5e937a9ae913 ("syscall_get_arch: remove useless function arguments") > Reverts: 1002d94d3076 ("syscall.h: fix doc text for syscall_get_arch()") > Reviewed-by: Andy Lutomirski # for x86 > Reviewed-by: Palmer Dabbelt > Acked-by: Paul Moore > Acked-by: Paul Burton # MIPS parts > Acked-by: Michael Ellerman (powerpc) > Acked-by: Kees Cook # seccomp parts > Acked-by: Mark Salter # for the c6x bit > Cc: Elvira Khabirova > Cc: Eugene Syromyatnikov > Cc: Oleg Nesterov > Cc: x...@kernel.org > Cc: linux-al...@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c6x-...@linux-c6x.org > Cc: uclinux-h8-de...@lists.sourceforge.jp > Cc: linux-hexa...@vger.kernel.org > Cc: linux-i...@vger.kernel.org > Cc: linux-m...@lists.linux-m68k.org > Cc: linux-m...@vger.kernel.org > Cc: nios2-...@lists.rocketboards.org > Cc: openr...@lists.librecores.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-ri...@lists.infradead.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@lists.infradead.org > Cc: linux-xte...@linux-xtensa.org > Cc: linux-a...@vger.kernel.org > Cc: linux-au...@redhat.com > Signed-off-by: Dmitry V. Levin > --- > > Notes: > v2: unchanged > > arch/alpha/include/asm/syscall.h | 2 +- > arch/arc/include/asm/syscall.h| 2 +- > arch/arm/include/asm/syscall.h| 2 +- > arch/arm64/include/asm/syscall.h | 4 ++-- > arch/c6x/include/asm/syscall.h| 2 +- > arch/csky/include/asm/syscall.h | 2 +- > arch/h8300/include/asm/syscall.h | 2 +- > arch/hexagon/include/asm/syscall.h| 2 +- > arch/ia64/include/asm/syscall.h | 2 +- > arch/m68k/include/asm/syscall.h | 2 +- > arch/microblaze/include/asm/syscall.h | 2 +- > arch/mips/include/asm/syscall.h | 6 +++--- > arch/mips/kernel/ptrace.c | 2 +- > arch/nds32/include/asm/syscall.h | 2 +- > arch/nios2/include/asm/syscall.h | 2 +- > arch/openrisc/include/asm/syscall.h | 2 +- > arch/parisc/include/asm/syscall.h | 4 ++-- > arch/powerpc/include/asm/syscall.h| 10 -- > arch/riscv/include/asm/syscall.h | 2 +- > arch/s390/include/asm/syscall.h | 4 ++-- > arch/sh/include/asm/syscall_32.h | 2 +- > arch/sh/include/asm/syscall_64.h | 2 +- > arch/sparc/include/asm/syscall.h | 5 +++-- > arch/unicore32/include/asm/syscall.h | 2 +- > arch/x86/include/asm/syscall.h| 8 +--- > arch/x86/um/asm/syscall.h | 2 +- > arch/xtensa/include/asm/syscall.h | 2 +- > include/asm-generic/syscall.h | 5 +++-- > kernel/auditsc.c | 4 ++-- > kernel/seccomp.c | 4 ++-- > 30 files changed, 52 insertions(+), 42 deletions(-) Merged into audit/next, thanks everyone. -- paul moore www.paul-moore.com
Re: [PATCH 2/2] ibmvscsi: Fix empty event pool access during host removal
Tyrel, > The event pool used for queueing commands is destroyed fairly early in > the ibmvscsi_remove() code path. Since, this happens prior to the call > so scsi_remove_host() it is possible for further calls to queuecommand > to be processed which manifest as a panic due to a NULL pointer > dereference as seen here: Applied to 5.1/scsi-fixes. Thanks! -- Martin K. Petersen Oracle Linux Engineering
[PATCH 4/4] ibmvfc: Clean up transport events
No change to functionality. Simply make transport event messages a litle clearer, and rework CRQ format enums such that we have separate enums for INIT messages and XPORT events. Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvfc.c | 8 +--- drivers/scsi/ibmvscsi/ibmvfc.h | 7 ++- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index 33dda4d32f65..3ad997ac3510 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -2756,16 +2756,18 @@ static void ibmvfc_handle_crq(struct ibmvfc_crq *crq, struct ibmvfc_host *vhost) ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_NONE); if (crq->format == IBMVFC_PARTITION_MIGRATED) { /* We need to re-setup the interpartition connection */ - dev_info(vhost->dev, "Re-enabling adapter\n"); + dev_info(vhost->dev, "Partition migrated, Re-enabling adapter\n"); vhost->client_migrated = 1; ibmvfc_purge_requests(vhost, DID_REQUEUE); ibmvfc_link_down(vhost, IBMVFC_LINK_DOWN); ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_REENABLE); - } else { - dev_err(vhost->dev, "Virtual adapter failed (rc=%d)\n", crq->format); + } else if (crq->format == IBMVFC_PARTNER_FAILED || crq->format == IBMVFC_PARTNER_DEREGISTER) { + dev_err(vhost->dev, "Host partner adapter deregistered or failed (rc=%d)\n", crq->format); ibmvfc_purge_requests(vhost, DID_ERROR); ibmvfc_link_down(vhost, IBMVFC_LINK_DOWN); ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_RESET); + } else { + dev_err(vhost->dev, "Received unknown transport event from partner (rc=%d)\n", crq->format); } return; case IBMVFC_CRQ_CMD_RSP: diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h index b81a53c4a9a8..459cc288ba1d 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.h +++ b/drivers/scsi/ibmvscsi/ibmvfc.h @@ -78,9 +78,14 @@ enum ibmvfc_crq_valid { IBMVFC_CRQ_XPORT_EVENT = 0xFF, }; -enum ibmvfc_crq_format { +enum ibmvfc_crq_init_msg { IBMVFC_CRQ_INIT = 0x01, IBMVFC_CRQ_INIT_COMPLETE= 0x02, +}; + +enum ibmvfc_crq_xport_evts { + IBMVFC_PARTNER_FAILED = 0x01, + IBMVFC_PARTNER_DEREGISTER = 0x02, IBMVFC_PARTITION_MIGRATED = 0x06, }; -- 2.12.3
[PATCH 3/4] ibmvfc: Byte swap status and error codes when logging
Status and error codes are returned in big endian from the VIOS. The values are translated into a human readable format when logged, but the values are also logged. This patch byte swaps those values so that they are consistent between BE and LE platforms. Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvfc.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index 18ee2a8ec3d5..33dda4d32f65 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -1497,7 +1497,7 @@ static void ibmvfc_log_error(struct ibmvfc_event *evt) scmd_printk(KERN_ERR, cmnd, "Command (%02X) : %s (%x:%x) " "flags: %x fcp_rsp: %x, resid=%d, scsi_status: %x\n", - cmnd->cmnd[0], err, vfc_cmd->status, vfc_cmd->error, + cmnd->cmnd[0], err, be16_to_cpu(vfc_cmd->status), be16_to_cpu(vfc_cmd->error), rsp->flags, rsp_code, scsi_get_resid(cmnd), rsp->scsi_status); } @@ -2023,7 +2023,7 @@ static int ibmvfc_reset_device(struct scsi_device *sdev, int type, char *desc) sdev_printk(KERN_ERR, sdev, "%s reset failed: %s (%x:%x) " "flags: %x fcp_rsp: %x, scsi_status: %x\n", desc, ibmvfc_get_cmd_error(be16_to_cpu(rsp_iu.cmd.status), be16_to_cpu(rsp_iu.cmd.error)), - rsp_iu.cmd.status, rsp_iu.cmd.error, fc_rsp->flags, rsp_code, + be16_to_cpu(rsp_iu.cmd.status), be16_to_cpu(rsp_iu.cmd.error), fc_rsp->flags, rsp_code, fc_rsp->scsi_status); rsp_rc = -EIO; } else @@ -2382,7 +2382,7 @@ static int ibmvfc_abort_task_set(struct scsi_device *sdev) sdev_printk(KERN_ERR, sdev, "Abort failed: %s (%x:%x) " "flags: %x fcp_rsp: %x, scsi_status: %x\n", ibmvfc_get_cmd_error(be16_to_cpu(rsp_iu.cmd.status), be16_to_cpu(rsp_iu.cmd.error)), - rsp_iu.cmd.status, rsp_iu.cmd.error, fc_rsp->flags, rsp_code, + be16_to_cpu(rsp_iu.cmd.status), be16_to_cpu(rsp_iu.cmd.error), fc_rsp->flags, rsp_code, fc_rsp->scsi_status); rsp_rc = -EIO; } else @@ -3349,7 +3349,7 @@ static void ibmvfc_tgt_prli_done(struct ibmvfc_event *evt) tgt_log(tgt, level, "Process Login failed: %s (%x:%x) rc=0x%02X\n", ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), be16_to_cpu(rsp->error)), - rsp->status, rsp->error, status); + be16_to_cpu(rsp->status), be16_to_cpu(rsp->error), status); break; } @@ -3447,9 +3447,10 @@ static void ibmvfc_tgt_plogi_done(struct ibmvfc_event *evt) ibmvfc_set_tgt_action(tgt, IBMVFC_TGT_ACTION_DEL_RPORT); tgt_log(tgt, level, "Port Login failed: %s (%x:%x) %s (%x) %s (%x) rc=0x%02X\n", - ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), be16_to_cpu(rsp->error)), rsp->status, rsp->error, - ibmvfc_get_fc_type(be16_to_cpu(rsp->fc_type)), rsp->fc_type, - ibmvfc_get_ls_explain(be16_to_cpu(rsp->fc_explain)), rsp->fc_explain, status); + ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), be16_to_cpu(rsp->error)), +be16_to_cpu(rsp->status), be16_to_cpu(rsp->error), + ibmvfc_get_fc_type(be16_to_cpu(rsp->fc_type)), be16_to_cpu(rsp->fc_type), + ibmvfc_get_ls_explain(be16_to_cpu(rsp->fc_explain)), be16_to_cpu(rsp->fc_explain), status); break; } @@ -3620,7 +3621,7 @@ static void ibmvfc_tgt_adisc_done(struct ibmvfc_event *evt) fc_explain = (be32_to_cpu(mad->fc_iu.response[1]) & 0xff00) >> 8; tgt_info(tgt, "ADISC failed: %s (%x:%x) %s (%x) %s (%x) rc=0x%02X\n", ibmvfc_get_cmd_error(be16_to_cpu(mad->iu.status), be16_to_cpu(mad->iu.error)), -mad->iu.status, mad->iu.error, +be16_to_cpu(mad->iu.status), be16_to_cpu(mad->iu.error), ibmvfc_get_fc_type(fc_reason), fc_reason, ibmvfc_get_ls_explain(fc_explain), fc_explain, status); break; @@ -3832,9 +3833,10 @@ static void ibmvfc_tgt_query_target_done(struct ibmvfc_event *evt) tgt_log(tgt, level, "Query Target failed: %s (%x:%x) %s (%x) %s (%x) rc=0x%02X\n", ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), be16_to_cpu(rsp->error)), - rsp->status, rsp->error, ibmvfc_get_fc_type(be16_to_cpu(rsp->fc_type)), - rsp->fc_type,
[PATCH 2/4] ibmvfc: Add failed PRLI to cmd_status lookup array
The VIOS uses the SCSI_ERROR class to report PRLI failures. These errors are indicated with the combination of a IBMVFC_FC_SCSI_ERROR return status and 0x8000 error code. Add these codes to cmd_status[] with appropriate human readable error message. Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvfc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index c3ce27039552..18ee2a8ec3d5 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -139,6 +139,7 @@ static const struct { { IBMVFC_FC_FAILURE, IBMVFC_VENDOR_SPECIFIC, DID_ERROR, 1, 1, "vendor specific" }, { IBMVFC_FC_SCSI_ERROR, 0, DID_OK, 1, 0, "SCSI error" }, + { IBMVFC_FC_SCSI_ERROR, IBMVFC_COMMAND_FAILED, DID_ERROR, 0, 1, "PRLI to device failed." }, }; static void ibmvfc_npiv_login(struct ibmvfc_host *); -- 2.12.3
[PATCH 1/4] ibmvfc: Remove "failed" from logged errors
The text of messages logged with ibmvfc_log_error() always contain the term "failed". In the case of cancelled commands during EH they are reported back by the VIOS using error codes. This can be confusing to somebody looking at these log messages as to whether a command was successfully cancelled. The following real log message for example it is unclear if the transaction was actaully cancelled. <6>sd 0:0:1:1: Cancelling outstanding commands. <3>sd 0:0:1:1: [sde] Command (28) failed: transaction cancelled (2:6) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0 Remove prefixing of "failed" to all error logged messages. The ibmvfc_log_error() function translates the returned error/status codes to a human readable message already. Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvfc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index dbaa4f131433..c3ce27039552 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -1494,7 +1494,7 @@ static void ibmvfc_log_error(struct ibmvfc_event *evt) if (rsp->flags & FCP_RSP_LEN_VALID) rsp_code = rsp->data.info.rsp_code; - scmd_printk(KERN_ERR, cmnd, "Command (%02X) failed: %s (%x:%x) " + scmd_printk(KERN_ERR, cmnd, "Command (%02X) : %s (%x:%x) " "flags: %x fcp_rsp: %x, resid=%d, scsi_status: %x\n", cmnd->cmnd[0], err, vfc_cmd->status, vfc_cmd->error, rsp->flags, rsp_code, scsi_get_resid(cmnd), rsp->scsi_status); -- 2.12.3
[PATCH] powerpc: vmlinux.lds: Drop Binutils 2.18 workarounds
Segher added some workarounds for GCC 4.2 and bintuils 2.18. We now set 4.6 and 2.20 as the minimum, so they can be dropped. This is mostly a revert of c6995fe4 ("powerpc: Fix build bug with binutils < 2.18 and GCC < 4.2"). Signed-off-by: Joel Stanley --- arch/powerpc/kernel/vmlinux.lds.S | 35 --- 1 file changed, 4 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S index 060a1acd7c6d..0551e9846676 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S +++ b/arch/powerpc/kernel/vmlinux.lds.S @@ -17,25 +17,6 @@ ENTRY(_stext) -PHDRS { - kernel PT_LOAD FLAGS(7); /* RWX */ - notes PT_NOTE FLAGS(0); - dummy PT_NOTE FLAGS(0); - - /* binutils < 2.18 has a bug that makes it misbehave when taking an - ELF file with all segments at load address 0 as input. This - happens when running "strip" on vmlinux, because of the AT() magic - in this linker script. People using GCC >= 4.2 won't run into - this problem, because the "build-id" support will put some data - into the "notes" segment (at a non-zero load address). - - To work around this, we force some data into both the "dummy" - segment and the kernel segment, so the dummy segment will get a - non-zero load address. It's not enough to always create the - "notes" segment, since if nothing gets assigned to it, its load - address will be zero. */ -} - #ifdef CONFIG_PPC64 OUTPUT_ARCH(powerpc:common64) jiffies = jiffies_64; @@ -77,7 +58,7 @@ SECTIONS #else /* !CONFIG_PPC64 */ HEAD_TEXT #endif - } :kernel + } __head_end = .; @@ -126,7 +107,7 @@ SECTIONS __got2_end = .; #endif /* CONFIG_PPC32 */ - } :kernel + } . = ALIGN(ETEXT_ALIGN_SIZE); _etext = .; @@ -177,15 +158,7 @@ SECTIONS #endif EXCEPTION_TABLE(0) - NOTES :kernel :notes - - /* The dummy segment contents for the bug workaround mentioned above - near PHDRS. */ - .dummy : AT(ADDR(.dummy) - LOAD_OFFSET) { - LONG(0) - LONG(0) - LONG(0) - } :kernel :dummy + NOTES /* * Init sections discarded at runtime @@ -200,7 +173,7 @@ SECTIONS #ifdef CONFIG_PPC64 *(.tramp.ftrace.init); #endif - } :kernel + } /* .exit.text is discarded at runtime, not link time, * to deal with references from __bug_table -- 2.20.1
Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation
On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote: > On Wed, 20 Mar 2019 15:38:24 +1100 > David Gibson wrote: > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote: > > > On Fri, 15 Mar 2019 19:18:35 +1100 > > > Alexey Kardashevskiy wrote: > > > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV > > > > platform puts all interconnected GPUs to the same IOMMU group. > > > > > > > > However the user may want to pass individual GPUs to the userspace so > > > > in order to do so we need to put them into separate IOMMU groups and > > > > cut off the interconnects. > > > > > > > > Thankfully V100 GPUs implement an interface to do by programming link > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using > > > > this interface, it cannot be re-enabled until the secondary bus reset is > > > > issued to the GPU. > > > > > > > > This defines a reset_done() handler for V100 NVlink2 device which > > > > determines what links need to be disabled. This relies on presence > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling > > > > which > > > > PCI peers it is connected to (which includes NVLink bridges or peer > > > > GPUs). > > > > > > > > This does not change the existing behaviour and instead adds > > > > a new "isolate_nvlink" kernel parameter to allow such isolation. > > > > > > > > The alternative approaches would be: > > > > > > > > 1. do this in the system firmware (skiboot) but for that we would need > > > > to tell skiboot via an additional OPAL call whether or not we want this > > > > isolation - skiboot is unaware of IOMMU groups. > > > > > > > > 2. do this in the secondary bus reset handler in the POWERNV platform - > > > > the problem with that is at that point the device is not enabled, i.e. > > > > config space is not restored so we need to enable the device (i.e. MMIO > > > > bit in CMD register + program valid address to BAR0) in order to disable > > > > links and then perhaps undo all this initialization to bring the device > > > > back to the state where pci_try_reset_function() expects it to be. > > > > > > The trouble seems to be that this approach only maintains the isolation > > > exposed by the IOMMU group when vfio-pci is the active driver for the > > > device. IOMMU groups can be used by any driver and the IOMMU core is > > > incorporating groups in various ways. > > > > I don't think that reasoning is quite right. An IOMMU group doesn't > > necessarily represent devices which *are* isolated, just devices which > > *can be* isolated. There are plenty of instances when we don't need > > to isolate devices in different IOMMU groups: passing both groups to > > the same guest or userspace VFIO driver for example, or indeed when > > both groups are owned by regular host kernel drivers. > > > > In at least some of those cases we also don't want to isolate the > > devices when we don't have to, usually for performance reasons. > > I see IOMMU groups as representing the current isolation of the device, > not just the possible isolation. If there are ways to break down that > isolation then ideally the group would be updated to reflect it. The > ACS disable patches seem to support this, at boot time we can choose to > disable ACS at certain points in the topology to favor peer-to-peer > performance over isolation. This is then reflected in the group > composition, because even though ACS *can be* enabled at the given > isolation points, it's intentionally not with this option. Whether or > not a given user who owns multiple devices needs that isolation is > really beside the point, the user can choose to connect groups via IOMMU > mappings or reconfigure the system to disable ACS and potentially more > direct routing. The IOMMU groups are still accurately reflecting the > topology and IOMMU based isolation. Huh, ok, I think we need to straighten this out. Thinking of iommu groups as possible rather than potential isolation was a conscious decision on my part when we were first coming up with them. The rationale was that that way iommu groups could be static for the lifetime of boot, with more dynamic isolation state layered on top. Now, that was based on analogy with PAPR's concept of "Partitionable Endpoints" which are decided by firmware before boot. However, I think it makes sense in other contexts too: if iommu groups represent current isolation, then we need some other way to advertise possible isolation - otherwise how will the admin (and/or tools) know how it can configure the iommu groups. VFIO already has the container, which represents explicitly a "group of groups" that we don't care to isolate from each other. I don't actually know what other uses of the iommu group infrastructure we have at
[PATCH 2/2] ibmvscsi: Fix empty event pool access during host removal
The event pool used for queueing commands is destroyed fairly early in the ibmvscsi_remove() code path. Since, this happens prior to the call so scsi_remove_host() it is possible for further calls to queuecommand to be processed which manifest as a panic due to a NULL pointer dereference as seen here: PANIC: "Unable to handle kernel paging request for data at address 0x" Context process backtrace: DSISR: 4200 Syscall Result: 4 [c2cb3820] memcpy_power7 at c0064204 [Link Register] [c2cb3820] ibmvscsi_send_srp_event at d3ed14a4 5 [c2cb3920] ibmvscsi_send_srp_event at d3ed14a4 [ibmvscsi] ?(unreliable) 6 [c2cb39c0] ibmvscsi_queuecommand at d3ed2388 [ibmvscsi] 7 [c2cb3a70] scsi_dispatch_cmd at d395c2d8 [scsi_mod] 8 [c2cb3af0] scsi_request_fn at d395ef88 [scsi_mod] 9 [c2cb3be0] __blk_run_queue at c0429860 10 [c2cb3c10] blk_delay_work at c042a0ec 11 [c2cb3c40] process_one_work at c00dac30 12 [c2cb3cd0] worker_thread at c00db110 13 [c2cb3d80] kthread at c00e3378 14 [c2cb3e30] ret_from_kernel_thread at c000982c The kernel buffer log is overfilled with this log: [11261.952732] ibmvscsi: found no event struct in pool! This patch reorders the operations during host teardown. Start by calling the SRP transport and Scsi_Host remove functions to flush any outstanding work and set the host offline. LLDD teardown follows including destruction of the event pool, freeing the Command Response Queue (CRQ), and unmapping any persistent buffers. The event pool destruction is protected by the scsi_host lock, and the pool is purged prior of any requests for which we never received a response. Finally, move the removal of the scsi host from our global list to the end so that the host is easily locatable for debugging purposes during teardown. Cc: # v2.6.12+ Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvscsi.c | 22 -- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 2b22969f3f63..8cec5230fe31 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -2295,17 +2295,27 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id *id) static int ibmvscsi_remove(struct vio_dev *vdev) { struct ibmvscsi_host_data *hostdata = dev_get_drvdata(>dev); - spin_lock(_driver_lock); - list_del(>host_list); - spin_unlock(_driver_lock); - unmap_persist_bufs(hostdata); + unsigned long flags; + + srp_remove_host(hostdata->host); + scsi_remove_host(hostdata->host); + + purge_requests(hostdata, DID_ERROR); + + spin_lock_irqsave(hostdata->host->host_lock, flags); release_event_pool(>pool, hostdata); + spin_unlock_irqrestore(hostdata->host->host_lock, flags); + ibmvscsi_release_crq_queue(>queue, hostdata, max_events); kthread_stop(hostdata->work_thread); - srp_remove_host(hostdata->host); - scsi_remove_host(hostdata->host); + unmap_persist_bufs(hostdata); + + spin_lock(_driver_lock); + list_del(>host_list); + spin_unlock(_driver_lock); + scsi_host_put(hostdata->host); return 0; -- 2.12.3
[PATCH 1/2] ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton
For each ibmvscsi host created during a probe or destroyed during a remove we either add or remove that host to/from the global ibmvscsi_head list. This runs the risk of concurrent modification. This patch adds a simple spinlock around the list modification calls to prevent concurrent updates as is done similarly in the ibmvfc driver and ipr driver. Fixes: 32d6e4b6e4ea ("scsi: ibmvscsi: add vscsi hosts to global list_head") Cc: # v4.10+ Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvscsi.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 1135e74646e2..2b22969f3f63 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -96,6 +96,7 @@ static int client_reserve = 1; static char partition_name[96] = "UNKNOWN"; static unsigned int partition_number = -1; static LIST_HEAD(ibmvscsi_head); +static DEFINE_SPINLOCK(ibmvscsi_driver_lock); static struct scsi_transport_template *ibmvscsi_transport_template; @@ -2270,7 +2271,9 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id *id) } dev_set_drvdata(>dev, hostdata); + spin_lock(_driver_lock); list_add_tail(>host_list, _head); + spin_unlock(_driver_lock); return 0; add_srp_port_failed: @@ -2292,7 +2295,9 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id *id) static int ibmvscsi_remove(struct vio_dev *vdev) { struct ibmvscsi_host_data *hostdata = dev_get_drvdata(>dev); + spin_lock(_driver_lock); list_del(>host_list); + spin_unlock(_driver_lock); unmap_persist_bufs(hostdata); release_event_pool(>pool, hostdata); ibmvscsi_release_crq_queue(>queue, hostdata, -- 2.12.3
Re: [PATCH 1/4] add generic builtin command line
On Wed, Mar 20, 2019 at 03:53:19PM -0700, Andrew Morton wrote: > On Tue, 19 Mar 2019 16:24:45 -0700 Daniel Walker wrote: > > > This code allows architectures to use a generic builtin command line. > > I wasn't cc'ed on [2/4]. No mailing lists were cc'ed on [0/4] but it > didn't say anything useful anyway ;) > > I'll queue them up for testing and shall await feedback from the > powerpc developers. > You weren't CC'd , but it was To: you, 35 From: Daniel Walker 36 To: Andrew Morton , 37 Christophe Leroy , 38 Michael Ellerman , 39 Rob Herring , xe-linux-exter...@cisco.com, 40 linuxppc-dev@lists.ozlabs.org, Frank Rowand 41 Cc: devicet...@vger.kernel.org, linux-ker...@vger.kernel.org 42 Subject: [PATCH 2/4] drivers: of: generic command line support and the first one [0/4] should have went to the linuxppc-dev , and xe-linux-external. Maybe our git-send-email isn't working with our mail servers. Thanks for picking it up. Daniel
Re: [PATCH v4 06/17] KVM: PPC: Book3S HV: XIVE: add controls for the EQ configuration
On Wed, Mar 20, 2019 at 09:37:40AM +0100, Cédric Le Goater wrote: > These controls will be used by the H_INT_SET_QUEUE_CONFIG and > H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying > Event Queue in the XIVE IC. They will also be used to restore the > configuration of the XIVE EQs and to capture the internal run-time > state of the EQs. Both 'get' and 'set' rely on an OPAL call to access > the EQ toggle bit and EQ index which are updated by the XIVE IC when > event notifications are enqueued in the EQ. > > The value of the guest physical address of the event queue is saved in > the XIVE internal xive_q structure for later use. That is when > migration needs to mark the EQ pages dirty to capture a consistent > memory state of the VM. > > To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra > OPAL call setting the EQ toggle bit and EQ index to configure the EQ, > but restoring the EQ state will. > > Signed-off-by: Cédric Le Goater > --- > > Changes since v3 : > > - fix the test ont the initial setting of the EQ toggle bit : 0 -> 1 > - renamed qsize to qshift > - renamed qpage to qaddr > - checked host page size > - limited flags to KVM_XIVE_EQ_ALWAYS_NOTIFY to fit sPAPR specs > > Changes since v2 : > > - fixed comments on the KVM device attribute definitions > - fixed check on supported EQ size to restrict to 64K pages > - checked kvm_eq.flags that need to be zero > - removed the OPAL call when EQ qtoggle bit and index are zero. > > arch/powerpc/include/asm/xive.h| 2 + > arch/powerpc/include/uapi/asm/kvm.h| 19 ++ > arch/powerpc/kvm/book3s_xive.h | 2 + > arch/powerpc/kvm/book3s_xive.c | 15 +- > arch/powerpc/kvm/book3s_xive_native.c | 242 + > Documentation/virtual/kvm/devices/xive.txt | 34 +++ > 6 files changed, 308 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h > index b579a943407b..c4e88abd3b67 100644 > --- a/arch/powerpc/include/asm/xive.h > +++ b/arch/powerpc/include/asm/xive.h > @@ -73,6 +73,8 @@ struct xive_q { > u32 esc_irq; > atomic_tcount; > atomic_tpending_count; > + u64 guest_qaddr; > + u32 guest_qshift; > }; > > /* Global enable flags for the XIVE support */ > diff --git a/arch/powerpc/include/uapi/asm/kvm.h > b/arch/powerpc/include/uapi/asm/kvm.h > index e8161e21629b..85005400fd86 100644 > --- a/arch/powerpc/include/uapi/asm/kvm.h > +++ b/arch/powerpc/include/uapi/asm/kvm.h > @@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char { > #define KVM_DEV_XIVE_GRP_CTRL1 > #define KVM_DEV_XIVE_GRP_SOURCE 2 /* 64-bit source > identifier */ > #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3 /* 64-bit source > identifier */ > +#define KVM_DEV_XIVE_GRP_EQ_CONFIG 4 /* 64-bit EQ identifier */ > > /* Layout of 64-bit XIVE source attribute values */ > #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) > @@ -696,4 +697,22 @@ struct kvm_ppc_cpu_char { > #define KVM_XIVE_SOURCE_EISN_SHIFT 33 > #define KVM_XIVE_SOURCE_EISN_MASK0xfffeULL > > +/* Layout of 64-bit EQ identifier */ > +#define KVM_XIVE_EQ_PRIORITY_SHIFT 0 > +#define KVM_XIVE_EQ_PRIORITY_MASK0x7 > +#define KVM_XIVE_EQ_SERVER_SHIFT 3 > +#define KVM_XIVE_EQ_SERVER_MASK 0xfff8ULL > + > +/* Layout of EQ configuration values (64 bytes) */ > +struct kvm_ppc_xive_eq { > + __u32 flags; > + __u32 qshift; > + __u64 qaddr; > + __u32 qtoggle; > + __u32 qindex; > + __u8 pad[40]; > +}; > + > +#define KVM_XIVE_EQ_ALWAYS_NOTIFY0x0001 > + > #endif /* __LINUX_KVM_POWERPC_H */ > diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h > index ae26fe653d98..622f594d93e1 100644 > --- a/arch/powerpc/kvm/book3s_xive.h > +++ b/arch/powerpc/kvm/book3s_xive.h > @@ -272,6 +272,8 @@ struct kvmppc_xive_src_block > *kvmppc_xive_create_src_block( > struct kvmppc_xive *xive, int irq); > void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb); > int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio); > +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio, > + bool single_escalation); > > #endif /* CONFIG_KVM_XICS */ > #endif /* _KVM_PPC_BOOK3S_XICS_H */ > diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c > index e09f3addffe5..c1b7aa7dbc28 100644 > --- a/arch/powerpc/kvm/book3s_xive.c > +++ b/arch/powerpc/kvm/book3s_xive.c > @@ -166,7 +166,8 @@ static irqreturn_t xive_esc_irq(int irq, void *data) > return IRQ_HANDLED; > } > > -static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio) > +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio, > + bool
Re: [PATCH 1/4] add generic builtin command line
On Tue, 19 Mar 2019 16:24:45 -0700 Daniel Walker wrote: > This code allows architectures to use a generic builtin command line. I wasn't cc'ed on [2/4]. No mailing lists were cc'ed on [0/4] but it didn't say anything useful anyway ;) I'll queue them up for testing and shall await feedback from the powerpc developers.
Re: [PATCH] hotplug/drc-info: ininitialize fndit to zero
[+cc Michael B (original author)] On Sat, Mar 16, 2019 at 09:40:16PM +, Colin King wrote: > From: Colin Ian King > > Currently variable fndit is not initialized and contains a > garbage value, later it is set to 1 if a drc entry is found. > Ensure fndit is not containing garbage by initializing it to > zero. Also remove an extraneous space at the end of an > sprintf call. > > Detected by static analysis with cppcheck. > > Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info > property") > Signed-off-by: Colin Ian King Michael E, I assume you'll take this since you took the original? Let me know if you want me to. > --- > drivers/pci/hotplug/rpaphp_core.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/hotplug/rpaphp_core.c > b/drivers/pci/hotplug/rpaphp_core.c > index bcd5d357ca23..28213f44f64a 100644 > --- a/drivers/pci/hotplug/rpaphp_core.c > +++ b/drivers/pci/hotplug/rpaphp_core.c > @@ -230,7 +230,7 @@ static int rpaphp_check_drc_props_v2(struct device_node > *dn, char *drc_name, > struct of_drc_info drc; > const __be32 *value; > char cell_drc_name[MAX_DRC_NAME_LEN]; > - int j, fndit; > + int j, fndit = 0; > > info = of_find_property(dn->parent, "ibm,drc-info", NULL); > if (info == NULL) > @@ -254,7 +254,7 @@ static int rpaphp_check_drc_props_v2(struct device_node > *dn, char *drc_name, > /* Found it */ > > if (fndit) > - sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, > + sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, > my_index); > > if (((drc_name == NULL) || > -- > 2.20.1 >
Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted
On Wed, Mar 20, 2019 at 01:13:41PM -0300, Thiago Jung Bauermann wrote: > >> Another way of looking at this issue which also explains our reluctance > >> is that the only difference between a secure guest and a regular guest > >> (at least regarding virtio) is that the former uses swiotlb while the > >> latter doens't. > > > > But swiotlb is just one implementation. It's a guest internal thing. The > > issue is that memory isn't host accessible. > > >From what I understand of the ACCESS_PLATFORM definition, the host will > only ever try to access memory addresses that are supplied to it by the > guest, so all of the secure guest memory that the host cares about is > accessible: > > If this feature bit is set to 0, then the device has same access to > memory addresses supplied to it as the driver has. In particular, > the device will always use physical addresses matching addresses > used by the driver (typically meaning physical addresses used by the > CPU) and not translated further, and can access any address supplied > to it by the driver. When clear, this overrides any > platform-specific description of whether device access is limited or > translated in any way, e.g. whether an IOMMU may be present. > > All of the above is true for POWER guests, whether they are secure > guests or not. > > Or are you saying that a virtio device may want to access memory > addresses that weren't supplied to it by the driver? Your logic would apply to IOMMUs as well. For your mode, there are specific encrypted memory regions that driver has access to but device does not. that seems to violate the constraint. > >> And from the device's point of view they're > >> indistinguishable. It can't tell one guest that is using swiotlb from > >> one that isn't. And that implies that secure guest vs regular guest > >> isn't a virtio interface issue, it's "guest internal affairs". So > >> there's no reason to reflect that in the feature flags. > > > > So don't. The way not to reflect that in the feature flags is > > to set ACCESS_PLATFORM. Then you say *I don't care let platform device*. > > > > > > Without ACCESS_PLATFORM > > virtio has a very specific opinion about the security of the > > device, and that opinion is that device is part of the guest > > supervisor security domain. > > Sorry for being a bit dense, but not sure what "the device is part of > the guest supervisor security domain" means. In powerpc-speak, > "supervisor" is the operating system so perhaps that explains my > confusion. Are you saying that without ACCESS_PLATFORM, the guest > considers the host to be part of the guest operating system's security > domain? I think so. The spec says "device has same access as driver". > If so, does that have any other implication besides "the host > can access any address supplied to it by the driver"? If that is the > case, perhaps the definition of ACCESS_PLATFORM needs to be amended to > include that information because it's not part of the current > definition. > > >> That said, we still would like to arrive at a proper design for this > >> rather than add yet another hack if we can avoid it. So here's another > >> proposal: considering that the dma-direct code (in kernel/dma/direct.c) > >> automatically uses swiotlb when necessary (thanks to Christoph's recent > >> DMA work), would it be ok to replace virtio's own direct-memory code > >> that is used in the !ACCESS_PLATFORM case with the dma-direct code? That > >> way we'll get swiotlb even with !ACCESS_PLATFORM, and virtio will get a > >> code cleanup (replace open-coded stuff with calls to existing > >> infrastructure). > > > > Let's say I have some doubts that there's an API that > > matches what virtio with its bag of legacy compatibility exactly. > > Ok. > > >> > But the name "sev_active" makes me scared because at least AMD guys who > >> > were doing the sensible thing and setting ACCESS_PLATFORM > >> > >> My understanding is, AMD guest-platform knows in advance that their > >> guest will run in secure mode and hence sets the flag at the time of VM > >> instantiation. Unfortunately we dont have that luxury on our platforms. > > > > Well you do have that luxury. It looks like that there are existing > > guests that already acknowledge ACCESS_PLATFORM and you are not happy > > with how that path is slow. So you are trying to optimize for > > them by clearing ACCESS_PLATFORM and then you have lost ability > > to invoke DMA API. > > > > For example if there was another flag just like ACCESS_PLATFORM > > just not yet used by anyone, you would be all fine using that right? > > Yes, a new flag sounds like a great idea. What about the definition > below? > > VIRTIO_F_ACCESS_PLATFORM_NO_IOMMU This feature has the same meaning as > VIRTIO_F_ACCESS_PLATFORM both when set and when not set, with the > exception that the IOMMU is explicitly defined to be off or bypassed > when accessing memory addresses supplied to
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Wed, Mar 20, 2019 at 8:34 AM Dan Williams wrote: > > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V > wrote: > > > > Aneesh Kumar K.V writes: > > > > > Dan Williams writes: > > > > > >> > > >>> Now what will be page size used for mapping vmemmap? > > >> > > >> That's up to the architecture's vmemmap_populate() implementation. > > >> > > >>> Architectures > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a > > >>> device-dax with struct page in the device will have pfn reserve area > > >>> aligned > > >>> to PAGE_SIZE with the above example? We can't map that using > > >>> PMD_SIZE page size? > > >> > > >> IIUC, that's a different alignment. Currently that's handled by > > >> padding the reservation area up to a section (128MB on x86) boundary, > > >> but I'm working on patches to allow sub-section sized ranges to be > > >> mapped. > > > > > > I am missing something w.r.t code. The below code align that using > > > nd_pfn->align > > > > > > if (nd_pfn->mode == PFN_MODE_PMEM) { > > > unsigned long memmap_size; > > > > > > /* > > >* vmemmap_populate_hugepages() allocates the memmap array > > > in > > >* HPAGE_SIZE chunks. > > >*/ > > > memmap_size = ALIGN(64 * npfns, HPAGE_SIZE); > > > offset = ALIGN(start + SZ_8K + memmap_size + > > > dax_label_reserve, > > > nd_pfn->align) - start; > > > } > > > > > > IIUC that is finding the offset where to put vmemmap start. And that has > > > to be aligned to the page size with which we may end up mapping vmemmap > > > area right? > > Right, that's the physical offset of where the vmemmap ends, and the > memory to be mapped begins. > > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that > > > is to compute howmany pfns we should map for this pfn dev right? > > > > > > > Also i guess those 4K assumptions there is wrong? > > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata > needs to be revved and the PAGE_SIZE needs to be recorded in the > info-block. How often does a system change page-size. Is it fixed or do environment change it from one boot to the next? I'm thinking through the behavior of what do when the recorded PAGE_SIZE in the info-block does not match the current system page size. The simplest option is to just fail the device and require it to be reconfigured. Is that acceptable?
[RFC PATCH 1/1] KVM: PPC: Report single stepping capability
When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request the next instruction to be single stepped via the KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure. We currently don't have support for guest single stepping implemented in Book3S HV. This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order to inform userspace about the state of single stepping support. Signed-off-by: Fabiano Rosas --- arch/powerpc/kvm/powerpc.c | 5 + include/uapi/linux/kvm.h | 1 + 2 files changed, 6 insertions(+) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8885377ec3e0..5ba990b0ec74 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -538,6 +538,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IMMEDIATE_EXIT: r = 1; break; + case KVM_CAP_PPC_GUEST_DEBUG_SSTEP: +#ifdef CONFIG_BOOKE + r = 1; + break; +#endif case KVM_CAP_PPC_PAIRED_SINGLES: case KVM_CAP_PPC_OSI: case KVM_CAP_PPC_GET_PVINFO: diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 6d4ea4b6c922..33e8a4db867e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_VM_IPA_SIZE 165 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166 #define KVM_CAP_HYPERV_CPUID 167 +#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 168 #ifdef KVM_CAP_IRQ_ROUTING -- 2.20.1
[RFC PATCH 0/1] KVM: PPC: Inform userspace about singlestep support
I am looking for a way to inform userspace about the lack of an implementation in KVM HV for single stepping of instructions (KVM_GUESTDGB_SINGLESTEP bit from SET_GUEST_DEBUG ioctl). This will be used by QEMU to decide whether to attempt a call to the set_guest_debug ioctl (for BookE, KVM PR) or fallback to a QEMU only implementation (for KVM HV). QEMU thread: http://patchwork.ozlabs.org/cover/1049811/ My current proposal is to introduce a ppc-specific capability for this. However I'm not sure if this would be better as a cap common for all architectures or even if it should report on all of the possible set_guest_debug flags to cover for the future. Please comment. Thanks. Fabiano Rosas (1): KVM: PPC: Report single stepping capability arch/powerpc/kvm/powerpc.c | 5 + include/uapi/linux/kvm.h | 1 + 2 files changed, 6 insertions(+) -- 2.20.1
Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section
On Wed, 2019-03-20 at 18:16 +, Catalin Marinas wrote: > I think I have a simpler idea. Kmemleak allows punching holes in > allocated objects, so just turn the data/bss sections into dedicated > kmemleak objects. This happens when kmemleak is initialised, before the > initcalls are invoked. The kvm_free_tmp() would just free the > corresponding part of the bss. > > Patch below, only tested briefly on arm64. Qian, could you give it a try > on powerpc? Thanks. It works great so far!
Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation
On Wed, 20 Mar 2019 15:38:24 +1100 David Gibson wrote: > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote: > > On Fri, 15 Mar 2019 19:18:35 +1100 > > Alexey Kardashevskiy wrote: > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV > > > platform puts all interconnected GPUs to the same IOMMU group. > > > > > > However the user may want to pass individual GPUs to the userspace so > > > in order to do so we need to put them into separate IOMMU groups and > > > cut off the interconnects. > > > > > > Thankfully V100 GPUs implement an interface to do by programming link > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using > > > this interface, it cannot be re-enabled until the secondary bus reset is > > > issued to the GPU. > > > > > > This defines a reset_done() handler for V100 NVlink2 device which > > > determines what links need to be disabled. This relies on presence > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which > > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs). > > > > > > This does not change the existing behaviour and instead adds > > > a new "isolate_nvlink" kernel parameter to allow such isolation. > > > > > > The alternative approaches would be: > > > > > > 1. do this in the system firmware (skiboot) but for that we would need > > > to tell skiboot via an additional OPAL call whether or not we want this > > > isolation - skiboot is unaware of IOMMU groups. > > > > > > 2. do this in the secondary bus reset handler in the POWERNV platform - > > > the problem with that is at that point the device is not enabled, i.e. > > > config space is not restored so we need to enable the device (i.e. MMIO > > > bit in CMD register + program valid address to BAR0) in order to disable > > > links and then perhaps undo all this initialization to bring the device > > > back to the state where pci_try_reset_function() expects it to be. > > > > The trouble seems to be that this approach only maintains the isolation > > exposed by the IOMMU group when vfio-pci is the active driver for the > > device. IOMMU groups can be used by any driver and the IOMMU core is > > incorporating groups in various ways. > > I don't think that reasoning is quite right. An IOMMU group doesn't > necessarily represent devices which *are* isolated, just devices which > *can be* isolated. There are plenty of instances when we don't need > to isolate devices in different IOMMU groups: passing both groups to > the same guest or userspace VFIO driver for example, or indeed when > both groups are owned by regular host kernel drivers. > > In at least some of those cases we also don't want to isolate the > devices when we don't have to, usually for performance reasons. I see IOMMU groups as representing the current isolation of the device, not just the possible isolation. If there are ways to break down that isolation then ideally the group would be updated to reflect it. The ACS disable patches seem to support this, at boot time we can choose to disable ACS at certain points in the topology to favor peer-to-peer performance over isolation. This is then reflected in the group composition, because even though ACS *can be* enabled at the given isolation points, it's intentionally not with this option. Whether or not a given user who owns multiple devices needs that isolation is really beside the point, the user can choose to connect groups via IOMMU mappings or reconfigure the system to disable ACS and potentially more direct routing. The IOMMU groups are still accurately reflecting the topology and IOMMU based isolation. > > So, if there's a device specific > > way to configure the isolation reported in the group, which requires > > some sort of active management against things like secondary bus > > resets, then I think we need to manage it above the attached endpoint > > driver. > > The problem is that above the endpoint driver, we don't actually have > enough information about what should be isolated. For VFIO we want to > isolate things if they're in different containers, for most regular > host kernel drivers we don't need to isolate at all (although we might > as well when it doesn't have a cost). This idea that we only want to isolate things if they're in different containers is bogus, imo. There are performance reasons why we might not want things isolated, but there are also address space reasons why we do. If there are direct routes between devices, the user needs to be aware of the IOVA pollution, if we maintain singleton groups, they don't. Granted we don't really account for this well in most userspaces and fumble through it by luck of the address space layout and lack of devices really attempting peer to peer access.
Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section
On Thu, Mar 21, 2019 at 12:15:46AM +1100, Michael Ellerman wrote: > Catalin Marinas writes: > > On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote: > >> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void) > >> > >>/* data/bss scanning */ > >>scan_large_block(_sdata, _edata); > >> - scan_large_block(__bss_start, __bss_stop); > >> + > >> + if (bss_hole_start) { > >> + scan_large_block(__bss_start, bss_hole_start); > >> + scan_large_block(bss_hole_stop, __bss_stop); > >> + } else { > >> + scan_large_block(__bss_start, __bss_stop); > >> + } > >> + > >>scan_large_block(__start_ro_after_init, __end_ro_after_init); > > > > I'm not a fan of this approach but I couldn't come up with anything > > better. I was hoping we could check for PageReserved() in scan_block() > > but on arm64 it ends up not scanning the .bss at all. > > > > Until another user appears, I'm ok with this patch. > > > > Acked-by: Catalin Marinas > > I actually would like to rework this kvm_tmp thing to not be in bss at > all. It's a bit of a hack and is incompatible with strict RWX. > > If we size it a bit more conservatively we can hopefully just reserve > some space in the text section for it. > > I'm not going to have time to work on that immediately though, so if > people want this fixed now then this patch could go in as a temporary > solution. I think I have a simpler idea. Kmemleak allows punching holes in allocated objects, so just turn the data/bss sections into dedicated kmemleak objects. This happens when kmemleak is initialised, before the initcalls are invoked. The kvm_free_tmp() would just free the corresponding part of the bss. Patch below, only tested briefly on arm64. Qian, could you give it a try on powerpc? Thanks. 8<-- diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c index 683b5b3805bd..c4b8cb3c298d 100644 --- a/arch/powerpc/kernel/kvm.c +++ b/arch/powerpc/kernel/kvm.c @@ -712,6 +712,8 @@ static void kvm_use_magic_page(void) static __init void kvm_free_tmp(void) { + kmemleak_free_part(_tmp[kvm_tmp_index], + ARRAY_SIZE(kvm_tmp) - kvm_tmp_index); free_reserved_area(_tmp[kvm_tmp_index], _tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL); } diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 707fa5579f66..0f6adcbfc2c7 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1529,11 +1529,6 @@ static void kmemleak_scan(void) } rcu_read_unlock(); - /* data/bss scanning */ - scan_large_block(_sdata, _edata); - scan_large_block(__bss_start, __bss_stop); - scan_large_block(__start_ro_after_init, __end_ro_after_init); - #ifdef CONFIG_SMP /* per-cpu sections scanning */ for_each_possible_cpu(i) @@ -2071,6 +2066,15 @@ void __init kmemleak_init(void) } local_irq_restore(flags); + /* register the data/bss sections */ + create_object((unsigned long)_sdata, _edata - _sdata, + KMEMLEAK_GREY, GFP_ATOMIC); + create_object((unsigned long)__bss_start, __bss_stop - __bss_start, + KMEMLEAK_GREY, GFP_ATOMIC); + create_object((unsigned long)__start_ro_after_init, + __end_ro_after_init - __start_ro_after_init, + KMEMLEAK_GREY, GFP_ATOMIC); + /* * This is the point where tracking allocations is safe. Automatic * scanning is started during the late initcall. Add the early logged
Re: [PATCH v3 2/5] ocxl: Clean up printf formats
On Wed, 2019-03-20 at 16:34 +1100, Alastair D'Silva wrote: > From: Alastair D'Silva > > Use %# instead of using a literal '0x' I do not suggest this as reasonable. There are 10's of thousands of uses of 0x%x in the kernel and converting them to save a byte seems unnecessary. $ git grep -P '0x%[\*\d\.]*[xX]' | wc -l 26120 And the %#x style is by far the lesser used form $ git grep -P '%#[\*\d\.]*[xX]' | wc -l 2726 Also, the sized form of %#[size]x is frequently misused where the size does not account for the initial 0x output. > diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c [] > @@ -178,9 +178,9 @@ static int read_dvsec_vendor(struct pci_dev *dev) > pci_read_config_dword(dev, pos + OCXL_DVSEC_VENDOR_DLX_VERS, ); > > dev_dbg(>dev, "Vendor specific DVSEC:\n"); > - dev_dbg(>dev, " CFG version = 0x%x\n", cfg); > - dev_dbg(>dev, " TLX version = 0x%x\n", tlx); > - dev_dbg(>dev, " DLX version = 0x%x\n", dlx); > + dev_dbg(>dev, " CFG version = %#x\n", cfg); > + dev_dbg(>dev, " TLX version = %#x\n", tlx); > + dev_dbg(>dev, " DLX version = %#x\n", dlx); etc...
Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted
Hello Michael, Sorry for the delay in responding. We had some internal discussions on this. Michael S. Tsirkin writes: > On Mon, Feb 04, 2019 at 04:14:20PM -0200, Thiago Jung Bauermann wrote: >> >> Hello Michael, >> >> Michael S. Tsirkin writes: >> >> > On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote: >> So while ACCESS_PLATFORM solves our problems for secure guests, we can't >> turn it on by default because we can't affect legacy systems. Doing so >> would penalize existing systems that can access all memory. They would >> all have to unnecessarily go through address translations, and take a >> performance hit. > > So as step one, you just give hypervisor admin an option to run legacy > systems faster by blocking secure mode. I don't see why that is > so terrible. There are a few reasons why: 1. It's bad user experience to require people to fiddle with knobs for obscure reasons if it's possible to design things such that they Just Work. 2. "User" in this case can be a human directly calling QEMU, but could also be libvirt or one of its users, or some other framework. This means having to adjust and/or educate an open-ended number of people and software. It's best avoided if possible. 3. The hypervisor admin and the admin of the guest system don't necessarily belong to the same organization (e.g., cloud provider and cloud customer), so there may be some friction when they need to coordinate to get this right. 4. A feature of our design is that the guest may or may not decide to "go secure" at boot time, so it's best not to depend on flags that may or may not have been set at the time QEMU was started. >> The semantics of ACCESS_PLATFORM assume that the hypervisor/QEMU knows >> in advance - right when the VM is instantiated - that it will not have >> access to all guest memory. > > Not quite. It just means that hypervisor can live with not having > access to all memory. If platform wants to give it access > to all memory that is quite all right. Except that on powerpc it also means "there's an IOMMU present" and there's no way to say "bypass IOMMU translation". :-/ >> Another way of looking at this issue which also explains our reluctance >> is that the only difference between a secure guest and a regular guest >> (at least regarding virtio) is that the former uses swiotlb while the >> latter doens't. > > But swiotlb is just one implementation. It's a guest internal thing. The > issue is that memory isn't host accessible. >From what I understand of the ACCESS_PLATFORM definition, the host will only ever try to access memory addresses that are supplied to it by the guest, so all of the secure guest memory that the host cares about is accessible: If this feature bit is set to 0, then the device has same access to memory addresses supplied to it as the driver has. In particular, the device will always use physical addresses matching addresses used by the driver (typically meaning physical addresses used by the CPU) and not translated further, and can access any address supplied to it by the driver. When clear, this overrides any platform-specific description of whether device access is limited or translated in any way, e.g. whether an IOMMU may be present. All of the above is true for POWER guests, whether they are secure guests or not. Or are you saying that a virtio device may want to access memory addresses that weren't supplied to it by the driver? >> And from the device's point of view they're >> indistinguishable. It can't tell one guest that is using swiotlb from >> one that isn't. And that implies that secure guest vs regular guest >> isn't a virtio interface issue, it's "guest internal affairs". So >> there's no reason to reflect that in the feature flags. > > So don't. The way not to reflect that in the feature flags is > to set ACCESS_PLATFORM. Then you say *I don't care let platform device*. > > > Without ACCESS_PLATFORM > virtio has a very specific opinion about the security of the > device, and that opinion is that device is part of the guest > supervisor security domain. Sorry for being a bit dense, but not sure what "the device is part of the guest supervisor security domain" means. In powerpc-speak, "supervisor" is the operating system so perhaps that explains my confusion. Are you saying that without ACCESS_PLATFORM, the guest considers the host to be part of the guest operating system's security domain? If so, does that have any other implication besides "the host can access any address supplied to it by the driver"? If that is the case, perhaps the definition of ACCESS_PLATFORM needs to be amended to include that information because it's not part of the current definition. >> That said, we still would like to arrive at a proper design for this >> rather than add yet another hack if we can avoid it. So here's another >> proposal: considering that the dma-direct code (in kernel/dma/direct.c) >> automatically uses
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V wrote: > > Aneesh Kumar K.V writes: > > > Dan Williams writes: > > > >> > >>> Now what will be page size used for mapping vmemmap? > >> > >> That's up to the architecture's vmemmap_populate() implementation. > >> > >>> Architectures > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a > >>> device-dax with struct page in the device will have pfn reserve area > >>> aligned > >>> to PAGE_SIZE with the above example? We can't map that using > >>> PMD_SIZE page size? > >> > >> IIUC, that's a different alignment. Currently that's handled by > >> padding the reservation area up to a section (128MB on x86) boundary, > >> but I'm working on patches to allow sub-section sized ranges to be > >> mapped. > > > > I am missing something w.r.t code. The below code align that using > > nd_pfn->align > > > > if (nd_pfn->mode == PFN_MODE_PMEM) { > > unsigned long memmap_size; > > > > /* > >* vmemmap_populate_hugepages() allocates the memmap array in > >* HPAGE_SIZE chunks. > >*/ > > memmap_size = ALIGN(64 * npfns, HPAGE_SIZE); > > offset = ALIGN(start + SZ_8K + memmap_size + > > dax_label_reserve, > > nd_pfn->align) - start; > > } > > > > IIUC that is finding the offset where to put vmemmap start. And that has > > to be aligned to the page size with which we may end up mapping vmemmap > > area right? Right, that's the physical offset of where the vmemmap ends, and the memory to be mapped begins. > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that > > is to compute howmany pfns we should map for this pfn dev right? > > > > Also i guess those 4K assumptions there is wrong? Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata needs to be revved and the PAGE_SIZE needs to be recorded in the info-block.
Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING
On Wed, Mar 20, 2019 at 10:41 AM Arnd Bergmann wrote: > > I've added your patch to my randconfig test setup and will let you > know if I see anything noticeable. I'm currently testing clang-arm32, > clang-arm64 and gcc-x86. This is the only additional bug that has come up so far: `.exit.text' referenced in section `.alt.smp.init' of drivers/char/ipmi/ipmi_msghandler.o: defined in discarded section `exit.text' of drivers/char/ipmi/ipmi_msghandler.o diff --git a/arch/arm/kernel/atags.h b/arch/arm/kernel/atags.h index 201100226301..84b12e33104d 100644 --- a/arch/arm/kernel/atags.h +++ b/arch/arm/kernel/atags.h @@ -5,7 +5,7 @@ void convert_to_tag_list(struct tag *tags); const struct machine_desc *setup_machine_tags(phys_addr_t __atags_pointer, unsigned int machine_nr); #else -static inline const struct machine_desc * +static __always_inline const struct machine_desc * setup_machine_tags(phys_addr_t __atags_pointer, unsigned int machine_nr) { early_print("no ATAGS support: can't continue\n");
Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section
Catalin Marinas writes: > Hi Qian, > > On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote: >> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void) >> >> /* data/bss scanning */ >> scan_large_block(_sdata, _edata); >> -scan_large_block(__bss_start, __bss_stop); >> + >> +if (bss_hole_start) { >> +scan_large_block(__bss_start, bss_hole_start); >> +scan_large_block(bss_hole_stop, __bss_stop); >> +} else { >> +scan_large_block(__bss_start, __bss_stop); >> +} >> + >> scan_large_block(__start_ro_after_init, __end_ro_after_init); > > I'm not a fan of this approach but I couldn't come up with anything > better. I was hoping we could check for PageReserved() in scan_block() > but on arm64 it ends up not scanning the .bss at all. > > Until another user appears, I'm ok with this patch. > > Acked-by: Catalin Marinas I actually would like to rework this kvm_tmp thing to not be in bss at all. It's a bit of a hack and is incompatible with strict RWX. If we size it a bit more conservatively we can hopefully just reserve some space in the text section for it. I'm not going to have time to work on that immediately though, so if people want this fixed now then this patch could go in as a temporary solution. cheers
Re: [PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()
On 20/03/2019 13:47, Michael Ellerman wrote: > Laurent Vivier writes: >> Hi Michael, >> >> as it seems good now, could you pick up this patch for merging? > > I'll start picking up patches for next starting after rc2, so next week. > > If you think it's a bug fix I can put it into fixes now, but I don't > think it's a bug fix is it? No, it's only cosmetic. Thanks, Laurent
Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING
On Wed, Mar 20, 2019 at 11:19 AM Masahiro Yamada wrote: > On Wed, Mar 20, 2019 at 6:39 PM Arnd Bergmann wrote: > > > > On Wed, Mar 20, 2019 at 7:41 AM Masahiro Yamada > > wrote: > > > > > It is unclear to me how to fix it. > > > That's why I ended up with "depends on !MIPS". > > > > > > > > > MODPOST vmlinux.o > > > arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2': > > > sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base' > > > sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base' > > > sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base' > > > sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base' > > > sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base' > > > arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined > > > references to `mips_gcr_base' > > > > > > > > > Perhaps, MIPS folks may know how to fix it. > > > > I would guess like this: > > > > diff --git a/arch/mips/include/asm/mips-cm.h > > b/arch/mips/include/asm/mips-cm.h > > index 8bc5df49b0e1..a27483fedb7d 100644 > > --- a/arch/mips/include/asm/mips-cm.h > > +++ b/arch/mips/include/asm/mips-cm.h > > @@ -79,7 +79,7 @@ static inline int mips_cm_probe(void) > > * > > * Returns true if a CM is present in the system, else false. > > */ > > -static inline bool mips_cm_present(void) > > +static __always_inline bool mips_cm_present(void) > > { > > #ifdef CONFIG_MIPS_CM > > return mips_gcr_base != NULL; > > @@ -93,7 +93,7 @@ static inline bool mips_cm_present(void) > > * > > * Returns true if the system implements an L2-only sync region, else > > false. > > */ > > -static inline bool mips_cm_has_l2sync(void) > > +static __always_inline bool mips_cm_has_l2sync(void) > > { > > #ifdef CONFIG_MIPS_CM > > return mips_cm_l2sync_base != NULL; > > > > > Thanks, I applied the above, but I still see > undefined reference to `mips_gcr_base' > > > I attached .config to produce this error. > > I use prebuilt mips-linux-gcc from > https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/ I got to this patch experimentally, it fixes the problem for me: diff --git a/arch/mips/mm/sc-mips.c b/arch/mips/mm/sc-mips.c index 394673991bab..d70d02da038b 100644 --- a/arch/mips/mm/sc-mips.c +++ b/arch/mips/mm/sc-mips.c @@ -181,7 +181,7 @@ static int __init mips_sc_probe_cm3(void) return 0; } -static inline int __init mips_sc_probe(void) +static __always_inline int __init mips_sc_probe(void) { struct cpuinfo_mips *c = _cpu_data; unsigned int config1, config2; diff --git a/arch/mips/include/asm/bitops.h b/arch/mips/include/asm/bitops.h index 830c93a010c3..186c28463bf3 100644 --- a/arch/mips/include/asm/bitops.h +++ b/arch/mips/include/asm/bitops.h @@ -548,7 +548,7 @@ static inline unsigned long __fls(unsigned long word) * Returns 0..SZLONG-1 * Undefined if no bit exists, so code should check against 0 first. */ -static inline unsigned long __ffs(unsigned long word) +static __always_inline unsigned long __ffs(unsigned long word) { return __fls(word & -word); } It does look like a gcc bug though, as at least some of the references are from a function that got split out from an inlined function but that has no remaining call sites. Arnd
Re: [PATCH v5 05/10] powerpc: Add a framework for Kernel Userspace Access Protection
Le 20/03/2019 à 13:57, Michael Ellerman a écrit : Christophe Leroy writes: Le 08/03/2019 à 02:16, Michael Ellerman a écrit : From: Christophe Leroy This patch implements a framework for Kernel Userspace Access Protection. Then subarches will have the possibility to provide their own implementation by providing setup_kuap() and allow/prevent_user_access(). Some platforms will need to know the area accessed and whether it is accessed from read, write or both. Therefore source, destination and size and handed over to the two functions. mpe: Rename to allow/prevent rather than unlock/lock, and add read/write wrappers. Drop the 32-bit code for now until we have an implementation for it. Add kuap to pt_regs for 64-bit as well as 32-bit. Don't split strings, use pr_crit_ratelimited(). Signed-off-by: Christophe Leroy Signed-off-by: Russell Currey Signed-off-by: Michael Ellerman --- v5: Futex ops need read/write so use allow_user_acccess() there. Use #ifdef CONFIG_PPC64 in kup.h to fix build errors. Allow subarch to override allow_read/write_from/to_user(). Those little helpers that will just call allow_user_access() when distinct read/write handling is not performed looks overkill to me. Can't the subarch do it by itself based on the nullity of from/to ? static inline void allow_user_access(void __user *to, const void __user *from, unsigned long size) { if (to & from) set_kuap(0); else if (to) set_kuap(AMR_KUAP_BLOCK_READ); else if (from) set_kuap(AMR_KUAP_BLOCK_WRITE); } You could implement it that way, but it reads better at the call sites if we have: allow_write_to_user(uaddr, sizeof(*uaddr)); vs: allow_user_access(uaddr, NULL, sizeof(*uaddr)); So I'm inclined to keep them. It should all end up inlined and generate the same code at the end of the day. I was not suggesting to completly remove allow_write_to_user(), I fully agree that it reads better at the call sites. I was just thinking that allow_write_to_user() could remain generic and call the subarch specific allow_user_access() instead of making multiple subarch's allow_write_to_user() But both solution are OK for me at the end. Christophe
Re: powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038
On Wed, 2019-03-13 at 13:14:38 UTC, Michael Ellerman wrote: > Jakub Drnec reported: > Setting the realtime clock can sometimes make the monotonic clock go > back by over a hundred years. Decreasing the realtime clock across > the y2k38 threshold is one reliable way to reproduce. Allegedly this > can also happen just by running ntpd, I have not managed to > reproduce that other than booting with rtc at >2038 and then running > ntp. When this happens, anything with timers (e.g. openjdk) breaks > rather badly. > > And included a test case (slightly edited for brevity): > #define _POSIX_C_SOURCE 199309L > #include > #include > #include > #include > > long get_time(void) { > struct timespec tp; > clock_gettime(CLOCK_MONOTONIC, ); > return tp.tv_sec + tp.tv_nsec / 10; > } > > int main(void) { > long last = get_time(); > while(1) { > long now = get_time(); > if (now < last) { > printf("clock went backwards by %ld seconds!\n", last - now); > } > last = now; > sleep(1); > } > return 0; > } > > Which when run concurrently with: > # date -s 2040-1-1 > # date -s 2037-1-1 > > Will detect the clock going backward. > > The root cause is that wtom_clock_sec in struct vdso_data is only a > 32-bit signed value, even though we set its value to be equal to > tk->wall_to_monotonic.tv_sec which is 64-bits. > > Because the monotonic clock starts at zero when the system boots the > wall_to_montonic.tv_sec offset is negative for current and future > dates. Currently on a freshly booted system the offset will be in the > vicinity of negative 1.5 billion seconds. > > However if the wall clock is set past the Y2038 boundary, the offset > from wall to monotonic becomes less than negative 2^31, and no longer > fits in 32-bits. When that value is assigned to wtom_clock_sec it is > truncated and becomes positive, causing the VDSO assembly code to > calculate CLOCK_MONOTONIC incorrectly. > > That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which > it is not meant to do. Worse, if the time is then set back before the > Y2038 boundary CLOCK_MONOTONIC will jump backward. > > We can fix it simply by storing the full 64-bit offset in the > vdso_data, and using that in the VDSO assembly code. We also shuffle > some of the fields in vdso_data to avoid creating a hole. > > The original commit that added the CLOCK_MONOTONIC support to the VDSO > did actually use a 64-bit value for wtom_clock_sec, see commit > a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support to > 32 bits kernel") (Nov 2005). However just 3 days later it was > converted to 32-bits in commit 0c37ec2aa88b ("[PATCH] powerpc: vdso > fixes (take #2)"), and the bug has existed since then AFAICS. > > Fixes: 0c37ec2aa88b ("[PATCH] powerpc: vdso fixes (take #2)") > Cc: sta...@vger.kernel.org # v2.6.15+ > Link: http://lkml.kernel.org/r/hac.zfes.62bwlnvavmp.1st...@seznam.cz > Reported-by: Jakub Drnec > Signed-off-by: Michael Ellerman Applied to powerpc fixes. https://git.kernel.org/powerpc/c/b5b4453e7912f056da1ca7572574cada cheers
Re: [v2, 01/10] powerpc/6xx: fix setup and use of SPRN_SPRG_PGDIR for hash32
On Mon, 2019-03-11 at 08:30:27 UTC, Christophe Leroy wrote: > Not only the 603 but all 6xx need SPRN_SPRG_PGDIR to be initialised at > startup. This patch move it from __setup_cpu_603() to start_here() > and __secondary_start(), close to the initialisation of SPRN_THREAD. > > Previously, virt addr of PGDIR was retrieved from thread struct. > Now that it is the phys addr which is stored in SPRN_SPRG_PGDIR, > hash_page() shall not convert it to phys anymore. > This patch removes the conversion. > > Fixes: 93c4a162b014("powerpc/6xx: Store PGDIR physical address in a SPRG") > Reported-by: Guenter Roeck > Tested-by: Guenter Roeck > Signed-off-by: Christophe Leroy Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/4622a2d43101ea2e3d54a2af090f25a5 cheers
Re: [PATCH v5 02/10] powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idle
Akshay Adiga writes: > On Fri, Mar 08, 2019 at 12:16:11PM +1100, Michael Ellerman wrote: >> In order to implement KUAP (Kernel Userspace Access Protection) on >> Power9 we will be using the AMR, and therefore indirectly the >> UAMOR/AMOR. >> >> So save/restore these regs in the idle code. >> >> Signed-off-by: Michael Ellerman >> --- >> v5: Unchanged. >> v4: New. >> >> arch/powerpc/kernel/idle_book3s.S | 27 +++ >> 1 file changed, 23 insertions(+), 4 deletions(-) > > Opps.. i posted a comment on the v4. > > It would be good if we can make AMOR/UAMOR/AMR save-restore > code power9 only. Yes that would be a good optimisation. If you can send an incremental patch against this one I'll squash it in. If not I'll try and get it done at some point before merging. cheers
Re: [PATCH v5 05/10] powerpc: Add a framework for Kernel Userspace Access Protection
Christophe Leroy writes: > Le 08/03/2019 à 02:16, Michael Ellerman a écrit : >> From: Christophe Leroy >> >> This patch implements a framework for Kernel Userspace Access >> Protection. >> >> Then subarches will have the possibility to provide their own >> implementation by providing setup_kuap() and >> allow/prevent_user_access(). >> >> Some platforms will need to know the area accessed and whether it is >> accessed from read, write or both. Therefore source, destination and >> size and handed over to the two functions. >> >> mpe: Rename to allow/prevent rather than unlock/lock, and add >> read/write wrappers. Drop the 32-bit code for now until we have an >> implementation for it. Add kuap to pt_regs for 64-bit as well as >> 32-bit. Don't split strings, use pr_crit_ratelimited(). >> >> Signed-off-by: Christophe Leroy >> Signed-off-by: Russell Currey >> Signed-off-by: Michael Ellerman >> --- >> v5: Futex ops need read/write so use allow_user_acccess() there. >> Use #ifdef CONFIG_PPC64 in kup.h to fix build errors. >> Allow subarch to override allow_read/write_from/to_user(). > > Those little helpers that will just call allow_user_access() when > distinct read/write handling is not performed looks overkill to me. > > Can't the subarch do it by itself based on the nullity of from/to ? > > static inline void allow_user_access(void __user *to, const void __user > *from, >unsigned long size) > { > if (to & from) > set_kuap(0); > else if (to) > set_kuap(AMR_KUAP_BLOCK_READ); > else if (from) > set_kuap(AMR_KUAP_BLOCK_WRITE); > } You could implement it that way, but it reads better at the call sites if we have: allow_write_to_user(uaddr, sizeof(*uaddr)); vs: allow_user_access(uaddr, NULL, sizeof(*uaddr)); So I'm inclined to keep them. It should all end up inlined and generate the same code at the end of the day. cheers
[PATCH] powerpc/dts/fsl: add crypto node alias for B4
crypto node alias is needed by U-boot to identify the node and perform fix-ups, like adding "fsl,sec-era" property. Signed-off-by: Horia Geantă --- arch/powerpc/boot/dts/fsl/b4qds.dtsi | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/boot/dts/fsl/b4qds.dtsi b/arch/powerpc/boot/dts/fsl/b4qds.dtsi index 999efd3bc167..05be919f3545 100644 --- a/arch/powerpc/boot/dts/fsl/b4qds.dtsi +++ b/arch/powerpc/boot/dts/fsl/b4qds.dtsi @@ -40,6 +40,7 @@ interrupt-parent = <>; aliases { + crypto = phy_sgmii_10 = _sgmii_10; phy_sgmii_11 = _sgmii_11; phy_sgmii_1c = _sgmii_1c; -- 2.17.1
Re: [PATCH] powerpc: Make some functions static
Mathieu Malaterre writes: > On Tue, Mar 12, 2019 at 10:14 PM Christophe Leroy > wrote: >> >> >> >> Le 12/03/2019 à 21:31, Mathieu Malaterre a écrit : >> > In commit cb9e4d10c448 ("[POWERPC] Add support for 750CL Holly board") >> > new functions were added. Since these functions can be made static, >> > make it so. While doing so, it turns out that holly_power_off and >> > holly_halt are unused, so remove them. >> >> I would have said 'since these functions are only used in this C file, >> make them static'. >> >> I think this could be split in two patches: >> 1/ Remove unused functions, ie holly_halt() and holly_power_off(). >> 2/ Make the other ones static. > > Michael do you want two patches ? That would be better if it's not too much trouble. A patch with a title of "Make some functions static" shouldn't really be deleting functions entirely. cheers
Re: Disable kcov for slb routines.
Mahesh Jagannath Salgaonkar writes: > On 3/14/19 5:13 PM, Michael Ellerman wrote: >> On Mon, 2019-03-04 at 08:25:51 UTC, Mahesh J Salgaonkar wrote: >>> From: Mahesh Salgaonkar >>> >>> The kcov instrumentation inside SLB routines causes duplicate SLB entries >>> to be added resulting into SLB multihit machine checks. >>> Disable kcov instrumentation on slb.o >>> >>> Signed-off-by: Mahesh Salgaonkar >>> Acked-by: Andrew Donnellan >>> Tested-by: Satheesh Rajendran >> >> Applied to powerpc next, thanks. >> >> https://git.kernel.org/powerpc/c/19d6907521b04206676741b26e05a152 >> >> cheers >> > > There was a v2 at http://patchwork.ozlabs.org/patch/1051718/, looks like > v1 got picked up. But I see the applied commit does address Andrew's > comments. Sorry not sure how I missed v2. cheers
Re: [PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()
Laurent Vivier writes: > Hi Michael, > > as it seems good now, could you pick up this patch for merging? I'll start picking up patches for next starting after rc2, so next week. If you think it's a bug fix I can put it into fixes now, but I don't think it's a bug fix is it? cheers
Re: Shift overflow warnings in arch/powerpc/boot/addnote.c on 32-bit builds
Mark Cave-Ayland writes: > Hi all, > > Whilst building the latest git master on my G4 I noticed the following shift > overflow > warnings in the build log for arch/powerpc/boot/addnote.c: > > > arch/powerpc/boot/addnote.c: In function ‘main’: > arch/powerpc/boot/addnote.c:75:47: warning: right shift count >= width of type > [-Wshift-count-overflow] > #define PUT_64BE(off, v)((PUT_32BE((off), (v) >> 32L), \ >^~ > arch/powerpc/boot/addnote.c:72:39: note: in definition of macro ‘PUT_16BE’ > #define PUT_16BE(off, v)(buf[off] = ((v) >> 8) & 0xff, \ >^ > arch/powerpc/boot/addnote.c:75:27: note: in expansion of macro ‘PUT_32BE’ > #define PUT_64BE(off, v)((PUT_32BE((off), (v) >> 32L), \ >^~~~ > arch/powerpc/boot/addnote.c:94:50: note: in expansion of macro ‘PUT_64BE’ > #define PUT_64(off, v) (e_data == ELFDATA2MSB ? PUT_64BE(off, v) : \ > ^~~~ > arch/powerpc/boot/addnote.c:183:3: note: in expansion of macro ‘PUT_64’ >PUT_64(ph + PH_OFFSET, ns); >^~ I don't think there's any situation in which a 32-bit addnote will be run against a 64-bit ELF is there? So I don't think there's an actual bug, but it would be good if we could make the warning go away. cheers
Re: [RESEND PATCH v2] powerpc: mute unused-but-set-variable warnings
Qian Cai writes: > On 3/19/19 5:21 AM, Christophe Leroy wrote: >> Is there a reason for resending ? AFAICS, both are identical and still marked >> new in patchwork: >> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=76055 >> > > "RESEND" because of no maintainer response for more than one week. I don't know who told you to RESEND after a week, but especially at this point in the development cycle a week is *way* too short. And for trivial patches like this I may not get to them for several weeks, I have other problems to fix like time going backward :) In future please check patchwork and then if the patch is still new after several weeks just send a ping in reply to that patch. A full RESEND means I now have two identical patches to deal with in patchwork, which makes more work for me. cheers
Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING
Hi Arnd, On Wed, Mar 20, 2019 at 6:39 PM Arnd Bergmann wrote: > > On Wed, Mar 20, 2019 at 7:41 AM Masahiro Yamada > wrote: > > > It is unclear to me how to fix it. > > That's why I ended up with "depends on !MIPS". > > > > > > MODPOST vmlinux.o > > arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2': > > sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base' > > sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base' > > sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base' > > sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base' > > sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base' > > arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined > > references to `mips_gcr_base' > > > > > > Perhaps, MIPS folks may know how to fix it. > > I would guess like this: > > diff --git a/arch/mips/include/asm/mips-cm.h b/arch/mips/include/asm/mips-cm.h > index 8bc5df49b0e1..a27483fedb7d 100644 > --- a/arch/mips/include/asm/mips-cm.h > +++ b/arch/mips/include/asm/mips-cm.h > @@ -79,7 +79,7 @@ static inline int mips_cm_probe(void) > * > * Returns true if a CM is present in the system, else false. > */ > -static inline bool mips_cm_present(void) > +static __always_inline bool mips_cm_present(void) > { > #ifdef CONFIG_MIPS_CM > return mips_gcr_base != NULL; > @@ -93,7 +93,7 @@ static inline bool mips_cm_present(void) > * > * Returns true if the system implements an L2-only sync region, else false. > */ > -static inline bool mips_cm_has_l2sync(void) > +static __always_inline bool mips_cm_has_l2sync(void) > { > #ifdef CONFIG_MIPS_CM > return mips_cm_l2sync_base != NULL; > Thanks, I applied the above, but I still see undefined reference to `mips_gcr_base' I attached .config to produce this error. I use prebuilt mips-linux-gcc from https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/ -- Best Regards Masahiro Yamada config.gz Description: application/gzip
[PATCH v1 27/27] powerpc/mm: flatten function __find_linux_pte() step 3
__find_linux_pte() is full of if/else which is hard to follow allthough the handling is pretty simple. Previous patches left a { } block. This patch removes it. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/pgtable.c | 98 +++ 1 file changed, 49 insertions(+), 49 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index c1c6d0b79baa..db4a6253df92 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -348,59 +348,59 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, hpdp = (hugepd_t *) goto out_huge; } - { - /* -* Even if we end up with an unmap, the pgtable will not -* be freed, because we do an rcu free and here we are -* irq disabled -*/ - pdshift = PUD_SHIFT; - pudp = pud_offset(, ea); - pud = READ_ONCE(*pudp); - if (pud_none(pud)) - return NULL; + /* +* Even if we end up with an unmap, the pgtable will not +* be freed, because we do an rcu free and here we are +* irq disabled +*/ + pdshift = PUD_SHIFT; + pudp = pud_offset(, ea); + pud = READ_ONCE(*pudp); - if (pud_huge(pud)) { - ret_pte = (pte_t *) pudp; - goto out; - } - if (is_hugepd(__hugepd(pud_val(pud { - hpdp = (hugepd_t *) - goto out_huge; - } - pdshift = PMD_SHIFT; - pmdp = pmd_offset(, ea); - pmd = READ_ONCE(*pmdp); - /* -* A hugepage collapse is captured by pmd_none, because -* it mark the pmd none and do a hpte invalidate. -*/ - if (pmd_none(pmd)) - return NULL; - - if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) { - if (is_thp) - *is_thp = true; - ret_pte = (pte_t *)pmdp; - goto out; - } - /* -* pmd_large check below will handle the swap pmd pte -* we need to do both the check because they are config -* dependent. -*/ - if (pmd_huge(pmd) || pmd_large(pmd)) { - ret_pte = (pte_t *)pmdp; - goto out; - } - if (is_hugepd(__hugepd(pmd_val(pmd { - hpdp = (hugepd_t *) - goto out_huge; - } + if (pud_none(pud)) + return NULL; - return pte_offset_kernel(, ea); + if (pud_huge(pud)) { + ret_pte = (pte_t *)pudp; + goto out; } + if (is_hugepd(__hugepd(pud_val(pud { + hpdp = (hugepd_t *) + goto out_huge; + } + pdshift = PMD_SHIFT; + pmdp = pmd_offset(, ea); + pmd = READ_ONCE(*pmdp); + /* +* A hugepage collapse is captured by pmd_none, because +* it mark the pmd none and do a hpte invalidate. +*/ + if (pmd_none(pmd)) + return NULL; + + if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) { + if (is_thp) + *is_thp = true; + ret_pte = (pte_t *)pmdp; + goto out; + } + /* +* pmd_large check below will handle the swap pmd pte +* we need to do both the check because they are config +* dependent. +*/ + if (pmd_huge(pmd) || pmd_large(pmd)) { + ret_pte = (pte_t *)pmdp; + goto out; + } + if (is_hugepd(__hugepd(pmd_val(pmd { + hpdp = (hugepd_t *) + goto out_huge; + } + + return pte_offset_kernel(, ea); + out_huge: if (!hpdp) return NULL; -- 2.13.3
[PATCH v1 26/27] powerpc/mm: flatten function __find_linux_pte() step 2
__find_linux_pte() is full of if/else which is hard to follow allthough the handling is pretty simple. Previous patch left { } blocks. This patch removes the first one by shifting its content to the left. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/pgtable.c | 62 +++ 1 file changed, 30 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index d332abeedf0a..c1c6d0b79baa 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -369,39 +369,37 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, hpdp = (hugepd_t *) goto out_huge; } - { - pdshift = PMD_SHIFT; - pmdp = pmd_offset(, ea); - pmd = READ_ONCE(*pmdp); - /* -* A hugepage collapse is captured by pmd_none, because -* it mark the pmd none and do a hpte invalidate. -*/ - if (pmd_none(pmd)) - return NULL; - - if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) { - if (is_thp) - *is_thp = true; - ret_pte = (pte_t *) pmdp; - goto out; - } - /* -* pmd_large check below will handle the swap pmd pte -* we need to do both the check because they are config -* dependent. -*/ - if (pmd_huge(pmd) || pmd_large(pmd)) { - ret_pte = (pte_t *) pmdp; - goto out; - } - if (is_hugepd(__hugepd(pmd_val(pmd { - hpdp = (hugepd_t *) - goto out_huge; - } - - return pte_offset_kernel(, ea); + pdshift = PMD_SHIFT; + pmdp = pmd_offset(, ea); + pmd = READ_ONCE(*pmdp); + /* +* A hugepage collapse is captured by pmd_none, because +* it mark the pmd none and do a hpte invalidate. +*/ + if (pmd_none(pmd)) + return NULL; + + if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) { + if (is_thp) + *is_thp = true; + ret_pte = (pte_t *)pmdp; + goto out; + } + /* +* pmd_large check below will handle the swap pmd pte +* we need to do both the check because they are config +* dependent. +*/ + if (pmd_huge(pmd) || pmd_large(pmd)) { + ret_pte = (pte_t *)pmdp; + goto out; } + if (is_hugepd(__hugepd(pmd_val(pmd { + hpdp = (hugepd_t *) + goto out_huge; + } + + return pte_offset_kernel(, ea); } out_huge: if (!hpdp) -- 2.13.3
[PATCH v1 25/27] powerpc/mm: flatten function __find_linux_pte()
__find_linux_pte() is full of if/else which is hard to follow allthough the handling is pretty simple. This patch flattens the function by getting rid of as much if/else as possible. In order to ease the review, this is done in two steps. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/pgtable.c | 32 ++-- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 9f4ccd15849f..d332abeedf0a 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -339,12 +339,16 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, */ if (pgd_none(pgd)) return NULL; - else if (pgd_huge(pgd)) { - ret_pte = (pte_t *) pgdp; + + if (pgd_huge(pgd)) { + ret_pte = (pte_t *)pgdp; goto out; - } else if (is_hugepd(__hugepd(pgd_val(pgd + } + if (is_hugepd(__hugepd(pgd_val(pgd { hpdp = (hugepd_t *) - else { + goto out_huge; + } + { /* * Even if we end up with an unmap, the pgtable will not * be freed, because we do an rcu free and here we are @@ -356,12 +360,16 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, if (pud_none(pud)) return NULL; - else if (pud_huge(pud)) { + + if (pud_huge(pud)) { ret_pte = (pte_t *) pudp; goto out; - } else if (is_hugepd(__hugepd(pud_val(pud + } + if (is_hugepd(__hugepd(pud_val(pud { hpdp = (hugepd_t *) - else { + goto out_huge; + } + { pdshift = PMD_SHIFT; pmdp = pmd_offset(, ea); pmd = READ_ONCE(*pmdp); @@ -386,12 +394,16 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, if (pmd_huge(pmd) || pmd_large(pmd)) { ret_pte = (pte_t *) pmdp; goto out; - } else if (is_hugepd(__hugepd(pmd_val(pmd + } + if (is_hugepd(__hugepd(pmd_val(pmd { hpdp = (hugepd_t *) - else - return pte_offset_kernel(, ea); + goto out_huge; + } + + return pte_offset_kernel(, ea); } } +out_huge: if (!hpdp) return NULL; -- 2.13.3
[PATCH v1 24/27] powerpc: define subarch SLB_ADDR_LIMIT_DEFAULT
This patch defines a subarch specific SLB_ADDR_LIMIT_DEFAULT to remove the #ifdefs around the setup of mm->context.slb_addr_limit Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/slice.h | 2 ++ arch/powerpc/include/asm/nohash/32/slice.h | 2 ++ arch/powerpc/kernel/setup-common.c | 8 +--- arch/powerpc/mm/slice.c| 6 +- 4 files changed, 6 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/slice.h b/arch/powerpc/include/asm/book3s/64/slice.h index af498b0da21a..8da15958dcd1 100644 --- a/arch/powerpc/include/asm/book3s/64/slice.h +++ b/arch/powerpc/include/asm/book3s/64/slice.h @@ -13,6 +13,8 @@ #define SLICE_NUM_HIGH (H_PGTABLE_RANGE >> SLICE_HIGH_SHIFT) #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT) +#define SLB_ADDR_LIMIT_DEFAULT DEFAULT_MAP_WINDOW_USER64 + #else /* CONFIG_PPC_MM_SLICES */ #define get_slice_psize(mm, addr) ((mm)->context.user_psize) diff --git a/arch/powerpc/include/asm/nohash/32/slice.h b/arch/powerpc/include/asm/nohash/32/slice.h index 777d62e40ac0..39eb0154ae2d 100644 --- a/arch/powerpc/include/asm/nohash/32/slice.h +++ b/arch/powerpc/include/asm/nohash/32/slice.h @@ -13,6 +13,8 @@ #define SLICE_NUM_HIGH 0ul #define GET_HIGH_SLICE_INDEX(addr) (addr & 0) +#define SLB_ADDR_LIMIT_DEFAULT DEFAULT_MAP_WINDOW + #endif /* CONFIG_PPC_MM_SLICES */ #endif /* _ASM_POWERPC_NOHASH_32_SLICE_H */ diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 2e5dfb6e0823..af2682d052a2 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -948,14 +948,8 @@ void __init setup_arch(char **cmdline_p) init_mm.brk = klimit; #ifdef CONFIG_PPC_MM_SLICES -#ifdef CONFIG_PPC64 if (!radix_enabled()) - init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64; -#elif defined(CONFIG_PPC_8xx) - init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW; -#else -#error "context.addr_limit not initialized." -#endif + init_mm.context.slb_addr_limit = SLB_ADDR_LIMIT_DEFAULT; #endif #ifdef CONFIG_SPAPR_TCE_IOMMU diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 50b1a5528384..64513cf47e5b 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -652,11 +652,7 @@ void slice_init_new_context_exec(struct mm_struct *mm) * case of fork it is just inherited from the mm being * duplicated. */ -#ifdef CONFIG_PPC64 - mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64; -#else - mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW; -#endif + mm->context.slb_addr_limit = SLB_ADDR_LIMIT_DEFAULT; mm->context.user_psize = psize; -- 2.13.3
[PATCH v1 23/27] powerpc/mm: remove a couple of #ifdef CONFIG_PPC_64K_PAGES in mm/slice.c
This patch replaces a couple of #ifdef CONFIG_PPC_64K_PAGES by IS_ENABLED(CONFIG_PPC_64K_PAGES) to improve code maintainability. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/slice.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 357d64e14757..50b1a5528384 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -558,14 +558,13 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, newaddr = slice_find_area(mm, len, _mask, psize, topdown, high_limit); -#ifdef CONFIG_PPC_64K_PAGES - if (newaddr == -ENOMEM && psize == MMU_PAGE_64K) { + if (IS_ENABLED(CONFIG_PPC_64K_PAGES) && newaddr == -ENOMEM && + psize == MMU_PAGE_64K) { /* retry the search with 4k-page slices included */ slice_or_mask(_mask, _mask, compat_maskp); newaddr = slice_find_area(mm, len, _mask, psize, topdown, high_limit); } -#endif if (newaddr == -ENOMEM) return -ENOMEM; @@ -731,9 +730,9 @@ int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr, VM_BUG_ON(radix_enabled()); maskp = slice_mask_for_size(>context, psize); -#ifdef CONFIG_PPC_64K_PAGES + /* We need to account for 4k slices too */ - if (psize == MMU_PAGE_64K) { + if (IS_ENABLED(CONFIG_PPC_64K_PAGES) && psize == MMU_PAGE_64K) { const struct slice_mask *compat_maskp; struct slice_mask available; @@ -741,7 +740,6 @@ int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr, slice_or_mask(, maskp, compat_maskp); return !slice_check_range_fits(mm, , addr, len); } -#endif return !slice_check_range_fits(mm, maskp, addr, len); } -- 2.13.3
[PATCH v1 22/27] powerpc/mm: move slice_mask_for_size() into mmu.h
Move slice_mask_for_size() into subarch mmu.h Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/mmu.h | 22 + arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 18 ++ arch/powerpc/mm/slice.c | 36 3 files changed, 36 insertions(+), 40 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 1ceee000c18d..927e3714b0d8 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -123,7 +123,6 @@ typedef struct { /* NPU NMMU context */ struct npu_context *npu_context; -#ifdef CONFIG_PPC_MM_SLICES /* SLB page size encodings*/ unsigned char low_slices_psize[BITS_PER_LONG / BITS_PER_BYTE]; unsigned char high_slices_psize[SLICE_ARRAY_SIZE]; @@ -136,9 +135,6 @@ typedef struct { struct slice_mask mask_16m; struct slice_mask mask_16g; # endif -#else - u16 sllp; /* SLB page size encoding */ -#endif unsigned long vdso_base; #ifdef CONFIG_PPC_SUBPAGE_PROT struct subpage_prot_table spt; @@ -172,6 +168,24 @@ extern int mmu_vmalloc_psize; extern int mmu_vmemmap_psize; extern int mmu_io_psize; +static inline struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize) +{ +#ifdef CONFIG_PPC_64K_PAGES + if (psize == MMU_PAGE_64K) + return >mask_64k; +#endif + if (psize == MMU_PAGE_4K) + return >mask_4k; +#ifdef CONFIG_HUGETLB_PAGE + if (psize == MMU_PAGE_16M) + return >mask_16m; + if (psize == MMU_PAGE_16G) + return >mask_16g; +#endif + WARN_ON(true); + return NULL; +} + /* MMU initialization */ void mmu_early_init_devtree(void); void hash__early_init_devtree(void); diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index 0a1a3fc54e54..4ba92c48b3a5 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -255,4 +255,22 @@ extern s32 patch__itlbmiss_perf, patch__dtlbmiss_perf; #define mmu_linear_psize MMU_PAGE_8M +#ifndef __ASSEMBLY__ +#ifdef CONFIG_PPC_MM_SLICES +static inline struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize) +{ + if (psize == mmu_virtual_psize) + return >mask_base_psize; +#ifdef CONFIG_HUGETLB_PAGE + if (psize == MMU_PAGE_512K) + return >mask_512k; + if (psize == MMU_PAGE_8M) + return >mask_8m; +#endif + WARN_ON(true); + return NULL; +} +#endif +#endif /* !__ASSEMBLY__ */ + #endif /* _ASM_POWERPC_MMU_8XX_H_ */ diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 231fd88d97e2..357d64e14757 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -126,42 +126,6 @@ static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret, __set_bit(i, ret->high_slices); } -#ifdef CONFIG_PPC_BOOK3S_64 -static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize) -{ -#ifdef CONFIG_PPC_64K_PAGES - if (psize == MMU_PAGE_64K) - return >mask_64k; -#endif - if (psize == MMU_PAGE_4K) - return >mask_4k; -#ifdef CONFIG_HUGETLB_PAGE - if (psize == MMU_PAGE_16M) - return >mask_16m; - if (psize == MMU_PAGE_16G) - return >mask_16g; -#endif - WARN_ON(true); - return NULL; -} -#elif defined(CONFIG_PPC_8xx) -static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize) -{ - if (psize == mmu_virtual_psize) - return >mask_base_psize; -#ifdef CONFIG_HUGETLB_PAGE - if (psize == MMU_PAGE_512K) - return >mask_512k; - if (psize == MMU_PAGE_8M) - return >mask_8m; -#endif - WARN_ON(true); - return NULL; -} -#else -#error "Must define the slice masks for page sizes supported by the platform" -#endif - static bool slice_check_range_fits(struct mm_struct *mm, const struct slice_mask *available, unsigned long start, unsigned long len) -- 2.13.3
[PATCH v1 21/27] powerpc/mm: hand a context_t over to slice_mask_for_size() instead of mm_struct
slice_mask_for_size() only uses mm->context, so hand directly a pointer to the context. This will help moving the function in subarch mmu.h in the next patch by avoiding having to include the definition of struct mm_struct Signed-off-by: Christophe Leroy --- arch/powerpc/mm/slice.c | 34 +- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index f98b9e812c62..231fd88d97e2 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -127,33 +127,33 @@ static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret, } #ifdef CONFIG_PPC_BOOK3S_64 -static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize) +static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize) { #ifdef CONFIG_PPC_64K_PAGES if (psize == MMU_PAGE_64K) - return >context.mask_64k; + return >mask_64k; #endif if (psize == MMU_PAGE_4K) - return >context.mask_4k; + return >mask_4k; #ifdef CONFIG_HUGETLB_PAGE if (psize == MMU_PAGE_16M) - return >context.mask_16m; + return >mask_16m; if (psize == MMU_PAGE_16G) - return >context.mask_16g; + return >mask_16g; #endif WARN_ON(true); return NULL; } #elif defined(CONFIG_PPC_8xx) -static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize) +static struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize) { if (psize == mmu_virtual_psize) - return >context.mask_base_psize; + return >mask_base_psize; #ifdef CONFIG_HUGETLB_PAGE if (psize == MMU_PAGE_512K) - return >context.mask_512k; + return >mask_512k; if (psize == MMU_PAGE_8M) - return >context.mask_8m; + return >mask_8m; #endif WARN_ON(true); return NULL; @@ -221,7 +221,7 @@ static void slice_convert(struct mm_struct *mm, unsigned long i, flags; int old_psize; - psize_mask = slice_mask_for_size(mm, psize); + psize_mask = slice_mask_for_size(>context, psize); /* We need to use a spinlock here to protect against * concurrent 64k -> 4k demotion ... @@ -238,7 +238,7 @@ static void slice_convert(struct mm_struct *mm, /* Update the slice_mask */ old_psize = (lpsizes[index] >> (mask_index * 4)) & 0xf; - old_mask = slice_mask_for_size(mm, old_psize); + old_mask = slice_mask_for_size(>context, old_psize); old_mask->low_slices &= ~(1u << i); psize_mask->low_slices |= 1u << i; @@ -257,7 +257,7 @@ static void slice_convert(struct mm_struct *mm, /* Update the slice_mask */ old_psize = (hpsizes[index] >> (mask_index * 4)) & 0xf; - old_mask = slice_mask_for_size(mm, old_psize); + old_mask = slice_mask_for_size(>context, old_psize); __clear_bit(i, old_mask->high_slices); __set_bit(i, psize_mask->high_slices); @@ -504,7 +504,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, /* First make up a "good" mask of slices that have the right size * already */ - maskp = slice_mask_for_size(mm, psize); + maskp = slice_mask_for_size(>context, psize); /* * Here "good" means slices that are already the right page size, @@ -531,7 +531,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, * a pointer to good mask for the next code to use. */ if (IS_ENABLED(CONFIG_PPC_64K_PAGES) && psize == MMU_PAGE_64K) { - compat_maskp = slice_mask_for_size(mm, MMU_PAGE_4K); + compat_maskp = slice_mask_for_size(>context, MMU_PAGE_4K); if (fixed) slice_or_mask(_mask, maskp, compat_maskp); else @@ -709,7 +709,7 @@ void slice_init_new_context_exec(struct mm_struct *mm) /* * Slice mask cache starts zeroed, fill the default size cache. */ - mask = slice_mask_for_size(mm, psize); + mask = slice_mask_for_size(>context, psize); mask->low_slices = ~0UL; if (SLICE_NUM_HIGH) bitmap_fill(mask->high_slices, SLICE_NUM_HIGH); @@ -766,14 +766,14 @@ int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr, VM_BUG_ON(radix_enabled()); - maskp = slice_mask_for_size(mm, psize); + maskp = slice_mask_for_size(>context, psize); #ifdef CONFIG_PPC_64K_PAGES /* We need to account for 4k slices too */ if (psize == MMU_PAGE_64K) { const struct slice_mask *compat_maskp; struct slice_mask available; - compat_maskp =
[PATCH v1 19/27] powerpc/mm: drop slice DEBUG
slice is now an improved functionnality. Drop the DEBUG stuff. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/slice.c | 62 - 1 file changed, 4 insertions(+), 58 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 011d470ea340..99983dc4e484 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -41,28 +41,6 @@ static DEFINE_SPINLOCK(slice_convert_lock); -#ifdef DEBUG -int _slice_debug = 1; - -static void slice_print_mask(const char *label, const struct slice_mask *mask) -{ - if (!_slice_debug) - return; - pr_devel("%s low_slice: %*pbl\n", label, - (int)SLICE_NUM_LOW, >low_slices); - pr_devel("%s high_slice: %*pbl\n", label, - (int)SLICE_NUM_HIGH, mask->high_slices); -} - -#define slice_dbg(fmt...) do { if (_slice_debug) pr_devel(fmt); } while (0) - -#else - -static void slice_print_mask(const char *label, const struct slice_mask *mask) {} -#define slice_dbg(fmt...) - -#endif - static inline bool slice_addr_is_low(unsigned long addr) { u64 tmp = (u64)addr; @@ -245,9 +223,6 @@ static void slice_convert(struct mm_struct *mm, unsigned long i, flags; int old_psize; - slice_dbg("slice_convert(mm=%p, psize=%d)\n", mm, psize); - slice_print_mask(" mask", mask); - psize_mask = slice_mask_for_size(mm, psize); /* We need to use a spinlock here to protect against @@ -293,10 +268,6 @@ static void slice_convert(struct mm_struct *mm, (((unsigned long)psize) << (mask_index * 4)); } - slice_dbg(" lsps=%lx, hsps=%lx\n", - (unsigned long)mm->context.low_slices_psize, - (unsigned long)mm->context.high_slices_psize); - spin_unlock_irqrestore(_convert_lock, flags); copro_flush_all_slbs(mm); @@ -523,14 +494,9 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, BUG_ON(mm->context.slb_addr_limit == 0); VM_BUG_ON(radix_enabled()); - slice_dbg("slice_get_unmapped_area(mm=%p, psize=%d...\n", mm, psize); - slice_dbg(" addr=%lx, len=%lx, flags=%lx, topdown=%d\n", - addr, len, flags, topdown); - /* If hint, make sure it matches our alignment restrictions */ if (!fixed && addr) { addr = _ALIGN_UP(addr, page_size); - slice_dbg(" aligned addr=%lx\n", addr); /* Ignore hint if it's too large or overlaps a VMA */ if (addr > high_limit - len || addr < mmap_min_addr || !slice_area_is_free(mm, addr, len)) @@ -576,17 +542,12 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, slice_copy_mask(_mask, maskp); } - slice_print_mask(" good_mask", _mask); - if (compat_maskp) - slice_print_mask(" compat_mask", compat_maskp); - /* First check hint if it's valid or if we have MAP_FIXED */ if (addr != 0 || fixed) { /* Check if we fit in the good mask. If we do, we just return, * nothing else to do */ if (slice_check_range_fits(mm, _mask, addr, len)) { - slice_dbg(" fits good !\n"); newaddr = addr; goto return_addr; } @@ -596,13 +557,10 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, */ newaddr = slice_find_area(mm, len, _mask, psize, topdown, high_limit); - if (newaddr != -ENOMEM) { - /* Found within the good mask, we don't have to setup, -* we thus return directly -*/ - slice_dbg(" found area at 0x%lx\n", newaddr); + + /* Found within good mask, don't have to setup, thus return directly */ + if (newaddr != -ENOMEM) goto return_addr; - } } /* * We don't fit in the good mask, check what other slices are @@ -610,11 +568,9 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, */ slice_mask_for_free(mm, _mask, high_limit); slice_or_mask(_mask, _mask, _mask); - slice_print_mask(" potential", _mask); if (addr != 0 || fixed) { if (slice_check_range_fits(mm, _mask, addr, len)) { - slice_dbg(" fits potential !\n"); newaddr = addr; goto convert; } @@ -624,18 +580,14 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, if (fixed) return -EBUSY; - slice_dbg(" search...\n"); - /* If we
[PATCH v1 20/27] powerpc/mm: remove unnecessary #ifdef CONFIG_PPC64
For PPC32 that's a noop, but gcc is smart enough ignore it. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/slice.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 99983dc4e484..f98b9e812c62 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -96,13 +96,11 @@ static int slice_high_has_vma(struct mm_struct *mm, unsigned long slice) unsigned long start = slice << SLICE_HIGH_SHIFT; unsigned long end = start + (1ul << SLICE_HIGH_SHIFT); -#ifdef CONFIG_PPC64 /* Hack, so that each addresses is controlled by exactly one * of the high or low area bitmaps, the first high area starts * at 4GB, not 0 */ if (start == 0) - start = SLICE_LOW_TOP; -#endif + start = (unsigned long)SLICE_LOW_TOP; return !slice_area_is_free(mm, start, end - start); } -- 2.13.3
[PATCH v1 18/27] powerpc/mm: cleanup remaining ifdef mess in hugetlbpage.c
Only 3 subarches support huge pages. So when it either 2 of them, it is not the third one. And mmu_has_feature() is known by all subarches so IS_ENABLED() can be used instead of #ifdef Signed-off-by: Christophe Leroy --- arch/powerpc/mm/hugetlbpage.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index dd62006e1243..a463ebf276b6 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -226,7 +226,7 @@ int __init alloc_bootmem_huge_page(struct hstate *h) return __alloc_bootmem_huge_page(h); } -#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx) +#ifndef CONFIG_PPC_BOOK3S_64 #define HUGEPD_FREELIST_SIZE \ ((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t)) @@ -596,10 +596,10 @@ static int __init hugetlbpage_init(void) return 0; } -#if !defined(CONFIG_PPC_FSL_BOOK3E) && !defined(CONFIG_PPC_8xx) - if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE)) + if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled() && + !mmu_has_feature(MMU_FTR_16M_PAGE)) return -ENODEV; -#endif + for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { unsigned shift; unsigned pdshift; @@ -637,10 +637,8 @@ static int __init hugetlbpage_init(void) pgtable_cache_add(PTE_INDEX_SIZE); else if (pdshift > shift) pgtable_cache_add(pdshift - shift); -#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx) - else + else if (IS_ENABLED(CONFIG_PPC_FSL_BOOK3E) || IS_ENABLED(CONFIG_PPC_8xx)) pgtable_cache_add(PTE_T_ORDER); -#endif } if (IS_ENABLED(HUGETLB_PAGE_SIZE_VARIABLE)) -- 2.13.3
[PATCH v1 17/27] powerpc/mm: cleanup HPAGE_SHIFT setup
Only book3s/64 may select default among several HPAGE_SHIFT at runtime. 8xx always defines 512K pages as default FSL_BOOK3E always defines 4M pages as default This patch limits HUGETLB_PAGE_SIZE_VARIABLE to book3s/64 moves the definitions in subarches files. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/hugetlb.h | 2 ++ arch/powerpc/include/asm/page.h | 11 --- arch/powerpc/mm/hugetlbpage-hash64.c | 16 arch/powerpc/mm/hugetlbpage.c| 23 +++ 5 files changed, 30 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 5d8e692d6470..7815eb0cc2a5 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -390,7 +390,7 @@ source "kernel/Kconfig.hz" config HUGETLB_PAGE_SIZE_VARIABLE bool - depends on HUGETLB_PAGE + depends on HUGETLB_PAGE && PPC_BOOK3S_64 default y config MATH_EMULATION diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index 84598c6b0959..20a101046cff 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -15,6 +15,8 @@ extern bool hugetlb_disabled; +void hugetlbpage_init_default(void); + void flush_dcache_icache_hugepage(struct page *page); int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr, diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index 0c11a7513919..eef10fe0e06f 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -28,10 +28,15 @@ #define PAGE_SIZE (ASM_CONST(1) << PAGE_SHIFT) #ifndef __ASSEMBLY__ -#ifdef CONFIG_HUGETLB_PAGE -extern unsigned int HPAGE_SHIFT; -#else +#ifndef CONFIG_HUGETLB_PAGE #define HPAGE_SHIFT PAGE_SHIFT +#elif defined(CONFIG_PPC_BOOK3S_64) +extern unsigned int hpage_shift; +#define HPAGE_SHIFT hpage_shift +#elif defined(CONFIG_PPC_8xx) +#define HPAGE_SHIFT19 /* 512k pages */ +#elif defined(CONFIG_PPC_FSL_BOOK3E) +#define HPAGE_SHIFT22 /* 4M pages */ #endif #define HPAGE_SIZE ((1UL) << HPAGE_SHIFT) #define HPAGE_MASK (~(HPAGE_SIZE - 1)) diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c index b0d9209d9a86..7a58204c3688 100644 --- a/arch/powerpc/mm/hugetlbpage-hash64.c +++ b/arch/powerpc/mm/hugetlbpage-hash64.c @@ -15,6 +15,9 @@ #include #include +unsigned int hpage_shift; +EXPORT_SYMBOL(hpage_shift); + extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn, unsigned long pa, unsigned long rlags, unsigned long vflags, int psize, int ssize); @@ -145,3 +148,16 @@ void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr old_pte, pte); set_huge_pte_at(vma->vm_mm, addr, ptep, pte); } + +void hugetlbpage_init_default(void) +{ + /* Set default large page size. Currently, we pick 16M or 1M +* depending on what is available +*/ + if (mmu_psize_defs[MMU_PAGE_16M].shift) + hpage_shift = mmu_psize_defs[MMU_PAGE_16M].shift; + else if (mmu_psize_defs[MMU_PAGE_1M].shift) + hpage_shift = mmu_psize_defs[MMU_PAGE_1M].shift; + else if (mmu_psize_defs[MMU_PAGE_2M].shift) + hpage_shift = mmu_psize_defs[MMU_PAGE_2M].shift; +} diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 3b449c9d4e47..dd62006e1243 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -28,9 +28,6 @@ bool hugetlb_disabled = false; -unsigned int HPAGE_SHIFT; -EXPORT_SYMBOL(HPAGE_SHIFT); - #define hugepd_none(hpd) (hpd_val(hpd) == 0) #define PTE_T_ORDER(__builtin_ffs(sizeof(pte_t)) - __builtin_ffs(sizeof(void *))) @@ -646,23 +643,9 @@ static int __init hugetlbpage_init(void) #endif } -#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx) - /* Default hpage size = 4M on FSL_BOOK3E and 512k on 8xx */ - if (mmu_psize_defs[MMU_PAGE_4M].shift) - HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift; - else if (mmu_psize_defs[MMU_PAGE_512K].shift) - HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_512K].shift; -#else - /* Set default large page size. Currently, we pick 16M or 1M -* depending on what is available -*/ - if (mmu_psize_defs[MMU_PAGE_16M].shift) - HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift; - else if (mmu_psize_defs[MMU_PAGE_1M].shift) - HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift; - else if (mmu_psize_defs[MMU_PAGE_2M].shift) - HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_2M].shift; -#endif + if (IS_ENABLED(HUGETLB_PAGE_SIZE_VARIABLE)) +
[PATCH v1 16/27] powerpc/mm: move hugetlb_disabled into asm/hugetlb.h
No need to have this in asm/page.h, move it into asm/hugetlb.h Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/hugetlb.h | 2 ++ arch/powerpc/include/asm/page.h| 1 - arch/powerpc/kernel/fadump.c | 1 + arch/powerpc/mm/hash_utils_64.c| 1 + 4 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index fd5c0873a57d..84598c6b0959 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -13,6 +13,8 @@ #include #endif /* CONFIG_PPC_BOOK3S_64 */ +extern bool hugetlb_disabled; + void flush_dcache_icache_hugepage(struct page *page); int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr, diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index ed870468ef6f..0c11a7513919 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -29,7 +29,6 @@ #ifndef __ASSEMBLY__ #ifdef CONFIG_HUGETLB_PAGE -extern bool hugetlb_disabled; extern unsigned int HPAGE_SHIFT; #else #define HPAGE_SHIFT PAGE_SHIFT diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 45a8d0be1c96..25f063f56ec5 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 0a4f939a8161..16ce13af6b9c 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include -- 2.13.3
[PATCH v1 15/27] powerpc/mm: cleanup ifdef mess in add_huge_page_size()
Introduce a subarch specific helper check_and_get_huge_psize() to check the huge page sizes and cleanup the ifdef mess in add_huge_page_size() Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/hugetlb.h | 27 + arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 5 arch/powerpc/include/asm/nohash/hugetlb-book3e.h | 8 + arch/powerpc/mm/hugetlbpage.c| 37 ++-- 4 files changed, 43 insertions(+), 34 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h index 177c81079209..4522a56a6269 100644 --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h @@ -108,4 +108,31 @@ static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshi void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr); +static inline int check_and_get_huge_psize(int shift) +{ + int mmu_psize; + + if (shift > SLICE_HIGH_SHIFT) + return -EINVAL; + + mmu_psize = shift_to_mmu_psize(shift); + + /* +* We need to make sure that for different page sizes reported by +* firmware we only add hugetlb support for page sizes that can be +* supported by linux page table layout. +* For now we have +* Radix: 2M and 1G +* Hash: 16M and 16G +*/ + if (radix_enabled()) { + if (mmu_psize != MMU_PAGE_2M && mmu_psize != MMU_PAGE_1G) + return -EINVAL; + } else { + if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) + return -EINVAL; + } + return mmu_psize; +} + #endif diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index eb90c2db7601..a442b499d5c8 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -37,4 +37,9 @@ static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshi (pshift == PAGE_SHIFT_8M ? _PMD_PAGE_8M : _PMD_PAGE_512K)); } +static inline int check_and_get_huge_psize(int shift) +{ + return shift_to_mmu_psize(shift); +} + #endif /* _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H */ diff --git a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h index 51439bcfe313..ecd8694cb229 100644 --- a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h +++ b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h @@ -34,4 +34,12 @@ static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshi *hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | pshift); } +static inline int check_and_get_huge_psize(int shift) +{ + if (shift & 1) /* Not a power of 4 */ + return -EINVAL; + + return shift_to_mmu_psize(shift); +} + #endif /* _ASM_POWERPC_NOHASH_HUGETLB_BOOK3E_H */ diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 87358b89513e..3b449c9d4e47 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -549,13 +549,6 @@ unsigned long vma_mmu_pagesize(struct vm_area_struct *vma) return vma_kernel_pagesize(vma); } -static inline bool is_power_of_4(unsigned long x) -{ - if (is_power_of_2(x)) - return (__ilog2(x) % 2) ? false : true; - return false; -} - static int __init add_huge_page_size(unsigned long long size) { int shift = __ffs(size); @@ -563,37 +556,13 @@ static int __init add_huge_page_size(unsigned long long size) /* Check that it is a page size supported by the hardware and * that it fits within pagetable and slice limits. */ - if (size <= PAGE_SIZE) - return -EINVAL; -#if defined(CONFIG_PPC_FSL_BOOK3E) - if (!is_power_of_4(size)) + if (size <= PAGE_SIZE || !is_power_of_2(size)) return -EINVAL; -#elif !defined(CONFIG_PPC_8xx) - if (!is_power_of_2(size) || (shift > SLICE_HIGH_SHIFT)) - return -EINVAL; -#endif - if ((mmu_psize = shift_to_mmu_psize(shift)) < 0) + mmu_psize = check_and_get_huge_psize(size); + if (mmu_psize < 0) return -EINVAL; -#ifdef CONFIG_PPC_BOOK3S_64 - /* -* We need to make sure that for different page sizes reported by -* firmware we only add hugetlb support for page sizes that can be -* supported by linux page table layout. -* For now we have -* Radix: 2M and 1G -* Hash: 16M and 16G -*/ - if (radix_enabled()) { - if (mmu_psize != MMU_PAGE_2M && mmu_psize != MMU_PAGE_1G) - return -EINVAL; - } else { - if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) - return -EINVAL;
[PATCH v1 14/27] powerpc/mm: no slice for nohash/64
Only nohash/32 and book3s/64 support mm slices. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/64/slice.h | 7 --- arch/powerpc/include/asm/slice.h | 4 +--- arch/powerpc/platforms/Kconfig.cputype | 4 3 files changed, 5 insertions(+), 10 deletions(-) delete mode 100644 arch/powerpc/include/asm/nohash/64/slice.h diff --git a/arch/powerpc/include/asm/nohash/64/slice.h b/arch/powerpc/include/asm/nohash/64/slice.h deleted file mode 100644 index 30adfdd4afde.. --- a/arch/powerpc/include/asm/nohash/64/slice.h +++ /dev/null @@ -1,7 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ASM_POWERPC_NOHASH_64_SLICE_H -#define _ASM_POWERPC_NOHASH_64_SLICE_H - -#define get_slice_psize(mm, addr) MMU_PAGE_4K - -#endif /* _ASM_POWERPC_NOHASH_64_SLICE_H */ diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h index d85c85422fdf..49d950a14e25 100644 --- a/arch/powerpc/include/asm/slice.h +++ b/arch/powerpc/include/asm/slice.h @@ -4,9 +4,7 @@ #ifdef CONFIG_PPC_BOOK3S_64 #include -#elif defined(CONFIG_PPC64) -#include -#elif defined(CONFIG_PPC_MMU_NOHASH) +#elif defined(CONFIG_PPC_MMU_NOHASH_32) #include #endif diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 842b2c7e156a..51ceeb046867 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -354,6 +354,10 @@ config PPC_MMU_NOHASH def_bool y depends on !PPC_BOOK3S +config PPC_MMU_NOHASH_32 + def_bool y + depends on PPC_MMU_NOHASH && PPC32 + config PPC_BOOK3E_MMU def_bool y depends on FSL_BOOKE || PPC_BOOK3E -- 2.13.3
[PATCH v1 13/27] powerpc/mm: define get_slice_psize() all the time
get_slice_psize() can be defined regardless of CONFIG_PPC_MM_SLICES to avoid ifdefs Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/slice.h | 4 arch/powerpc/mm/hugetlbpage.c| 4 +--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h index 44816cbc4198..d85c85422fdf 100644 --- a/arch/powerpc/include/asm/slice.h +++ b/arch/powerpc/include/asm/slice.h @@ -38,6 +38,10 @@ void slice_setup_new_exec(void); static inline void slice_init_new_context_exec(struct mm_struct *mm) {} +#ifndef get_slice_psize +#define get_slice_psize(mm, addr) MMU_PAGE_4K +#endif + #endif /* CONFIG_PPC_MM_SLICES */ #endif /* __ASSEMBLY__ */ diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 26a57ebaf5cf..87358b89513e 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -540,14 +540,12 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long vma_mmu_pagesize(struct vm_area_struct *vma) { -#ifdef CONFIG_PPC_MM_SLICES /* With radix we don't use slice, so derive it from vma*/ - if (!radix_enabled()) { + if (IS_ENABLED(CONFIG_PPC_MM_SLICES) && !radix_enabled()) { unsigned int psize = get_slice_psize(vma->vm_mm, vma->vm_start); return 1UL << mmu_psize_to_shift(psize); } -#endif return vma_kernel_pagesize(vma); } -- 2.13.3
[PATCH v1 11/27] powerpc/mm: split asm/hugetlb.h into dedicated subarch files
Three subarches support hugepages: - fsl book3e - book3s/64 - 8xx This patch splits asm/hugetlb.h to reduce the #ifdef mess. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/hugetlb.h | 41 +++ arch/powerpc/include/asm/hugetlb.h | 89 ++-- arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 32 + arch/powerpc/include/asm/nohash/hugetlb-book3e.h | 31 + 4 files changed, 108 insertions(+), 85 deletions(-) create mode 100644 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h create mode 100644 arch/powerpc/include/asm/nohash/hugetlb-book3e.h diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h index ec2a55a553c7..2f9cf2bc601c 100644 --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h @@ -62,4 +62,45 @@ extern pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); +/* + * This should work for other subarchs too. But right now we use the + * new format only for 64bit book3s + */ +static inline pte_t *hugepd_page(hugepd_t hpd) +{ + if (WARN_ON(!hugepd_ok(hpd))) + return NULL; + /* +* We have only four bits to encode, MMU page size +*/ + BUILD_BUG_ON((MMU_PAGE_COUNT - 1) > 0xf); + return __va(hpd_val(hpd) & HUGEPD_ADDR_MASK); +} + +static inline unsigned int hugepd_mmu_psize(hugepd_t hpd) +{ + return (hpd_val(hpd) & HUGEPD_SHIFT_MASK) >> 2; +} + +static inline unsigned int hugepd_shift(hugepd_t hpd) +{ + return mmu_psize_to_shift(hugepd_mmu_psize(hpd)); +} +static inline void flush_hugetlb_page(struct vm_area_struct *vma, + unsigned long vmaddr) +{ + if (radix_enabled()) + return radix__flush_hugetlb_page(vma, vmaddr); +} + +static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, + unsigned int pdshift) +{ + unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(hpd); + + return hugepd_page(hpd) + idx; +} + +void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr); + #endif diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index 48c29686c78e..fd5c0873a57d 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -6,85 +6,13 @@ #include #ifdef CONFIG_PPC_BOOK3S_64 - #include -/* - * This should work for other subarchs too. But right now we use the - * new format only for 64bit book3s - */ -static inline pte_t *hugepd_page(hugepd_t hpd) -{ - if (WARN_ON(!hugepd_ok(hpd))) - return NULL; - /* -* We have only four bits to encode, MMU page size -*/ - BUILD_BUG_ON((MMU_PAGE_COUNT - 1) > 0xf); - return __va(hpd_val(hpd) & HUGEPD_ADDR_MASK); -} - -static inline unsigned int hugepd_mmu_psize(hugepd_t hpd) -{ - return (hpd_val(hpd) & HUGEPD_SHIFT_MASK) >> 2; -} - -static inline unsigned int hugepd_shift(hugepd_t hpd) -{ - return mmu_psize_to_shift(hugepd_mmu_psize(hpd)); -} -static inline void flush_hugetlb_page(struct vm_area_struct *vma, - unsigned long vmaddr) -{ - if (radix_enabled()) - return radix__flush_hugetlb_page(vma, vmaddr); -} - -#else - -static inline pte_t *hugepd_page(hugepd_t hpd) -{ - if (WARN_ON(!hugepd_ok(hpd))) - return NULL; -#ifdef CONFIG_PPC_8xx - return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK); -#else - return (pte_t *)((hpd_val(hpd) & - ~HUGEPD_SHIFT_MASK) | PD_HUGE); -#endif -} - -static inline unsigned int hugepd_shift(hugepd_t hpd) -{ -#ifdef CONFIG_PPC_8xx - return ((hpd_val(hpd) & _PMD_PAGE_MASK) >> 1) + 17; -#else - return hpd_val(hpd) & HUGEPD_SHIFT_MASK; -#endif -} - +#elif defined(CONFIG_PPC_FSL_BOOK3E) +#include +#elif defined(CONFIG_PPC_8xx) +#include #endif /* CONFIG_PPC_BOOK3S_64 */ - -static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, - unsigned pdshift) -{ - /* -* On FSL BookE, we have multiple higher-level table entries that -* point to the same hugepte. Just use the first one since they're all -* identical. So for that case, idx=0. -*/ - unsigned long idx = 0; - - pte_t *dir = hugepd_page(hpd); -#ifdef CONFIG_PPC_8xx - idx = (addr & ((1UL << pdshift) - 1)) >> PAGE_SHIFT; -#elif !defined(CONFIG_PPC_FSL_BOOK3E) - idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(hpd); -#endif - - return dir + idx; -} - void flush_dcache_icache_hugepage(struct page *page); int
[PATCH v1 12/27] powerpc/mm: add a helper to populate hugepd
This patchs adds a subarch helper to populate hugepd. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/hugetlb.h | 5 + arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 8 arch/powerpc/include/asm/nohash/hugetlb-book3e.h | 6 ++ arch/powerpc/mm/hugetlbpage.c| 20 +--- 4 files changed, 20 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h index 2f9cf2bc601c..177c81079209 100644 --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h @@ -101,6 +101,11 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, return hugepd_page(hpd) + idx; } +static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift) +{ + *hpdp = __hugepd(__pa(new) | HUGEPD_VAL_BITS | (shift_to_mmu_psize(pshift) << 2)); +} + void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr); #endif diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index 209e6a219835..eb90c2db7601 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -2,6 +2,8 @@ #ifndef _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H #define _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H +#define PAGE_SHIFT_8M 23 + static inline pte_t *hugepd_page(hugepd_t hpd) { if (WARN_ON(!hugepd_ok(hpd))) @@ -29,4 +31,10 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma, flush_tlb_page(vma, vmaddr); } +static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift) +{ + *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | +(pshift == PAGE_SHIFT_8M ? _PMD_PAGE_8M : _PMD_PAGE_512K)); +} + #endif /* _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H */ diff --git a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h index e94f1cd048ee..51439bcfe313 100644 --- a/arch/powerpc/include/asm/nohash/hugetlb-book3e.h +++ b/arch/powerpc/include/asm/nohash/hugetlb-book3e.h @@ -28,4 +28,10 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr); +static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift) +{ + /* We use the old format for PPC_FSL_BOOK3E */ + *hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | pshift); +} + #endif /* _ASM_POWERPC_NOHASH_HUGETLB_BOOK3E_H */ diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 29d1568c7775..26a57ebaf5cf 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -26,12 +26,6 @@ #include #include -#define PAGE_SHIFT_64K 16 -#define PAGE_SHIFT_512K19 -#define PAGE_SHIFT_8M 23 -#define PAGE_SHIFT_16M 24 -#define PAGE_SHIFT_16G 34 - bool hugetlb_disabled = false; unsigned int HPAGE_SHIFT; @@ -95,19 +89,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, for (i = 0; i < num_hugepd; i++, hpdp++) { if (unlikely(!hugepd_none(*hpdp))) break; - else { -#ifdef CONFIG_PPC_BOOK3S_64 - *hpdp = __hugepd(__pa(new) | HUGEPD_VAL_BITS | -(shift_to_mmu_psize(pshift) << 2)); -#elif defined(CONFIG_PPC_8xx) - *hpdp = __hugepd(__pa(new) | _PMD_USER | -(pshift == PAGE_SHIFT_8M ? _PMD_PAGE_8M : - _PMD_PAGE_512K) | _PMD_PRESENT); -#else - /* We use the old format for PPC_FSL_BOOK3E */ - *hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | pshift); -#endif - } + hugepd_populate(hpdp, new, pshift); } /* If we bailed from the for loop early, an error occurred, clean up */ if (i < num_hugepd) { -- 2.13.3
[PATCH v1 10/27] powerpc/mm: make gup_hugepte() static
gup_huge_pd() is the only user of gup_hugepte() and it is located in the same file. This patch moves gup_huge_pd() after gup_hugepte() and makes gup_hugepte() static. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/pgtable.h | 3 --- arch/powerpc/mm/hugetlbpage.c | 38 +++--- 2 files changed, 19 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 505550fb2935..c51846da41a7 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -89,9 +89,6 @@ extern void paging_init(void); */ extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t *); -extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, - unsigned long end, int write, - struct page **pages, int *nr); #ifndef CONFIG_TRANSPARENT_HUGEPAGE #define pmd_large(pmd) 0 #endif diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 6d9751b188c1..29d1568c7775 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -539,23 +539,6 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, return (__boundary - 1 < end - 1) ? __boundary : end; } -int gup_huge_pd(hugepd_t hugepd, unsigned long addr, unsigned pdshift, - unsigned long end, int write, struct page **pages, int *nr) -{ - pte_t *ptep; - unsigned long sz = 1UL << hugepd_shift(hugepd); - unsigned long next; - - ptep = hugepte_offset(hugepd, addr, pdshift); - do { - next = hugepte_addr_end(addr, end, sz); - if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr)) - return 0; - } while (ptep++, addr = next, addr != end); - - return 1; -} - #ifdef CONFIG_PPC_MM_SLICES unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, @@ -754,8 +737,8 @@ void flush_dcache_icache_hugepage(struct page *page) } } -int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) +static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, + unsigned long end, int write, struct page **pages, int *nr) { unsigned long pte_end; struct page *head, *page; @@ -801,3 +784,20 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 1; } + +int gup_huge_pd(hugepd_t hugepd, unsigned long addr, unsigned int pdshift, + unsigned long end, int write, struct page **pages, int *nr) +{ + pte_t *ptep; + unsigned long sz = 1UL << hugepd_shift(hugepd); + unsigned long next; + + ptep = hugepte_offset(hugepd, addr, pdshift); + do { + next = hugepte_addr_end(addr, end, sz); + if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr)) + return 0; + } while (ptep++, addr = next, addr != end); + + return 1; +} -- 2.13.3
[PATCH v1 09/27] powerpc/mm: make hugetlbpage.c depend on CONFIG_HUGETLB_PAGE
The only function in hugetlbpage.c which doesn't depend on CONFIG_HUGETLB_PAGE is gup_hugepte(), and this function is only called from gup_huge_pd() which depends on CONFIG_HUGETLB_PAGE so all the content of hugetlbpage.c depends on CONFIG_HUGETLB_PAGE. This patch modifies Makefile to only compile hugetlbpage.c when CONFIG_HUGETLB_PAGE is set. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/Makefile | 2 +- arch/powerpc/mm/hugetlbpage.c | 5 - 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 2c23d1ece034..20b900537fc9 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -33,7 +33,7 @@ obj-$(CONFIG_PPC_FSL_BOOK3E) += fsl_booke_mmu.o obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o obj-$(CONFIG_PPC_SPLPAR) += vphn.o obj-$(CONFIG_PPC_MM_SLICES)+= slice.o -obj-y += hugetlbpage.o +obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o ifdef CONFIG_HUGETLB_PAGE obj-$(CONFIG_PPC_BOOK3S_64)+= hugetlbpage-hash64.o obj-$(CONFIG_PPC_RADIX_MMU)+= hugetlbpage-radix.o diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 202ae006aa39..6d9751b188c1 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -26,9 +26,6 @@ #include #include - -#ifdef CONFIG_HUGETLB_PAGE - #define PAGE_SHIFT_64K 16 #define PAGE_SHIFT_512K19 #define PAGE_SHIFT_8M 23 @@ -757,8 +754,6 @@ void flush_dcache_icache_hugepage(struct page *page) } } -#endif /* CONFIG_HUGETLB_PAGE */ - int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { -- 2.13.3
[PATCH v1 08/27] powerpc/mm: move __find_linux_pte() out of hugetlbpage.c
__find_linux_pte() is the only function in hugetlbpage.c which is compiled in regardless on CONFIG_HUGETLBPAGE This patch moves it in pgtable.c. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/hugetlbpage.c | 103 - arch/powerpc/mm/pgtable.c | 104 ++ 2 files changed, 104 insertions(+), 103 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index cf2978e235f3..202ae006aa39 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -759,109 +759,6 @@ void flush_dcache_icache_hugepage(struct page *page) #endif /* CONFIG_HUGETLB_PAGE */ -/* - * We have 4 cases for pgds and pmds: - * (1) invalid (all zeroes) - * (2) pointer to next table, as normal; bottom 6 bits == 0 - * (3) leaf pte for huge page _PAGE_PTE set - * (4) hugepd pointer, _PAGE_PTE = 0 and bits [2..6] indicate size of table - * - * So long as we atomically load page table pointers we are safe against teardown, - * we can follow the address down to the the page and take a ref on it. - * This function need to be called with interrupts disabled. We use this variant - * when we have MSR[EE] = 0 but the paca->irq_soft_mask = IRQS_ENABLED - */ -pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, - bool *is_thp, unsigned *hpage_shift) -{ - pgd_t pgd, *pgdp; - pud_t pud, *pudp; - pmd_t pmd, *pmdp; - pte_t *ret_pte; - hugepd_t *hpdp = NULL; - unsigned pdshift = PGDIR_SHIFT; - - if (hpage_shift) - *hpage_shift = 0; - - if (is_thp) - *is_thp = false; - - pgdp = pgdir + pgd_index(ea); - pgd = READ_ONCE(*pgdp); - /* -* Always operate on the local stack value. This make sure the -* value don't get updated by a parallel THP split/collapse, -* page fault or a page unmap. The return pte_t * is still not -* stable. So should be checked there for above conditions. -*/ - if (pgd_none(pgd)) - return NULL; - else if (pgd_huge(pgd)) { - ret_pte = (pte_t *) pgdp; - goto out; - } else if (is_hugepd(__hugepd(pgd_val(pgd - hpdp = (hugepd_t *) - else { - /* -* Even if we end up with an unmap, the pgtable will not -* be freed, because we do an rcu free and here we are -* irq disabled -*/ - pdshift = PUD_SHIFT; - pudp = pud_offset(, ea); - pud = READ_ONCE(*pudp); - - if (pud_none(pud)) - return NULL; - else if (pud_huge(pud)) { - ret_pte = (pte_t *) pudp; - goto out; - } else if (is_hugepd(__hugepd(pud_val(pud - hpdp = (hugepd_t *) - else { - pdshift = PMD_SHIFT; - pmdp = pmd_offset(, ea); - pmd = READ_ONCE(*pmdp); - /* -* A hugepage collapse is captured by pmd_none, because -* it mark the pmd none and do a hpte invalidate. -*/ - if (pmd_none(pmd)) - return NULL; - - if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) { - if (is_thp) - *is_thp = true; - ret_pte = (pte_t *) pmdp; - goto out; - } - /* -* pmd_large check below will handle the swap pmd pte -* we need to do both the check because they are config -* dependent. -*/ - if (pmd_huge(pmd) || pmd_large(pmd)) { - ret_pte = (pte_t *) pmdp; - goto out; - } else if (is_hugepd(__hugepd(pmd_val(pmd - hpdp = (hugepd_t *) - else - return pte_offset_kernel(, ea); - } - } - if (!hpdp) - return NULL; - - ret_pte = hugepte_offset(*hpdp, ea, pdshift); - pdshift = hugepd_shift(*hpdp); -out: - if (hpage_shift) - *hpage_shift = pdshift; - return ret_pte; -} -EXPORT_SYMBOL_GPL(__find_linux_pte); - int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index d3d61d29b4f1..9f4ccd15849f 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -30,6 +30,7 @@
[PATCH v1 07/27] powerpc/book3e: hugetlbpage is only for CONFIG_PPC_FSL_BOOK3E
As per Kconfig.cputype, only CONFIG_PPC_FSL_BOOK3E gets to select SYS_SUPPORTS_HUGETLBFS so simplify accordingly. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/Makefile | 2 +- arch/powerpc/mm/hugetlbpage-book3e.c | 47 +++- 2 files changed, 20 insertions(+), 29 deletions(-) diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 3c1bd9fa23cd..2c23d1ece034 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -37,7 +37,7 @@ obj-y += hugetlbpage.o ifdef CONFIG_HUGETLB_PAGE obj-$(CONFIG_PPC_BOOK3S_64)+= hugetlbpage-hash64.o obj-$(CONFIG_PPC_RADIX_MMU)+= hugetlbpage-radix.o -obj-$(CONFIG_PPC_BOOK3E_MMU) += hugetlbpage-book3e.o +obj-$(CONFIG_PPC_FSL_BOOK3E) += hugetlbpage-book3e.o endif obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += hugepage-hash64.o obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o diff --git a/arch/powerpc/mm/hugetlbpage-book3e.c b/arch/powerpc/mm/hugetlbpage-book3e.c index c911fe9bfa0e..61915f4d3c7f 100644 --- a/arch/powerpc/mm/hugetlbpage-book3e.c +++ b/arch/powerpc/mm/hugetlbpage-book3e.c @@ -11,8 +11,9 @@ #include -#ifdef CONFIG_PPC_FSL_BOOK3E #ifdef CONFIG_PPC64 +#include + static inline int tlb1_next(void) { struct paca_struct *paca = get_paca(); @@ -29,28 +30,6 @@ static inline int tlb1_next(void) tcd->esel_next = next; return this; } -#else -static inline int tlb1_next(void) -{ - int index, ncams; - - ncams = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY; - - index = this_cpu_read(next_tlbcam_idx); - - /* Just round-robin the entries and wrap when we hit the end */ - if (unlikely(index == ncams - 1)) - __this_cpu_write(next_tlbcam_idx, tlbcam_index); - else - __this_cpu_inc(next_tlbcam_idx); - - return index; -} -#endif /* !PPC64 */ -#endif /* FSL */ - -#if defined(CONFIG_PPC_FSL_BOOK3E) && defined(CONFIG_PPC64) -#include static inline void book3e_tlb_lock(void) { @@ -93,6 +72,23 @@ static inline void book3e_tlb_unlock(void) paca->tcd_ptr->lock = 0; } #else +static inline int tlb1_next(void) +{ + int index, ncams; + + ncams = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY; + + index = this_cpu_read(next_tlbcam_idx); + + /* Just round-robin the entries and wrap when we hit the end */ + if (unlikely(index == ncams - 1)) + __this_cpu_write(next_tlbcam_idx, tlbcam_index); + else + __this_cpu_inc(next_tlbcam_idx); + + return index; +} + static inline void book3e_tlb_lock(void) { } @@ -134,10 +130,7 @@ void book3e_hugetlb_preload(struct vm_area_struct *vma, unsigned long ea, unsigned long psize, tsize, shift; unsigned long flags; struct mm_struct *mm; - -#ifdef CONFIG_PPC_FSL_BOOK3E int index; -#endif if (unlikely(is_kernel_addr(ea))) return; @@ -161,11 +154,9 @@ void book3e_hugetlb_preload(struct vm_area_struct *vma, unsigned long ea, return; } -#ifdef CONFIG_PPC_FSL_BOOK3E /* We have to use the CAM(TLB1) on FSL parts for hugepages */ index = tlb1_next(); mtspr(SPRN_MAS0, MAS0_ESEL(index) | MAS0_TLBSEL(1)); -#endif mas1 = MAS1_VALID | MAS1_TID(mm->context.id) | MAS1_TSIZE(tsize); mas2 = ea & ~((1UL << shift) - 1); -- 2.13.3
[PATCH v1 06/27] powerpc/64: only book3s/64 supports CONFIG_PPC_64K_PAGES
CONFIG_PPC_64K_PAGES cannot be selected by nohash/64 Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 1 - arch/powerpc/include/asm/nohash/64/pgalloc.h | 3 --- arch/powerpc/include/asm/nohash/64/pgtable.h | 4 arch/powerpc/include/asm/nohash/64/slice.h | 4 arch/powerpc/include/asm/nohash/pte-book3e.h | 5 - arch/powerpc/include/asm/pgtable-be-types.h | 7 ++- arch/powerpc/include/asm/pgtable-types.h | 7 ++- arch/powerpc/include/asm/task_size_64.h | 2 +- arch/powerpc/mm/tlb_low_64e.S| 31 arch/powerpc/mm/tlb_nohash.c | 13 10 files changed, 5 insertions(+), 72 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 2d0be82c3061..5d8e692d6470 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -375,7 +375,6 @@ config ZONE_DMA config PGTABLE_LEVELS int default 2 if !PPC64 - default 3 if PPC_64K_PAGES && !PPC_BOOK3S_64 default 4 source "arch/powerpc/sysdev/Kconfig" diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h b/arch/powerpc/include/asm/nohash/64/pgalloc.h index 66d086f85bd5..ded453f9b5a8 100644 --- a/arch/powerpc/include/asm/nohash/64/pgalloc.h +++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h @@ -171,12 +171,9 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table, #define __pmd_free_tlb(tlb, pmd, addr) \ pgtable_free_tlb(tlb, pmd, PMD_CACHE_INDEX) -#ifndef CONFIG_PPC_64K_PAGES #define __pud_free_tlb(tlb, pud, addr) \ pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE) -#endif /* CONFIG_PPC_64K_PAGES */ - #define check_pgt_cache() do { } while (0) #endif /* _ASM_POWERPC_PGALLOC_64_H */ diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h index e77ed9761632..3efbd8a1720a 100644 --- a/arch/powerpc/include/asm/nohash/64/pgtable.h +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h @@ -10,10 +10,6 @@ #include #include -#ifdef CONFIG_PPC_64K_PAGES -#error "Page size not supported" -#endif - #define FIRST_USER_ADDRESS 0UL /* diff --git a/arch/powerpc/include/asm/nohash/64/slice.h b/arch/powerpc/include/asm/nohash/64/slice.h index 1a32d1fae6af..30adfdd4afde 100644 --- a/arch/powerpc/include/asm/nohash/64/slice.h +++ b/arch/powerpc/include/asm/nohash/64/slice.h @@ -2,10 +2,6 @@ #ifndef _ASM_POWERPC_NOHASH_64_SLICE_H #define _ASM_POWERPC_NOHASH_64_SLICE_H -#ifdef CONFIG_PPC_64K_PAGES -#define get_slice_psize(mm, addr) MMU_PAGE_64K -#else /* CONFIG_PPC_64K_PAGES */ #define get_slice_psize(mm, addr) MMU_PAGE_4K -#endif /* !CONFIG_PPC_64K_PAGES */ #endif /* _ASM_POWERPC_NOHASH_64_SLICE_H */ diff --git a/arch/powerpc/include/asm/nohash/pte-book3e.h b/arch/powerpc/include/asm/nohash/pte-book3e.h index dd40d200f274..813918f40765 100644 --- a/arch/powerpc/include/asm/nohash/pte-book3e.h +++ b/arch/powerpc/include/asm/nohash/pte-book3e.h @@ -60,13 +60,8 @@ #define _PAGE_SPECIAL _PAGE_SW0 /* Base page size */ -#ifdef CONFIG_PPC_64K_PAGES -#define _PAGE_PSIZE_PAGE_PSIZE_64K -#define PTE_RPN_SHIFT (28) -#else #define _PAGE_PSIZE_PAGE_PSIZE_4K #definePTE_RPN_SHIFT (24) -#endif #define PTE_WIMGE_SHIFT (19) #define PTE_BAP_SHIFT (2) diff --git a/arch/powerpc/include/asm/pgtable-be-types.h b/arch/powerpc/include/asm/pgtable-be-types.h index a89c67b62680..5932a9883eb7 100644 --- a/arch/powerpc/include/asm/pgtable-be-types.h +++ b/arch/powerpc/include/asm/pgtable-be-types.h @@ -34,10 +34,8 @@ static inline __be64 pmd_raw(pmd_t x) } /* - * 64 bit hash always use 4 level table. Everybody else use 4 level - * only for 4K page size. + * 64 bit always use 4 level table */ -#if defined(CONFIG_PPC_BOOK3S_64) || !defined(CONFIG_PPC_64K_PAGES) typedef struct { __be64 pud; } pud_t; #define __pud(x) ((pud_t) { cpu_to_be64(x) }) #define __pud_raw(x) ((pud_t) { (x) }) @@ -51,7 +49,6 @@ static inline __be64 pud_raw(pud_t x) return x.pud; } -#endif /* CONFIG_PPC_BOOK3S_64 || !CONFIG_PPC_64K_PAGES */ #endif /* CONFIG_PPC64 */ /* PGD level */ @@ -77,7 +74,7 @@ typedef struct { unsigned long pgprot; } pgprot_t; * With hash config 64k pages additionally define a bigger "real PTE" type that * gathers the "second half" part of the PTE for pseudo 64k pages */ -#if defined(CONFIG_PPC_64K_PAGES) && defined(CONFIG_PPC_BOOK3S_64) +#ifdef CONFIG_PPC_64K_PAGES typedef struct { pte_t pte; unsigned long hidx; } real_pte_t; #else typedef struct { pte_t pte; } real_pte_t; diff --git a/arch/powerpc/include/asm/pgtable-types.h b/arch/powerpc/include/asm/pgtable-types.h index 3b0edf041b2e..02e75e89c93e 100644 --- a/arch/powerpc/include/asm/pgtable-types.h +++ b/arch/powerpc/include/asm/pgtable-types.h @@ -24,17 +24,14 @@ static inline unsigned long pmd_val(pmd_t x) } /* - * 64 bit hash always use 4
[PATCH v1 05/27] powerpc/mm: drop slice_set_user_psize()
slice_set_user_psize() is not used anymore, drop it. Fixes: 1753dd183036 ("powerpc/mm/slice: Simplify and optimise slice context initialisation") Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/slice.h | 5 - arch/powerpc/include/asm/nohash/64/slice.h | 1 - 2 files changed, 6 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/slice.h b/arch/powerpc/include/asm/book3s/64/slice.h index db0dedab65ee..af498b0da21a 100644 --- a/arch/powerpc/include/asm/book3s/64/slice.h +++ b/arch/powerpc/include/asm/book3s/64/slice.h @@ -16,11 +16,6 @@ #else /* CONFIG_PPC_MM_SLICES */ #define get_slice_psize(mm, addr) ((mm)->context.user_psize) -#define slice_set_user_psize(mm, psize)\ -do { \ - (mm)->context.user_psize = (psize); \ - (mm)->context.sllp = SLB_VSID_USER | mmu_psize_defs[(psize)].sllp; \ -} while (0) #endif /* CONFIG_PPC_MM_SLICES */ diff --git a/arch/powerpc/include/asm/nohash/64/slice.h b/arch/powerpc/include/asm/nohash/64/slice.h index ad0d6e3cc1c5..1a32d1fae6af 100644 --- a/arch/powerpc/include/asm/nohash/64/slice.h +++ b/arch/powerpc/include/asm/nohash/64/slice.h @@ -7,6 +7,5 @@ #else /* CONFIG_PPC_64K_PAGES */ #define get_slice_psize(mm, addr) MMU_PAGE_4K #endif /* !CONFIG_PPC_64K_PAGES */ -#define slice_set_user_psize(mm, psize)do { BUG(); } while (0) #endif /* _ASM_POWERPC_NOHASH_64_SLICE_H */ -- 2.13.3
[PATCH v1 02/27] powerpc/mm: don't BUG in add_huge_page_size()
No reason to BUG() in add_huge_page_size(). Just WARN and reject the add. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/hugetlbpage.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 9e732bb2c84a..cf2978e235f3 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -634,7 +634,8 @@ static int __init add_huge_page_size(unsigned long long size) } #endif - BUG_ON(mmu_psize_defs[mmu_psize].shift != shift); + if (WARN_ON(mmu_psize_defs[mmu_psize].shift != shift)) + return -EINVAL; /* Return if huge page size has already been setup */ if (size_to_hstate(size)) -- 2.13.3
[PATCH v1 00/27] Reduce ifdef mess in hugetlbpage.c and slice.c
The main purpose of this series is to reduce the amount of #ifdefs in hugetlbpage.c and slice.c At the same time, it does some cleanup by reducing the number of BUG_ON() and dropping unused functions. It also removes 64k pages related code in nohash/64 as 64k pages are can only by selected on book3s/64 Christophe Leroy (27): powerpc/mm: Don't BUG() in hugepd_page() powerpc/mm: don't BUG in add_huge_page_size() powerpc/mm: don't BUG() in slice_mask_for_size() powerpc/book3e: drop mmu_get_tsize() powerpc/mm: drop slice_set_user_psize() powerpc/64: only book3s/64 supports CONFIG_PPC_64K_PAGES powerpc/book3e: hugetlbpage is only for CONFIG_PPC_FSL_BOOK3E powerpc/mm: move __find_linux_pte() out of hugetlbpage.c powerpc/mm: make hugetlbpage.c depend on CONFIG_HUGETLB_PAGE powerpc/mm: make gup_hugepte() static powerpc/mm: split asm/hugetlb.h into dedicated subarch files powerpc/mm: add a helper to populate hugepd powerpc/mm: define get_slice_psize() all the time powerpc/mm: no slice for nohash/64 powerpc/mm: cleanup ifdef mess in add_huge_page_size() powerpc/mm: move hugetlb_disabled into asm/hugetlb.h powerpc/mm: cleanup HPAGE_SHIFT setup powerpc/mm: cleanup remaining ifdef mess in hugetlbpage.c powerpc/mm: drop slice DEBUG powerpc/mm: remove unnecessary #ifdef CONFIG_PPC64 powerpc/mm: hand a context_t over to slice_mask_for_size() instead of mm_struct powerpc/mm: move slice_mask_for_size() into mmu.h powerpc/mm: remove a couple of #ifdef CONFIG_PPC_64K_PAGES in mm/slice.c powerpc: define subarch SLB_ADDR_LIMIT_DEFAULT powerpc/mm: flatten function __find_linux_pte() powerpc/mm: flatten function __find_linux_pte() step 2 powerpc/mm: flatten function __find_linux_pte() step 3 arch/powerpc/Kconfig | 3 +- arch/powerpc/include/asm/book3s/64/hugetlb.h | 73 +++ arch/powerpc/include/asm/book3s/64/mmu.h | 22 +- arch/powerpc/include/asm/book3s/64/slice.h | 7 +- arch/powerpc/include/asm/hugetlb.h | 87 +--- arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 45 + arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 18 ++ arch/powerpc/include/asm/nohash/32/slice.h | 2 + arch/powerpc/include/asm/nohash/64/pgalloc.h | 3 - arch/powerpc/include/asm/nohash/64/pgtable.h | 4 - arch/powerpc/include/asm/nohash/64/slice.h | 12 -- arch/powerpc/include/asm/nohash/hugetlb-book3e.h | 45 + arch/powerpc/include/asm/nohash/pte-book3e.h | 5 - arch/powerpc/include/asm/page.h | 12 +- arch/powerpc/include/asm/pgtable-be-types.h | 7 +- arch/powerpc/include/asm/pgtable-types.h | 7 +- arch/powerpc/include/asm/pgtable.h | 3 - arch/powerpc/include/asm/slice.h | 8 +- arch/powerpc/include/asm/task_size_64.h | 2 +- arch/powerpc/kernel/fadump.c | 1 + arch/powerpc/kernel/setup-common.c | 8 +- arch/powerpc/mm/Makefile | 4 +- arch/powerpc/mm/hash_utils_64.c | 1 + arch/powerpc/mm/hugetlbpage-book3e.c | 52 ++--- arch/powerpc/mm/hugetlbpage-hash64.c | 16 ++ arch/powerpc/mm/hugetlbpage.c| 245 --- arch/powerpc/mm/pgtable.c| 114 +++ arch/powerpc/mm/slice.c | 132 ++-- arch/powerpc/mm/tlb_low_64e.S| 31 --- arch/powerpc/mm/tlb_nohash.c | 13 -- arch/powerpc/platforms/Kconfig.cputype | 4 + 31 files changed, 438 insertions(+), 548 deletions(-) create mode 100644 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h delete mode 100644 arch/powerpc/include/asm/nohash/64/slice.h create mode 100644 arch/powerpc/include/asm/nohash/hugetlb-book3e.h -- 2.13.3
[PATCH v1 01/27] powerpc/mm: Don't BUG() in hugepd_page()
Don't BUG(), just warn and return NULL. If the NULL value is not handled, it will get catched anyway. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/hugetlb.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index 8d40565ad0c3..48c29686c78e 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -14,7 +14,8 @@ */ static inline pte_t *hugepd_page(hugepd_t hpd) { - BUG_ON(!hugepd_ok(hpd)); + if (WARN_ON(!hugepd_ok(hpd))) + return NULL; /* * We have only four bits to encode, MMU page size */ @@ -42,7 +43,8 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma, static inline pte_t *hugepd_page(hugepd_t hpd) { - BUG_ON(!hugepd_ok(hpd)); + if (WARN_ON(!hugepd_ok(hpd))) + return NULL; #ifdef CONFIG_PPC_8xx return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK); #else -- 2.13.3
[PATCH v1 04/27] powerpc/book3e: drop mmu_get_tsize()
This function is not used anymore, drop it. Fixes: b42279f0165c ("powerpc/mm/nohash: MM_SLICE is only used by book3s 64") Signed-off-by: Christophe Leroy --- arch/powerpc/mm/hugetlbpage-book3e.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage-book3e.c b/arch/powerpc/mm/hugetlbpage-book3e.c index f84ec46cdb26..c911fe9bfa0e 100644 --- a/arch/powerpc/mm/hugetlbpage-book3e.c +++ b/arch/powerpc/mm/hugetlbpage-book3e.c @@ -49,11 +49,6 @@ static inline int tlb1_next(void) #endif /* !PPC64 */ #endif /* FSL */ -static inline int mmu_get_tsize(int psize) -{ - return mmu_psize_defs[psize].enc; -} - #if defined(CONFIG_PPC_FSL_BOOK3E) && defined(CONFIG_PPC64) #include -- 2.13.3
[PATCH v1 03/27] powerpc/mm: don't BUG() in slice_mask_for_size()
When no mask is found for the page size, WARN() and return NULL instead of BUG()ing. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/slice.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index aec91dbcdc0b..011d470ea340 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -165,7 +165,8 @@ static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize) if (psize == MMU_PAGE_16G) return >context.mask_16g; #endif - BUG(); + WARN_ON(true); + return NULL; } #elif defined(CONFIG_PPC_8xx) static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize) @@ -178,7 +179,8 @@ static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize) if (psize == MMU_PAGE_8M) return >context.mask_8m; #endif - BUG(); + WARN_ON(true); + return NULL; } #else #error "Must define the slice masks for page sizes supported by the platform" -- 2.13.3
Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING
On Wed, Mar 20, 2019 at 7:21 AM Masahiro Yamada wrote: > > Commit 60a3cdd06394 ("x86: add optimized inlining") introduced > CONFIG_OPTIMIZE_INLINING, but it has been available only for x86. > > The idea is obviously arch-agnostic although we need some code fixups. > This commit moves the config entry from arch/x86/Kconfig.debug to > lib/Kconfig.debug so that all architectures (except MIPS for now) can > benefit from it. > > At this moment, I added "depends on !MIPS" because fixing 0day bot reports > for MIPS was complex to me. > > I tested this patch on my arm/arm64 boards. > > This can make a huge difference in kernel image size especially when > CONFIG_OPTIMIZE_FOR_SIZE is enabled. > > For example, I got 3.5% smaller arm64 kernel image for v5.1-rc1. > > dec file > 18983424 arch/arm64/boot/Image.before > 18321920 arch/arm64/boot/Image.after > > This also slightly improves the "Kernel hacking" Kconfig menu. > Commit e61aca5158a8 ("Merge branch 'kconfig-diet' from Dave Hansen') > mentioned this config option would be a good fit in the "compiler option" > menu. I did so. I think this is a good idea in general, but it is likely to cause a lot of new warnings. Especially the -Wmaybe-uninitialized warnings get new false positives every time we get substantially different inlining decisions. I've added your patch to my randconfig test setup and will let you know if I see anything noticeable. I'm currently testing clang-arm32, clang-arm64 and gcc-x86. Arnd
Re: [PATCH] compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING
On Wed, Mar 20, 2019 at 7:41 AM Masahiro Yamada wrote: > It is unclear to me how to fix it. > That's why I ended up with "depends on !MIPS". > > > MODPOST vmlinux.o > arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2': > sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base' > sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base' > sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base' > sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base' > sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base' > arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined > references to `mips_gcr_base' > > > Perhaps, MIPS folks may know how to fix it. I would guess like this: diff --git a/arch/mips/include/asm/mips-cm.h b/arch/mips/include/asm/mips-cm.h index 8bc5df49b0e1..a27483fedb7d 100644 --- a/arch/mips/include/asm/mips-cm.h +++ b/arch/mips/include/asm/mips-cm.h @@ -79,7 +79,7 @@ static inline int mips_cm_probe(void) * * Returns true if a CM is present in the system, else false. */ -static inline bool mips_cm_present(void) +static __always_inline bool mips_cm_present(void) { #ifdef CONFIG_MIPS_CM return mips_gcr_base != NULL; @@ -93,7 +93,7 @@ static inline bool mips_cm_present(void) * * Returns true if the system implements an L2-only sync region, else false. */ -static inline bool mips_cm_has_l2sync(void) +static __always_inline bool mips_cm_has_l2sync(void) { #ifdef CONFIG_MIPS_CM return mips_cm_l2sync_base != NULL;
[PATCH v4 11/17] KVM: introduce a 'mmap' method for KVM devices
Some KVM devices will want to handle special mappings related to the underlying HW. For instance, the XIVE interrupt controller of the POWER9 processor has MMIO pages for thread interrupt management and for interrupt source control that need to be exposed to the guest when the OS has the required support. Cc: Paolo Bonzini Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 11 +++ 2 files changed, 12 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9d55c63db09b..831d963451d8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1245,6 +1245,7 @@ struct kvm_device_ops { int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr); long (*ioctl)(struct kvm_device *dev, unsigned int ioctl, unsigned long arg); + int (*mmap)(struct kvm_device *dev, struct vm_area_struct *vma); }; void kvm_device_get(struct kvm_device *dev); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f25aa98a94df..5e2fa5c7dd1a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2884,6 +2884,16 @@ static long kvm_vcpu_compat_ioctl(struct file *filp, } #endif +static int kvm_device_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct kvm_device *dev = filp->private_data; + + if (dev->ops->mmap) + return dev->ops->mmap(dev, vma); + + return -ENODEV; +} + static int kvm_device_ioctl_attr(struct kvm_device *dev, int (*accessor)(struct kvm_device *dev, struct kvm_device_attr *attr), @@ -2933,6 +2943,7 @@ static const struct file_operations kvm_device_fops = { .unlocked_ioctl = kvm_device_ioctl, .release = kvm_device_release, KVM_COMPAT(kvm_device_ioctl), + .mmap = kvm_device_mmap, }; struct kvm_device *kvm_device_from_filp(struct file *filp) -- 2.20.1
[PATCH v4 13/17] KVM: PPC: Book3S HV: XIVE: add a mapping for the source ESB pages
Each source is associated with an Event State Buffer (ESB) with a even/odd pair of pages which provides commands to manage the source: to trigger, to EOI, to turn off the source for instance. The custom VM fault handler will deduce the guest IRQ number from the offset of the fault, and the ESB page of the associated XIVE interrupt will be inserted into the VMA using the internal structure caching information on the interrupts. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- arch/powerpc/include/uapi/asm/kvm.h| 1 + arch/powerpc/kvm/book3s_xive_native.c | 57 ++ Documentation/virtual/kvm/devices/xive.txt | 7 +++ 3 files changed, 65 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0998e8edc91a..b0f72dea8b11 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -721,5 +721,6 @@ struct kvm_ppc_xive_eq { #define KVM_XIVE_EQ_ALWAYS_NOTIFY 0x0001 #define KVM_XIVE_TIMA_PAGE_OFFSET 0 +#define KVM_XIVE_ESB_PAGE_OFFSET 4 #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 0cfad45d8b75..d0a055030efd 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -165,6 +165,59 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, return rc; } +static vm_fault_t xive_native_esb_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct kvm_device *dev = vma->vm_file->private_data; + struct kvmppc_xive *xive = dev->private; + struct kvmppc_xive_src_block *sb; + struct kvmppc_xive_irq_state *state; + struct xive_irq_data *xd; + u32 hw_num; + u16 src; + u64 page; + unsigned long irq; + u64 page_offset; + + /* +* Linux/KVM uses a two pages ESB setting, one for trigger and +* one for EOI +*/ + page_offset = vmf->pgoff - vma->vm_pgoff; + irq = page_offset / 2; + + sb = kvmppc_xive_find_source(xive, irq, ); + if (!sb) { + pr_devel("%s: source %lx not found !\n", __func__, irq); + return VM_FAULT_SIGBUS; + } + + state = >irq_state[src]; + kvmppc_xive_select_irq(state, _num, ); + + arch_spin_lock(>lock); + + /* +* first/even page is for trigger +* second/odd page is for EOI and management. +*/ + page = page_offset % 2 ? xd->eoi_page : xd->trig_page; + arch_spin_unlock(>lock); + + if (WARN_ON(!page)) { + pr_err("%s: acessing invalid ESB page for source %lx !\n", + __func__, irq); + return VM_FAULT_SIGBUS; + } + + vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT); + return VM_FAULT_NOPAGE; +} + +static const struct vm_operations_struct xive_native_esb_vmops = { + .fault = xive_native_esb_fault, +}; + static vm_fault_t xive_native_tima_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -194,6 +247,10 @@ static int kvmppc_xive_native_mmap(struct kvm_device *dev, if (vma_pages(vma) > 4) return -EINVAL; vma->vm_ops = _native_tima_vmops; + } else if (vma->vm_pgoff == KVM_XIVE_ESB_PAGE_OFFSET) { + if (vma_pages(vma) > KVMPPC_XIVE_NR_IRQS * 2) + return -EINVAL; + vma->vm_ops = _native_esb_vmops; } else { return -EINVAL; } diff --git a/Documentation/virtual/kvm/devices/xive.txt b/Documentation/virtual/kvm/devices/xive.txt index 944fd0971b13..2d795805b39e 100644 --- a/Documentation/virtual/kvm/devices/xive.txt +++ b/Documentation/virtual/kvm/devices/xive.txt @@ -36,6 +36,13 @@ the legacy interrupt mode, referred as XICS (POWER7/8). third (operating system) and the fourth (user level) are exposed the guest. + 2. Event State Buffer (ESB) + + Each source is associated with an Event State Buffer (ESB) with + either a pair of even/odd pair of pages which provides commands to + manage the source: to trigger, to EOI, to turn off the source for + instance. + * Groups: 1. KVM_DEV_XIVE_GRP_CTRL -- 2.20.1
[PATCH v4 10/17] KVM: PPC: Book3S HV: XIVE: add get/set accessors for the VP XIVE state
The state of the thread interrupt management registers needs to be collected for migration. These registers are cached under the 'xive_saved_state.w01' field of the VCPU when the VPCU context is pulled from the HW thread. An OPAL call retrieves the backup of the IPB register in the underlying XIVE NVT structure and merges it in the KVM state. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v3 : - Fixed xive_timaval description in documentation Changes since v2 : - reduced the size of kvmppc_one_reg timaval attribute to two u64s - stopped returning of the OS CAM line value arch/powerpc/include/asm/kvm_ppc.h | 11 arch/powerpc/include/uapi/asm/kvm.h| 2 + arch/powerpc/kvm/book3s.c | 24 +++ arch/powerpc/kvm/book3s_xive_native.c | 76 ++ Documentation/virtual/kvm/devices/xive.txt | 17 + 5 files changed, 130 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 6928a35ac3c7..0579c9b253db 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -273,6 +273,7 @@ union kvmppc_one_reg { u64 addr; u64 length; } vpaval; + u64 xive_timaval[2]; }; struct kvmppc_ops { @@ -605,6 +606,10 @@ extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu); extern void kvmppc_xive_native_init_module(void); extern void kvmppc_xive_native_exit_module(void); +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, +union kvmppc_one_reg *val); +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, +union kvmppc_one_reg *val); #else static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server, @@ -637,6 +642,12 @@ static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { } static inline void kvmppc_xive_native_init_module(void) { } static inline void kvmppc_xive_native_exit_module(void) { } +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, + union kvmppc_one_reg *val) +{ return 0; } +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, + union kvmppc_one_reg *val) +{ return -ENOENT; } #endif /* CONFIG_KVM_XIVE */ diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 12744608a61c..cd3f16b70a2e 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -482,6 +482,8 @@ struct kvm_ppc_cpu_char { #define KVM_REG_PPC_ICP_PPRI_SHIFT16 /* pending irq priority */ #define KVM_REG_PPC_ICP_PPRI_MASK 0xff +#define KVM_REG_PPC_VP_STATE (KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x8d) + /* Device control API: PPC-specific devices */ #define KVM_DEV_MPIC_GRP_MISC 1 #define KVM_DEV_MPIC_BASE_ADDR 0 /* 64-bit */ diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 7c3348fa27e1..efd15101eef0 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -651,6 +651,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, *val = get_reg_val(id, kvmppc_xics_get_icp(vcpu)); break; #endif /* CONFIG_KVM_XICS */ +#ifdef CONFIG_KVM_XIVE + case KVM_REG_PPC_VP_STATE: + if (!vcpu->arch.xive_vcpu) { + r = -ENXIO; + break; + } + if (xive_enabled()) + r = kvmppc_xive_native_get_vp(vcpu, val); + else + r = -ENXIO; + break; +#endif /* CONFIG_KVM_XIVE */ case KVM_REG_PPC_FSCR: *val = get_reg_val(id, vcpu->arch.fscr); break; @@ -724,6 +736,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, *val)); break; #endif /* CONFIG_KVM_XICS */ +#ifdef CONFIG_KVM_XIVE + case KVM_REG_PPC_VP_STATE: + if (!vcpu->arch.xive_vcpu) { + r = -ENXIO; + break; + } + if (xive_enabled()) + r = kvmppc_xive_native_set_vp(vcpu, val); + else + r = -ENXIO; + break; +#endif /* CONFIG_KVM_XIVE */ case KVM_REG_PPC_FSCR: vcpu->arch.fscr =
[PATCH v4 12/17] KVM: PPC: Book3S HV: XIVE: add a TIMA mapping
Each thread has an associated Thread Interrupt Management context composed of a set of registers. These registers let the thread handle priority management and interrupt acknowledgment. The most important are : - Interrupt Pending Buffer (IPB) - Current Processor Priority (CPPR) - Notification Source Register (NSR) They are exposed to software in four different pages each proposing a view with a different privilege. The first page is for the physical thread context and the second for the hypervisor. Only the third (operating system) and the fourth (user level) are exposed the guest. A custom VM fault handler will populate the VMA with the appropriate pages, which should only be the OS page for now. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- arch/powerpc/include/asm/xive.h| 1 + arch/powerpc/include/uapi/asm/kvm.h| 2 ++ arch/powerpc/kvm/book3s_xive_native.c | 39 ++ arch/powerpc/sysdev/xive/native.c | 11 ++ Documentation/virtual/kvm/devices/xive.txt | 23 + 5 files changed, 76 insertions(+) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index c4e88abd3b67..eaf76f57023a 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -23,6 +23,7 @@ * same offset regardless of where the code is executing */ extern void __iomem *xive_tima; +extern unsigned long xive_tima_os; /* * Offset in the TM area of our current execution level (provided by diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index cd3f16b70a2e..0998e8edc91a 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -720,4 +720,6 @@ struct kvm_ppc_xive_eq { #define KVM_XIVE_EQ_ALWAYS_NOTIFY 0x0001 +#define KVM_XIVE_TIMA_PAGE_OFFSET 0 + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index a8c62e07ebee..0cfad45d8b75 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -165,6 +165,44 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, return rc; } +static vm_fault_t xive_native_tima_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + + switch (vmf->pgoff - vma->vm_pgoff) { + case 0: /* HW - forbid access */ + case 1: /* HV - forbid access */ + return VM_FAULT_SIGBUS; + case 2: /* OS */ + vmf_insert_pfn(vma, vmf->address, xive_tima_os >> PAGE_SHIFT); + return VM_FAULT_NOPAGE; + case 3: /* USER - TODO */ + default: + return VM_FAULT_SIGBUS; + } +} + +static const struct vm_operations_struct xive_native_tima_vmops = { + .fault = xive_native_tima_fault, +}; + +static int kvmppc_xive_native_mmap(struct kvm_device *dev, + struct vm_area_struct *vma) +{ + /* We only allow mappings at fixed offset for now */ + if (vma->vm_pgoff == KVM_XIVE_TIMA_PAGE_OFFSET) { + if (vma_pages(vma) > 4) + return -EINVAL; + vma->vm_ops = _native_tima_vmops; + } else { + return -EINVAL; + } + + vma->vm_flags |= VM_IO | VM_PFNMAP; + vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot); + return 0; +} + static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq, u64 addr) { @@ -1043,6 +1081,7 @@ struct kvm_device_ops kvm_xive_native_ops = { .set_attr = kvmppc_xive_native_set_attr, .get_attr = kvmppc_xive_native_get_attr, .has_attr = kvmppc_xive_native_has_attr, + .mmap = kvmppc_xive_native_mmap, }; void kvmppc_xive_native_init_module(void) diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c index 0c037e933e55..7782201e5fe8 100644 --- a/arch/powerpc/sysdev/xive/native.c +++ b/arch/powerpc/sysdev/xive/native.c @@ -521,6 +521,9 @@ u32 xive_native_default_eq_shift(void) } EXPORT_SYMBOL_GPL(xive_native_default_eq_shift); +unsigned long xive_tima_os; +EXPORT_SYMBOL_GPL(xive_tima_os); + bool __init xive_native_init(void) { struct device_node *np; @@ -573,6 +576,14 @@ bool __init xive_native_init(void) for_each_possible_cpu(cpu) kvmppc_set_xive_tima(cpu, r.start, tima); + /* Resource 2 is OS window */ + if (of_address_to_resource(np, 2, )) { + pr_err("Failed to get thread mgmnt area resource\n"); + return false; + } + + xive_tima_os = r.start; + /* Grab size of provisionning pages */ xive_parse_provisioning(np); diff --git a/Documentation/virtual/kvm/devices/xive.txt b/Documentation/virtual/kvm/devices/xive.txt index 702836d5ad7a..944fd0971b13 100644 ---
[PATCH v4 15/17] KVM: PPC: Book3S HV: XIVE: activate XIVE exploitation mode
Full support for the XIVE native exploitation mode is now available, advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM Hypervisor) is not yet available. XIVE should also have been activated which is default setting on POWER9 systems running a recent Linux kernel. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- arch/powerpc/kvm/powerpc.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index b0858ee61460..f54926c78320 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -573,10 +573,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) #ifdef CONFIG_KVM_XIVE case KVM_CAP_PPC_IRQ_XIVE: /* -* Return false until all the XIVE infrastructure is -* in place including support for migration. +* We need XIVE to be enabled on the platform (implies +* a POWER9 processor) and the PowerNV platform, as +* nested is not yet supported. */ - r = 0; + r = xive_enabled() && !!cpu_has_feature(CPU_FTR_HVMODE); break; #endif -- 2.20.1
[PATCH v4 05/17] KVM: PPC: Book3S HV: XIVE: add a control to configure a source
This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from QEMU to configure the target of a source and also to restore the configuration of a source when migrating the VM. The XIVE source interrupt structure is extended with the value of the Effective Interrupt Source Number. The EISN is the interrupt number pushed in the event queue that the guest OS will use to dispatch events internally. Caching the EISN value in KVM eases the test when checking if a reconfiguration is indeed needed. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2: - fixed comments on the KVM device attribute definitions - handled MASKED EAS configuration - fixed locking on source block arch/powerpc/include/uapi/asm/kvm.h| 11 +++ arch/powerpc/kvm/book3s_xive.h | 4 + arch/powerpc/kvm/book3s_xive.c | 5 +- arch/powerpc/kvm/book3s_xive_native.c | 97 ++ Documentation/virtual/kvm/devices/xive.txt | 21 + 5 files changed, 136 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index d468294c2a67..e8161e21629b 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -680,9 +680,20 @@ struct kvm_ppc_cpu_char { /* POWER9 XIVE Native Interrupt Controller */ #define KVM_DEV_XIVE_GRP_CTRL 1 #define KVM_DEV_XIVE_GRP_SOURCE2 /* 64-bit source identifier */ +#define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3 /* 64-bit source identifier */ /* Layout of 64-bit XIVE source attribute values */ #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) #define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1) +/* Layout of 64-bit XIVE source configuration attribute values */ +#define KVM_XIVE_SOURCE_PRIORITY_SHIFT 0 +#define KVM_XIVE_SOURCE_PRIORITY_MASK 0x7 +#define KVM_XIVE_SOURCE_SERVER_SHIFT 3 +#define KVM_XIVE_SOURCE_SERVER_MASK0xfff8ULL +#define KVM_XIVE_SOURCE_MASKED_SHIFT 32 +#define KVM_XIVE_SOURCE_MASKED_MASK0x1ULL +#define KVM_XIVE_SOURCE_EISN_SHIFT 33 +#define KVM_XIVE_SOURCE_EISN_MASK 0xfffeULL + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index 1be921cb5dcb..ae26fe653d98 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -61,6 +61,9 @@ struct kvmppc_xive_irq_state { bool saved_p; bool saved_q; u8 saved_scan_prio; + + /* Xive native */ + u32 eisn; /* Guest Effective IRQ number */ }; /* Select the "right" interrupt (IPI vs. passthrough) */ @@ -268,6 +271,7 @@ int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu); struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( struct kvmppc_xive *xive, int irq); void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb); +int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio); #endif /* CONFIG_KVM_XICS */ #endif /* _KVM_PPC_BOOK3S_XICS_H */ diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index 6c9f9fd0855f..e09f3addffe5 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -342,7 +342,7 @@ static int xive_try_pick_queue(struct kvm_vcpu *vcpu, u8 prio) return atomic_add_unless(>count, 1, max) ? 0 : -EBUSY; } -static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio) +int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio) { struct kvm_vcpu *vcpu; int i, rc; @@ -530,7 +530,7 @@ static int xive_target_interrupt(struct kvm *kvm, * priority. The count for that new target will have * already been incremented. */ - rc = xive_select_target(kvm, , prio); + rc = kvmppc_xive_select_target(kvm, , prio); /* * We failed to find a target ? Not much we can do @@ -1504,6 +1504,7 @@ struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) { sb->irq_state[i].number = (bid << KVMPPC_XICS_ICS_SHIFT) | i; + sb->irq_state[i].eisn = 0; sb->irq_state[i].guest_priority = MASKED; sb->irq_state[i].saved_priority = MASKED; sb->irq_state[i].act_priority = MASKED; diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 5f2bd6c137b7..492825a35958 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -242,6 +242,99 @@ static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq, return rc; } +static int kvmppc_xive_native_update_source_config(struct kvmppc_xive *xive, + struct kvmppc_xive_src_block *sb, + struct
[PATCH v4 09/17] KVM: PPC: Book3S HV: XIVE: add a control to dirty the XIVE EQ pages
When migration of a VM is initiated, a first copy of the RAM is transferred to the destination before the VM is stopped, but there is no guarantee that the EQ pages in which the event notifications are queued have not been modified. To make sure migration will capture a consistent memory state, the XIVE device should perform a XIVE quiesce sequence to stop the flow of event notifications and stabilize the EQs. This is the purpose of the KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty to force their transfer. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2 : - Extra comments - fixed locking on source block arch/powerpc/include/uapi/asm/kvm.h| 1 + arch/powerpc/kvm/book3s_xive_native.c | 85 ++ Documentation/virtual/kvm/devices/xive.txt | 29 3 files changed, 115 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index e4abe30f6fc6..12744608a61c 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char { /* POWER9 XIVE Native Interrupt Controller */ #define KVM_DEV_XIVE_GRP_CTRL 1 #define KVM_DEV_XIVE_RESET 1 +#define KVM_DEV_XIVE_EQ_SYNC 2 #define KVM_DEV_XIVE_GRP_SOURCE2 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4 /* 64-bit EQ identifier */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index d45dc2ec0557..44ce74086550 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -674,6 +674,88 @@ static int kvmppc_xive_reset(struct kvmppc_xive *xive) return 0; } +static void kvmppc_xive_native_sync_sources(struct kvmppc_xive_src_block *sb) +{ + int j; + + for (j = 0; j < KVMPPC_XICS_IRQ_PER_ICS; j++) { + struct kvmppc_xive_irq_state *state = >irq_state[j]; + struct xive_irq_data *xd; + u32 hw_num; + + if (!state->valid) + continue; + + /* +* The struct kvmppc_xive_irq_state reflects the state +* of the EAS configuration and not the state of the +* source. The source is masked setting the PQ bits to +* '-Q', which is what is being done before calling +* the KVM_DEV_XIVE_EQ_SYNC control. +* +* If a source EAS is configured, OPAL syncs the XIVE +* IC of the source and the XIVE IC of the previous +* target if any. +* +* So it should be fine ignoring MASKED sources as +* they have been synced already. +*/ + if (state->act_priority == MASKED) + continue; + + kvmppc_xive_select_irq(state, _num, ); + xive_native_sync_source(hw_num); + xive_native_sync_queue(hw_num); + } +} + +static int kvmppc_xive_native_vcpu_eq_sync(struct kvm_vcpu *vcpu) +{ + struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; + unsigned int prio; + + if (!xc) + return -ENOENT; + + for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) { + struct xive_q *q = >queues[prio]; + + if (!q->qpage) + continue; + + /* Mark EQ page dirty for migration */ + mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qaddr)); + } + return 0; +} + +static int kvmppc_xive_native_eq_sync(struct kvmppc_xive *xive) +{ + struct kvm *kvm = xive->kvm; + struct kvm_vcpu *vcpu; + unsigned int i; + + pr_devel("%s\n", __func__); + + mutex_lock(>lock); + for (i = 0; i <= xive->max_sbid; i++) { + struct kvmppc_xive_src_block *sb = xive->src_blocks[i]; + + if (sb) { + arch_spin_lock(>lock); + kvmppc_xive_native_sync_sources(sb); + arch_spin_unlock(>lock); + } + } + + kvm_for_each_vcpu(i, vcpu, kvm) { + kvmppc_xive_native_vcpu_eq_sync(vcpu); + } + mutex_unlock(>lock); + + return 0; +} + static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -684,6 +766,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_XIVE_RESET: return kvmppc_xive_reset(xive); + case KVM_DEV_XIVE_EQ_SYNC: + return kvmppc_xive_native_eq_sync(xive); } break; case
Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode
Hi Daniel, pi 15. 3. 2019 o 3:09 Daniel Axtens napísal(a): > The original assembly imported from OpenSSL has two copy-paste > errors in handling CTR mode. When dealing with a 2 or 3 block tail, > the code branches to the CBC decryption exit path, rather than to > the CTR exit path. > > This leads to corruption of the IV, which leads to subsequent blocks > being corrupted. > > This can be detected with libkcapi test suite, which is available at > https://github.com/smuellerDD/libkcapi > > Reported-by: Ondrej Mosnáček > Fixes: 5c380d623ed3 ("crypto: vmx - Add support for VMS instructions by ASM") > Cc: sta...@vger.kernel.org > Signed-off-by: Daniel Axtens Thank you for looking into this and for posting the patch(es)! I tested the patch yesterday and I can confirm that it makes the libkcapi tests/reproducer pass. Assuming you will want to cover the other failures from the new testmgr tests by a separate patch: Tested-by: Ondrej Mosnacek > --- > drivers/crypto/vmx/aesp8-ppc.pl | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/crypto/vmx/aesp8-ppc.pl b/drivers/crypto/vmx/aesp8-ppc.pl > index d6a9f63d65ba..de78282b8f44 100644 > --- a/drivers/crypto/vmx/aesp8-ppc.pl > +++ b/drivers/crypto/vmx/aesp8-ppc.pl > @@ -1854,7 +1854,7 @@ Lctr32_enc8x_three: > stvx_u $out1,$x10,$out > stvx_u $out2,$x20,$out > addi$out,$out,0x30 > - b Lcbc_dec8x_done > + b Lctr32_enc8x_done > > .align 5 > Lctr32_enc8x_two: > @@ -1866,7 +1866,7 @@ Lctr32_enc8x_two: > stvx_u $out0,$x00,$out > stvx_u $out1,$x10,$out > addi$out,$out,0x20 > - b Lcbc_dec8x_done > + b Lctr32_enc8x_done > > .align 5 > Lctr32_enc8x_one: > -- > 2.19.1 >
[PATCH v4 17/17] KVM: PPC: Book3S HV: XIVE: clear the vCPU interrupt presenters
When the VM boots, the CAS negotiation process determines which interrupt mode to use and invokes a machine reset. At that time, the previous KVM interrupt device is 'destroyed' before the chosen one is created. Upon destruction, the vCPU interrupt presenters using the KVM device should be cleared first, the machine will reconnect them later to the new device after it is created. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2 : - removed comments on possible race in kvmppc_native_connect_vcpu() for the XIVE KVM device. This is still an issue in the XICS-over-XIVE device. arch/powerpc/kvm/book3s_xics.c| 19 + arch/powerpc/kvm/book3s_xive.c| 39 +-- arch/powerpc/kvm/book3s_xive_native.c | 12 + 3 files changed, 68 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c index f27ee57ab46e..81cdabf4295f 100644 --- a/arch/powerpc/kvm/book3s_xics.c +++ b/arch/powerpc/kvm/book3s_xics.c @@ -1342,6 +1342,25 @@ static void kvmppc_xics_free(struct kvm_device *dev) struct kvmppc_xics *xics = dev->private; int i; struct kvm *kvm = xics->kvm; + struct kvm_vcpu *vcpu; + + /* +* When destroying the VM, the vCPUs are destroyed first and +* the vCPU list should be empty. If this is not the case, +* then we are simply destroying the device and we should +* clean up the vCPU interrupt presenters first. +*/ + if (atomic_read(>online_vcpus) != 0) { + /* +* call kick_all_cpus_sync() to ensure that all CPUs +* have executed any pending interrupts +*/ + if (is_kvmppc_hv_enabled(kvm)) + kick_all_cpus_sync(); + + kvm_for_each_vcpu(i, vcpu, kvm) + kvmppc_xics_free_icp(vcpu); + } debugfs_remove(xics->dentry); diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index 480a3fc6b9fd..cf6a4c6c5a28 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -1100,11 +1100,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu) void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu) { struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; - struct kvmppc_xive *xive = xc->xive; + struct kvmppc_xive *xive; int i; + if (!kvmppc_xics_enabled(vcpu)) + return; + + if (!xc) + return; + pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num); + xive = xc->xive; + /* Ensure no interrupt is still routed to that VP */ xc->valid = false; kvmppc_xive_disable_vcpu_interrupts(vcpu); @@ -1141,6 +1149,10 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu) } /* Free the VP */ kfree(xc); + + /* Cleanup the vcpu */ + vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT; + vcpu->arch.xive_vcpu = NULL; } int kvmppc_xive_connect_vcpu(struct kvm_device *dev, @@ -1158,7 +1170,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev, } if (xive->kvm != vcpu->kvm) return -EPERM; - if (vcpu->arch.irq_type) + if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT) return -EBUSY; if (kvmppc_xive_find_server(vcpu->kvm, cpu)) { pr_devel("Duplicate !\n"); @@ -1828,8 +1840,31 @@ static void kvmppc_xive_free(struct kvm_device *dev) { struct kvmppc_xive *xive = dev->private; struct kvm *kvm = xive->kvm; + struct kvm_vcpu *vcpu; int i; + /* +* When destroying the VM, the vCPUs are destroyed first and +* the vCPU list should be empty. If this is not the case, +* then we are simply destroying the device and we should +* clean up the vCPU interrupt presenters first. +*/ + if (atomic_read(>online_vcpus) != 0) { + /* +* call kick_all_cpus_sync() to ensure that all CPUs +* have executed any pending interrupts +*/ + if (is_kvmppc_hv_enabled(kvm)) + kick_all_cpus_sync(); + + /* +* TODO: There is still a race window with the early +* checks in kvmppc_native_connect_vcpu() +*/ + kvm_for_each_vcpu(i, vcpu, kvm) + kvmppc_xive_cleanup_vcpu(vcpu); + } + debugfs_remove(xive->dentry); if (kvm) diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 6a502eee6744..96e6b5c50eb3 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -961,8 +961,20 @@ static void kvmppc_xive_native_free(struct kvm_device *dev) { struct kvmppc_xive *xive =
[PATCH v4 16/17] KVM: introduce a KVM_DESTROY_DEVICE ioctl
The 'destroy' method is currently used to destroy all devices when the VM is destroyed after the vCPUs have been freed. This new KVM ioctl exposes the same KVM device method. It acts as a software reset of the VM to 'destroy' selected devices when necessary and perform the required cleanups on the vCPUs. Called with the kvm->lock. The 'destroy' method could be improved by returning an error code. Cc: Paolo Bonzini Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v3 : - Removed temporary TODO comment in kvm_ioctl_destroy_device() regarding kvm_put_kvm() Changes since v2 : - checked that device is owned by VM include/uapi/linux/kvm.h | 7 ++ virt/kvm/kvm_main.c | 41 +++ Documentation/virtual/kvm/api.txt | 20 +++ 3 files changed, 68 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 52bf74a1616e..d78fafa54274 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1183,6 +1183,11 @@ struct kvm_create_device { __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ }; +struct kvm_destroy_device { + __u32 fd; /* in: device handle */ + __u32 flags; /* in: unused */ +}; + struct kvm_device_attr { __u32 flags; /* no flags currently defined */ __u32 group; /* device-defined */ @@ -1331,6 +1336,8 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +#define KVM_DESTROY_DEVICE _IOWR(KVMIO, 0xf0, struct kvm_destroy_device) + /* * ioctls for vcpu fds */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5e2fa5c7dd1a..9601c2ddecc5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3032,6 +3032,33 @@ static int kvm_ioctl_create_device(struct kvm *kvm, return 0; } +static int kvm_ioctl_destroy_device(struct kvm *kvm, + struct kvm_destroy_device *dd) +{ + struct fd f; + struct kvm_device *dev; + + f = fdget(dd->fd); + if (!f.file) + return -EBADF; + + dev = kvm_device_from_filp(f.file); + fdput(f); + + if (!dev) + return -ENODEV; + + if (dev->kvm != kvm) + return -EPERM; + + mutex_lock(>lock); + list_del(>vm_node); + dev->ops->destroy(dev); + mutex_unlock(>lock); + + return 0; +} + static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) { switch (arg) { @@ -3276,6 +3303,20 @@ static long kvm_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_DESTROY_DEVICE: { + struct kvm_destroy_device dd; + + r = -EFAULT; + if (copy_from_user(, argp, sizeof(dd))) + goto out; + + r = kvm_ioctl_destroy_device(kvm, ); + if (r) + goto out; + + r = 0; + break; + } case KVM_CHECK_EXTENSION: r = kvm_vm_ioctl_check_extension_generic(kvm, arg); break; diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 8022ecce2c47..abe8433adf4f 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -3874,6 +3874,26 @@ number of valid entries in the 'entries' array, which is then filled. 'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved, userspace should not expect to get any particular value there. +4.119 KVM_DESTROY_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_destroy_device (in) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EPERM: The device does not belong to the VM + + Other error conditions may be defined by individual device types or + have their standard meanings. + +Destroys an emulated device in the kernel. + +struct kvm_destroy_device { + __u32 fd; /* in: device handle */ + __u32 flags; /* unused */ +}; + 5. The kvm_run structure -- 2.20.1
[PATCH v4 14/17] KVM: PPC: Book3S HV: XIVE: add passthrough support
The KVM XICS-over-XIVE device and the proposed KVM XIVE native device implement an IRQ space for the guest using the generic IPI interrupts of the XIVE IC controller. These interrupts are allocated at the OPAL level and "mapped" into the guest IRQ number space in the range 0-0x1FFF. Interrupt management is performed in the XIVE way: using loads and stores on the addresses of the XIVE IPI interrupt ESB pages. Both KVM devices share the same internal structure caching information on the interrupts, among which the xive_irq_data struct containing the addresses of the IPI ESB pages and an extra one in case of pass-through. The later contains the addresses of the ESB pages of the underlying HW controller interrupts, PHB4 in all cases for now. A guest, when running in the XICS legacy interrupt mode, lets the KVM XICS-over-XIVE device "handle" interrupt management, that is to perform the loads and stores on the addresses of the ESB pages of the guest interrupts. However, when running in XIVE native exploitation mode, the KVM XIVE native device exposes the interrupt ESB pages to the guest and lets the guest perform directly the loads and stores. The VMA exposing the ESB pages make use of a custom VM fault handler which role is to populate the VMA with appropriate pages. When a fault occurs, the guest IRQ number is deduced from the offset, and the ESB pages of associated XIVE IPI interrupt are inserted in the VMA (using the internal structure caching information on the interrupts). Supporting device passthrough in the guest running in XIVE native exploitation mode adds some extra refinements because the ESB pages of a different HW controller (PHB4) need to be exposed to the guest along with the initial IPI ESB pages of the XIVE IC controller. But the overall mechanic is the same. When the device HW irqs are mapped into or unmapped from the guest IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped(), are called to record or clear the passthrough interrupt information and to perform the switch. The approach taken by this patch is to clear the ESB pages of the guest IRQ number being mapped and let the VM fault handler repopulate. The handler will insert the ESB page corresponding to the HW interrupt of the device being passed-through or the initial IPI ESB page if the device is being removed. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2 : - extra comment in documentation arch/powerpc/kvm/book3s_xive.h | 9 + arch/powerpc/kvm/book3s_xive.c | 15 arch/powerpc/kvm/book3s_xive_native.c | 41 ++ Documentation/virtual/kvm/devices/xive.txt | 19 ++ 4 files changed, 84 insertions(+) diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index 622f594d93e1..e011622dc038 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -94,6 +94,11 @@ struct kvmppc_xive_src_block { struct kvmppc_xive_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS]; }; +struct kvmppc_xive; + +struct kvmppc_xive_ops { + int (*reset_mapped)(struct kvm *kvm, unsigned long guest_irq); +}; struct kvmppc_xive { struct kvm *kvm; @@ -132,6 +137,10 @@ struct kvmppc_xive { /* Flags */ u8 single_escalation; + + struct kvmppc_xive_ops *ops; + struct address_space *mapping; + struct mutex mapping_lock; }; #define KVMPPC_XIVE_Q_COUNT8 diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index c1b7aa7dbc28..480a3fc6b9fd 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -937,6 +937,13 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq, /* Turn the IPI hard off */ xive_vm_esb_load(>ipi_data, XIVE_ESB_SET_PQ_01); + /* +* Reset ESB guest mapping. Needed when ESB pages are exposed +* to the guest in XIVE native mode +*/ + if (xive->ops && xive->ops->reset_mapped) + xive->ops->reset_mapped(kvm, guest_irq); + /* Grab info about irq */ state->pt_number = hw_irq; state->pt_data = irq_data_get_irq_handler_data(host_data); @@ -1022,6 +1029,14 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq, state->pt_number = 0; state->pt_data = NULL; + /* +* Reset ESB guest mapping. Needed when ESB pages are exposed +* to the guest in XIVE native mode +*/ + if (xive->ops && xive->ops->reset_mapped) { + xive->ops->reset_mapped(kvm, guest_irq); + } + /* Reconfigure the IPI */ xive_native_configure_irq(state->ipi_number, kvmppc_xive_vp(xive, state->act_server), diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index d0a055030efd..6a502eee6744 100644 ---
[PATCH v4 08/17] KVM: PPC: Book3S HV: XIVE: add a control to sync the sources
This control will be used by the H_INT_SYNC hcall from QEMU to flush event notifications on the XIVE IC owning the source. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2 : - fixed locking on source block arch/powerpc/include/uapi/asm/kvm.h| 1 + arch/powerpc/kvm/book3s_xive_native.c | 36 ++ Documentation/virtual/kvm/devices/xive.txt | 8 + 3 files changed, 45 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index f045f9dee42e..e4abe30f6fc6 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -683,6 +683,7 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GRP_SOURCE2 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4 /* 64-bit EQ identifier */ +#define KVM_DEV_XIVE_GRP_SOURCE_SYNC 5 /* 64-bit source identifier */ /* Layout of 64-bit XIVE source attribute values */ #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index b54d6fa978fe..d45dc2ec0557 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -335,6 +335,38 @@ static int kvmppc_xive_native_set_source_config(struct kvmppc_xive *xive, priority, masked, eisn); } +static int kvmppc_xive_native_sync_source(struct kvmppc_xive *xive, + long irq, u64 addr) +{ + struct kvmppc_xive_src_block *sb; + struct kvmppc_xive_irq_state *state; + struct xive_irq_data *xd; + u32 hw_num; + u16 src; + int rc = 0; + + pr_devel("%s irq=0x%lx", __func__, irq); + + sb = kvmppc_xive_find_source(xive, irq, ); + if (!sb) + return -ENOENT; + + state = >irq_state[src]; + + rc = -EINVAL; + + arch_spin_lock(>lock); + + if (state->valid) { + kvmppc_xive_select_irq(state, _num, ); + xive_native_sync_source(hw_num); + rc = 0; + } + + arch_spin_unlock(>lock); + return rc; +} + static int xive_native_validate_queue_size(u32 qshift) { /* @@ -663,6 +695,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, case KVM_DEV_XIVE_GRP_EQ_CONFIG: return kvmppc_xive_native_set_queue_config(xive, attr->attr, attr->addr); + case KVM_DEV_XIVE_GRP_SOURCE_SYNC: + return kvmppc_xive_native_sync_source(xive, attr->attr, + attr->addr); } return -ENXIO; } @@ -692,6 +727,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev, break; case KVM_DEV_XIVE_GRP_SOURCE: case KVM_DEV_XIVE_GRP_SOURCE_CONFIG: + case KVM_DEV_XIVE_GRP_SOURCE_SYNC: if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ && attr->attr < KVMPPC_XIVE_NR_IRQS) return 0; diff --git a/Documentation/virtual/kvm/devices/xive.txt b/Documentation/virtual/kvm/devices/xive.txt index acd5cb9d1339..26fc918b02fb 100644 --- a/Documentation/virtual/kvm/devices/xive.txt +++ b/Documentation/virtual/kvm/devices/xive.txt @@ -92,3 +92,11 @@ the legacy interrupt mode, referred as XICS (POWER7/8). -EINVAL: Invalid queue address -EFAULT: Invalid user pointer for attr->addr. -EIO:Configuration of the underlying HW failed + + 5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only) + Synchronize the source to flush event notifications + Attributes: +Interrupt source number (64-bit) + Errors: +-ENOENT: Unknown source number +-EINVAL: Not initialized source number -- 2.20.1
[PATCH v4 07/17] KVM: PPC: Book3S HV: XIVE: add a global reset control
This control is to be used by the H_INT_RESET hcall from QEMU. Its purpose is to clear all configuration of the sources and EQs. This is necessary in case of a kexec (for a kdump kernel for instance) to make sure that no remaining configuration is left from the previous boot setup so that the new kernel can start safely from a clean state. The queue 7 is ignored when the XIVE device is configured to run in single escalation mode. Prio 7 is used by escalations. The XIVE VP is kept enabled as the vCPU is still active and connected to the XIVE device. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2 : - fixed locking on source block arch/powerpc/include/uapi/asm/kvm.h| 1 + arch/powerpc/kvm/book3s_xive_native.c | 85 ++ Documentation/virtual/kvm/devices/xive.txt | 5 ++ 3 files changed, 91 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 85005400fd86..f045f9dee42e 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -679,6 +679,7 @@ struct kvm_ppc_cpu_char { /* POWER9 XIVE Native Interrupt Controller */ #define KVM_DEV_XIVE_GRP_CTRL 1 +#define KVM_DEV_XIVE_RESET 1 #define KVM_DEV_XIVE_GRP_SOURCE2 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4 /* 64-bit EQ identifier */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 2c335454da72..b54d6fa978fe 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -565,6 +565,83 @@ static int kvmppc_xive_native_get_queue_config(struct kvmppc_xive *xive, return 0; } +static void kvmppc_xive_reset_sources(struct kvmppc_xive_src_block *sb) +{ + int i; + + for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) { + struct kvmppc_xive_irq_state *state = >irq_state[i]; + + if (!state->valid) + continue; + + if (state->act_priority == MASKED) + continue; + + state->eisn = 0; + state->act_server = 0; + state->act_priority = MASKED; + xive_vm_esb_load(>ipi_data, XIVE_ESB_SET_PQ_01); + xive_native_configure_irq(state->ipi_number, 0, MASKED, 0); + if (state->pt_number) { + xive_vm_esb_load(state->pt_data, XIVE_ESB_SET_PQ_01); + xive_native_configure_irq(state->pt_number, + 0, MASKED, 0); + } + } +} + +static int kvmppc_xive_reset(struct kvmppc_xive *xive) +{ + struct kvm *kvm = xive->kvm; + struct kvm_vcpu *vcpu; + unsigned int i; + + pr_devel("%s\n", __func__); + + mutex_lock(>lock); + + kvm_for_each_vcpu(i, vcpu, kvm) { + struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; + unsigned int prio; + + if (!xc) + continue; + + kvmppc_xive_disable_vcpu_interrupts(vcpu); + + for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) { + + /* Single escalation, no queue 7 */ + if (prio == 7 && xive->single_escalation) + break; + + if (xc->esc_virq[prio]) { + free_irq(xc->esc_virq[prio], vcpu); + irq_dispose_mapping(xc->esc_virq[prio]); + kfree(xc->esc_virq_names[prio]); + xc->esc_virq[prio] = 0; + } + + kvmppc_xive_native_cleanup_queue(vcpu, prio); + } + } + + for (i = 0; i <= xive->max_sbid; i++) { + struct kvmppc_xive_src_block *sb = xive->src_blocks[i]; + + if (sb) { + arch_spin_lock(>lock); + kvmppc_xive_reset_sources(sb); + arch_spin_unlock(>lock); + } + } + + mutex_unlock(>lock); + + return 0; +} + static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -572,6 +649,10 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, switch (attr->group) { case KVM_DEV_XIVE_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_XIVE_RESET: + return kvmppc_xive_reset(xive); + } break; case KVM_DEV_XIVE_GRP_SOURCE: return kvmppc_xive_native_set_source(xive, attr->attr, @@ -604,6 +685,10 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev, {
[PATCH v4 06/17] KVM: PPC: Book3S HV: XIVE: add controls for the EQ configuration
These controls will be used by the H_INT_SET_QUEUE_CONFIG and H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying Event Queue in the XIVE IC. They will also be used to restore the configuration of the XIVE EQs and to capture the internal run-time state of the EQs. Both 'get' and 'set' rely on an OPAL call to access the EQ toggle bit and EQ index which are updated by the XIVE IC when event notifications are enqueued in the EQ. The value of the guest physical address of the event queue is saved in the XIVE internal xive_q structure for later use. That is when migration needs to mark the EQ pages dirty to capture a consistent memory state of the VM. To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra OPAL call setting the EQ toggle bit and EQ index to configure the EQ, but restoring the EQ state will. Signed-off-by: Cédric Le Goater --- Changes since v3 : - fix the test ont the initial setting of the EQ toggle bit : 0 -> 1 - renamed qsize to qshift - renamed qpage to qaddr - checked host page size - limited flags to KVM_XIVE_EQ_ALWAYS_NOTIFY to fit sPAPR specs Changes since v2 : - fixed comments on the KVM device attribute definitions - fixed check on supported EQ size to restrict to 64K pages - checked kvm_eq.flags that need to be zero - removed the OPAL call when EQ qtoggle bit and index are zero. arch/powerpc/include/asm/xive.h| 2 + arch/powerpc/include/uapi/asm/kvm.h| 19 ++ arch/powerpc/kvm/book3s_xive.h | 2 + arch/powerpc/kvm/book3s_xive.c | 15 +- arch/powerpc/kvm/book3s_xive_native.c | 242 + Documentation/virtual/kvm/devices/xive.txt | 34 +++ 6 files changed, 308 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index b579a943407b..c4e88abd3b67 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -73,6 +73,8 @@ struct xive_q { u32 esc_irq; atomic_tcount; atomic_tpending_count; + u64 guest_qaddr; + u32 guest_qshift; }; /* Global enable flags for the XIVE support */ diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index e8161e21629b..85005400fd86 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GRP_CTRL 1 #define KVM_DEV_XIVE_GRP_SOURCE2 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3 /* 64-bit source identifier */ +#define KVM_DEV_XIVE_GRP_EQ_CONFIG 4 /* 64-bit EQ identifier */ /* Layout of 64-bit XIVE source attribute values */ #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) @@ -696,4 +697,22 @@ struct kvm_ppc_cpu_char { #define KVM_XIVE_SOURCE_EISN_SHIFT 33 #define KVM_XIVE_SOURCE_EISN_MASK 0xfffeULL +/* Layout of 64-bit EQ identifier */ +#define KVM_XIVE_EQ_PRIORITY_SHIFT 0 +#define KVM_XIVE_EQ_PRIORITY_MASK 0x7 +#define KVM_XIVE_EQ_SERVER_SHIFT 3 +#define KVM_XIVE_EQ_SERVER_MASK0xfff8ULL + +/* Layout of EQ configuration values (64 bytes) */ +struct kvm_ppc_xive_eq { + __u32 flags; + __u32 qshift; + __u64 qaddr; + __u32 qtoggle; + __u32 qindex; + __u8 pad[40]; +}; + +#define KVM_XIVE_EQ_ALWAYS_NOTIFY 0x0001 + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index ae26fe653d98..622f594d93e1 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -272,6 +272,8 @@ struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( struct kvmppc_xive *xive, int irq); void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb); int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio); +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio, + bool single_escalation); #endif /* CONFIG_KVM_XICS */ #endif /* _KVM_PPC_BOOK3S_XICS_H */ diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index e09f3addffe5..c1b7aa7dbc28 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -166,7 +166,8 @@ static irqreturn_t xive_esc_irq(int irq, void *data) return IRQ_HANDLED; } -static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio) +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio, + bool single_escalation) { struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; struct xive_q *q = >queues[prio]; @@ -185,7 +186,7 @@ static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio) return -EIO; } - if
[PATCH v4 04/17] KVM: PPC: Book3S HV: XIVE: add a control to initialize a source
The XIVE KVM device maintains a list of interrupt sources for the VM which are allocated in the pool of generic interrupts (IPIs) of the main XIVE IC controller. These are used for the CPU IPIs as well as for virtual device interrupts. The IRQ number space is defined by QEMU. The XIVE device reuses the source structures of the XICS-on-XIVE device for the source blocks (2-level tree) and for the source interrupts. Under XIVE native, the source interrupt caches mostly configuration information and is less used than under the XICS-on-XIVE device in which hcalls are still necessary at run-time. When a source is initialized in KVM, an IPI interrupt source is simply allocated at the OPAL level and then MASKED. KVM only needs to know about its type: LSI or MSI. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2: - extra documentation in commit log - fixed comments on XIVE IRQ number space - removed usage of the __x_* macros - fixed locking on source block arch/powerpc/include/uapi/asm/kvm.h| 5 + arch/powerpc/kvm/book3s_xive.h | 10 ++ arch/powerpc/kvm/book3s_xive.c | 8 +- arch/powerpc/kvm/book3s_xive_native.c | 106 + Documentation/virtual/kvm/devices/xive.txt | 15 +++ 5 files changed, 140 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index be0ce1f17625..d468294c2a67 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -679,5 +679,10 @@ struct kvm_ppc_cpu_char { /* POWER9 XIVE Native Interrupt Controller */ #define KVM_DEV_XIVE_GRP_CTRL 1 +#define KVM_DEV_XIVE_GRP_SOURCE2 /* 64-bit source identifier */ + +/* Layout of 64-bit XIVE source attribute values */ +#define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) +#define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1) #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index d366df69b9cb..1be921cb5dcb 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -12,6 +12,13 @@ #ifdef CONFIG_KVM_XICS #include "book3s_xics.h" +/* + * The XIVE Interrupt source numbers are within the range 0 to + * KVMPPC_XICS_NR_IRQS. + */ +#define KVMPPC_XIVE_FIRST_IRQ 0 +#define KVMPPC_XIVE_NR_IRQSKVMPPC_XICS_NR_IRQS + /* * State for one guest irq source. * @@ -258,6 +265,9 @@ extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr); */ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu); int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu); +struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( + struct kvmppc_xive *xive, int irq); +void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb); #endif /* CONFIG_KVM_XICS */ #endif /* _KVM_PPC_BOOK3S_XICS_H */ diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index e7f1ada1c3de..6c9f9fd0855f 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -1480,8 +1480,8 @@ static int xive_get_source(struct kvmppc_xive *xive, long irq, u64 addr) return 0; } -static struct kvmppc_xive_src_block *xive_create_src_block(struct kvmppc_xive *xive, - int irq) +struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( + struct kvmppc_xive *xive, int irq) { struct kvm *kvm = xive->kvm; struct kvmppc_xive_src_block *sb; @@ -1560,7 +1560,7 @@ static int xive_set_source(struct kvmppc_xive *xive, long irq, u64 addr) sb = kvmppc_xive_find_source(xive, irq, ); if (!sb) { pr_devel("No source, creating source block...\n"); - sb = xive_create_src_block(xive, irq); + sb = kvmppc_xive_create_src_block(xive, irq); if (!sb) { pr_devel("Failed to create block...\n"); return -ENOMEM; @@ -1784,7 +1784,7 @@ static void kvmppc_xive_cleanup_irq(u32 hw_num, struct xive_irq_data *xd) xive_cleanup_irq_data(xd); } -static void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb) +void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb) { int i; diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 6fa73cfd9d9c..5f2bd6c137b7 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -26,6 +26,17 @@ #include "book3s_xive.h" +static u8 xive_vm_esb_load(struct xive_irq_data *xd, u32 offset) +{ + u64 val; + + if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG) + offset |= offset << 4; + + val = in_be64(xd->eoi_mmio + offset); + return (u8)val; +} + static void kvmppc_xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio) {
[PATCH v4 03/17] KVM: PPC: Book3S HV: XIVE: introduce a new capability KVM_CAP_PPC_IRQ_XIVE
The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to let QEMU connect the vCPU presenters to the XIVE KVM device if required. The capability is not advertised for now as the full support for the XIVE native exploitation mode is not yet available. When this is case, the capability will be advertised on PowerNV Hypervisors only. Nested guests (pseries KVM Hypervisor) are not supported. Internally, the interface to the new KVM device is protected with a new interrupt mode: KVMPPC_IRQ_XIVE. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v2: - made use of the xive_vp() macro to compute VP identifiers - reworked locking in kvmppc_xive_native_connect_vcpu() to fix races - stop advertising KVM_CAP_PPC_IRQ_XIVE as support is not fully available yet arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/include/asm/kvm_ppc.h| 13 +++ arch/powerpc/kvm/book3s_xive.h| 11 ++ include/uapi/linux/kvm.h | 1 + arch/powerpc/kvm/book3s_xive.c| 88 --- arch/powerpc/kvm/book3s_xive_native.c | 150 ++ arch/powerpc/kvm/powerpc.c| 36 +++ Documentation/virtual/kvm/api.txt | 9 ++ 8 files changed, 268 insertions(+), 41 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 008523224e7a..9cc6abdce1b9 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -450,6 +450,7 @@ struct kvmppc_passthru_irqmap { #define KVMPPC_IRQ_DEFAULT 0 #define KVMPPC_IRQ_MPIC1 #define KVMPPC_IRQ_XICS2 /* Includes a XIVE option */ +#define KVMPPC_IRQ_XIVE3 /* XIVE native exploitation mode */ #define MMIO_HPTE_CACHE_SIZE 4 diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index f3383e76017a..6928a35ac3c7 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -595,6 +595,14 @@ extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status); extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu); +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE; +} + +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, + struct kvm_vcpu *vcpu, u32 cpu); +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu); extern void kvmppc_xive_native_init_module(void); extern void kvmppc_xive_native_exit_module(void); @@ -622,6 +630,11 @@ static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 ir int level, bool line_status) { return -ENODEV; } static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { } +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu) + { return 0; } +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, + struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; } +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { } static inline void kvmppc_xive_native_init_module(void) { } static inline void kvmppc_xive_native_exit_module(void) { } diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index a08ae6fd4c51..d366df69b9cb 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -198,6 +198,11 @@ static inline struct kvmppc_xive_src_block *kvmppc_xive_find_source(struct kvmpp return xive->src_blocks[bid]; } +static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server) +{ + return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server); +} + /* * Mapping between guest priorities and host priorities * is as follow. @@ -248,5 +253,11 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, unsigned long server, extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr); extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr); +/* + * Common Xive routines for XICS-over-XIVE and XIVE native + */ +void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu); +int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu); + #endif /* CONFIG_KVM_XICS */ #endif /* _KVM_PPC_BOOK3S_XICS_H */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e6368163d3a0..52bf74a1616e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_VM_IPA_SIZE 165 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166 #define KVM_CAP_HYPERV_CPUID 167 +#define KVM_CAP_PPC_IRQ_XIVE 168 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index f78d002f0fe0..e7f1ada1c3de
[PATCH v4 02/17] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
This is the basic framework for the new KVM device supporting the XIVE native exploitation mode. The user interface exposes a new KVM device to be created by QEMU, only available when running on a L0 hypervisor. Support for nested guests is not available yet. The XIVE device reuses the device structure of the XICS-on-XIVE device as they have a lot in common. That could possibly change in the future if the need arise. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v3: - removed a couple of useless includes Changes since v2: - removed ->q_order setting. Only useful in the XICS-on-XIVE KVM device which allocates the EQs on behalf of the guest. - returned -ENXIO when VP base is invalid arch/powerpc/include/asm/kvm_host.h| 1 + arch/powerpc/include/asm/kvm_ppc.h | 8 + arch/powerpc/include/uapi/asm/kvm.h| 3 + include/uapi/linux/kvm.h | 2 + arch/powerpc/kvm/book3s.c | 7 +- arch/powerpc/kvm/book3s_xive_native.c | 179 + Documentation/virtual/kvm/devices/xive.txt | 19 +++ arch/powerpc/kvm/Makefile | 2 +- 8 files changed, 219 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/kvm/book3s_xive_native.c create mode 100644 Documentation/virtual/kvm/devices/xive.txt diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index e6b5bb012ccb..008523224e7a 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -222,6 +222,7 @@ extern struct kvm_device_ops kvm_xics_ops; struct kvmppc_xive; struct kvmppc_xive_vcpu; extern struct kvm_device_ops kvm_xive_ops; +extern struct kvm_device_ops kvm_xive_native_ops; struct kvmppc_passthru_irqmap; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index ac22b28ae78d..f3383e76017a 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -594,6 +594,10 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval); extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status); extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu); + +extern void kvmppc_xive_native_init_module(void); +extern void kvmppc_xive_native_exit_module(void); + #else static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority) { return -1; } @@ -617,6 +621,10 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status) { return -ENODEV; } static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { } + +static inline void kvmppc_xive_native_init_module(void) { } +static inline void kvmppc_xive_native_exit_module(void) { } + #endif /* CONFIG_KVM_XIVE */ #if defined(CONFIG_PPC_POWERNV) && defined(CONFIG_KVM_BOOK3S_64_HANDLER) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 26ca425f4c2c..be0ce1f17625 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -677,4 +677,7 @@ struct kvm_ppc_cpu_char { #define KVM_XICS_PRESENTED(1ULL << 43) #define KVM_XICS_QUEUED (1ULL << 44) +/* POWER9 XIVE Native Interrupt Controller */ +#define KVM_DEV_XIVE_GRP_CTRL 1 + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 6d4ea4b6c922..e6368163d3a0 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1211,6 +1211,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_ARM_VGIC_V3 KVM_DEV_TYPE_ARM_VGIC_V3 KVM_DEV_TYPE_ARM_VGIC_ITS, #define KVM_DEV_TYPE_ARM_VGIC_ITS KVM_DEV_TYPE_ARM_VGIC_ITS + KVM_DEV_TYPE_XIVE, +#define KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_MAX, }; diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 10c5579d20ce..7c3348fa27e1 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -1050,6 +1050,9 @@ static int kvmppc_book3s_init(void) if (xics_on_xive()) { kvmppc_xive_init_module(); kvm_register_device_ops(_xive_ops, KVM_DEV_TYPE_XICS); + kvmppc_xive_native_init_module(); + kvm_register_device_ops(_xive_native_ops, + KVM_DEV_TYPE_XIVE); } else #endif kvm_register_device_ops(_xics_ops, KVM_DEV_TYPE_XICS); @@ -1060,8 +1063,10 @@ static int kvmppc_book3s_init(void) static void kvmppc_book3s_exit(void) { #ifdef CONFIG_KVM_XICS - if (xics_on_xive()) + if (xics_on_xive()) { kvmppc_xive_exit_module(); +
[PATCH v4 01/17] powerpc/xive: add OPAL extensions for the XIVE native exploitation support
The support for XIVE native exploitation mode in Linux/KVM needs a couple more OPAL calls to get and set the state of the XIVE internal structures being used by a sPAPR guest. Signed-off-by: Cédric Le Goater Reviewed-by: David Gibson --- Changes since v3: - rebased on 5.1-rc1 Changes since v2: - remove extra OPAL call definitions arch/powerpc/include/asm/opal-api.h| 7 +- arch/powerpc/include/asm/opal.h| 7 ++ arch/powerpc/include/asm/xive.h| 14 +++ arch/powerpc/platforms/powernv/opal-call.c | 3 + arch/powerpc/sysdev/xive/native.c | 99 ++ 5 files changed, 127 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 870fb7b239ea..e1d118ac61dc 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -186,8 +186,8 @@ #define OPAL_XIVE_FREE_IRQ 140 #define OPAL_XIVE_SYNC 141 #define OPAL_XIVE_DUMP 142 -#define OPAL_XIVE_RESERVED3143 -#define OPAL_XIVE_RESERVED4144 +#define OPAL_XIVE_GET_QUEUE_STATE 143 +#define OPAL_XIVE_SET_QUEUE_STATE 144 #define OPAL_SIGNAL_SYSTEM_RESET 145 #define OPAL_NPU_INIT_CONTEXT 146 #define OPAL_NPU_DESTROY_CONTEXT 147 @@ -210,7 +210,8 @@ #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 164 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 165 #defineOPAL_NX_COPROC_INIT 167 -#define OPAL_LAST 167 +#define OPAL_XIVE_GET_VP_STATE 170 +#define OPAL_LAST 170 #define QUIESCE_HOLD 1 /* Spin all calls at entry */ #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index a55b01c90bb1..4e978d4dea5c 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -279,6 +279,13 @@ int64_t opal_xive_allocate_irq(uint32_t chip_id); int64_t opal_xive_free_irq(uint32_t girq); int64_t opal_xive_sync(uint32_t type, uint32_t id); int64_t opal_xive_dump(uint32_t type, uint32_t id); +int64_t opal_xive_get_queue_state(uint64_t vp, uint32_t prio, + __be32 *out_qtoggle, + __be32 *out_qindex); +int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t prio, + uint32_t qtoggle, + uint32_t qindex); +int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01); int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target, uint64_t desc, uint16_t pe_number); diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index 3c704f5dd3ae..b579a943407b 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -109,12 +109,26 @@ extern int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio, extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio); extern void xive_native_sync_source(u32 hw_irq); +extern void xive_native_sync_queue(u32 hw_irq); extern bool is_xive_irq(struct irq_chip *chip); extern int xive_native_enable_vp(u32 vp_id, bool single_escalation); extern int xive_native_disable_vp(u32 vp_id); extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 *out_chip_id); extern bool xive_native_has_single_escalation(void); +extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio, + u64 *out_qpage, + u64 *out_qsize, + u64 *out_qeoi_page, + u32 *out_escalate_irq, + u64 *out_qflags); + +extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle, + u32 *qindex); +extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle, + u32 qindex); +extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state); + #else static inline bool xive_enabled(void) { return false; } diff --git a/arch/powerpc/platforms/powernv/opal-call.c b/arch/powerpc/platforms/powernv/opal-call.c index daad8c45c8e7..7472244e7f30 100644 --- a/arch/powerpc/platforms/powernv/opal-call.c +++ b/arch/powerpc/platforms/powernv/opal-call.c @@ -260,6 +260,9 @@ OPAL_CALL(opal_xive_get_vp_info, OPAL_XIVE_GET_VP_INFO); OPAL_CALL(opal_xive_set_vp_info, OPAL_XIVE_SET_VP_INFO); OPAL_CALL(opal_xive_sync, OPAL_XIVE_SYNC); OPAL_CALL(opal_xive_dump, OPAL_XIVE_DUMP); +OPAL_CALL(opal_xive_get_queue_state,
[PATCH v4 00/17] KVM: PPC: Book3S HV: add XIVE native exploitation mode
Hello, On the POWER9 processor, the XIVE interrupt controller can control interrupt sources using MMIOs to trigger events, to EOI or to turn off the sources. Priority management and interrupt acknowledgment is also controlled by MMIO in the CPU presenter sub-engine. PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need special support from the hypervisor to do the same. This is called the XIVE native exploitation mode and today, it can be activated under the PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support and still offers the old interrupt mode interface using a KVM device implementing the XICS hcalls over XIVE. The following series is proposal to add the same support under KVM. A new KVM device is introduced for the XIVE native exploitation mode. It reuses most of the XICS-over-XIVE glue implementation structures which are internal to KVM but has a completely different interface. A set of KVM device ioctls provide support for the hypervisor calls, all handled in QEMU, to configure the sources and the event queues. From there, all interrupt control is transferred to the guest which can use MMIOs. These MMIO regions (ESB and TIMA) are exposed to guests in QEMU, similarly to VFIO, and the associated VMAs are populated dynamically with the appropriate pages using a fault handler. These are now implemented using mmap()s of the KVM device fd. Migration has its own specific needs regarding memory. The patchset provides a specific control to quiesce XIVE before capturing the memory. The save and restore of the internal state is based on the same ioctls used for the hcalls. On a POWER9 sPAPR machine, the Client Architecture Support (CAS) negotiation process determines whether the guest operates with a interrupt controller using the XICS legacy model, as found on POWER8, or in XIVE exploitation mode. Which means that the KVM interrupt device should be created at run-time, after the machine has started. This requires extra support from KVM to destroy KVM devices. It is introduced at the end of the patchset as it still requires some attention and a XIVE-only VM would not need. This is based on 5.1-rc1 and should be a candidate for 5.2 now. The OPAL patches have not yet been merged. GitHub trees available here : QEMU sPAPR: https://github.com/legoater/qemu/commits/xive-next Linux/KVM: https://github.com/legoater/linux/commits/xive-5.1 OPAL: https://github.com/legoater/skiboot/commits/xive Thanks, C. Caveats : - We should introduce a set of definitions common to XIVE and XICS - The XICS-over-XIVE device file book3s_xive.c could be renamed to book3s_xics_on_xive.c or book3s_xics_p9.c - The XICS-over-XIVE device has locking issues in the setup. Changes since v3: - removed a couple of useless includes - fix the test ont the initial setting of the EQ toggle bit : 0 -> 1 - renamed qsize to qshift - renamed qpage to qaddr - checked host page size - limited flags to KVM_XIVE_EQ_ALWAYS_NOTIFY to fit sPAPR specs - Fixed xive_timaval description in documentation Changes since v2: - removed extra OPAL call definitions - removed ->q_order setting. Only useful in the XICS-on-XIVE KVM device which allocates the EQs on behalf of the guest. - returned -ENXIO when VP base is invalid - made use of the xive_vp() macro to compute VP identifiers - reworked locking in kvmppc_xive_native_connect_vcpu() to fix races - stop advertising KVM_CAP_PPC_IRQ_XIVE as support is not fully available yet - fixed comment on XIVE IRQ number space - removed usage of the __x_* macros - fixed locking on source block - fixed comments on the KVM device attribute definitions - handled MASKED EAS configuration - fixed check on supported EQ size to restrict to 64K pages - checked kvm_eq.flags that need to be zero - removed the OPAL call when EQ qtoggle bit and index are zero. - reduced the size of kvmppc_one_reg timaval attribute to two u64s - stopped returning of the OS CAM line value Changes since v1: - Better documentation (was missing) - Nested support. XIVE not advertised on non PowerNV platforms. This is a good way to test the fallback on QEMU emulated devices. - ESB and TIMA special mapping done using the KVM device fd - All hcalls moved to QEMU. Dropped the patch moving the hcall flags. - Reworked of the KVM device ioctl controls to support hcalls and migration needs to capture/save states - Merged the control syncing XIVE and marking the EQ page dirty - Fixed passthrough support using the KVM device file address_space to clear the ESB pages from the mapping - Misc enhancements and fixes Cédric Le Goater (17): powerpc/xive: add OPAL extensions for the XIVE native exploitation support KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode KVM: PPC: Book3S HV: XIVE: introduce a new capability KVM_CAP_PPC_IRQ_XIVE KVM: PPC: Book3S HV: XIVE: add a control to initialize a source KVM:
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
Dan Williams writes: > >> Now what will be page size used for mapping vmemmap? > > That's up to the architecture's vmemmap_populate() implementation. > >> Architectures >> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a >> device-dax with struct page in the device will have pfn reserve area aligned >> to PAGE_SIZE with the above example? We can't map that using >> PMD_SIZE page size? > > IIUC, that's a different alignment. Currently that's handled by > padding the reservation area up to a section (128MB on x86) boundary, > but I'm working on patches to allow sub-section sized ranges to be > mapped. I am missing something w.r.t code. The below code align that using nd_pfn->align if (nd_pfn->mode == PFN_MODE_PMEM) { unsigned long memmap_size; /* * vmemmap_populate_hugepages() allocates the memmap array in * HPAGE_SIZE chunks. */ memmap_size = ALIGN(64 * npfns, HPAGE_SIZE); offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve, nd_pfn->align) - start; } IIUC that is finding the offset where to put vmemmap start. And that has to be aligned to the page size with which we may end up mapping vmemmap area right? Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that is to compute howmany pfns we should map for this pfn dev right? -aneesh
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
Aneesh Kumar K.V writes: > Dan Williams writes: > >> >>> Now what will be page size used for mapping vmemmap? >> >> That's up to the architecture's vmemmap_populate() implementation. >> >>> Architectures >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a >>> device-dax with struct page in the device will have pfn reserve area aligned >>> to PAGE_SIZE with the above example? We can't map that using >>> PMD_SIZE page size? >> >> IIUC, that's a different alignment. Currently that's handled by >> padding the reservation area up to a section (128MB on x86) boundary, >> but I'm working on patches to allow sub-section sized ranges to be >> mapped. > > I am missing something w.r.t code. The below code align that using > nd_pfn->align > > if (nd_pfn->mode == PFN_MODE_PMEM) { > unsigned long memmap_size; > > /* >* vmemmap_populate_hugepages() allocates the memmap array in >* HPAGE_SIZE chunks. >*/ > memmap_size = ALIGN(64 * npfns, HPAGE_SIZE); > offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve, > nd_pfn->align) - start; > } > > IIUC that is finding the offset where to put vmemmap start. And that has > to be aligned to the page size with which we may end up mapping vmemmap > area right? > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that > is to compute howmany pfns we should map for this pfn dev right? > Also i guess those 4K assumptions there is wrong? modified drivers/nvdimm/pfn_devs.c @@ -783,7 +783,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) return -ENXIO; } - npfns = (size - offset - start_pad - end_trunc) / SZ_4K; + npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE; pfn_sb->mode = cpu_to_le32(nd_pfn->mode); pfn_sb->dataoff = cpu_to_le64(offset); pfn_sb->npfns = cpu_to_le64(npfns); -aneesh
[PATCH 3/3] powerpc/mm: print hash info in a helper
Reduce #ifdef mess by defining a helper to print hash info at startup. In the meantime, remove the display of hash table address to reduce leak of non necessary information. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/setup-common.c | 19 +-- arch/powerpc/mm/hash_utils_64.c| 8 arch/powerpc/mm/mmu_decl.h | 5 - arch/powerpc/mm/ppc_mmu_32.c | 9 - 4 files changed, 21 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 2e5dfb6e0823..f24a74f7912d 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -799,12 +799,6 @@ void arch_setup_pdev_archdata(struct platform_device *pdev) static __init void print_system_info(void) { pr_info("-\n"); -#ifdef CONFIG_PPC_BOOK3S_64 - pr_info("ppc64_pft_size= 0x%llx\n", ppc64_pft_size); -#endif -#ifdef CONFIG_PPC_BOOK3S_32 - pr_info("Hash_size = 0x%lx\n", Hash_size); -#endif pr_info("phys_mem_size = 0x%llx\n", (unsigned long long)memblock_phys_mem_size()); @@ -826,18 +820,7 @@ static __init void print_system_info(void) pr_info("firmware_features = 0x%016lx\n", powerpc_firmware_features); #endif -#ifdef CONFIG_PPC_BOOK3S_64 - if (htab_address) - pr_info("htab_address = 0x%p\n", htab_address); - if (htab_hash_mask) - pr_info("htab_hash_mask= 0x%lx\n", htab_hash_mask); -#endif -#ifdef CONFIG_PPC_BOOK3S_32 - if (Hash) - pr_info("Hash = 0x%p\n", Hash); - if (Hash_mask) - pr_info("Hash_mask = 0x%lx\n", Hash_mask); -#endif + print_system_hash_info(); if (PHYSICAL_START > 0) pr_info("physical_start= 0x%llx\n", diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 0a4f939a8161..017380b890bb 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1909,3 +1909,11 @@ static int __init hash64_debugfs(void) } machine_device_initcall(pseries, hash64_debugfs); #endif /* CONFIG_DEBUG_FS */ + +void __init print_system_hash_info(void) +{ + pr_info("ppc64_pft_size= 0x%llx\n", ppc64_pft_size); + + if (htab_hash_mask) + pr_info("htab_hash_mask= 0x%lx\n", htab_hash_mask); +} diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h index f7f1374ba3ee..dc617ade83ab 100644 --- a/arch/powerpc/mm/mmu_decl.h +++ b/arch/powerpc/mm/mmu_decl.h @@ -83,6 +83,8 @@ static inline void _tlbivax_bcast(unsigned long address, unsigned int pid, } #endif +static inline void print_system_hash_info(void) {} + #else /* CONFIG_PPC_MMU_NOHASH */ extern void hash_preload(struct mm_struct *mm, unsigned long ea, @@ -92,6 +94,8 @@ extern void hash_preload(struct mm_struct *mm, unsigned long ea, extern void _tlbie(unsigned long address); extern void _tlbia(void); +void print_system_hash_info(void); + #endif /* CONFIG_PPC_MMU_NOHASH */ #ifdef CONFIG_PPC32 @@ -105,7 +109,6 @@ extern unsigned int rtas_data, rtas_size; struct hash_pte; extern struct hash_pte *Hash; -extern unsigned long Hash_size, Hash_mask; #endif /* CONFIG_PPC32 */ diff --git a/arch/powerpc/mm/ppc_mmu_32.c b/arch/powerpc/mm/ppc_mmu_32.c index 088f14d57cce..864096489b6d 100644 --- a/arch/powerpc/mm/ppc_mmu_32.c +++ b/arch/powerpc/mm/ppc_mmu_32.c @@ -37,7 +37,7 @@ #include "mmu_decl.h" struct hash_pte *Hash; -unsigned long Hash_size, Hash_mask; +static unsigned long Hash_size, Hash_mask; unsigned long _SDR1; struct ppc_bat BATS[8][2]; /* 8 pairs of IBAT, DBAT */ @@ -392,3 +392,10 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base, else /* Anything else has 256M mapped */ memblock_set_current_limit(min_t(u64, first_memblock_size, 0x1000)); } + +void __init print_system_hash_info(void) +{ + pr_info("Hash_size = 0x%lx\n", Hash_size); + if (Hash_mask) + pr_info("Hash_mask = 0x%lx\n", Hash_mask); +} -- 2.13.3