[PATCH] powerpc/nvdimm: Add support for multibyte read/write for metadata
SCM_READ/WRITE_MEATADATA hcall supports multibyte read/write. This patch updates the metadata read/write to use 1, 2, 4 or 8 byte read/write as mentioned in PAPR document. READ/WRITE_METADATA hcall supports the 1, 2, 4, or 8 bytes read/write. For other values hcall results H_P3. Hypervisor stores the metadata contents in big-endian format and in-order to enable read/write in different granularity, we need to switch the contents to big-endian before calling HCALL. Based on an patch from Oliver O'Halloran Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/platforms/pseries/papr_scm.c | 104 +- 1 file changed, 82 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 0176ce66673f..e33cebb8ee6c 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -97,42 +97,102 @@ static int drc_pmem_unbind(struct papr_scm_priv *p) } static int papr_scm_meta_get(struct papr_scm_priv *p, - struct nd_cmd_get_config_data_hdr *hdr) +struct nd_cmd_get_config_data_hdr *hdr) { unsigned long data[PLPAR_HCALL_BUFSIZE]; + unsigned long offset, data_offset; + int len, read; int64_t ret; - if (hdr->in_offset >= p->metadata_size || hdr->in_length != 1) + if ((hdr->in_offset + hdr->in_length) >= p->metadata_size) return -EINVAL; - ret = plpar_hcall(H_SCM_READ_METADATA, data, p->drc_index, - hdr->in_offset, 1); - - if (ret == H_PARAMETER) /* bad DRC index */ - return -ENODEV; - if (ret) - return -EINVAL; /* other invalid parameter */ - - hdr->out_buf[0] = data[0] & 0xff; - + for (len = hdr->in_length; len; len -= read) { + + data_offset = hdr->in_length - len; + offset = hdr->in_offset + data_offset; + + if (len >= 8) + read = 8; + else if (len >= 4) + read = 4; + else if ( len >= 2) + read = 2; + else + read = 1; + + ret = plpar_hcall(H_SCM_READ_METADATA, data, p->drc_index, + offset, read); + + if (ret == H_PARAMETER) /* bad DRC index */ + return -ENODEV; + if (ret) + return -EINVAL; /* other invalid parameter */ + + switch (read) { + case 8: + *(uint64_t *)(hdr->out_buf + data_offset) = be64_to_cpu(data[0]); + break; + case 4: + *(uint32_t *)(hdr->out_buf + data_offset) = be32_to_cpu(data[0] & 0x); + break; + + case 2: + *(uint16_t *)(hdr->out_buf + data_offset) = be16_to_cpu(data[0] & 0x); + break; + + case 1: + *(uint32_t *)(hdr->out_buf + data_offset) = (data[0] & 0xff); + break; + } + } return 0; } static int papr_scm_meta_set(struct papr_scm_priv *p, - struct nd_cmd_set_config_hdr *hdr) +struct nd_cmd_set_config_hdr *hdr) { + unsigned long offset, data_offset; + int len, wrote; + unsigned long data; + __be64 data_be; int64_t ret; - if (hdr->in_offset >= p->metadata_size || hdr->in_length != 1) + if ((hdr->in_offset + hdr->in_length) >= p->metadata_size) return -EINVAL; - ret = plpar_hcall_norets(H_SCM_WRITE_METADATA, - p->drc_index, hdr->in_offset, hdr->in_buf[0], 1); - - if (ret == H_PARAMETER) /* bad DRC index */ - return -ENODEV; - if (ret) - return -EINVAL; /* other invalid parameter */ + for (len = hdr->in_length; len; len -= wrote) { + + data_offset = hdr->in_length - len; + offset = hdr->in_offset + data_offset; + + if (len >= 8) { + data = *(uint64_t *)(hdr->in_buf + data_offset); + data_be = cpu_to_be64(data); + wrote = 8; + } else if (len >= 4) { + data = *(uint32_t *)(hdr->in_buf + data_offset); + data &= 0x; + data_be = cpu_to_be32(data); + wrote = 4; + } else if (len >= 2) { + data = *(uint16_t *)(hdr->in_buf + data_offset); + data &= 0x; + data_be = cpu_to_be16(data); + wrote = 2; + } else { + data_be = *(uint8_t *)(hdr->in_buf + data_offset); +
Re: RFC: switch the remaining architectures to use generic GUP v2
From: Christoph Hellwig Date: Sat, 1 Jun 2019 09:49:43 +0200 > below is a series to switch mips, sh and sparc64 to use the generic > GUP code so that we only have one codebase to touch for further > improvements to this code. I don't have hardware for any of these > architectures, and generally no clue about their page table > management, so handle with care. > > Changes since v1: > - fix various issues found by the build bot > - cherry pick and use the untagged_addr helper form Andrey > - add various refactoring patches to share more code over architectures > - move the powerpc hugepd code to mm/gup.c and sync it with the generic >hup semantics I will today look seriously at the sparc64 stuff wrt. tagged pointers.
Re: [PATCH 00/12] Secure Virtual Machine Enablement
Hello, Thiago Jung Bauermann writes: > This series enables Secure Virtual Machines (SVMs) on powerpc. SVMs use the > Protected Execution Facility (PEF) and request to be migrated to secure > memory during prom_init() so by default all of their memory is inaccessible > to the hypervisor. There is an Ultravisor call that the VM can use to > request certain pages to be made accessible to (or shared with) the > hypervisor. > > The objective of these patches is to have the guest perform this request > for buffers that need to be accessed by the hypervisor such as the LPPACAs, > the SWIOTLB memory and the Debug Trace Log. Ping? Any more comments on these patches? Or even acks? :-) -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [PATCH 08/16] sparc64: add the missing pgd_page definition
Both sparc64 and sh had this pattern, but now that I look at it more closely, I think your version is wrong, or at least nonoptimal. On Sat, Jun 1, 2019 at 12:50 AM Christoph Hellwig wrote: > > +#define pgd_page(pgd) virt_to_page(__va(pgd_val(pgd))) Going through the virtual address is potentially very inefficient, and might in some cases just be wrong (ie it's definitely wrong for HIGHMEM style setups). It would likely be much better to go through the physical address and use "pfn_to_page()". I realize that we don't have a "pgd to physical", but neither do we really have a "pgd to virtual", and your "__va(pgd_val(x))" thing is not at allguaranteed to work. You're basically assuming that "pgd_val(x)" is the physical address, which is likely not entirely incorrect, but it should be checked by the architecture people. The pgd value could easily have high bits with meaning, which would also potentially screw up the __va(x) model. So I thgink this would be better done with #define pgd_page(pgd)pfn_to_page(pgd_pfn(pgd)) where that "pgd_pfn()" would need to be a new (but likely very trivial) function. That's what we do for pte_pfn(). IOW, it would likely end up something like #define pgd_to_pfn(pgd) (pgd_val(x) >> PFN_PGD_SHIFT) David? Linus
Re: [PATCH 03/16] mm: simplify gup_fast_permitted
On Sat, Jun 1, 2019 at 12:50 AM Christoph Hellwig wrote: > > Pass in the already calculated end value instead of recomputing it, and > leave the end > start check in the callers instead of duplicating them > in the arch code. Good cleanup, except it's wrong. > - if (nr_pages <= 0) > + if (end < start) > return 0; You moved the overflow test to generic code - good. You removed the sign and zero test on nr_pages - bad. The zero test in particular is _important_ - the GUP range operators know and depend on the fact that they are passed a non-empty range. The sign test it less so, but is definitely appropriate. It might be even better to check that the "<< PAGE_SHIFT" doesn't overflow in "long", of course, but with callers being supposed to be trusted, the sign test at least checks for stupid underflow issues. So at the very least that "(end < start)" needs to be "(end <= start)", but honestly, I think the sign of the nr_pages should be continued to be checked. Linus
[PATCH AUTOSEL 4.4 39/56] PCI: rpadlpar: Fix leaked device_node references in add/remove paths
From: Tyrel Datwyler [ Upstream commit fb26228bfc4ce3951544848555c0278e2832e618 ] The find_dlpar_node() helper returns a device node with its reference incremented. Both the add and remove paths use this helper for find the appropriate node, but fail to release the reference when done. Annotate the find_dlpar_node() helper with a comment about the incremented reference count and call of_node_put() on the obtained device_node in the add and remove paths. Also, fixup a reference leak in the find_vio_slot() helper where we fail to call of_node_put() on the vdevice node after we iterate over its children. Signed-off-by: Tyrel Datwyler Signed-off-by: Bjorn Helgaas Signed-off-by: Sasha Levin --- drivers/pci/hotplug/rpadlpar_core.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c index f2fcbe944d940..aae295708ea7a 100644 --- a/drivers/pci/hotplug/rpadlpar_core.c +++ b/drivers/pci/hotplug/rpadlpar_core.c @@ -55,6 +55,7 @@ static struct device_node *find_vio_slot_node(char *drc_name) if ((rc == 0) && (!strcmp(drc_name, name))) break; } + of_node_put(parent); return dn; } @@ -78,6 +79,7 @@ static struct device_node *find_php_slot_pci_node(char *drc_name, return np; } +/* Returns a device_node with its reference count incremented */ static struct device_node *find_dlpar_node(char *drc_name, int *node_type) { struct device_node *dn; @@ -314,6 +316,7 @@ int dlpar_add_slot(char *drc_name) rc = dlpar_add_phb(drc_name, dn); break; } + of_node_put(dn); printk(KERN_INFO "%s: slot %s added\n", DLPAR_MODULE_NAME, drc_name); exit: @@ -447,6 +450,7 @@ int dlpar_remove_slot(char *drc_name) rc = dlpar_remove_pci_slot(drc_name, dn); break; } + of_node_put(dn); vm_unmap_aliases(); printk(KERN_INFO "%s: slot %s removed\n", DLPAR_MODULE_NAME, drc_name); -- 2.20.1
[PATCH AUTOSEL 4.14 23/99] EDAC/mpc85xx: Prevent building as a module
From: Michael Ellerman [ Upstream commit 2b8358a951b1e2a534a54924cd8245e58a1c5fb8 ] The mpc85xx EDAC driver can be configured as a module but then fails to build because it uses two unexported symbols: ERROR: ".pci_find_hose_for_OF_device" [drivers/edac/mpc85xx_edac_mod.ko] undefined! ERROR: ".early_find_capability" [drivers/edac/mpc85xx_edac_mod.ko] undefined! We don't want to export those symbols just for this driver, so make the driver only configurable as a built-in. This seems to have been broken since at least c92132f59806 ("edac/85xx: Add PCIe error interrupt edac support") (Nov 2013). [ bp: make it depend on EDAC=y so that the EDAC core doesn't get built as a module. ] Signed-off-by: Michael Ellerman Signed-off-by: Borislav Petkov Acked-by: Johannes Thumshirn Cc: James Morse Cc: Mauro Carvalho Chehab Cc: linux-edac Cc: linuxppc-...@ozlabs.org Cc: morbid...@gmail.com Link: https://lkml.kernel.org/r/20190502141941.12927-1-...@ellerman.id.au Signed-off-by: Sasha Levin --- drivers/edac/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index 96afb2aeed18a..8ce8d3fdd 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -246,8 +246,8 @@ config EDAC_PND2 micro-server but may appear on others in the future. config EDAC_MPC85XX - tristate "Freescale MPC83xx / MPC85xx" - depends on FSL_SOC + bool "Freescale MPC83xx / MPC85xx" + depends on FSL_SOC && EDAC=y help Support for error detection and correction on the Freescale MPC8349, MPC8560, MPC8540, MPC8548, T4240 -- 2.20.1
[PATCH AUTOSEL 4.19 029/141] EDAC/mpc85xx: Prevent building as a module
From: Michael Ellerman [ Upstream commit 2b8358a951b1e2a534a54924cd8245e58a1c5fb8 ] The mpc85xx EDAC driver can be configured as a module but then fails to build because it uses two unexported symbols: ERROR: ".pci_find_hose_for_OF_device" [drivers/edac/mpc85xx_edac_mod.ko] undefined! ERROR: ".early_find_capability" [drivers/edac/mpc85xx_edac_mod.ko] undefined! We don't want to export those symbols just for this driver, so make the driver only configurable as a built-in. This seems to have been broken since at least c92132f59806 ("edac/85xx: Add PCIe error interrupt edac support") (Nov 2013). [ bp: make it depend on EDAC=y so that the EDAC core doesn't get built as a module. ] Signed-off-by: Michael Ellerman Signed-off-by: Borislav Petkov Acked-by: Johannes Thumshirn Cc: James Morse Cc: Mauro Carvalho Chehab Cc: linux-edac Cc: linuxppc-...@ozlabs.org Cc: morbid...@gmail.com Link: https://lkml.kernel.org/r/20190502141941.12927-1-...@ellerman.id.au Signed-off-by: Sasha Levin --- drivers/edac/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index 57304b2e989f2..b00cc03ad6b67 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -250,8 +250,8 @@ config EDAC_PND2 micro-server but may appear on others in the future. config EDAC_MPC85XX - tristate "Freescale MPC83xx / MPC85xx" - depends on FSL_SOC + bool "Freescale MPC83xx / MPC85xx" + depends on FSL_SOC && EDAC=y help Support for error detection and correction on the Freescale MPC8349, MPC8560, MPC8540, MPC8548, T4240 -- 2.20.1
[PATCH AUTOSEL 5.0 035/173] EDAC/mpc85xx: Prevent building as a module
From: Michael Ellerman [ Upstream commit 2b8358a951b1e2a534a54924cd8245e58a1c5fb8 ] The mpc85xx EDAC driver can be configured as a module but then fails to build because it uses two unexported symbols: ERROR: ".pci_find_hose_for_OF_device" [drivers/edac/mpc85xx_edac_mod.ko] undefined! ERROR: ".early_find_capability" [drivers/edac/mpc85xx_edac_mod.ko] undefined! We don't want to export those symbols just for this driver, so make the driver only configurable as a built-in. This seems to have been broken since at least c92132f59806 ("edac/85xx: Add PCIe error interrupt edac support") (Nov 2013). [ bp: make it depend on EDAC=y so that the EDAC core doesn't get built as a module. ] Signed-off-by: Michael Ellerman Signed-off-by: Borislav Petkov Acked-by: Johannes Thumshirn Cc: James Morse Cc: Mauro Carvalho Chehab Cc: linux-edac Cc: linuxppc-...@ozlabs.org Cc: morbid...@gmail.com Link: https://lkml.kernel.org/r/20190502141941.12927-1-...@ellerman.id.au Signed-off-by: Sasha Levin --- drivers/edac/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index e286b5b990035..a3e6750393380 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -251,8 +251,8 @@ config EDAC_PND2 micro-server but may appear on others in the future. config EDAC_MPC85XX - tristate "Freescale MPC83xx / MPC85xx" - depends on FSL_SOC + bool "Freescale MPC83xx / MPC85xx" + depends on FSL_SOC && EDAC=y help Support for error detection and correction on the Freescale MPC8349, MPC8560, MPC8540, MPC8548, T4240 -- 2.20.1
[PATCH AUTOSEL 5.1 038/186] EDAC/mpc85xx: Prevent building as a module
From: Michael Ellerman [ Upstream commit 2b8358a951b1e2a534a54924cd8245e58a1c5fb8 ] The mpc85xx EDAC driver can be configured as a module but then fails to build because it uses two unexported symbols: ERROR: ".pci_find_hose_for_OF_device" [drivers/edac/mpc85xx_edac_mod.ko] undefined! ERROR: ".early_find_capability" [drivers/edac/mpc85xx_edac_mod.ko] undefined! We don't want to export those symbols just for this driver, so make the driver only configurable as a built-in. This seems to have been broken since at least c92132f59806 ("edac/85xx: Add PCIe error interrupt edac support") (Nov 2013). [ bp: make it depend on EDAC=y so that the EDAC core doesn't get built as a module. ] Signed-off-by: Michael Ellerman Signed-off-by: Borislav Petkov Acked-by: Johannes Thumshirn Cc: James Morse Cc: Mauro Carvalho Chehab Cc: linux-edac Cc: linuxppc-...@ozlabs.org Cc: morbid...@gmail.com Link: https://lkml.kernel.org/r/20190502141941.12927-1-...@ellerman.id.au Signed-off-by: Sasha Levin --- drivers/edac/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index 47eb4d13ed5f8..5e2e0348d460f 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -263,8 +263,8 @@ config EDAC_PND2 micro-server but may appear on others in the future. config EDAC_MPC85XX - tristate "Freescale MPC83xx / MPC85xx" - depends on FSL_SOC + bool "Freescale MPC83xx / MPC85xx" + depends on FSL_SOC && EDAC=y help Support for error detection and correction on the Freescale MPC8349, MPC8560, MPC8540, MPC8548, T4240 -- 2.20.1
[PATCH 16/16] mm: mark the page referenced in gup_hugepte
All other get_user_page_fast cases mark the page referenced, so do this here as well. Signed-off-by: Christoph Hellwig --- mm/gup.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/gup.c b/mm/gup.c index 6090044227f1..d1fc008de292 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2020,6 +2020,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 0; } + SetPageReferenced(head); return 1; } -- 2.20.1
[PATCH 15/16] mm: switch gup_hugepte to use try_get_compound_head
This applies the overflow fixes from 8fde12ca79aff ("mm: prevent get_user_pages() from overflowing page refcount") to the powerpc hugepd code and brings it back in sync with the other GUP cases. Signed-off-by: Christoph Hellwig --- mm/gup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index e03c7e6b1422..6090044227f1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2006,7 +2006,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end); - if (!page_cache_add_speculative(head, refs)) { + head = try_get_compound_head(head, refs); + if (!head) { *nr -= refs; return 0; } -- 2.20.1
[PATCH 14/16] mm: move the powerpc hugepd code to mm/gup.c
While only powerpc supports the hugepd case, the code is pretty generic and I'd like to keep all GUP internals in one place. Signed-off-by: Christoph Hellwig --- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/hugetlbpage.c | 72 -- include/linux/hugetlb.h | 18 mm/Kconfig| 10 + mm/gup.c | 82 +++ 5 files changed, 93 insertions(+), 90 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 992a04796e56..4f1b00979cde 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -125,6 +125,7 @@ config PPC select ARCH_HAS_FORTIFY_SOURCE select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_KCOV + select ARCH_HAS_HUGEPD if HUGETLB_PAGE select ARCH_HAS_MMIOWB if PPC64 select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_PMEM_APIif PPC64 diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index b5d92dc32844..51716c11d0fb 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -511,13 +511,6 @@ struct page *follow_huge_pd(struct vm_area_struct *vma, return page; } -static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, - unsigned long sz) -{ - unsigned long __boundary = (addr + sz) & ~(sz-1); - return (__boundary - 1 < end - 1) ? __boundary : end; -} - #ifdef CONFIG_PPC_MM_SLICES unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, @@ -665,68 +658,3 @@ void flush_dcache_icache_hugepage(struct page *page) } } } - -static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) -{ - unsigned long pte_end; - struct page *head, *page; - pte_t pte; - int refs; - - pte_end = (addr + sz) & ~(sz-1); - if (pte_end < end) - end = pte_end; - - pte = READ_ONCE(*ptep); - - if (!pte_access_permitted(pte, write)) - return 0; - - /* hugepages are never "special" */ - VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - - refs = 0; - head = pte_page(pte); - - page = head + ((addr & (sz-1)) >> PAGE_SHIFT); - do { - VM_BUG_ON(compound_head(page) != head); - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); - - if (!page_cache_add_speculative(head, refs)) { - *nr -= refs; - return 0; - } - - if (unlikely(pte_val(pte) != pte_val(*ptep))) { - /* Could be optimized better */ - *nr -= refs; - while (refs--) - put_page(head); - return 0; - } - - return 1; -} - -int gup_huge_pd(hugepd_t hugepd, unsigned long addr, unsigned int pdshift, - unsigned long end, int write, struct page **pages, int *nr) -{ - pte_t *ptep; - unsigned long sz = 1UL << hugepd_shift(hugepd); - unsigned long next; - - ptep = hugepte_offset(hugepd, addr, pdshift); - do { - next = hugepte_addr_end(addr, end, sz); - if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr)) - return 0; - } while (ptep++, addr = next, addr != end); - - return 1; -} diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index edf476c8cfb9..0f91761e2c53 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -16,29 +16,11 @@ struct user_struct; struct mmu_gather; #ifndef is_hugepd -/* - * Some architectures requires a hugepage directory format that is - * required to support multiple hugepage sizes. For example - * a4fe3ce76 "powerpc/mm: Allow more flexible layouts for hugepage pagetables" - * introduced the same on powerpc. This allows for a more flexible hugepage - * pagetable layout. - */ typedef struct { unsigned long pd; } hugepd_t; #define is_hugepd(hugepd) (0) #define __hugepd(x) ((hugepd_t) { (x) }) -static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr, - unsigned pdshift, unsigned long end, - int write, struct page **pages, int *nr) -{ - return 0; -} -#else -extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr, - unsigned pdshift, unsigned long end, - int write, struct page **pages, int *nr); #endif - #ifdef CONFIG_HUGETLB_PAGE #include diff --git a/mm/Kconfig b/mm/Kconfig index 5c41409557da..44be3f01a2b2 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -769,4 +769,14 @@ config GUP_GET_PTE_L
[PATCH 13/16] mm: validate get_user_pages_fast flags
We can only deal with FOLL_WRITE and/or FOLL_LONGTERM in get_user_pages_fast, so reject all other flags. Signed-off-by: Christoph Hellwig --- mm/gup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index c8da7764de9c..53b50c63ba51 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2316,6 +2316,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, unsigned long addr, len, end; int nr = 0, ret = 0; + if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM))) + return -EINVAL; + start = untagged_addr(start) & PAGE_MASK; addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT; -- 2.20.1
[PATCH 12/16] mm: consolidate the get_user_pages* implementations
Always build mm/gup.c, and move the nommu versions and replace the separate stubs for various functions by the default ones, with the _fast version always falling back to the slow path because gup_fast_permitted always returns false now if HAVE_FAST_GUP is not set, and we use the nommu version of __get_user_pages while keeping all the wrappers common. This also ensures the new put_user_pages* helpers are available for nommu, as those are currently missing, which would create a problem as soon as we actually grew users for it. Signed-off-by: Christoph Hellwig --- mm/Kconfig | 1 + mm/Makefile | 4 +- mm/gup.c| 476 +--- mm/nommu.c | 88 -- mm/util.c | 47 -- 5 files changed, 269 insertions(+), 347 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 98dffb0f2447..5c41409557da 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -133,6 +133,7 @@ config HAVE_MEMBLOCK_PHYS_MAP bool config HAVE_FAST_GUP + depends on MMU bool config ARCH_KEEP_MEMBLOCK diff --git a/mm/Makefile b/mm/Makefile index ac5e5ba78874..dc0746ca1109 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -22,7 +22,7 @@ KCOV_INSTRUMENT_mmzone.o := n KCOV_INSTRUMENT_vmstat.o := n mmu-y := nommu.o -mmu-$(CONFIG_MMU) := gup.o highmem.o memory.o mincore.o \ +mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ pgtable-generic.o rmap.o vmalloc.o @@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ compaction.o vmacache.o \ interval_tree.o list_lru.o workingset.o \ - debug.o $(mmu-y) + debug.o gup.o $(mmu-y) # Give 'page_alloc' its own module-parameter namespace page-alloc-y := page_alloc.o diff --git a/mm/gup.c b/mm/gup.c index a24f52292c7f..c8da7764de9c 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -134,6 +134,7 @@ void put_user_pages(struct page **pages, unsigned long npages) } EXPORT_SYMBOL(put_user_pages); +#ifdef CONFIG_MMU static struct page *no_page_table(struct vm_area_struct *vma, unsigned int flags) { @@ -1099,86 +1100,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, return pages_done; } -/* - * We can leverage the VM_FAULT_RETRY functionality in the page fault - * paths better by using either get_user_pages_locked() or - * get_user_pages_unlocked(). - * - * get_user_pages_locked() is suitable to replace the form: - * - * down_read(&mm->mmap_sem); - * do_something() - * get_user_pages(tsk, mm, ..., pages, NULL); - * up_read(&mm->mmap_sem); - * - * to: - * - * int locked = 1; - * down_read(&mm->mmap_sem); - * do_something() - * get_user_pages_locked(tsk, mm, ..., pages, &locked); - * if (locked) - * up_read(&mm->mmap_sem); - */ -long get_user_pages_locked(unsigned long start, unsigned long nr_pages, - unsigned int gup_flags, struct page **pages, - int *locked) -{ - /* -* FIXME: Current FOLL_LONGTERM behavior is incompatible with -* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on -* vmas. As there are no users of this flag in this call we simply -* disallow this option for now. -*/ - if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM)) - return -EINVAL; - - return __get_user_pages_locked(current, current->mm, start, nr_pages, - pages, NULL, locked, - gup_flags | FOLL_TOUCH); -} -EXPORT_SYMBOL(get_user_pages_locked); - -/* - * get_user_pages_unlocked() is suitable to replace the form: - * - * down_read(&mm->mmap_sem); - * get_user_pages(tsk, mm, ..., pages, NULL); - * up_read(&mm->mmap_sem); - * - * with: - * - * get_user_pages_unlocked(tsk, mm, ..., pages); - * - * It is functionally equivalent to get_user_pages_fast so - * get_user_pages_fast should be used instead if specific gup_flags - * (e.g. FOLL_FORCE) are not required. - */ -long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, -struct page **pages, unsigned int gup_flags) -{ - struct mm_struct *mm = current->mm; - int locked = 1; - long ret; - - /* -* FIXME: Current FOLL_LONGTERM behavior is incompatible with -* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on -* vmas. As there are no users of this flag in this call we simply -* disallow this option for now. -*/ - if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
[PATCH 11/16] mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
We only support the generic GUP now, so rename the config option to be more clear, and always use the mm/Kconfig definition of the symbol and select it from the arch Kconfigs. Signed-off-by: Christoph Hellwig --- arch/arm/Kconfig | 5 + arch/arm64/Kconfig | 4 +--- arch/mips/Kconfig| 2 +- arch/powerpc/Kconfig | 2 +- arch/s390/Kconfig| 2 +- arch/sh/Kconfig | 2 +- arch/sparc/Kconfig | 2 +- arch/x86/Kconfig | 4 +--- mm/Kconfig | 2 +- mm/gup.c | 4 ++-- 10 files changed, 11 insertions(+), 18 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 8869742a85df..3879a3e2c511 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -73,6 +73,7 @@ config ARM select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE select HAVE_EFFICIENT_UNALIGNED_ACCESS if (CPU_V6 || CPU_V6K || CPU_V7) && MMU select HAVE_EXIT_THREAD + select HAVE_FAST_GUP if ARM_LPAE select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL && !CC_IS_CLANG select HAVE_FUNCTION_TRACER if !XIP_KERNEL @@ -1596,10 +1597,6 @@ config ARCH_SELECT_MEMORY_MODEL config HAVE_ARCH_PFN_VALID def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM -config HAVE_GENERIC_GUP - def_bool y - depends on ARM_LPAE - config HIGHMEM bool "High Memory Support" depends on MMU diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 697ea0510729..4a6ee3e92757 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -140,6 +140,7 @@ config ARM64 select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE select HAVE_EFFICIENT_UNALIGNED_ACCESS + select HAVE_FAST_GUP select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER @@ -262,9 +263,6 @@ config GENERIC_CALIBRATE_DELAY config ZONE_DMA32 def_bool y -config HAVE_GENERIC_GUP - def_bool y - config ARCH_ENABLE_MEMORY_HOTPLUG def_bool y diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 64108a2a16d4..b1e42f0e4ed0 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -54,10 +54,10 @@ config MIPS select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE select HAVE_EXIT_THREAD + select HAVE_FAST_GUP select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER - select HAVE_GENERIC_GUP select HAVE_IDE select HAVE_IOREMAP_PROT select HAVE_IRQ_EXIT_ON_IRQ_STACK diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8c1c636308c8..992a04796e56 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -185,12 +185,12 @@ config PPC select HAVE_DYNAMIC_FTRACE_WITH_REGSif MPROFILE_KERNEL select HAVE_EBPF_JITif PPC64 select HAVE_EFFICIENT_UNALIGNED_ACCESS if !(CPU_LITTLE_ENDIAN && POWER7_CPU) + select HAVE_FAST_GUP select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_ERROR_INJECTION select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER select HAVE_GCC_PLUGINS if GCC_VERSION >= 50200 # plugin support on gcc <= 5.1 is buggy on PPC - select HAVE_GENERIC_GUP select HAVE_HW_BREAKPOINT if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx) select HAVE_IDE select HAVE_IOREMAP_PROT diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 109243fdb6ec..aaff0376bf53 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -137,6 +137,7 @@ config S390 select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS + select HAVE_FAST_GUP select HAVE_EFFICIENT_UNALIGNED_ACCESS select HAVE_FENTRY select HAVE_FTRACE_MCOUNT_RECORD @@ -144,7 +145,6 @@ config S390 select HAVE_FUNCTION_TRACER select HAVE_FUTEX_CMPXCHG if FUTEX select HAVE_GCC_PLUGINS - select HAVE_GENERIC_GUP select HAVE_KERNEL_BZIP2 select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZ4 diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index 6fddfc3c9710..56712f3c9838 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -14,7 +14,7 @@ config SUPERH select HAVE_ARCH_TRACEHOOK select HAVE_PERF_EVENTS select HAVE_DEBUG_BUGVERBOSE - select HAVE_GENERIC_GUP + select HAVE_FAST_GUP select ARCH_HAVE_CUSTOM_GPIO_H select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A) select ARCH_HAS_GCOV_PROFILE_ALL diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 22435471f942..659232b760e1 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -28,7 +28,7 @@ config SPARC select RTC_DRV_M48T59 select RTC_SYSTOHC select HAVE_ARCH_JUMP_LABEL if SPARC64 - select HAVE_GENERIC_GUP if
[PATCH 10/16] sparc64: use the generic get_user_pages_fast code
The sparc64 code is mostly equivalent to the generic one, minus various bugfixes and two arch overrides that this patch adds to pgtable.h. Signed-off-by: Christoph Hellwig --- arch/sparc/Kconfig | 1 + arch/sparc/include/asm/pgtable_64.h | 18 ++ arch/sparc/mm/Makefile | 2 +- arch/sparc/mm/gup.c | 340 4 files changed, 20 insertions(+), 341 deletions(-) delete mode 100644 arch/sparc/mm/gup.c diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 26ab6f5bbaaf..22435471f942 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -28,6 +28,7 @@ config SPARC select RTC_DRV_M48T59 select RTC_SYSTOHC select HAVE_ARCH_JUMP_LABEL if SPARC64 + select HAVE_GENERIC_GUP if SPARC64 select GENERIC_IRQ_SHOW select ARCH_WANT_IPC_PARSE_VERSION select GENERIC_PCI_IOMAP diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index a93eca29e85a..2301ab5250e4 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -1098,6 +1098,24 @@ static inline unsigned long untagged_addr(unsigned long start) } #define untagged_addr untagged_addr +static inline bool pte_access_permitted(pte_t pte, bool write) +{ + u64 prot; + + if (tlb_type == hypervisor) { + prot = _PAGE_PRESENT_4V | _PAGE_P_4V; + if (prot) + prot |= _PAGE_WRITE_4V; + } else { + prot = _PAGE_PRESENT_4U | _PAGE_P_4U; + if (write) + prot |= _PAGE_WRITE_4U; + } + + return (pte_val(pte) & (prot | _PAGE_SPECIAL)) == prot; +} +#define pte_access_permitted pte_access_permitted + #include #include diff --git a/arch/sparc/mm/Makefile b/arch/sparc/mm/Makefile index d39075b1e3b7..b078205b70e0 100644 --- a/arch/sparc/mm/Makefile +++ b/arch/sparc/mm/Makefile @@ -5,7 +5,7 @@ asflags-y := -ansi ccflags-y := -Werror -obj-$(CONFIG_SPARC64) += ultra.o tlb.o tsb.o gup.o +obj-$(CONFIG_SPARC64) += ultra.o tlb.o tsb.o obj-y += fault_$(BITS).o obj-y += init_$(BITS).o obj-$(CONFIG_SPARC32) += extable.o srmmu.o iommu.o io-unit.o diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c deleted file mode 100644 index 1e770a517d4a.. --- a/arch/sparc/mm/gup.c +++ /dev/null @@ -1,340 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * Lockless get_user_pages_fast for sparc, cribbed from powerpc - * - * Copyright (C) 2008 Nick Piggin - * Copyright (C) 2008 Novell Inc. - */ - -#include -#include -#include -#include -#include -#include -#include - -/* - * The performance critical leaf functions are made noinline otherwise gcc - * inlines everything into a single function which results in too much - * register pressure. - */ -static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) -{ - unsigned long mask, result; - pte_t *ptep; - - if (tlb_type == hypervisor) { - result = _PAGE_PRESENT_4V|_PAGE_P_4V; - if (write) - result |= _PAGE_WRITE_4V; - } else { - result = _PAGE_PRESENT_4U|_PAGE_P_4U; - if (write) - result |= _PAGE_WRITE_4U; - } - mask = result | _PAGE_SPECIAL; - - ptep = pte_offset_kernel(&pmd, addr); - do { - struct page *page, *head; - pte_t pte = *ptep; - - if ((pte_val(pte) & mask) != result) - return 0; - VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - - /* The hugepage case is simplified on sparc64 because -* we encode the sub-page pfn offsets into the -* hugepage PTEs. We could optimize this in the future -* use page_cache_add_speculative() for the hugepage case. -*/ - page = pte_page(pte); - head = compound_head(page); - if (!page_cache_get_speculative(head)) - return 0; - if (unlikely(pte_val(pte) != pte_val(*ptep))) { - put_page(head); - return 0; - } - - pages[*nr] = page; - (*nr)++; - } while (ptep++, addr += PAGE_SIZE, addr != end); - - return 1; -} - -static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, - unsigned long end, int write, struct page **pages, - int *nr) -{ - struct page *head, *page; - int refs; - - if (!(pmd_val(pmd) & _PAGE_VALID)) - return 0; - - if (write && !pmd_write(pmd)) - return 0; - - refs = 0; - page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - head = compound_
[PATCH 09/16] sparc64: define untagged_addr()
Add a helper to untag a user pointer. This is needed for ADI support in get_user_pages_fast. Signed-off-by: Christoph Hellwig --- arch/sparc/include/asm/pgtable_64.h | 22 ++ 1 file changed, 22 insertions(+) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index dcf970e82262..a93eca29e85a 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -1076,6 +1076,28 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma, } #define io_remap_pfn_range io_remap_pfn_range +static inline unsigned long untagged_addr(unsigned long start) +{ + if (adi_capable()) { + long addr = start; + + /* If userspace has passed a versioned address, kernel +* will not find it in the VMAs since it does not store +* the version tags in the list of VMAs. Storing version +* tags in list of VMAs is impractical since they can be +* changed any time from userspace without dropping into +* kernel. Any address search in VMAs will be done with +* non-versioned addresses. Ensure the ADI version bits +* are dropped here by sign extending the last bit before +* ADI bits. IOMMU does not implement version tags. +*/ + return (addr << (long)adi_nbits()) >> (long)adi_nbits(); + } + + return start; +} +#define untagged_addr untagged_addr + #include #include -- 2.20.1
[PATCH 04/16] mm: lift the x86_32 PAE version of gup_get_pte to common code
The split low/high access is the only non-READ_ONCE version of gup_get_pte that did show up in the various arch implemenations. Lift it to common code and drop the ifdef based arch override. Signed-off-by: Christoph Hellwig --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable-3level.h | 47 arch/x86/kvm/mmu.c| 2 +- mm/Kconfig| 3 ++ mm/gup.c | 51 --- 5 files changed, 52 insertions(+), 52 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2bbbd4d1ba31..7cd53cc59f0f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -121,6 +121,7 @@ config X86 select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL + select GUP_GET_PTE_LOW_HIGH if X86_PAE select HARDLOCKUP_CHECK_TIMESTAMP if X86_64 select HAVE_ACPI_APEI if ACPI select HAVE_ACPI_APEI_NMI if ACPI diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index f8b1ad2c3828..e3633795fb22 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -285,53 +285,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp) #define __pte_to_swp_entry(pte)(__swp_entry(__pteval_swp_type(pte), \ __pteval_swp_offset(pte))) -#define gup_get_pte gup_get_pte -/* - * WARNING: only to be used in the get_user_pages_fast() implementation. - * - * With get_user_pages_fast(), we walk down the pagetables without taking - * any locks. For this we would like to load the pointers atomically, - * but that is not possible (without expensive cmpxchg8b) on PAE. What - * we do have is the guarantee that a PTE will only either go from not - * present to present, or present to not present or both -- it will not - * switch to a completely different present page without a TLB flush in - * between; something that we are blocking by holding interrupts off. - * - * Setting ptes from not present to present goes: - * - * ptep->pte_high = h; - * smp_wmb(); - * ptep->pte_low = l; - * - * And present to not present goes: - * - * ptep->pte_low = 0; - * smp_wmb(); - * ptep->pte_high = 0; - * - * We must ensure here that the load of pte_low sees 'l' iff pte_high - * sees 'h'. We load pte_high *after* loading pte_low, which ensures we - * don't see an older value of pte_high. *Then* we recheck pte_low, - * which ensures that we haven't picked up a changed pte high. We might - * have gotten rubbish values from pte_low and pte_high, but we are - * guaranteed that pte_low will not have the present bit set *unless* - * it is 'l'. Because get_user_pages_fast() only operates on present ptes - * we're safe. - */ -static inline pte_t gup_get_pte(pte_t *ptep) -{ - pte_t pte; - - do { - pte.pte_low = ptep->pte_low; - smp_rmb(); - pte.pte_high = ptep->pte_high; - smp_rmb(); - } while (unlikely(pte.pte_low != ptep->pte_low)); - - return pte; -} - #include #endif /* _ASM_X86_PGTABLE_3LEVEL_H */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1e9ba81accba..3f7cd11168f9 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -653,7 +653,7 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) /* * The idea using the light way get the spte on x86_32 guest is from - * gup_get_pte(arch/x86/mm/gup.c). + * gup_get_pte (mm/gup.c). * * An spte tlb flush may be pending, because kvm_set_pte_rmapp * coalesces them and we are running out of the MMU lock. Therefore diff --git a/mm/Kconfig b/mm/Kconfig index f0c76ba47695..fe51f104a9e0 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -762,6 +762,9 @@ config GUP_BENCHMARK See tools/testing/selftests/vm/gup_benchmark.c +config GUP_GET_PTE_LOW_HIGH + bool + config ARCH_HAS_PTE_SPECIAL bool diff --git a/mm/gup.c b/mm/gup.c index e7566f5ff9cf..a86d65cd7051 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1683,17 +1683,60 @@ struct page *get_dump_page(unsigned long addr) * This code is based heavily on the PowerPC implementation by Nick Piggin. */ #ifdef CONFIG_HAVE_GENERIC_GUP +#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH +/* + * WARNING: only to be used in the get_user_pages_fast() implementation. + * + * With get_user_pages_fast(), we walk down the pagetables without taking any + * locks. For this we would like to load the pointers atomically, but sometimes + * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What + * we do have is the guarantee that a PTE will only either go from not present + * to present, or present to not present or both -- it will not switch to a + * completely different present page without a TLB flush in between; something + * that we are blocking by h
[PATCH 07/16] sh: use the generic get_user_pages_fast code
The sh code is mostly equivalent to the generic one, minus various bugfixes and two arch overrides that this patch adds to pgtable.h. Signed-off-by: Christoph Hellwig --- arch/sh/Kconfig | 2 + arch/sh/include/asm/pgtable.h | 37 + arch/sh/mm/Makefile | 2 +- arch/sh/mm/gup.c | 277 -- 4 files changed, 40 insertions(+), 278 deletions(-) delete mode 100644 arch/sh/mm/gup.c diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index b77f512bb176..6fddfc3c9710 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -14,6 +14,7 @@ config SUPERH select HAVE_ARCH_TRACEHOOK select HAVE_PERF_EVENTS select HAVE_DEBUG_BUGVERBOSE + select HAVE_GENERIC_GUP select ARCH_HAVE_CUSTOM_GPIO_H select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A) select ARCH_HAS_GCOV_PROFILE_ALL @@ -63,6 +64,7 @@ config SUPERH config SUPERH32 def_bool "$(ARCH)" = "sh" select ARCH_32BIT_OFF_T + select GUP_GET_PTE_LOW_HIGH if X2TLB select HAVE_KPROBES select HAVE_KRETPROBES select HAVE_IOREMAP_PROT if MMU && !X2TLB diff --git a/arch/sh/include/asm/pgtable.h b/arch/sh/include/asm/pgtable.h index 3587103afe59..9085d1142fa3 100644 --- a/arch/sh/include/asm/pgtable.h +++ b/arch/sh/include/asm/pgtable.h @@ -149,6 +149,43 @@ extern void paging_init(void); extern void page_table_range_init(unsigned long start, unsigned long end, pgd_t *pgd); +static inline bool __pte_access_permitted(pte_t pte, u64 prot) +{ + return (pte_val(pte) & (prot | _PAGE_SPECIAL)) == prot; +} + +#ifdef CONFIG_X2TLB +static inline bool pte_access_permitted(pte_t pte, bool write) +{ + u64 prot = _PAGE_PRESENT; + + prot |= _PAGE_EXT(_PAGE_EXT_KERN_READ | _PAGE_EXT_USER_READ); + if (write) + prot |= _PAGE_EXT(_PAGE_EXT_KERN_WRITE | _PAGE_EXT_USER_WRITE); + return __pte_access_permitted(pte, prot); +} +#elif defined(CONFIG_SUPERH64) +static inline bool pte_access_permitted(pte_t pte, bool write) +{ + u64 prot = _PAGE_PRESENT | _PAGE_USER | _PAGE_READ; + + if (write) + prot |= _PAGE_WRITE; + return __pte_access_permitted(pte, prot); +} +#else +static inline bool pte_access_permitted(pte_t pte, bool write) +{ + u64 prot = _PAGE_PRESENT | _PAGE_USER; + + if (write) + prot |= _PAGE_RW; + return __pte_access_permitted(pte, prot); +} +#endif + +#define pte_access_permitted pte_access_permitted + /* arch/sh/mm/mmap.c */ #define HAVE_ARCH_UNMAPPED_AREA #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN diff --git a/arch/sh/mm/Makefile b/arch/sh/mm/Makefile index fbe5e79751b3..5051b38fd5b6 100644 --- a/arch/sh/mm/Makefile +++ b/arch/sh/mm/Makefile @@ -17,7 +17,7 @@ cacheops-$(CONFIG_CPU_SHX3) += cache-shx3.o obj-y += $(cacheops-y) mmu-y := nommu.o extable_32.o -mmu-$(CONFIG_MMU) := extable_$(BITS).o fault.o gup.o ioremap.o kmap.o \ +mmu-$(CONFIG_MMU) := extable_$(BITS).o fault.o ioremap.o kmap.o \ pgtable.o tlbex_$(BITS).o tlbflush_$(BITS).o obj-y += $(mmu-y) diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c deleted file mode 100644 index 277c882f7489.. --- a/arch/sh/mm/gup.c +++ /dev/null @@ -1,277 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * Lockless get_user_pages_fast for SuperH - * - * Copyright (C) 2009 - 2010 Paul Mundt - * - * Cloned from the x86 and PowerPC versions, by: - * - * Copyright (C) 2008 Nick Piggin - * Copyright (C) 2008 Novell Inc. - */ -#include -#include -#include -#include -#include - -static inline pte_t gup_get_pte(pte_t *ptep) -{ -#ifndef CONFIG_X2TLB - return READ_ONCE(*ptep); -#else - /* -* With get_user_pages_fast, we walk down the pagetables without -* taking any locks. For this we would like to load the pointers -* atomically, but that is not possible with 64-bit PTEs. What -* we do have is the guarantee that a pte will only either go -* from not present to present, or present to not present or both -* -- it will not switch to a completely different present page -* without a TLB flush in between; something that we are blocking -* by holding interrupts off. -* -* Setting ptes from not present to present goes: -* ptep->pte_high = h; -* smp_wmb(); -* ptep->pte_low = l; -* -* And present to not present goes: -* ptep->pte_low = 0; -* smp_wmb(); -* ptep->pte_high = 0; -* -* We must ensure here that the load of pte_low sees l iff pte_high -* sees h. We load pte_high *after* loading pte_low, which ensures we -* don't see an older value of pte_high. *Then* we recheck pte_low, -* which ensures that we haven't picked up a chang
[PATCH 08/16] sparc64: add the missing pgd_page definition
sparc64 only had pgd_page_vaddr, but not pgd_page. Signed-off-by: Christoph Hellwig --- arch/sparc/include/asm/pgtable_64.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 22500c3be7a9..dcf970e82262 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -861,6 +861,7 @@ static inline unsigned long pud_page_vaddr(pud_t pud) #define pud_clear(pudp)(pud_val(*(pudp)) = 0UL) #define pgd_page_vaddr(pgd)\ ((unsigned long) __va(pgd_val(pgd))) +#define pgd_page(pgd) virt_to_page(__va(pgd_val(pgd))) #define pgd_present(pgd) (pgd_val(pgd) != 0U) #define pgd_clear(pgdp)(pgd_val(*(pgdp)) = 0UL) -- 2.20.1
[PATCH 05/16] MIPS: use the generic get_user_pages_fast code
The mips code is mostly equivalent to the generic one, minus various bugfixes and an arch override for gup_fast_permitted. Note that this defines ARCH_HAS_PTE_SPECIAL for mips as mips has pte_special and pte_mkspecial implemented and used in the existing gup code. They are no-op stubs, though which makes me a little unsure if this is really right thing to do. Signed-off-by: Christoph Hellwig --- arch/mips/Kconfig | 3 + arch/mips/include/asm/pgtable.h | 3 + arch/mips/mm/Makefile | 1 - arch/mips/mm/gup.c | 303 4 files changed, 6 insertions(+), 304 deletions(-) delete mode 100644 arch/mips/mm/gup.c diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 70d3200476bf..64108a2a16d4 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -6,6 +6,7 @@ config MIPS select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT select ARCH_CLOCKSOURCE_DATA select ARCH_HAS_ELF_RANDOMIZE + select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_SUPPORTS_UPROBES @@ -34,6 +35,7 @@ config MIPS select GENERIC_SCHED_CLOCK if !CAVIUM_OCTEON_SOC select GENERIC_SMP_IDLE_THREAD select GENERIC_TIME_VSYSCALL + select GUP_GET_PTE_LOW_HIGH if CPU_MIPS32 && PHYS_ADDR_T_64BIT select HANDLE_DOMAIN_IRQ select HAVE_ARCH_COMPILER_H select HAVE_ARCH_JUMP_LABEL @@ -55,6 +57,7 @@ config MIPS select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER + select HAVE_GENERIC_GUP select HAVE_IDE select HAVE_IOREMAP_PROT select HAVE_IRQ_EXIT_ON_IRQ_STACK diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index 4ccb465ef3f2..7d27194e3b45 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -20,6 +20,7 @@ #include #include #include +#include struct mm_struct; struct vm_area_struct; @@ -626,6 +627,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define gup_fast_permitted(start, end) (!cpu_has_dc_aliases) + #include /* diff --git a/arch/mips/mm/Makefile b/arch/mips/mm/Makefile index f34d7ff5eb60..1e8d335025d7 100644 --- a/arch/mips/mm/Makefile +++ b/arch/mips/mm/Makefile @@ -7,7 +7,6 @@ obj-y += cache.o obj-y += context.o obj-y += extable.o obj-y += fault.o -obj-y += gup.o obj-y += init.o obj-y += mmap.o obj-y += page.o diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c deleted file mode 100644 index 4c2b4483683c.. --- a/arch/mips/mm/gup.c +++ /dev/null @@ -1,303 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * Lockless get_user_pages_fast for MIPS - * - * Copyright (C) 2008 Nick Piggin - * Copyright (C) 2008 Novell Inc. - * Copyright (C) 2011 Ralf Baechle - */ -#include -#include -#include -#include -#include -#include - -#include -#include - -static inline pte_t gup_get_pte(pte_t *ptep) -{ -#if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32) - pte_t pte; - -retry: - pte.pte_low = ptep->pte_low; - smp_rmb(); - pte.pte_high = ptep->pte_high; - smp_rmb(); - if (unlikely(pte.pte_low != ptep->pte_low)) - goto retry; - - return pte; -#else - return READ_ONCE(*ptep); -#endif -} - -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - int write, struct page **pages, int *nr) -{ - pte_t *ptep = pte_offset_map(&pmd, addr); - do { - pte_t pte = gup_get_pte(ptep); - struct page *page; - - if (!pte_present(pte) || - pte_special(pte) || (write && !pte_write(pte))) { - pte_unmap(ptep); - return 0; - } - VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - page = pte_page(pte); - get_page(page); - SetPageReferenced(page); - pages[*nr] = page; - (*nr)++; - - } while (ptep++, addr += PAGE_SIZE, addr != end); - - pte_unmap(ptep - 1); - return 1; -} - -static inline void get_head_page_multiple(struct page *page, int nr) -{ - VM_BUG_ON(page != compound_head(page)); - VM_BUG_ON(page_count(page) == 0); - page_ref_add(page, nr); - SetPageReferenced(page); -} - -static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, - int write, struct page **pages, int *nr) -{ - pte_t pte = *(pte_t *)&pmd; - struct page *head, *page; - int refs; - -
[PATCH 03/16] mm: simplify gup_fast_permitted
Pass in the already calculated end value instead of recomputing it, and leave the end > start check in the callers instead of duplicating them in the arch code. Signed-off-by: Christoph Hellwig --- arch/s390/include/asm/pgtable.h | 8 +--- arch/x86/include/asm/pgtable_64.h | 8 +--- mm/gup.c | 17 +++-- 3 files changed, 9 insertions(+), 24 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 9f0195d5fa16..9b274fcaacb6 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1270,14 +1270,8 @@ static inline pte_t *pte_offset(pmd_t *pmd, unsigned long address) #define pte_offset_map(pmd, address) pte_offset_kernel(pmd, address) #define pte_unmap(pte) do { } while (0) -static inline bool gup_fast_permitted(unsigned long start, int nr_pages) +static inline bool gup_fast_permitted(unsigned long start, unsigned long end) { - unsigned long len, end; - - len = (unsigned long) nr_pages << PAGE_SHIFT; - end = start + len; - if (end < start) - return false; return end <= current->mm->context.asce_limit; } #define gup_fast_permitted gup_fast_permitted diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 0bb566315621..4990d26dfc73 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -259,14 +259,8 @@ extern void init_extra_mapping_uc(unsigned long phys, unsigned long size); extern void init_extra_mapping_wb(unsigned long phys, unsigned long size); #define gup_fast_permitted gup_fast_permitted -static inline bool gup_fast_permitted(unsigned long start, int nr_pages) +static inline bool gup_fast_permitted(unsigned long start, unsigned long end) { - unsigned long len, end; - - len = (unsigned long)nr_pages << PAGE_SHIFT; - end = start + len; - if (end < start) - return false; if (end >> __VIRTUAL_MASK_SHIFT) return false; return true; diff --git a/mm/gup.c b/mm/gup.c index 9775f7675653..e7566f5ff9cf 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2122,13 +2122,9 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, * Check if it's allowed to use __get_user_pages_fast() for the range, or * we need to fall back to the slow version: */ -bool gup_fast_permitted(unsigned long start, int nr_pages) +static bool gup_fast_permitted(unsigned long start, unsigned long end) { - unsigned long len, end; - - len = (unsigned long) nr_pages << PAGE_SHIFT; - end = start + len; - return end >= start; + return true; } #endif @@ -2149,6 +2145,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; + if (end < start) + return 0; if (unlikely(!access_ok((void __user *)start, len))) return 0; @@ -2164,7 +2162,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, * block IPIs that come from THPs splitting. */ - if (gup_fast_permitted(start, nr_pages)) { + if (gup_fast_permitted(start, end)) { local_irq_save(flags); gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr); local_irq_restore(flags); @@ -2223,13 +2221,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages, len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; - if (nr_pages <= 0) + if (end < start) return 0; - if (unlikely(!access_ok((void __user *)start, len))) return -EFAULT; - if (gup_fast_permitted(start, nr_pages)) { + if (gup_fast_permitted(start, end)) { local_irq_disable(); gup_pgd_range(addr, end, gup_flags, pages, &nr); local_irq_enable(); -- 2.20.1
[PATCH 06/16] sh: add the missing pud_page definition
sh only had pud_page_vaddr, but not pud_page. Signed-off-by: Christoph Hellwig --- arch/sh/include/asm/pgtable-3level.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/sh/include/asm/pgtable-3level.h b/arch/sh/include/asm/pgtable-3level.h index 7d8587eb65ff..8ff6fb6b4d19 100644 --- a/arch/sh/include/asm/pgtable-3level.h +++ b/arch/sh/include/asm/pgtable-3level.h @@ -37,6 +37,7 @@ static inline unsigned long pud_page_vaddr(pud_t pud) { return pud_val(pud); } +#define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) #define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1)) static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) -- 2.20.1
RFC: switch the remaining architectures to use generic GUP v2
Hi Linus and maintainers, below is a series to switch mips, sh and sparc64 to use the generic GUP code so that we only have one codebase to touch for further improvements to this code. I don't have hardware for any of these architectures, and generally no clue about their page table management, so handle with care. Changes since v1: - fix various issues found by the build bot - cherry pick and use the untagged_addr helper form Andrey - add various refactoring patches to share more code over architectures - move the powerpc hugepd code to mm/gup.c and sync it with the generic hup semantics
[PATCH 01/16] uaccess: add untagged_addr definition for other arches
From: Andrey Konovalov To allow arm64 syscalls to accept tagged pointers from userspace, we must untag them when they are passed to the kernel. Since untagging is done in generic parts of the kernel, the untagged_addr macro needs to be defined for all architectures. Define it as a noop for architectures other than arm64. Acked-by: Catalin Marinas Signed-off-by: Andrey Konovalov Signed-off-by: Christoph Hellwig --- include/linux/mm.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0e8834ac32b7..949d43e9c0b6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly; #include #include +#ifndef untagged_addr +#define untagged_addr(addr) (addr) +#endif + #ifndef __pa_symbol #define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0)) #endif -- 2.20.1
[PATCH 02/16] mm: use untagged_addr() for get_user_pages_fast addresses
This will allow sparc64 to override its ADI tags for get_user_pages and get_user_pages_fast. Signed-off-by: Christoph Hellwig --- mm/gup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index f173fcbaf1b2..9775f7675653 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2145,7 +2145,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, unsigned long flags; int nr = 0; - start &= PAGE_MASK; + start = untagged_addr(start) & PAGE_MASK; len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; @@ -2218,7 +2218,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, unsigned long addr, len, end; int nr = 0, ret = 0; - start &= PAGE_MASK; + start = untagged_addr(start) & PAGE_MASK; addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; -- 2.20.1