Re: [PATCH] powerpc/fault: kernel can extend a user process's stack

2019-12-10 Thread Michal Suchánek
On Wed, Dec 11, 2019 at 12:43:37PM +1100, Daniel Axtens wrote:
> If a process page-faults trying to write beyond the end of its
> stack, we attempt to grow the stack.
> 
> However, if the kernel attempts to write beyond the end of a
> process's stack, it takes a bad fault. This can occur when the
> kernel is trying to set up a signal frame.
> 
> Permit the kernel to grow a process's stack. The same general
> limits as to how and when the stack can be grown apply: the kernel
> code still passes through expand_stack(), so anything that goes
> beyond e.g. the rlimit should still be blocked.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=205183
Fixes: 14cf11af6cf6 ("powerpc: Merge enough to start building in
arch/powerpc.")
> Reported-by: Tom Lane 
> Cc: Daniel Black 
> Signed-off-by: Daniel Axtens 
> ---
>  arch/powerpc/mm/fault.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index b5047f9b5dec..00183731ea22 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -287,7 +287,17 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
> unsigned long address,
>   if (!res)
>   return !store_updates_sp(inst);
>   *must_retry = true;
> + } else if ((flags & FAULT_FLAG_WRITE) &&
> +!(flags & FAULT_FLAG_USER)) {
> + /*
> +  * the kernel can also attempt to write beyond the end
> +  * of a process's stack - for example setting up a
> +  * signal frame. We assume this is valid, subject to
> +  * the checks in expand_stack() later.
> +  */
> + return false;
>   }
> +
>   return true;
>   }
>   return false;
> -- 
> 2.20.1
> 


Call for report - G5/PPC970 status (was: Re: Found the commit for: 5.3.7 64-bits kernel doesn't boot on G5 Quad [regression])

2019-12-10 Thread Romain Dolbeau
Le mer. 11 déc. 2019 à 03:20, Aneesh Kumar K.V
 a écrit :
> The PowerMac system we have internally was not able to recreate this.

To narrow down the issue - is that a PCI/PCI-X (7,3 [1]) or PCIe G5 (11,2 [1]) ?
Single, dual or quad ?

Same question to anyone else with a G5 / PPC970 - what is it and does
it boot recent PPC64 Linux kernel ?

Christian from the original report has a quad, like me (so powermac11,2).

There was also a report of a powermac7.3 working in the original discussion,
single or dual unspecified.

So this might be a Quad thing, or a more general 11,2 thing...

> At this point, I am not sure what would cause the Machine check with
> that patch series because we have not changed the VA bits in that patch.

Any test I could run that would help you tracking the bug ?

Cordially,

Romain

[1] 


--
Romain Dolbeau


Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-10 Thread Daniel Axtens
Hi Balbir,

>> +Discontiguous memory can occur when you have a machine with memory spread
>> +across multiple nodes. For example, on a Talos II with 64GB of RAM:
>> +
>> + - 32GB runs from 0x0 to 0x_0008__,
>> + - then there's a gap,
>> + - then the final 32GB runs from 0x_2000__ to 
>> 0x_2008__
>> +
>> +This can create _significant_ issues:
>> +
>> + - If we try to treat the machine as having 64GB of _contiguous_ RAM, we 
>> would
>> +   assume that ran from 0x0 to 0x_0010__. We'd then reserve the
>> +   last 1/8th - 0x_000e__ to 0x_0010__ as the shadow
>> +   region. But when we try to access any of that, we'll try to access pages
>> +   that are not physically present.
>> +
>
> If we reserved memory for KASAN from each node (discontig region), we might 
> survive
> this no? May be we need NUMA aware KASAN? That might be a generic change, 
> just thinking
> out loud.

The challenge is that - AIUI - in inline instrumentation, the compiler
doesn't generate calls to things like __asan_loadN and
__asan_storeN. Instead it uses -fasan-shadow-offset to compute the
checks, and only calls the __asan_report* family of functions if it
detects an issue. This also matches what I can observe with objdump
across outline and inline instrumentation settings.

This means that for this sort of thing to work we would need to either
drop back to out-of-line calls, or teach the compiler how to use a
nonlinear, NUMA aware mem-to-shadow mapping.

I'll document this a bit better in the next spin.

>> +if (IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_PPC_BOOK3S_64)) {
>> +kasan_memory_size =
>> +((phys_addr_t)CONFIG_PHYS_MEM_SIZE_FOR_KASAN << 20);
>> +
>> +if (top_phys_addr < kasan_memory_size) {
>> +/*
>> + * We are doomed. Attempts to call e.g. panic() are
>> + * likely to fail because they call out into
>> + * instrumented code, which will almost certainly
>> + * access memory beyond the end of physical
>> + * memory. Hang here so that at least the NIP points
>> + * somewhere that will help you debug it if you look at
>> + * it in qemu.
>> + */
>> +while (true)
>> +;
>
> Again with the right hooks in check_memory_region_inline() these are 
> recoverable,
> or so I think

So unless I misunderstand the circumstances in which
check_memory_region_inline is used, this isn't going to help with inline
instrumentation.

>> +void __init kasan_init(void)
>> +{
>> +int i;
>> +void *k_start = kasan_mem_to_shadow((void *)RADIX_KERN_VIRT_START);
>> +void *k_end = kasan_mem_to_shadow((void *)RADIX_VMEMMAP_END);
>> +
>> +pte_t pte = __pte(__pa(kasan_early_shadow_page) |
>> +  pgprot_val(PAGE_KERNEL) | _PAGE_PTE);
>> +
>> +if (!early_radix_enabled())
>> +panic("KASAN requires radix!");
>> +
>
> I think this is avoidable, we could use a static key for disabling kasan in
> the generic code. I wonder what happens if someone tries to boot this
> image on a Power8 box and keeps panic'ing with no easy way of recovering.

Again, assuming I understand correctly that the compiler generates raw
IR->asm for these checks rather than calling out to a function, then I
don't think we get a way to intercept those checks. It's too late to do
anything at the __asan report stage because that will already have
accessed memory that's not set up properly.

If you try to boot this on a Power8 box it will panic and you'll have to
boot into another kernel from the bootloader. I don't think it's
avoidable without disabling inline instrumentation, but I'd love to be
proven wrong.

>
> NOTE: I can't test any of these, well may be with qemu, let me see if I can 
> spin
> the series and provide more feedback

It's actually super easy to do simple boot tests with qemu, it works fine in 
TCG,
Michael's wiki page at
https://github.com/linuxppc/wiki/wiki/Booting-with-Qemu is very helpful.

I did this a lot in development.

My full commandline, fwiw, is:

qemu-system-ppc64  -m 8G -M pseries -cpu power9  -kernel 
../out-3s-radix/vmlinux  -nographic -chardev stdio,id=charserial0,mux=on 
-device spapr-vty,chardev=charserial0,reg=0x3000 -initrd 
./rootfs-le.cpio.xz -mon chardev=charserial0,mode=readline -nodefaults -smp 4

Regards,
Daniel



Re: [PATCH 2/2] powerpc/64s/exception: Add missing comma to INT_KVM_HANDLER macro for system_reset

2019-12-10 Thread Andrew Donnellan

On 11/12/19 1:37 pm, Jordan Niethe wrote:

The INT_KVM_HANDLER macro for system_reset is missing a comma so add it
to be consistent.

Signed-off-by: Jordan Niethe 


I see another one for machine_check as well which you might want to fix

Reviewed-by: Andrew Donnellan 


--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



Re: [PATCH 1/2] powerpc/64s/exception: Remove unused parameters from KVMTEST macro

2019-12-10 Thread Andrew Donnellan

On 11/12/19 1:37 pm, Jordan Niethe wrote:

The hsrr and n parameters are never used by the KVMTEST macro so remove
them.

Signed-off-by: Jordan Niethe 


Reviewed-by: Andrew Donnellan 

--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



Re: [PATCH] powerpc/64: Use {SAVE,REST}_NVGPRS macros

2019-12-10 Thread Andrew Donnellan

On 11/12/19 1:35 pm, Jordan Niethe wrote:

In entry_64.S there are places that open code saving and restoring the
non-volatile registers. There are already macros for doing this so use
them.

Signed-off-by: Jordan Niethe 


Reviewed-by: Andrew Donnellan 


--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



[PATCH v9 24/25] mm/gup_benchmark: support pin_user_pages() and related calls

2019-12-10 Thread John Hubbard
Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages():  via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_BENCHMARK  : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 65 --
 tools/testing/selftests/vm/gup_benchmark.c | 15 -
 2 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7fc44d25eca7..76d32db48af8 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK  _IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +21,42 @@ struct gup_benchmark {
__u64 expansion[10];/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long 
nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case GUP_FAST_BENCHMARK:
+   case GUP_LONGTERM_BENCHMARK:
+   case GUP_BENCHMARK:
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   break;
+
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   unpin_user_pages(pages, nr_pages);
+   break;
+   }
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   for (i = 0; i < nr_pages; i++) {
+   if (WARN(!page_dma_pinned(pages[i]),
+"pages[%d] is NOT dma-pinned\n", i))
+   break;
+   }
+   break;
+   }
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
 {
@@ -65,6 +103,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
+   case PIN_FAST_BENCHMARK:
+   nr = pin_user_pages_fast(addr, nr, gup->flags,
+pages + i);
+   break;
+   case PIN_BENCHMARK:
+   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   NULL);
+   break;
default:
return -1;
}
@@ -75,15 +121,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();
 
+   /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+   nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;
 
+   /*
+* Take an un-benchmark-timed moment to verify DMA pinned
+* state: print a warning if any non-dma-pinned pages are found:
+*/
+   verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
-   for (i = 0; i < nr_pages; i++) {
-   if (!pages[i])
-   break;
-   put_page(pages[i]);
-   }
+
+   put_back_pages(cmd, pages, nr_pages);
+
end_time = ktime_get();
gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -101,6 +154,8 @@ static long gup_benchmark_ioctl(struct file *filep, 
unsigned int cmd,
case GUP_FAST_BENCHMARK:
case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
break;
default:
return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 

[PATCH v9 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

2019-12-10 Thread John Hubbard
It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 tools/testing/selftests/vm/run_vmtests | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..df6a6bf3f238 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -112,6 +112,28 @@ echo "NOTE: The above hugetlb tests provide minimal 
coverage.  Use"
 echo "  https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "  hugetlb regression testing."
 
+echo ""
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo ""
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
+echo "--"
+echo "running gup_benchmark -b (pin_user_pages)"
+echo "--"
+./gup_benchmark -b
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
 echo "---"
 echo "running userfaultfd"
 echo "---"
-- 
2.24.0



[PATCH v9 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*()

2019-12-10 Thread John Hubbard
In order to provide a clearer, more symmetric API for pinning
and unpinning DMA pages. This way, pin_user_pages*() calls
match up with unpin_user_pages*() calls, and the API is a lot
closer to being self-explanatory.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst   |  2 +-
 arch/powerpc/mm/book3s64/iommu_api.c|  4 +--
 drivers/gpu/drm/via/via_dmablit.c   |  4 +--
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  4 +--
 drivers/platform/goldfish/goldfish_pipe.c   |  4 +--
 drivers/vfio/vfio_iommu_type1.c |  2 +-
 fs/io_uring.c   |  4 +--
 include/linux/mm.h  | 26 -
 mm/gup.c| 32 ++---
 mm/process_vm_access.c  |  4 +--
 net/xdp/xdp_umem.c  |  2 +-
 18 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index a7a261d869f1..fd2a19c96189 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -219,7 +219,7 @@ since the system was booted, via two new /proc/vmstat 
entries: ::
 /proc/vmstat/nr_foll_pin_requested
 
 Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in put_user_page(), when they
+because there is a noticeable performance drop in unpin_user_page(), when they
 are activated.
 
 References
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index a86547822034..eba73ebd8ae5 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -168,7 +168,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
 free_exit:
/* free the references taken */
-   put_user_pages(mem->hpages, pinned);
+   unpin_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -214,7 +214,7 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_user_page(page);
+   unpin_user_page(page);
 
mem->hpas[i] = 0;
}
diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 37c5e572993a..719d036c9384 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t 
*vsg)
kfree(vsg->desc_pages);
/* fall through */
case dr_via_pages_locked:
-   put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
- (vsg->direction == DMA_FROM_DEVICE));
+   unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+  (vsg->direction == DMA_FROM_DEVICE));
/* fall through */
case dr_via_pages_alloc:
vfree(vsg->pages);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 55daefaa9b88..a6094766b6f5 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
 
for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
page = sg_page_iter_page(_iter);
-   put_user_pages_dirty_lock(, 1, umem->writable && dirty);
+   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 9a94761765c0..3b505006c0a6 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   put_user_pages_dirty_lock(p, npages, dirty);
+   unpin_user_pages_dirty_lock(p, npages, dirty);
 
if (mm) { /* during close after signal, mm can be NULL */
atomic64_sub(npages, >pinned_vm);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index 

[PATCH v9 23/25] mm/gup: track FOLL_PIN pages

2019-12-10 Thread John Hubbard
Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via unpin_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], and [3].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/

Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst |   2 +-
 include/linux/mm.h|  77 -
 include/linux/mmzone.h|   2 +
 include/linux/page_ref.h  |  10 +
 mm/gup.c  | 369 --
 mm/huge_memory.c  |  26 +-
 mm/hugetlb.c  |  25 +-
 mm/vmstat.c   |   2 +
 8 files changed, 397 insertions(+), 116 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index fd2a19c96189..2640d122de29 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -53,7 +53,7 @@ Which flags are set by each wrapper
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
 pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+value: GUP_PIN_COUNTING_BIAS.::
 
  Function
  
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a1a357e7d86..1765332f27e8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1005,6 +1005,10 @@ static inline bool is_pci_p2pdma_page(const struct page 
*page)
 #define page_ref_zero_or_close_to_overflow(page) \
((unsigned int) page_ref_count(page) + 127u <= 127u)
 
+#define page_ref_zero_or_close_to_bias_overflow(page) \
+   ((unsigned int) page_ref_count(page) + \
+   GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)
+
 static inline void get_page(struct page *page)
 {
page = compound_head(page);
@@ -1016,6 +1020,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
page = compound_head(page);
@@ -1044,29 +1050,70 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many pin_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. 
In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Choice of value:
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
+ *
+ * Of course, once the 

[PATCH v9 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-12-10 Thread John Hubbard
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

Acked-by: Daniel Vetter 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.24.0



[PATCH v9 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-12-10 Thread John Hubbard
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Reviewed-by: Christoph Hellwig 
Acked-by: Hans Verkuil 
Cc: Mauro Carvalho Chehab 
Cc: 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.0



[PATCH v9 16/25] net/xdp: set FOLL_PIN via pin_user_pages()

2019-12-10 Thread John Hubbard
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Acked-by: Björn Töpel 
Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..d071003b5e76 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(>mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
+   npgs = pin_user_pages(umem->address, umem->npgs,
  gup_flags | FOLL_LONGTERM, >pgs[0], NULL);
up_read(>mm->mmap_sem);
 
-- 
2.24.0



[PATCH v9 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-12-10 Thread John Hubbard
1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Cc: Jan Kara 
Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..a86547822034 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
+   ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
mem->hpages + entry, NULL);
if (ret == n) {
@@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_page(page);
+   put_user_page(page);
+
mem->hpas[i] = 0;
}
 }
-- 
2.24.0



[PATCH v9 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2019-12-10 Thread John Hubbard
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3ecce297a47f..c0c56888e7cc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,13 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int flags);
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -1179,13 +1186,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *locked)
 {
/*
-* FIXME: Current FOLL_LONGTERM behavior is incompatible with
+* Parts of FOLL_LONGTERM behavior are incompatible with
 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-* vmas.  As there are no users of this flag in this call we simply
-* disallow this option for now.
+* vmas. However, this only comes up if locked is set, and there are
+* callers that do request FOLL_LONGTERM, but do not set locked. So,
+* allow what we can.
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-   return -EINVAL;
+   if (gup_flags & FOLL_LONGTERM) {
+   if (WARN_ON_ONCE(locked))
+   return -EINVAL;
+   /*
+* This will check the vmas (even if our vmas arg is NULL)
+* and return -ENOTSUPP if DAX isn't allowed in this case:
+*/
+   return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+vmas, gup_flags | FOLL_TOUCH |
+FOLL_REMOTE);
+   }
 
return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
   locked,
-- 
2.24.0



[PATCH v9 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-12-10 Thread John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(>mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, );
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, );
if (locked)
up_read(>mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.24.0



[PATCH v9 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

2019-12-10 Thread John Hubbard
Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
instead of a hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing
it once before issuing any of the get_user_pages*() calls. This
makes it harder to overlook, instead of having little "gup_flags & 1"
phrases in the function calls.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 9 ++---
 tools/testing/selftests/vm/gup_benchmark.c | 6 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..7fc44d25eca7 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -48,18 +48,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
+   /* Filter out most gup flags: only allow a tiny subset here: */
+   gup->flags &= FOLL_WRITE;
+
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+   nr = get_user_pages_fast(addr, nr, gup->flags,
 pages + i);
break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
+   gup->flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case GUP_BENCHMARK:
-   nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+   nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
default:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..389327e9b30a 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
+
 struct gup_benchmark {
__u64 get_delta_usec;
__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
}
 
gup.nr_pages_per_call = nr_pages;
-   gup.flags = write;
+   if (write)
+   gup.flags |= FOLL_WRITE;
 
fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
if (fd == -1)
-- 
2.24.0



[PATCH v9 19/25] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-10 Thread John Hubbard
1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b800fc9a0251..18bfc2fc8e6d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -329,7 +328,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-- 
2.24.0



[PATCH v9 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-10 Thread John Hubbard
1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..162a2633b1e3 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
 flags | FOLL_LONGTERM, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_user_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.0



[PATCH v9 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-12-10 Thread John Hubbard
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index ef50c264db71..2a5901efecde 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int goldfish_pin_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int goldfish_pin_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(>lock);
return 0;
-- 
2.24.0



[PATCH v9 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-12-10 Thread John Hubbard
And get rid of the mmap_sem calls, as part of that. Note
that get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Reviewed-by: Leon Romanovsky 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7a3b99597ead..214e87aa609d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,16 +266,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   down_read(>mmap_sem);
-   ret = get_user_pages(cur_base,
-min_t(unsigned long, npages,
-  PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
-   if (ret < 0) {
-   up_read(>mmap_sem);
+   ret = get_user_pages_fast(cur_base,
+ min_t(unsigned long, npages,
+   PAGE_SIZE /
+   sizeof(struct page *)),
+ gup_flags | FOLL_LONGTERM, page_list);
+   if (ret < 0)
goto umem_release;
-   }
 
cur_base += ret * PAGE_SIZE;
npages   -= ret;
@@ -283,8 +280,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = ib_umem_add_sg_table(sg, page_list, ret,
dma_get_max_seg_size(context->device->dma_device),
>sg_nents);
-
-   up_read(>mmap_sem);
}
 
sg_mark_end(sg);
-- 
2.24.0



[PATCH v9 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

2019-12-10 Thread John Hubbard
Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.
So remove it.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 30 +-
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..b800fc9a0251 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -322,7 +322,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 {
struct page *page[1];
struct vm_area_struct *vma;
-   struct vm_area_struct *vmas[1];
unsigned int flags = 0;
int ret;
 
@@ -330,33 +329,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
-   } else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
-   /*
-* The lifetime of a vaddr_get_pfn() page pin is
-* userspace-controlled. In the fs-dax case this could
-* lead to indefinite stalls in filesystem operations.
-* Disallow attempts to pin fs-dax pages via this
-* interface.
-*/
-   if (ret > 0 && vma_is_fsdax(vmas[0])) {
-   ret = -EOPNOTSUPP;
-   put_page(page[0]);
-   }
-   }
-   up_read(>mmap_sem);
-
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-   return 0;
+   ret = 0;
+   goto done;
}
 
-   down_read(>mmap_sem);
-
vaddr = untagged_addr(vaddr);
 
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
@@ -366,7 +346,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
if (is_invalid_reserved_pfn(*pfn))
ret = 0;
}
-
+done:
up_read(>mmap_sem);
return ret;
 }
-- 
2.24.0



[PATCH v9 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-12-10 Thread John Hubbard
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Reviewed-by: Jens Axboe 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 405be10da73d..9639ebc21e8a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4521,7 +4521,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(>mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
+   pret = pin_user_pages(ubuf, nr_pages,
  FOLL_WRITE | FOLL_LONGTERM,
  pages, vmas);
if (pret == nr_pages) {
-- 
2.24.0



[PATCH v9 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-12-10 Thread John Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites
are converted to use the correct get_user_pages*() or
pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded upon it.)

Reviewed-by: Jan Kara 
Reviewed-by: Mike Rapoport   # Documentation
Reviewed-by: Jérôme Glisse 
Cc: Jonathan Corbet 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/pin_user_pages.rst | 232 ++
 include/linux/mm.h|  63 --
 mm/gup.c  | 161 +--
 4 files changed, 423 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+   pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index ..a7a261d869f1
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions: ::
+
+ pin_user_pages
+ pin_user_pages_fast
+ pin_user_pages_remote
+
+Basic description of FOLL_PIN
+=
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for 
problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper 
calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ 
+ pin_user_pages  FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast FOLL_PIN is always set internally by this function.
+ 

[PATCH v9 12/25] IB/{core, hw, umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

2019-12-10 Thread John Hubbard
Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  | 13 ++---
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 214e87aa609d..55daefaa9b88 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,7 +266,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   ret = get_user_pages_fast(cur_base,
+   ret = pin_user_pages_fast(cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index e42d44e501fd..abc3bb6578cc 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -308,9 +308,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For 
on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
struct ib_umem_odp *umem_odp,
@@ -363,7 +362,7 @@ static int ib_umem_odp_map_dma_single_page(
}
 
 out:
-   put_user_page(page);
+   put_page(page);
return ret;
 }
 
@@ -473,7 +472,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_user_page(local_page_list[j]);
+   put_page(local_page_list[j]);
continue;
}
 
@@ -500,8 +499,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
 * ib_umem_odp_map_dma_single_page().
 */
if (npages - (j + 1) > 0)
-   put_user_pages(_page_list[j+1],
-  npages - (j + 1));
+   release_pages(_page_list[j+1],
+ npages - (j + 1));
break;
}
}
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9a94761765c0 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
int ret;
unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..8269ab040c21 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
goto out;
}
 
-   ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
+   ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1,
  FOLL_WRITE | FOLL_LONGTERM, pages);
if (ret < 0)
goto out;
diff --git 

[PATCH v9 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast()

2019-12-10 Thread John Hubbard
Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Cc: Christoph Hellwig 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index c0c56888e7cc..958ab0757389 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2414,7 +2414,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned long addr, len, end;
int nr = 0, ret = 0;
 
-   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+  FOLL_FORCE)))
return -EINVAL;
 
start = untagged_addr(start) & PAGE_MASK;
-- 
2.24.0



[PATCH v9 05/25] goldish_pipe: rename local pin_user_pages() routine

2019-12-10 Thread John Hubbard
1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..ef50c264db71 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int goldfish_pin_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(>lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, _last_page_size);
+   pages_count = goldfish_pin_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, _last_page_size);
if (pages_count < 0) {
mutex_unlock(>lock);
return pages_count;
-- 
2.24.0



[PATCH v9 01/25] mm/gup: factor out duplicate code from four routines

2019-12-10 Thread John Hubbard
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Jan Kara 
Cc: Ira Weiny 
Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 91 ++--
 1 file changed, 36 insertions(+), 55 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..f764432914c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1978,6 +1978,25 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int record_subpages(struct page *page, unsigned long addr,
+  unsigned long end, struct page **pages)
+{
+   int nr;
+
+   for (nr = 0; addr != end; addr += PAGE_SIZE)
+   pages[nr++] = page++;
+
+   return nr;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -2007,32 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2079,28 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2120,28 +2118,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2157,28 +2146,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, 
unsigned long 

[PATCH v9 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-12-10 Thread John Hubbard
An upcoming patch uses try_get_compound_head() more widely,
so move it to the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f764432914c4..3ecce297a47f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+   struct page *head = compound_head(page);
+
+   if (WARN_ON_ONCE(page_ref_count(head) < 0))
+   return NULL;
+   if (unlikely(!page_cache_add_speculative(head, refs)))
+   return NULL;
+   return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1807,20 +1822,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-   struct page *head = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(head) < 0))
-   return NULL;
-   if (unlikely(!page_cache_add_speculative(head, refs)))
-   return NULL;
-   return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
-- 
2.24.0



[PATCH v9 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-10 Thread John Hubbard
Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.

Christoph Hellwig, a point of interest:

a) I've moved the bulk of the code out of the inline functions, as
   requested, for the devmap changes (patch 4: "mm: devmap: refactor
   1-based refcounting for ZONE_DEVICE pages").

Changes since v8:

* Merged the "mm/gup: pass flags arg to __gup_device_* functions" patch
  into the "mm/gup: track FOLL_PIN pages" patch, as requested by
  Christoph and Jan.

* Changed void grab_page() to bool try_grab_page(), and handled errors
  at the call sites. (From Jan's review comments.) try_grab_page()
  attempts to avoid page refcount overflows, even when counting up with
  GUP_PIN_COUNTING_BIAS increments.

* Fixed a bug that I'd introduced, when changing a BUG() to a WARN().

* Added Jan's reviewed-by tag to the " mm/gup: allow FOLL_FORCE for
  get_user_pages_fast()" patch.

* Documentation: pin_user_pages.rst: fixed an incorrect gup_benchmark
  invocation, left over from the pin_longterm days, spotted while preparing
  this version.

* Rebased onto today's linux.git (-rc1), and re-tested.

Changes since v7:

* Rebased onto Linux 5.5-rc1

* Reworked the grab_page() and try_grab_compound_head(), for API
  consistency and less diffs (thanks to Jan Kara's reviews).

* Added Leon Romanovsky's reviewed-by tags for two of the IB-related
  patches.

* patch 4 refactoring changes, as mentioned above.

There is a git repo and branch, for convenience:

g...@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8

For the remaining list of "changes since version N", those are all in
v7, which is here:

  https://lore.kernel.org/r/20191121071354.456618-1-jhubb...@nvidia.com


Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()


Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've
  been able to runtime test the core get_user_pages() and
  pin_user_pages() and related routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Runtime testing for the call sites so far is pretty light:

* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
 passes.
* infiniband: ran "ib_write_bw", which exercises the umem.c changes,
  but not the other changes.
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
  not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...

[1] Some slow progress on get_user_pages() (Apr 2, 2019): 
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): 
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): 
https://lwn.net/Articles/753027/

Dan Williams (1):
  mm: Cleanup __put_devmap_managed_page() vs ->page_free()

John Hubbard (24):
  mm/gup: factor out duplicate code from four routines
  mm/gup: move try_get_compound_head() to top, fix minor issues
  mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  goldish_pipe: rename local pin_user_pages() routine

[PATCH v9 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-10 Thread John Hubbard
An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
  what kind of page it is, and what kind of refcount handling it
  requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig 
Suggested-by: Jérôme Glisse 
Reviewed-by: Dan Williams 
Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 17 +
 mm/memremap.c  | 16 ++--
 mm/swap.c  | 24 
 3 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c97ea3b694e6..77a4df06c8a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+bool put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool page_is_devmap_managed(struct page *page)
+{
+   return false;
+}
+
 static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
@@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
 * need to inform the device driver through callback. See
 * include/linux/memremap.h and HMM for details.
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
return;
+   }
 
if (put_page_testzero(page))
__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index e899fa876a62..2ba773859031 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-   int count = page_ref_dec_return(page);
-
-   /* still busy */
-   if (count > 1)
-   return;
-
-   /* only triggered by the dev_pagemap shutdown path */
-   if (count == 0) {
-   __put_page(page);
-   return;
-   }
-
/* notify page idle for dax */
if (!is_device_private_page(page)) {
wake_up_var(>_refcount);
@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..49f7c2eea0ba 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1102,3 +1102,27 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {
+   int count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+   }
+
+   return is_devmap;
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
+#endif
-- 
2.24.0



[PATCH v9 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free()

2019-12-10 Thread John Hubbard
From: Dan Williams 

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a device-private
callback in the nouveau driver, the other is a generic wakeup needed in
the DAX case. In the hopes that all ->page_free() callbacks can be
migrated to common core kernel functionality, move the device-private
specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Cc: Jan Kara 
Cc: Ira Weiny 
Signed-off-by: Dan Williams 
Signed-off-by: John Hubbard 
---
 drivers/nvdimm/pmem.c |  6 
 mm/memremap.c | 80 ---
 2 files changed, 44 insertions(+), 42 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)
-{
-   wake_up_var(>_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-   .page_free  = pmem_pagemap_page_free,
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
 };
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..e899fa876a62 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -27,7 +27,8 @@ static void devmap_managed_enable_put(void)
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-   if (!pgmap->ops || !pgmap->ops->page_free) {
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
+   (!pgmap->ops || !pgmap->ops->page_free)) {
WARN(1, "Missing page_free method\n");
return -EINVAL;
}
@@ -414,44 +415,51 @@ void __put_devmap_managed_page(struct page *page)
 {
int count = page_ref_dec_return(page);
 
-   /*
-* If refcount is 1 then page is freed and refcount is stable as nobody
-* holds a reference on the page.
-*/
-   if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
+   /* still busy */
+   if (count > 1)
+   return;
 
-   mem_cgroup_uncharge(page);
+   /* only triggered by the dev_pagemap shutdown path */
+   if (count == 0) {
+   __put_page(page);
+   return;
+   }
 
-   /*
-* When a device_private page is freed, the page->mapping field
-* may still contain a (stale) mapping value. For example, the
-* lower bits of page->mapping may still identify the page as
-* an anonymous page. Ultimately, this entire field is just
-* stale and wrong, and it will cause errors if not cleared.
-* One example is:
-*
-*  migrate_vma_pages()
-*migrate_vma_insert_page()
-*  page_add_new_anon_rmap()
-*__page_set_anon_rmap()
-*  ...checks page->mapping, via PageAnon(page) call,
-*and incorrectly concludes that the page is an
-*anonymous page. Therefore, it incorrectly,
-*silently fails to set up the new anon rmap.
-*
-* For other types of ZONE_DEVICE pages, migration is either
-* handled differently or not done at all, so there is no need
-* to clear page->mapping.
-*/
-   if (is_device_private_page(page))
-   page->mapping = NULL;
+   /* notify page idle for dax */
+   if (!is_device_private_page(page)) {
+   wake_up_var(>_refcount);
+   return;
+   }
 
-   page->pgmap->ops->page_free(page);
-   } else if (!count)
-   __put_page(page);
+   /* Clear Active bit in case of parallel mark_page_accessed */
+   __ClearPageActive(page);
+   __ClearPageWaiters(page);
+
+   mem_cgroup_uncharge(page);
+
+   /*
+* When a device_private page is freed, the page->mapping field
+* may still contain a (stale) mapping value. For example, the
+* lower bits of page->mapping may still identify the page as an
+* anonymous page. Ultimately, this entire field is just stale
+* and wrong, and it will cause errors if not cleared.  One
+* example is:
+*
+*  

[PATCH 2/2] powerpc/64s/exception: Add missing comma to INT_KVM_HANDLER macro for system_reset

2019-12-10 Thread Jordan Niethe
The INT_KVM_HANDLER macro for system_reset is missing a comma so add it
to be consistent.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 8bcf562242a2..528c893deefd 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -828,7 +828,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 */
 EXC_REAL_END(system_reset, 0x100, 0x100)
 EXC_VIRT_NONE(0x4100, 0x100)
-INT_KVM_HANDLER system_reset 0x100, EXC_STD, PACA_EXNMI, 0
+INT_KVM_HANDLER system_reset, 0x100, EXC_STD, PACA_EXNMI, 0
 
 #ifdef CONFIG_PPC_P7_NAP
 TRAMP_REAL_BEGIN(system_reset_idle_wake)
-- 
2.17.1



[PATCH 1/2] powerpc/64s/exception: Remove unused parameters from KVMTEST macro

2019-12-10 Thread Jordan Niethe
The hsrr and n parameters are never used by the KVMTEST macro so remove
them.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kernel/exceptions-64s.S | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index d0018dd17e0a..8bcf562242a2 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -210,7 +210,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define kvmppc_interrupt kvmppc_interrupt_pr
 #endif
 
-.macro KVMTEST name, hsrr, n
+.macro KVMTEST name
lbz r10,HSTATE_IN_GUEST(r13)
cmpwi   r10,0
bne \name\()_kvm
@@ -284,7 +284,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 .endm
 
 #else
-.macro KVMTEST name, hsrr, n
+.macro KVMTEST name
 .endm
 .macro KVM_HANDLER name, vec, hsrr, area, skip
 .endm
@@ -431,7 +431,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
SAVE_CTR(r10, \area\())
mfcrr9
.if \kvm
-   KVMTEST \name \hsrr \vec
+   KVMTEST \name
.endif
.if \bitmask
lbz r10,PACAIRQSOFTMASK(r13)
@@ -1444,7 +1444,7 @@ EXC_VIRT_NONE(0x4b00, 0x100)
GET_PACA(r13)
std r10,PACA_EXGEN+EX_R10(r13)
INTERRUPT_TO_KERNEL
-   KVMTEST system_call EXC_STD 0xc00 /* uses r10, branch to 
system_call_kvm */
+   KVMTEST system_call /* uses r10, branch to system_call_kvm */
mfctr   r9
 #else
mr  r9,r13
@@ -1811,7 +1811,7 @@ EXC_REAL_BEGIN(denorm_exception_hv, 0x1500, 0x100)
andis.  r10,r10,(HSRR1_DENORM)@h /* denorm? */
bne+denorm_assist
 #endif
-   KVMTEST denorm_exception_hv, EXC_HV 0x1500
+   KVMTEST denorm_exception_hv
INT_SAVE_SRR_AND_JUMP denorm_common, EXC_HV, 1
 EXC_REAL_END(denorm_exception_hv, 0x1500, 0x100)
 
-- 
2.17.1



[PATCH] powerpc/64: Use {SAVE,REST}_NVGPRS macros

2019-12-10 Thread Jordan Niethe
In entry_64.S there are places that open code saving and restoring the
non-volatile registers. There are already macros for doing this so use
them.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kernel/entry_64.S | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 6467bdab8d40..457b86c13c19 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -591,8 +591,7 @@ _GLOBAL(_switch)
std r0,16(r1)
stdur1,-SWITCH_FRAME_SIZE(r1)
/* r3-r13 are caller saved -- Cort */
-   SAVE_8GPRS(14, r1)
-   SAVE_10GPRS(22, r1)
+   SAVE_NVGPRS(r1)
std r0,_NIP(r1) /* Return to switch caller */
mfcrr23
std r23,_CCR(r1)
@@ -716,8 +715,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
mtcrf   0xFF,r6
 
/* r3-r13 are destroyed -- Cort */
-   REST_8GPRS(14, r1)
-   REST_10GPRS(22, r1)
+   REST_NVGPRS(r1)
 
/* convert old thread to its task_struct for return value */
addir3,r3,-THREAD
@@ -1149,8 +1147,7 @@ _GLOBAL(enter_rtas)
 */
SAVE_GPR(2, r1) /* Save the TOC */
SAVE_GPR(13, r1)/* Save paca */
-   SAVE_8GPRS(14, r1)  /* Save the non-volatiles */
-   SAVE_10GPRS(22, r1) /* ditto */
+   SAVE_NVGPRS(r1) /* Save the non-volatiles */
 
mfcrr4
std r4,_CCR(r1)
@@ -1257,8 +1254,7 @@ rtas_restore_regs:
/* relocation is on at this point */
REST_GPR(2, r1) /* Restore the TOC */
REST_GPR(13, r1)/* Restore paca */
-   REST_8GPRS(14, r1)  /* Restore the non-volatiles */
-   REST_10GPRS(22, r1) /* ditto */
+   REST_NVGPRS(r1) /* Restore the non-volatiles */
 
GET_PACA(r13)
 
@@ -1292,8 +1288,7 @@ _GLOBAL(enter_prom)
 */
SAVE_GPR(2, r1)
SAVE_GPR(13, r1)
-   SAVE_8GPRS(14, r1)
-   SAVE_10GPRS(22, r1)
+   SAVE_NVGPRS(r1)
mfcrr10
mfmsr   r11
std r10,_CCR(r1)
@@ -1337,8 +1332,7 @@ _GLOBAL(enter_prom)
/* Restore other registers */
REST_GPR(2, r1)
REST_GPR(13, r1)
-   REST_8GPRS(14, r1)
-   REST_10GPRS(22, r1)
+   REST_NVGPRS(r1)
ld  r4,_CCR(r1)
mtcrr4
 
-- 
2.17.1



Re: Found the commit for: 5.3.7 64-bits kernel doesn't boot on G5 Quad [regression]

2019-12-10 Thread Aneesh Kumar K.V
John Paul Adrian Glaubitz  writes:

> Hi!
>
> On 12/10/19 9:35 AM, Romain Dolbeau wrote:
>> Le sam. 16 nov. 2019 à 17:34, Romain Dolbeau  a écrit :
>>> So it seems to me that 0034d395f89d9c092bb15adbabdca5283e258b41
>>> introduced the bug that crashes the PowerMac G5
>> 
>> There's been some commits in that subsystem, so I tried again; as of
>> 6794862a16ef41f753abd75c03a152836e4c8028, the kernel still crashes
>> when trying to boot my PowerMac G5.
>
> If Aneesh is currently unable to look at the problem, I would suggest 
> reverting
> the commit in question since I don't think it's acceptable that users are 
> unable
> to boot their machines anymore after a kernel upgrade.
>

The PowerMac system we have internally was not able to recreate this.
Hence we have not been able to make progress on this.

At this point, I am not sure what would cause the Machine check with
that patch series because we have not changed the VA bits in that patch.

-aneesh


[Bug 205183] PPC64: Signal delivery fails with SIGSEGV if between about 1KB and 4KB bytes of stack remain

2019-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205183

--- Comment #3 from Daniel Axtens (d...@axtens.net) ---
I have a proposed patch at
https://lore.kernel.org/linuxppc-dev/20191211014337.28128-1-...@axtens.net/T/#u

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH] powerpc/fault: kernel can extend a user process's stack

2019-12-10 Thread Daniel Axtens
If a process page-faults trying to write beyond the end of its
stack, we attempt to grow the stack.

However, if the kernel attempts to write beyond the end of a
process's stack, it takes a bad fault. This can occur when the
kernel is trying to set up a signal frame.

Permit the kernel to grow a process's stack. The same general
limits as to how and when the stack can be grown apply: the kernel
code still passes through expand_stack(), so anything that goes
beyond e.g. the rlimit should still be blocked.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=205183
Reported-by: Tom Lane 
Cc: Daniel Black 
Signed-off-by: Daniel Axtens 
---
 arch/powerpc/mm/fault.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index b5047f9b5dec..00183731ea22 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -287,7 +287,17 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
unsigned long address,
if (!res)
return !store_updates_sp(inst);
*must_retry = true;
+   } else if ((flags & FAULT_FLAG_WRITE) &&
+  !(flags & FAULT_FLAG_USER)) {
+   /*
+* the kernel can also attempt to write beyond the end
+* of a process's stack - for example setting up a
+* signal frame. We assume this is valid, subject to
+* the checks in expand_stack() later.
+*/
+   return false;
}
+
return true;
}
return false;
-- 
2.20.1



Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-10 Thread Michael Roth
Quoting Ram Pai (2019-12-06 19:12:39)
> Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
> secure guests")
> disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> path for secure VMs, helped enable dma_direct path.  This enabled
> support for bounce-buffering through SWIOTLB.  However it fails to
> operate when IOMMU is enabled, since I/O pages are not TCE mapped.
> 
> Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> cases including, TCE mapping I/O pages, in the presence of a
> IOMMU.

Wasn't clear to me at first, but I guess the main gist of this series is
that we want to continue to use SWIOTLB, but also need to create mappings
of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
checks in DMA API functions to do the same.

That makes sense, but one issue I see with that is that
dma_iommu_map_bypass() only tests true if all the following are true:

1) the device requests a 64-bit DMA mask via
   dma_set_mask/dma_set_coherent_mask
2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)

dma_is_direct() checks don't have this limitation, so I think for
anything cases, such as devices that use a smaller DMA mask, we'll
end up falling back to the non-bypass functions in dma_iommu_ops, which
will likely break for things like dma_alloc_coherent/dma_map_single
since they won't use SWIOTLB pages and won't do the necessary calls to
set_memory_unencrypted() to share those non-SWIOTLB buffers with
hypervisor.

Maybe that's ok, but I think we should be clearer about how to
fail/handle these cases.

Though I also agree with some concerns Alexey stated earlier: it seems
wasteful to map the entire DDW window just so these bounce buffers can be
mapped.  Especially if you consider the lack of a mapping to be an additional
safe-guard against things like buggy device implementations on the QEMU
side. E.g. if we leaked pages to the hypervisor on accident, those pages
wouldn't be immediately accessible to a device, and would still require
additional work get past the IOMMU.

What would it look like if we try to make all this work with disable_ddw passed
to kernel command-line (or forced for is_secure_guest())?

  1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* ops,
 but an additional case or hook that considers is_secure_guest() might do
 it.
 
  2) We'd also need to set up an IOMMU mapping for the bounce buffers via
 io_tlb_start/io_tlb_end. We could do it once, on-demand via
 dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
 maybe in some init function.

That also has the benefit of not requiring devices to support 64-bit DMA.

Alternatively, we could continue to rely on the 64-bit DDW window, but
modify call to enable_ddw() to only map the io_tlb_start/end range in
the case of is_secure_guest(). This is a little cleaner implementation-wise
since we can rely on the existing dma_iommu_{alloc,map}_bypass() hooks, but
devices that don't support 64-bit will fail back to not using dma_direct_* ops
and fail miserably. We'd probably want to handle that more gracefully.

Or we handle both cases gracefully. To me it makes more sense to enable
non-DDW case, then consider adding DDW case later if there's some reason
why 64-bit DMA is needed. But would be good to hear if there are other
opinions.

> 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 11 +--
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 67b5009..4e27d66 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -36,7 +36,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
> 
>  #include "pseries.h"
> @@ -1346,15 +1345,7 @@ void iommu_init_early_pSeries(void)
> of_reconfig_notifier_register(_reconfig_nb);
> register_memory_notifier(_mem_nb);
> 
> -   /*
> -* Secure guest memory is inacessible to devices so regular DMA isn't
> -* possible.
> -*
> -* In that case keep devices' dma_map_ops as NULL so that the generic
> -* DMA code path will use SWIOTLB to bounce buffers for DMA.
> -*/
> -   if (!is_secure_guest())
> -   set_pci_dma_ops(_iommu_ops);
> +   set_pci_dma_ops(_iommu_ops);
>  }
> 
>  static int __init disable_multitce(char *str)
> -- 
> 1.8.3.1
> 


Re: [PATCH v8 20/26] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-12-10 Thread Andrew Morton
On Mon, 9 Dec 2019 21:53:00 -0800 John Hubbard  wrote:

> > Correction: this is somehow missing the fixes that resulted from Jan Kara's 
> > review (he
> > noted that we can't take a page lock in this context). I must have picked 
> > up the
> > wrong version of it, when I rebased for -rc1.
> > 
> 
> Andrew, given that the series is now in -mm, what's the preferred way for me 
> to fix this?
> Send a v9 version of the whole series? Or something else?

I think a full resend is warranted at this time - it's only been in
there a day and there seem to be quite a number of changes to be made.



Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops)

2019-12-10 Thread Michael Ellerman
Peter Zijlstra  writes:
> On Tue, Dec 10, 2019 at 04:38:54PM +1100, Michael Ellerman wrote:
>
>> Good question, I'll have a look.
>> 
>> There seems to be confusion about what the type of the bit number is,
>> which is leading to sign extension in some cases and not others.
>
> Shiny.
>
>> It looks like the type should be unsigned long?
>
> I'm thinking unsigned makes most sense, I mean, negative bit offsets
> should 'work' but that's almost always guaranteed to be an out-of-bound
> operation.

Yeah I agree.

> As to 'long' vs 'int', I'm not sure, 4G bits is a long bitmap. But I
> suppose since the bitmap itself is 'unsigned long', we might as well use
> 'unsigned long' for the bitnr too.

4G is a lot of bits, but it's not *that* many.

eg. If we had a bit per 4K page on a 32T machine that would be 8G bits.

So unsigned long seems best.

>>   Documentation/core-api/atomic_ops.rst:  void __clear_bit_unlock(unsigned 
>> long nr, unsigned long *addr);
>>   arch/mips/include/asm/bitops.h:static inline void 
>> __clear_bit_unlock(unsigned long nr, volatile unsigned long *addr)
>>   arch/powerpc/include/asm/bitops.h:static inline void 
>> arch___clear_bit_unlock(int nr, volatile unsigned long *addr)
>>   arch/riscv/include/asm/bitops.h:static inline void 
>> __clear_bit_unlock(unsigned long nr, volatile unsigned long *addr)
>>   arch/s390/include/asm/bitops.h:static inline void 
>> arch___clear_bit_unlock(unsigned long nr,
>>   include/asm-generic/bitops/instrumented-lock.h:static inline void 
>> __clear_bit_unlock(long nr, volatile unsigned long *addr)
>>   include/asm-generic/bitops/lock.h:static inline void 
>> __clear_bit_unlock(unsigned int nr,
>> 
>> So I guess step one is to convert our versions to use unsigned long, so
>> we're at least not tripping over that difference when comparing the
>> assembly.
>
> Yeah, I'll look at fixing the generic code, bitops/atomic.h and
> bitops/non-atomic.h don't even agree on the type of bitnr.

Thanks.

cheers


Re: [PATCH v8 23/26] mm/gup: pass flags arg to __gup_device_* functions

2019-12-10 Thread John Hubbard
On 12/10/19 4:49 AM, Jan Kara wrote:
> On Mon 09-12-19 14:53:41, John Hubbard wrote:
>> A subsequent patch requires access to gup flags, so pass the flags
>> argument through to the __gup_device_* functions.
>>
>> Also placate checkpatch.pl by shortening a nearby line.
>>
>> TODO: Christoph Hellwig requested folding this into the patch the uses
>> the gup flags arguments.
> 
> You should probably implement this TODO? :)
> 
>   Honza

Yes. Done for v9.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v8 24/26] mm/gup: track FOLL_PIN pages

2019-12-10 Thread John Hubbard
On 12/10/19 5:39 AM, Jan Kara wrote:
...
>> +void grab_page(struct page *page, unsigned int flags)
>> +{
>> +if (flags & FOLL_GET)
>> +get_page(page);
>> +else if (flags & FOLL_PIN) {
>> +get_page(page);
>> +WARN_ON_ONCE(flags & FOLL_GET);
>> +/*
>> + * Use get_page(), above, to do the refcount error
>> + * checking. Then just add in the remaining references:
>> + */
>> +page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1);
> 
> This is wrong for two reasons:
> 
> 1) You miss compound_head() indirection from get_page() for this
> page_ref_add().

whoops, yes that is missing.

> 
> 2) page_ref_add() could overflow the counter without noticing.
> 
> Especially with GUP_PIN_COUNTING_BIAS being non-trivial, it is realistic
> that an attacker might try to overflow the page refcount and we have to
> protect the kernel against that. So I think that all the places that would
> use grab_page() actually need to use try_grab_page() and then gracefully
> deal with the failure.
> 

OK, I've replaced grab_page() everywhere with try_grab_page(), with the
above issues fixed. The v7 patchset had error handling for grab_page() failures,
that had been reviewed, so relevants parts of that have reappeared.

I had initially hesitated to do this, but now I've gone ahead and added:

#define page_ref_zero_or_close_to_bias_overflow(page) \
((unsigned int) page_ref_count(page) + \
GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)

...which is used in the new try_grab_page() for protection.


>> @@ -278,11 +425,23 @@ static struct page *follow_page_pte(struct 
>> vm_area_struct *vma,
>>  goto retry;
>>  }
>>  
>> -if (flags & FOLL_GET) {
>> +if (flags & (FOLL_PIN | FOLL_GET)) {
>> +/*
>> + * Allow try_get_page() to take care of error handling, for
>> + * both cases: FOLL_GET or FOLL_PIN:
>> + */
>>  if (unlikely(!try_get_page(page))) {
>>  page = ERR_PTR(-ENOMEM);
>>  goto out;
>>  }
>> +
>> +if (flags & FOLL_PIN) {
>> +WARN_ON_ONCE(flags & FOLL_GET);
>> +
>> +/* We got a +1 refcount from try_get_page(), above. */
>> +page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1);
>> +__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
>> +}
>>  }
> 
> The same problem here as above, plus this place should use the same
> try_grab..() helper, shouldn't it?


Yes, now that the new try_grab_page() has behavior that matches what
this call site needs. Done.


> 
>> @@ -544,8 +703,8 @@ static struct page *follow_page_mask(struct 
>> vm_area_struct *vma,
>>  /* make this handle hugepd */
>>  page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
>>  if (!IS_ERR(page)) {
>> -BUG_ON(flags & FOLL_GET);
>> -return page;
>> +WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
>> +return NULL;
> 
> I agree with the change to WARN_ON_ONCE but why is correct the change of
> the return value? Note that this is actually a "success branch".
> 

Good catch, thanks! I worked through the logic...correctly at first, but then I 
must 
have become temporarily dazed by the raw destructive power of the pre-existing 
BUG_ON() statement, and screwed it up after all. :)


thanks,
-- 
John Hubbard
NVIDIA



[Bug 205099] KASAN hit at raid6_pq: BUG: Unable to handle kernel data access at 0x00f0fd0d

2019-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205099

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Kernel Version|5.4-rc1 |5.5-rc1
   See Also|https://bugzilla.kernel.org |
   |/show_bug.cgi?id=204479 |

--- Comment #5 from Erhard F. (erhar...@mailbox.org) ---
Unchanged in 5.5-rc1.

[..]
[   19.425679] BUG: Unable to handle kernel data access on read at 0x00f0fd0d
[   19.425693] Faulting instruction address: 0xf165a560
[   19.426731] Oops: Kernel access of bad area, sig: 11 [#1]
[   19.426745] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 DEBUG_PAGEALLOC PowerMac
[   19.426757] Modules linked in: raid6_pq(+) zlib_inflate ehci_pci(+) ohci_hcd
hwmon ehci_hcd i2c_algo_bit drm_kms_helper sungem syscopyarea sungem_phy
sysfillrect firewire_ohci firewire_core sysimgblt crc_itu_t sr_mod fb_sys_fops
usbcore ttm cdrom snd_aoa_i2sbus snd_aoa_soundbus usb_common snd_pcm snd_timer
snd soundcore drm ssb pcmcia drm_panel_orientation_quirks pcmcia_core
uninorth_agp agpgart
[   19.426986] CPU: 1 PID: 127 Comm: modprobe Tainted: GW
5.5.0-rc1-PowerMacG4 #3
[   19.426997] NIP:  f165a560 LR: f165a4e8 CTR: c0247f54
[   19.427009] REGS: e86d9740 TRAP: 0300   Tainted: GW 
(5.5.0-rc1-PowerMacG4)
[   19.427014] MSR:  02009032   CR: 8828  XER: 
[   19.427048] DAR: 00f0fd0d DSISR: 4000 
   GPR00: e86d9998 e86d97f8 ee26c720 f166d070 0010 
f165a4e8  
   GPR08:  00f0fd0d e7171aea fff0 c0247f54 00b25ff4
0060 0050 
   GPR16: f166d000 ec077000 ec076000 0070 0060 0050
0040 0030 
   GPR24: 0020 0010 0012 e86d9a84 e86d9a48 ec077010
ec076010  
[   19.427198] NIP [f165a560] raid6_altivec8_gen_syndrome_real+0x3c0/0x480
[raid6_pq]
[   19.427228] LR [f165a4e8] raid6_altivec8_gen_syndrome_real+0x348/0x480
[raid6_pq]
[   19.427234] Call Trace:
[   19.427242] [e86d97f8] [000a] 0xa (unreliable)
[   19.427277] [e86d99e8] [f165a654] raid6_altivec8_gen_syndrome+0x34/0x58
[raid6_pq]
[   19.427309] [e86d9a08] [f15d83d8] init_module+0x3d8/0x528 [raid6_pq]
[   19.427334] [e86d9b18] [c0005828] do_one_initcall+0xb8/0x36c
[   19.427355] [e86d9be8] [c010e5e0] do_init_module+0xa8/0x2c4
[   19.427369] [e86d9c18] [c01114c0] load_module+0x2be4/0x2d68
[   19.427383] [e86d9e18] [c01118b8] sys_finit_module+0x100/0x138
[   19.427397] [e86d9f38] [c001a274] ret_from_syscall+0x0/0x34
[   19.427410] --- interrupt: c01 at 0x96ff78
   LR = 0xafda14
[   19.427416] Instruction dump:
[   19.427429] 1304c4c4 7d2048ce 39210090 1325ccc4 7d6048ce 1346d4c4 81210088
1367dcc4 
[   19.427463] 7d1098ce 115f5b06 116b5800 8141008c <7c0048ce> 392100a0 7d8048ce
392100b0 
[   19.427505] ---[ end trace 161d8d283bc9b7b8 ]---

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH 5/6] mm, memory_hotplug: Provide argument for the pgprot_t in arch_add_memory()

2019-12-10 Thread Logan Gunthorpe



On 2019-12-10 4:25 a.m., David Hildenbrand wrote:
> On 10.12.19 11:34, Michal Hocko wrote:
>> On Tue 10-12-19 11:09:46, David Hildenbrand wrote:
>>> On 10.12.19 11:04, Michal Hocko wrote:
 On Mon 09-12-19 12:43:40, Dan Williams wrote:
> On Mon, Dec 9, 2019 at 12:24 PM Logan Gunthorpe  
> wrote:
>>
>>
>>
>> On 2019-12-09 12:23 p.m., David Hildenbrand wrote:
>>> On 09.12.19 20:13, Logan Gunthorpe wrote:
 [...]
  #ifdef CONFIG_MEMORY_HOTPLUG
 -int arch_add_memory(int nid, u64 start, u64 size,
 +int arch_add_memory(int nid, u64 start, u64 size, pgprot_t prot,
  struct mhp_restrictions *restrictions)
>>>
>>> Can we fiddle that into "struct mhp_restrictions" instead?
>>
>> Yes, if that's what people want, it's pretty trivial to do. I chose not
>> to do it that way because it doesn't get passed down to add_pages() and
>> it's not really a "restriction". If I don't hear any objections, I will
>> do that for v2.
>
> +1 to storing this information alongside the altmap in that structure.
> However, I agree struct mhp_restrictions, with the MHP_MEMBLOCK_API
> flag now gone, has lost all of its "restrictions". How about dropping
> the 'flags' property and renaming the struct to 'struct
> mhp_modifiers'?

 Hmm, this email somehow didn't end up in my inbox so I have missed it
 before replying.

 Well, mhp_modifiers makes some sense and it would reduce the API
 proliferation but how do you expect the prot part to be handled?
 I really do not want people to think about PAGE_KERNEL or which
 protection to use because my experience tells that this will get copied
 without much thinking or simply will break with some odd usecases.
 So how exactly this would be used?
>>>
>>> I was thinking about exactly the same "issue".
>>>
>>> 1. default initialization via a function
>>>
>>> memhp_modifier_default_init();
>>>
>>> 2. a flag that unlocks the prot field (default:0). Without the flag, it
>>> is ignored. We can keep the current initialization then.
>>>
>>> Other ideas?
>>
>> 3. a prot mask to apply on top of PAGE_KERNEL? Or would that be
>> insufficient/clumsy?
>>
> 
> If it works for the given use case, I guess this would be simple and ok.

I don't see how we can do that without a ton of work. The pgport_t is
architecture specific so we'd need to add mask functions to every
architecture for every page cache type we need to use. I don't think
that's a good idea.

I think I slightly prefer option 2 over the above.  But I'd actually
prefer callers have to think about the caching type seeing when we grew
the second user (memremap_pages()) and it was paired with
track_pfn_remap(), it was actually subtly wrong because it could create
a mapping that track_pfn_remap() disagreed with. In fact, PAGE_KERNEL
has already been specified in memremap_pages() for ages, it was just
ignored when it got to the arch_add_memory() step which is were this
issue comes from.

In my opinion, having a coder and reviewer see PAGE_KERNEL and ask if
that makes sense is a benefit. Having it hidden because we don't want
people to think about it is worse, harder to understand and results in
bugs that are more difficult to spot.

Though, we may be overthinking this: arch_add_memory() is a low level
non-exported API that's currently used in exactly two places. I don't
think there's going to be many, if any, valid new use cases coming up
for it in the future. That's more what memremap_pages() is for.

Logan


[Bug 205099] KASAN hit at raid6_pq: BUG: Unable to handle kernel data access at 0x00f0fd0d

2019-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205099

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #285367|0   |1
is obsolete||

--- Comment #4 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 286251
  --> https://bugzilla.kernel.org/attachment.cgi?id=286251=edit
dmesg (kernel 5.5-rc1, PowerMac G4 DP)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 205099] KASAN hit at raid6_pq: BUG: Unable to handle kernel data access at 0x00f0fd0d

2019-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205099

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #285369|0   |1
is obsolete||

--- Comment #3 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 286249
  --> https://bugzilla.kernel.org/attachment.cgi?id=286249=edit
kernel .config (5.5-rc1, PowerMac G4 DP)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH AUTOSEL 4.4 67/71] crypto: vmx - Avoid weird build failures

2019-12-10 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 4ee812f6143d78d8ba1399671d78c8d78bf2817c ]

In the vmx crypto Makefile we assign to a variable called TARGET and
pass that to the aesp8-ppc.pl and ghashp8-ppc.pl scripts.

The variable is meant to describe what flavour of powerpc we're
building for, eg. either 32 or 64-bit, and big or little endian.

Unfortunately TARGET is a fairly common name for a make variable, and
if it happens that TARGET is specified as a command line parameter to
make, the value specified on the command line will override our value.

In particular this can happen if the kernel Makefile is driven by an
external Makefile that uses TARGET for something.

This leads to weird build failures, eg:
  nonsense  at /build/linux/drivers/crypto/vmx/ghashp8-ppc.pl line 45.
  /linux/drivers/crypto/vmx/Makefile:20: recipe for target 
'drivers/crypto/vmx/ghashp8-ppc.S' failed

Which shows that we passed an empty value for $(TARGET) to the perl
script, confirmed with make V=1:

  perl /linux/drivers/crypto/vmx/ghashp8-ppc.pl  > 
drivers/crypto/vmx/ghashp8-ppc.S

We can avoid this confusion by using override, to tell make that we
don't want anything to override our variable, even a value specified
on the command line. We can also use a less common name, given the
script calls it "flavour", let's use that.

Signed-off-by: Michael Ellerman 
Signed-off-by: Herbert Xu 
Signed-off-by: Sasha Levin 
---
 drivers/crypto/vmx/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/vmx/Makefile b/drivers/crypto/vmx/Makefile
index d28ab96a24759..7663494809a0c 100644
--- a/drivers/crypto/vmx/Makefile
+++ b/drivers/crypto/vmx/Makefile
@@ -2,13 +2,13 @@ obj-$(CONFIG_CRYPTO_DEV_VMX_ENCRYPT) += vmx-crypto.o
 vmx-crypto-objs := vmx.o aesp8-ppc.o ghashp8-ppc.o aes.o aes_cbc.o aes_ctr.o 
ghash.o
 
 ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),y)
-TARGET := linux-ppc64le
+override flavour := linux-ppc64le
 else
-TARGET := linux-ppc64
+override flavour := linux-ppc64
 endif
 
 quiet_cmd_perl = PERL $@
-  cmd_perl = $(PERL) $(<) $(TARGET) > $(@)
+  cmd_perl = $(PERL) $(<) $(flavour) > $(@)
 
 $(src)/aesp8-ppc.S: $(src)/aesp8-ppc.pl
$(call cmd,perl)
-- 
2.20.1



[PATCH AUTOSEL 4.9 87/91] crypto: vmx - Avoid weird build failures

2019-12-10 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 4ee812f6143d78d8ba1399671d78c8d78bf2817c ]

In the vmx crypto Makefile we assign to a variable called TARGET and
pass that to the aesp8-ppc.pl and ghashp8-ppc.pl scripts.

The variable is meant to describe what flavour of powerpc we're
building for, eg. either 32 or 64-bit, and big or little endian.

Unfortunately TARGET is a fairly common name for a make variable, and
if it happens that TARGET is specified as a command line parameter to
make, the value specified on the command line will override our value.

In particular this can happen if the kernel Makefile is driven by an
external Makefile that uses TARGET for something.

This leads to weird build failures, eg:
  nonsense  at /build/linux/drivers/crypto/vmx/ghashp8-ppc.pl line 45.
  /linux/drivers/crypto/vmx/Makefile:20: recipe for target 
'drivers/crypto/vmx/ghashp8-ppc.S' failed

Which shows that we passed an empty value for $(TARGET) to the perl
script, confirmed with make V=1:

  perl /linux/drivers/crypto/vmx/ghashp8-ppc.pl  > 
drivers/crypto/vmx/ghashp8-ppc.S

We can avoid this confusion by using override, to tell make that we
don't want anything to override our variable, even a value specified
on the command line. We can also use a less common name, given the
script calls it "flavour", let's use that.

Signed-off-by: Michael Ellerman 
Signed-off-by: Herbert Xu 
Signed-off-by: Sasha Levin 
---
 drivers/crypto/vmx/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/vmx/Makefile b/drivers/crypto/vmx/Makefile
index de6e241b08666..957377c309a91 100644
--- a/drivers/crypto/vmx/Makefile
+++ b/drivers/crypto/vmx/Makefile
@@ -2,13 +2,13 @@ obj-$(CONFIG_CRYPTO_DEV_VMX_ENCRYPT) += vmx-crypto.o
 vmx-crypto-objs := vmx.o aesp8-ppc.o ghashp8-ppc.o aes.o aes_cbc.o aes_ctr.o 
aes_xts.o ghash.o
 
 ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),y)
-TARGET := linux-ppc64le
+override flavour := linux-ppc64le
 else
-TARGET := linux-ppc64
+override flavour := linux-ppc64
 endif
 
 quiet_cmd_perl = PERL $@
-  cmd_perl = $(PERL) $(<) $(TARGET) > $(@)
+  cmd_perl = $(PERL) $(<) $(flavour) > $(@)
 
 $(src)/aesp8-ppc.S: $(src)/aesp8-ppc.pl
$(call cmd,perl)
-- 
2.20.1



Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-10 Thread Thadeu Lima de Souza Cascardo
On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
> On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  wrote:
> >
> > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > > Aurelien Jarno  writes:
> > > > On powerpc with recent versions of binutils, readelf outputs an extra
> > > > field when dumping the symbols of an object file. For example:
> > > >
> > > > 35: 083896 FUNCLOCAL  DEFAULT [: 8] 
> > > > 1 btf_is_struct
> > > >
> > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT variable to
> > > > be computed correctly and causes the checkabi target to fail.
> > > >
> > > > Fix that by looking for the symbol name in the last field instead of the
> > > > 8th one. This way it should also cope with future extra fields.
> > > >
> > > > Signed-off-by: Aurelien Jarno 
> > > > ---
> > > >  tools/lib/bpf/Makefile | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > Thanks for fixing that, it's been on my very long list of test failures
> > > for a while.
> > >
> > > Tested-by: Michael Ellerman 
> >
> > Looks good & also continues to work on x86. Applied, thanks!
> 
> This actually seems to break horribly on PPC64le with binutils 2.33.1
> resulting in:
> Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
> match with num of versioned symbols in libbpf.so (184). Please make
> sure all LIBBPF_API symbols are versioned in libbpf.map.
> 
> This is the only arch that fails, with x86/arm/aarch64/s390 all
> building fine.  Reverting this patch allows successful build across
> all arches.
> 
> Justin

Well, I ended up debugging this same issue and had the same fix as Jarno's when
I noticed his fix was already applied.

I just installed a system with the latest binutils, 2.33.1, and it still breaks
without such fix. Can you tell what is the output of the following command on
your system?

readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | sed 
's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $0}' 

Cascardo.


Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-10 Thread Thiago Jung Bauermann


Hello Ram,

Ram Pai  writes:

> Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
>   secure guests")
> disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> path for secure VMs, helped enable dma_direct path.  This enabled
> support for bounce-buffering through SWIOTLB.  However it fails to
> operate when IOMMU is enabled, since I/O pages are not TCE mapped.
>
> Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> cases including, TCE mapping I/O pages, in the presence of a
> IOMMU.
>
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 11 +--
>  1 file changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 67b5009..4e27d66 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -36,7 +36,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>
>  #include "pseries.h"

You still need to keep , otherwise there won't be a
definition of is_secure_guest() when CONFIG_PPC_SVM=n.

--
Thiago Jung Bauermann
IBM Linux Technology Center


[PATCH AUTOSEL 4.14 125/130] crypto: vmx - Avoid weird build failures

2019-12-10 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 4ee812f6143d78d8ba1399671d78c8d78bf2817c ]

In the vmx crypto Makefile we assign to a variable called TARGET and
pass that to the aesp8-ppc.pl and ghashp8-ppc.pl scripts.

The variable is meant to describe what flavour of powerpc we're
building for, eg. either 32 or 64-bit, and big or little endian.

Unfortunately TARGET is a fairly common name for a make variable, and
if it happens that TARGET is specified as a command line parameter to
make, the value specified on the command line will override our value.

In particular this can happen if the kernel Makefile is driven by an
external Makefile that uses TARGET for something.

This leads to weird build failures, eg:
  nonsense  at /build/linux/drivers/crypto/vmx/ghashp8-ppc.pl line 45.
  /linux/drivers/crypto/vmx/Makefile:20: recipe for target 
'drivers/crypto/vmx/ghashp8-ppc.S' failed

Which shows that we passed an empty value for $(TARGET) to the perl
script, confirmed with make V=1:

  perl /linux/drivers/crypto/vmx/ghashp8-ppc.pl  > 
drivers/crypto/vmx/ghashp8-ppc.S

We can avoid this confusion by using override, to tell make that we
don't want anything to override our variable, even a value specified
on the command line. We can also use a less common name, given the
script calls it "flavour", let's use that.

Signed-off-by: Michael Ellerman 
Signed-off-by: Herbert Xu 
Signed-off-by: Sasha Levin 
---
 drivers/crypto/vmx/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/vmx/Makefile b/drivers/crypto/vmx/Makefile
index cab32cfec9c45..709670d2b553a 100644
--- a/drivers/crypto/vmx/Makefile
+++ b/drivers/crypto/vmx/Makefile
@@ -3,13 +3,13 @@ obj-$(CONFIG_CRYPTO_DEV_VMX_ENCRYPT) += vmx-crypto.o
 vmx-crypto-objs := vmx.o aesp8-ppc.o ghashp8-ppc.o aes.o aes_cbc.o aes_ctr.o 
aes_xts.o ghash.o
 
 ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),y)
-TARGET := linux-ppc64le
+override flavour := linux-ppc64le
 else
-TARGET := linux-ppc64
+override flavour := linux-ppc64
 endif
 
 quiet_cmd_perl = PERL $@
-  cmd_perl = $(PERL) $(<) $(TARGET) > $(@)
+  cmd_perl = $(PERL) $(<) $(flavour) > $(@)
 
 targets += aesp8-ppc.S ghashp8-ppc.S
 
-- 
2.20.1



[PATCH AUTOSEL 4.19 171/177] crypto: vmx - Avoid weird build failures

2019-12-10 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 4ee812f6143d78d8ba1399671d78c8d78bf2817c ]

In the vmx crypto Makefile we assign to a variable called TARGET and
pass that to the aesp8-ppc.pl and ghashp8-ppc.pl scripts.

The variable is meant to describe what flavour of powerpc we're
building for, eg. either 32 or 64-bit, and big or little endian.

Unfortunately TARGET is a fairly common name for a make variable, and
if it happens that TARGET is specified as a command line parameter to
make, the value specified on the command line will override our value.

In particular this can happen if the kernel Makefile is driven by an
external Makefile that uses TARGET for something.

This leads to weird build failures, eg:
  nonsense  at /build/linux/drivers/crypto/vmx/ghashp8-ppc.pl line 45.
  /linux/drivers/crypto/vmx/Makefile:20: recipe for target 
'drivers/crypto/vmx/ghashp8-ppc.S' failed

Which shows that we passed an empty value for $(TARGET) to the perl
script, confirmed with make V=1:

  perl /linux/drivers/crypto/vmx/ghashp8-ppc.pl  > 
drivers/crypto/vmx/ghashp8-ppc.S

We can avoid this confusion by using override, to tell make that we
don't want anything to override our variable, even a value specified
on the command line. We can also use a less common name, given the
script calls it "flavour", let's use that.

Signed-off-by: Michael Ellerman 
Signed-off-by: Herbert Xu 
Signed-off-by: Sasha Levin 
---
 drivers/crypto/vmx/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/vmx/Makefile b/drivers/crypto/vmx/Makefile
index cab32cfec9c45..709670d2b553a 100644
--- a/drivers/crypto/vmx/Makefile
+++ b/drivers/crypto/vmx/Makefile
@@ -3,13 +3,13 @@ obj-$(CONFIG_CRYPTO_DEV_VMX_ENCRYPT) += vmx-crypto.o
 vmx-crypto-objs := vmx.o aesp8-ppc.o ghashp8-ppc.o aes.o aes_cbc.o aes_ctr.o 
aes_xts.o ghash.o
 
 ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),y)
-TARGET := linux-ppc64le
+override flavour := linux-ppc64le
 else
-TARGET := linux-ppc64
+override flavour := linux-ppc64
 endif
 
 quiet_cmd_perl = PERL $@
-  cmd_perl = $(PERL) $(<) $(TARGET) > $(@)
+  cmd_perl = $(PERL) $(<) $(flavour) > $(@)
 
 targets += aesp8-ppc.S ghashp8-ppc.S
 
-- 
2.20.1



[PATCH AUTOSEL 5.4 348/350] ibmvnic: Fix completion structure initialization

2019-12-10 Thread Sasha Levin
From: Thomas Falcon 

[ Upstream commit 070eca955c4af1248cb78a216463ff757a5dc511 ]

Fix multiple calls to init_completion for device completion
structures. Instead, initialize them during device probe and
reinitialize them later as needed.

Signed-off-by: Thomas Falcon 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 0686ded7ad3a2..e1ab2feeae53d 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -176,7 +176,7 @@ static int alloc_long_term_buff(struct ibmvnic_adapter 
*adapter,
ltb->map_id = adapter->map_id;
adapter->map_id++;
 
-   init_completion(>fw_done);
+   reinit_completion(>fw_done);
rc = send_request_map(adapter, ltb->addr,
  ltb->size, ltb->map_id);
if (rc) {
@@ -215,7 +215,7 @@ static int reset_long_term_buff(struct ibmvnic_adapter 
*adapter,
 
memset(ltb->buff, 0, ltb->size);
 
-   init_completion(>fw_done);
+   reinit_completion(>fw_done);
rc = send_request_map(adapter, ltb->addr, ltb->size, ltb->map_id);
if (rc)
return rc;
@@ -943,7 +943,7 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter)
if (adapter->vpd->buff)
len = adapter->vpd->len;
 
-   init_completion(>fw_done);
+   reinit_completion(>fw_done);
crq.get_vpd_size.first = IBMVNIC_CRQ_CMD;
crq.get_vpd_size.cmd = GET_VPD_SIZE;
rc = ibmvnic_send_crq(adapter, );
@@ -1689,7 +1689,7 @@ static int __ibmvnic_set_mac(struct net_device *netdev, 
u8 *dev_addr)
crq.change_mac_addr.cmd = CHANGE_MAC_ADDR;
ether_addr_copy(_mac_addr.mac_addr[0], dev_addr);
 
-   init_completion(>fw_done);
+   reinit_completion(>fw_done);
rc = ibmvnic_send_crq(adapter, );
if (rc) {
rc = -EIO;
@@ -2316,7 +2316,7 @@ static int wait_for_reset(struct ibmvnic_adapter *adapter)
adapter->fallback.rx_entries = adapter->req_rx_add_entries_per_subcrq;
adapter->fallback.tx_entries = adapter->req_tx_entries_per_subcrq;
 
-   init_completion(>reset_done);
+   reinit_completion(>reset_done);
adapter->wait_for_reset = true;
rc = ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
if (rc)
@@ -2332,7 +2332,7 @@ static int wait_for_reset(struct ibmvnic_adapter *adapter)
adapter->desired.rx_entries = adapter->fallback.rx_entries;
adapter->desired.tx_entries = adapter->fallback.tx_entries;
 
-   init_completion(>reset_done);
+   reinit_completion(>reset_done);
adapter->wait_for_reset = true;
rc = ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
if (rc)
@@ -2603,7 +2603,7 @@ static void ibmvnic_get_ethtool_stats(struct net_device 
*dev,
cpu_to_be32(sizeof(struct ibmvnic_statistics));
 
/* Wait for data to be written */
-   init_completion(>stats_done);
+   reinit_completion(>stats_done);
rc = ibmvnic_send_crq(adapter, );
if (rc)
return;
@@ -4408,7 +4408,7 @@ static int send_query_phys_parms(struct ibmvnic_adapter 
*adapter)
memset(, 0, sizeof(crq));
crq.query_phys_parms.first = IBMVNIC_CRQ_CMD;
crq.query_phys_parms.cmd = QUERY_PHYS_PARMS;
-   init_completion(>fw_done);
+   reinit_completion(>fw_done);
rc = ibmvnic_send_crq(adapter, );
if (rc)
return rc;
@@ -4960,6 +4960,9 @@ static int ibmvnic_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
INIT_LIST_HEAD(>rwi_list);
spin_lock_init(>rwi_lock);
init_completion(>init_done);
+   init_completion(>fw_done);
+   init_completion(>reset_done);
+   init_completion(>stats_done);
clear_bit(0, >resetting);
 
do {
-- 
2.20.1



[PATCH AUTOSEL 5.4 333/350] crypto: vmx - Avoid weird build failures

2019-12-10 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 4ee812f6143d78d8ba1399671d78c8d78bf2817c ]

In the vmx crypto Makefile we assign to a variable called TARGET and
pass that to the aesp8-ppc.pl and ghashp8-ppc.pl scripts.

The variable is meant to describe what flavour of powerpc we're
building for, eg. either 32 or 64-bit, and big or little endian.

Unfortunately TARGET is a fairly common name for a make variable, and
if it happens that TARGET is specified as a command line parameter to
make, the value specified on the command line will override our value.

In particular this can happen if the kernel Makefile is driven by an
external Makefile that uses TARGET for something.

This leads to weird build failures, eg:
  nonsense  at /build/linux/drivers/crypto/vmx/ghashp8-ppc.pl line 45.
  /linux/drivers/crypto/vmx/Makefile:20: recipe for target 
'drivers/crypto/vmx/ghashp8-ppc.S' failed

Which shows that we passed an empty value for $(TARGET) to the perl
script, confirmed with make V=1:

  perl /linux/drivers/crypto/vmx/ghashp8-ppc.pl  > 
drivers/crypto/vmx/ghashp8-ppc.S

We can avoid this confusion by using override, to tell make that we
don't want anything to override our variable, even a value specified
on the command line. We can also use a less common name, given the
script calls it "flavour", let's use that.

Signed-off-by: Michael Ellerman 
Signed-off-by: Herbert Xu 
Signed-off-by: Sasha Levin 
---
 drivers/crypto/vmx/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/vmx/Makefile b/drivers/crypto/vmx/Makefile
index cab32cfec9c45..709670d2b553a 100644
--- a/drivers/crypto/vmx/Makefile
+++ b/drivers/crypto/vmx/Makefile
@@ -3,13 +3,13 @@ obj-$(CONFIG_CRYPTO_DEV_VMX_ENCRYPT) += vmx-crypto.o
 vmx-crypto-objs := vmx.o aesp8-ppc.o ghashp8-ppc.o aes.o aes_cbc.o aes_ctr.o 
aes_xts.o ghash.o
 
 ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),y)
-TARGET := linux-ppc64le
+override flavour := linux-ppc64le
 else
-TARGET := linux-ppc64
+override flavour := linux-ppc64
 endif
 
 quiet_cmd_perl = PERL $@
-  cmd_perl = $(PERL) $(<) $(TARGET) > $(@)
+  cmd_perl = $(PERL) $(<) $(flavour) > $(@)
 
 targets += aesp8-ppc.S ghashp8-ppc.S
 
-- 
2.20.1



Re: [PATCH v2 1/2] spi: fsl: don't map irq during probe

2019-12-10 Thread Mark Brown
On Tue, Dec 10, 2019 at 03:17:15PM +, Christophe Leroy wrote:
> With lastest kernel, the following warning is observed at startup:

Please do not submit new versions of already applied patches, please
submit incremental updates to the existing code.  Modifying existing
commits creates problems for other users building on top of those
commits so it's best practice to only change pubished git commits if
absolutely essential.


signature.asc
Description: PGP signature


Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-10 Thread Justin Forbes
On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  wrote:
>
> On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > Aurelien Jarno  writes:
> > > On powerpc with recent versions of binutils, readelf outputs an extra
> > > field when dumping the symbols of an object file. For example:
> > >
> > > 35: 083896 FUNCLOCAL  DEFAULT [: 8]   
> > >   1 btf_is_struct
> > >
> > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT variable to
> > > be computed correctly and causes the checkabi target to fail.
> > >
> > > Fix that by looking for the symbol name in the last field instead of the
> > > 8th one. This way it should also cope with future extra fields.
> > >
> > > Signed-off-by: Aurelien Jarno 
> > > ---
> > >  tools/lib/bpf/Makefile | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > Thanks for fixing that, it's been on my very long list of test failures
> > for a while.
> >
> > Tested-by: Michael Ellerman 
>
> Looks good & also continues to work on x86. Applied, thanks!

This actually seems to break horribly on PPC64le with binutils 2.33.1
resulting in:
Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
match with num of versioned symbols in libbpf.so (184). Please make
sure all LIBBPF_API symbols are versioned in libbpf.map.

This is the only arch that fails, with x86/arm/aarch64/s390 all
building fine.  Reverting this patch allows successful build across
all arches.

Justin


RE: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-10 Thread Ram Pai
On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 10/12/2019 16:12, Ram Pai wrote:
> > On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 07/12/2019 12:12, Ram Pai wrote:
> >>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>> its parameters.  On secure VMs, hypervisor cannot access the contents of
> >>> this page since it gets encrypted.  Hence share the page with the
> >>> hypervisor, and unshare when done.
> >>
> >>
> >> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
> >> pages. There is small problem that when DDW is enabled,
> >> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
> >> about the performance on slack but this is caused by initial cleanup of
> >> the default TCE window (which we do not use anyway) and to battle this
> >> we can simply reduce its size by adding
> > 
> > something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> > 13secs per device for H_PUT_TCE approach, during boot. This is with a
> > 30GB guest. With larger guest, the time will further detoriate.
> 
> 
> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> the delay is caused by clearing the small DMA window which is small by
> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> (depends on the system) so the number of TCEs is much smaller.

I cant get your results.  What changes did you make to get it?

> 
> 
> > 
> >>
> >> -global
> >> spapr-pci-host-bridge.dma_win_size=0x400
> > 
> > This option, speeds it up tremendously.  But than should this option be
> > enabled in qemu by default?  only for secure VMs? for both VMs?
> 
> 
> As discussed in slack, by default we do not need to clear the entire TCE
> table and we only have to map swiotlb buffer using the small window. It
> is a guest kernel change only. Thanks,

Can you tell me what code you are talking about here.  Where is the TCE
table getting cleared? What code needs to be changed to not clear it?

Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
aswell?

>
>

But before I close, you have not told me clearly, what is the problem
with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the page'.


Remember this is the same page that is earmarked for doing
H_PUT_INDIRECT_TCE, not by my patch, but its already earmarked by the
existing code. So it not some random buffer that is picked. Second 
this page is temporarily shared and unshared, it does not stay shared
for life.  It does not slow the boot. it does not need any
special command line options on the qemu.

Shared pages technology was put in place, exactly for the purpose of
sharing data with the hypervisor.  We are using this technology exactly
for that purpose.  And finally I agreed with your concern of having
shared pages staying around.  Hence i addressed that concern, by
unsharing the page.  At this point, I fail to understand your concern.


RP



[PATCH v2 2/2] spi: fsl: simplify error path in of_fsl_spi_probe()

2019-12-10 Thread Christophe Leroy
No need to 'goto err;' for just doing a return.
return directly from where the error happens.

Signed-off-by: Christophe Leroy 
---
 drivers/spi/spi-fsl-spi.c | 27 +--
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
index 2d85c81983b1..e991c6ff4e7a 100644
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -725,8 +725,8 @@ static int of_fsl_spi_probe(struct platform_device *ofdev)
struct device_node *np = ofdev->dev.of_node;
struct spi_master *master;
struct resource mem;
-   int irq = 0, type;
-   int ret = -ENOMEM;
+   int irq, type;
+   int ret;
 
ret = of_mpc8xxx_spi_probe(ofdev);
if (ret)
@@ -741,10 +741,8 @@ static int of_fsl_spi_probe(struct platform_device *ofdev)
 
if (spisel_boot) {
pinfo->immr_spi_cs = ioremap(get_immrbase() + 
IMMR_SPI_CS_OFFSET, 4);
-   if (!pinfo->immr_spi_cs) {
-   ret = -ENOMEM;
-   goto err;
-   }
+   if (!pinfo->immr_spi_cs)
+   return -ENOMEM;
}
 #endif
/*
@@ -763,24 +761,17 @@ static int of_fsl_spi_probe(struct platform_device *ofdev)
 
ret = of_address_to_resource(np, 0, );
if (ret)
-   goto err;
+   return ret;
 
irq = of_irq_to_resource(np, 0, NULL);
-   if (irq <= 0) {
-   ret = -EINVAL;
-   goto err;
-   }
+   if (irq <= 0)
+   return -EINVAL;
 
master = fsl_spi_probe(dev, , irq);
-   if (IS_ERR(master)) {
-   ret = PTR_ERR(master);
-   goto err;
-   }
+   if (IS_ERR(master))
+   return PTR_ERR(master);
 
return 0;
-
-err:
-   return ret;
 }
 
 static int of_fsl_spi_remove(struct platform_device *ofdev)
-- 
2.13.3



[PATCH v2 1/2] spi: fsl: don't map irq during probe

2019-12-10 Thread Christophe Leroy
With lastest kernel, the following warning is observed at startup:

[1.500609] [ cut here ]
[1.505225] remove_proc_entry: removing non-empty directory 'irq/22', 
leaking at least 'fsl_spi'
[1.514234] WARNING: CPU: 0 PID: 1 at fs/proc/generic.c:682 
remove_proc_entry+0x198/0x1c0
[1.522403] CPU: 0 PID: 1 Comm: swapper Not tainted 
5.4.0-s3k-dev-02248-g93532430a4ff #2564
[1.530724] NIP:  c0197694 LR: c0197694 CTR: c0050d80
[1.535762] REGS: df4a5af0 TRAP: 0700   Not tainted  
(5.4.0-02248-g93532430a4ff)
[1.543818] MSR:  00029032   CR: 22028222  XER: 
[1.550524]
[1.550524] GPR00: c0197694 df4a5ba8 df4a 0054   
4a38 0010
[1.550524] GPR08: c07c5a30 0800  1032 22000208  
c0004b14 
[1.550524] GPR16:       
c083 c07fc078
[1.550524] GPR24: c08e8ca0 df665d10 df60ea98 c07c9db8 0001 df5d5ae3 
df5d5a80 df43f8e3
[1.585327] NIP [c0197694] remove_proc_entry+0x198/0x1c0
[1.590628] LR [c0197694] remove_proc_entry+0x198/0x1c0
[1.595829] Call Trace:
[1.598280] [df4a5ba8] [c0197694] remove_proc_entry+0x198/0x1c0 (unreliable)
[1.605321] [df4a5bd8] [c0067acc] unregister_irq_proc+0x5c/0x70
[1.611238] [df4a5bf8] [c005fbc4] free_desc+0x3c/0x80
[1.616286] [df4a5c18] [c005fe2c] irq_free_descs+0x70/0xa8
[1.621778] [df4a5c38] [c033d3fc] of_fsl_spi_probe+0xdc/0x3cc
[1.627525] [df4a5c88] [c02f0f64] platform_drv_probe+0x44/0xa4
[1.633350] [df4a5c98] [c02eee44] really_probe+0x1ac/0x418
[1.638829] [df4a5cc8] [c02ed3e8] bus_for_each_drv+0x64/0xb0
[1.644481] [df4a5cf8] [c02ef950] __device_attach+0xd4/0x128
[1.650132] [df4a5d28] [c02ed61c] bus_probe_device+0xa0/0xbc
[1.655783] [df4a5d48] [c02ebbe8] device_add+0x544/0x74c
[1.661096] [df4a5d88] [c0382b78] of_platform_device_create_pdata+0xa4/0x100
[1.668131] [df4a5da8] [c0382cf4] of_platform_bus_create+0x120/0x20c
[1.674474] [df4a5df8] [c0382d50] of_platform_bus_create+0x17c/0x20c
[1.680818] [df4a5e48] [c0382e88] of_platform_bus_probe+0x9c/0xf0
[1.686907] [df4a5e68] [c0751404] 
__machine_initcall_cmpcpro_cmpcpro_declare_of_platform_devices+0x74/0x1a4
[1.696629] [df4a5e98] [c072a4cc] do_one_initcall+0x8c/0x1d4
[1.702282] [df4a5ef8] [c072a768] kernel_init_freeable+0x154/0x204
[1.708455] [df4a5f28] [c0004b2c] kernel_init+0x18/0x110
[1.713769] [df4a5f38] [c00122ac] ret_from_kernel_thread+0x14/0x1c
[1.719926] Instruction dump:
[1.722889] 2c03 4182004c 3863ffb0 3c80c05f 80e3005c 388436a0 3c60c06d 
7fa6eb78
[1.730630] 7fe5fb78 38840280 38634178 4be8c611 <0fe0> 4b6c 3c60c071 
7fe4fb78
[1.738556] ---[ end trace 05d0720bf2e352e2 ]---

The problem comes from the error path which calls
irq_dispose_mapping() while the IRQ has been requested with
devm_request_irq().

IRQ doesn't need to be mapped with irq_of_parse_and_map(). The only
need is to get the IRQ virtual number. For that, use
of_irq_to_resource() instead of the
irq_of_parse_and_map()/irq_dispose_mapping() pair.

To avoid build failure on SPARC which has a dummy version of
irq_of_parse_and_map() but not of of_irq_to_resource(), get this
driver dependent on OF_IRQ.

Reported-by: kbuild test robot 
[v2: adding dependency on CONFIG_OF_IRQ]
Fixes: 500a32abaf81 ("spi: fsl: Call irq_dispose_mapping in err path")
Cc: sta...@vger.kernel.org

Signed-off-by: Christophe Leroy 
---
 drivers/spi/Kconfig   | 2 +-
 drivers/spi/spi-fsl-spi.c | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/spi/Kconfig b/drivers/spi/Kconfig
index 870f7797b56b..fba933f31d40 100644
--- a/drivers/spi/Kconfig
+++ b/drivers/spi/Kconfig
@@ -378,7 +378,7 @@ config SPI_FSL_CPM
 
 config SPI_FSL_SPI
tristate "Freescale SPI controller and Aeroflex Gaisler GRLIB SPI 
controller"
-   depends on OF
+   depends on OF_IRQ
select SPI_FSL_LIB
select SPI_FSL_CPM if FSL_SOC
help
diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
index e2d878397efd..2d85c81983b1 100644
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -765,8 +765,8 @@ static int of_fsl_spi_probe(struct platform_device *ofdev)
if (ret)
goto err;
 
-   irq = irq_of_parse_and_map(np, 0);
-   if (!irq) {
+   irq = of_irq_to_resource(np, 0, NULL);
+   if (irq <= 0) {
ret = -EINVAL;
goto err;
}
@@ -780,7 +780,6 @@ static int of_fsl_spi_probe(struct platform_device *ofdev)
return 0;
 
 err:
-   irq_dispose_mapping(irq);
return ret;
 }
 
-- 
2.13.3



Re: [PATCH v2 2/4] kasan: use MAX_PTRS_PER_* for early shadow

2019-12-10 Thread Christophe Leroy




Le 10/12/2019 à 10:36, Balbir Singh a écrit :



On 10/12/19 3:47 pm, Daniel Axtens wrote:

This helps with powerpc support, and should have no effect on
anything else.

Suggested-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 


If you follow the recommendations by Christophe and I, you don't need this patch


I guess you mean Patch 1 (the one adding the const to all arches) is not 
needed. Of course this one (Patch 2) is needed as it is the one that 
changes kasan.h to use const table size instead of impossible variable 
table size.


And that would also fix the problem reported by the kbuild test robot.

Christophe


Re: [PATCH v8 24/26] mm/gup: track FOLL_PIN pages

2019-12-10 Thread Jan Kara
On Mon 09-12-19 14:53:42, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: John Hubbard 

Looks nice, some comments below...

> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * This has a default assumption of "use FOLL_GET behavior, if FOLL_PIN is 
> not
> + * set".
> + *
> + * "grab" names in this file mean, "look at flags to decide with to use 
> FOLL_PIN
> + * or FOLL_GET behavior, when incrementing the page's refcount.
> + */
> +static struct page *try_grab_compound_head(struct page *page, int refs,
> +unsigned int flags)
> +{
> + if (flags & FOLL_PIN)
> + return try_pin_compound_head(page, refs);
> +
> + return try_get_compound_head(page, refs);
> +}
> +
> +/**
> + * grab_page() - elevate a page's refcount by a flag-dependent amount
> + *
> + * This might not do anything at all, depending on the flags argument.
> + *
> + * "grab" names in this file mean, "look at flags to decide with to use 
> FOLL_PIN
   ^^^ whether

> + * or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * @page:pointer to page to be grabbed
> + * @flags:   gup flags: these are the FOLL_* flag values.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the 
> same
> + * time. (That's true throughout the get_user_pages*() and pin_user_pages*()
> + * APIs.) Cases:
> + *
> + *   FOLL_GET: page's refcount will be incremented by 1.
> + *   FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
> + */
> +void grab_page(struct page *page, unsigned int flags)
> +{
> + if (flags & FOLL_GET)
> + get_page(page);
> + else if (flags & FOLL_PIN) {
> + get_page(page);
> + WARN_ON_ONCE(flags & FOLL_GET);
> + /*
> +  * Use get_page(), above, to do the refcount error
> +  * checking. Then just add in the remaining references:
> +  */
> + page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1);

This is wrong for two reasons:

1) You miss compound_head() indirection from get_page() for this
page_ref_add().

2) page_ref_add() could overflow the counter without noticing.

Especially with GUP_PIN_COUNTING_BIAS being non-trivial, it is realistic
that an attacker might try to overflow the page refcount and we have to
protect the kernel against that. So I think that all the places that would
use grab_page() actually need to use try_grab_page() and then gracefully
deal with the failure.

> @@ -278,11 +425,23 @@ static struct page *follow_page_pte(struct 
> vm_area_struct *vma,
>   goto retry;
>   }
>  
> - if (flags & FOLL_GET) {
> + if (flags & (FOLL_PIN | FOLL_GET)) {
> + /*
> +  * Allow try_get_page() to take care of error handling, for
> +  * both cases: FOLL_GET or FOLL_PIN:
> +  */
>   if (unlikely(!try_get_page(page))) {
>   page = ERR_PTR(-ENOMEM);
>   goto out;
>   }
> +
> + if (flags & FOLL_PIN) {
> + WARN_ON_ONCE(flags & FOLL_GET);
> +
> + /* We got a +1 refcount from try_get_page(), above. */
> + page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1);
> + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
> + }
>   }

The same problem here as above, plus this place should use the same
try_grab..() helper, shouldn't it?

> @@ -544,8 +703,8 @@ static struct page *follow_page_mask(struct 
> vm_area_struct *vma,
>   /* make this handle hugepd */
>   page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
>   if (!IS_ERR(page)) {
> - BUG_ON(flags & FOLL_GET);
> - return page;
> + WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
> +   

[Bug 205183] PPC64: Signal delivery fails with SIGSEGV if between about 1KB and 4KB bytes of stack remain

2019-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205183

Daniel Axtens (d...@axtens.net) changed:

   What|Removed |Added

 CC||d...@axtens.net

--- Comment #2 from Daniel Axtens (d...@axtens.net) ---
Hi, I'm starting to have a look at this for Daniel B.

So looking at the fault that fails, I see that it's a fault with the NIP in the
_kernel_ that fails, rather than in userspace. Dumping stack we see:

[  118.917679] Call Trace:
[  118.917715] [c0007b457820] [c0b71538] dump_stack+0xbc/0x104
(unreliable)
[  118.917719] [c0007b457860] [c006e8f0]
__do_page_fault+0x860/0xf90
[  118.917721] [c0007b457940] [c000af68]
handle_page_fault+0x10/0x30
[  118.917725] --- interrupt: 301 at handle_rt_signal64+0x180/0x13a0
   LR = handle_rt_signal64+0x148/0x13a0
[  118.917726] [c0007b457d30] [c0023d30]
do_notify_resume+0x2e0/0x410
[  118.917728] [c0007b457e20] [c000e4c4]
ret_from_except_lite+0x70/0x74

I'm still debugging, but it looks like handle_rt_signal64 attempts to reserve a
stack frame for the signal, but computes a stack address that sits outside
valid stack space. Then when writing to it, it pagefaults, and because it's not
a userland NIP, it refuses to expand the stack.

I'll keep you up to date.

Regards,
Daniel A

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v8 23/26] mm/gup: pass flags arg to __gup_device_* functions

2019-12-10 Thread Jan Kara
On Mon 09-12-19 14:53:41, John Hubbard wrote:
> A subsequent patch requires access to gup flags, so pass the flags
> argument through to the __gup_device_* functions.
> 
> Also placate checkpatch.pl by shortening a nearby line.
> 
> TODO: Christoph Hellwig requested folding this into the patch the uses
> the gup flags arguments.

You should probably implement this TODO? :)

Honza

> 
> Reviewed-by: Jan Kara 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 
> ---
>  mm/gup.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 73aedcefa4bd..687d48506f04 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1957,7 +1957,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
>   struct dev_pagemap *pgmap = NULL;
> @@ -1983,13 +1984,14 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -2000,13 +2002,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
> *pmdp, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -2017,14 +2020,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
> *pudp, unsigned long addr,
>  }
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
>  }
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
> @@ -2136,7 +2141,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>   if (pmd_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> @@ -2157,7 +2163,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, unsigned int flags, struct page **pages, int 
> *nr)
> + unsigned long end, unsigned int flags,
> + struct page **pages, int *nr)
>  {
>   struct page *head, *page;
>   int refs;
> @@ -2168,7 +2175,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>   if (pud_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
> + return __gup_device_huge_pud(orig, pudp, addr, end, flags,
> +  

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-10 Thread Frank Rowand
On 12/10/19 2:17 AM, Frank Rowand wrote:
> On 12/9/19 7:51 PM, Rob Herring wrote:
>> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
>>  wrote:
>>>
>>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
 Is there a memory usage issue for the systems that led to this thread?
>>>
>>> No, no memory issue led to this thread. I was just testing my patch and
>>> I assumed that I did something wrong in the counting/lock drop/lock
>>> acquire/allocate path because the array was hardly used. So I started to
>>> look deeper…
>>> Once I figured out everything was fine, I was curious if everyone is
>>> aware of the different phandle creation by dtc vs POWER. And I posted
>>> the mail in the thread.
>>> Once you confirmed that everything is "known / not an issue" I was ready
>>> to take off [0].
>>>
>>> Later more replies came in such as one mail [1] from Rob describing the
>>> original reason with 814 phandles. _Here_ I was just surprised that 1024
>>> were used over 64 entries for a benefit of 60ms. I understand that this
>>> is low concern for you because that memory is released if modules are
>>> not enabled. I usually see that module support is left enabled.
>>>
>>> However, Rob suggested / asked about the fixed size array (this is how I
>>> understood it):
>>> |And yes, as mentioned earlier I don't like the complexity. I didn't
>>> |from the start and I'm  I'm still of the opinion we should have a
>>> |fixed or 1 time sized true cache (i.e. smaller than total # of
>>> |phandles). That would solve the RT memory allocation and locking issue
>>> |too.
>>>
>>> so I attempted to ask if we should have the fixed size array maybe
>>> with the hash_32() instead the mask. This would make my other patch
>>> obsolete because the fixed size array should not have a RT issue. The
>>> hash_32() part here would address the POWER issue where the cache is
>>> currently not used efficiently.
>>>
>>> If you want instead to keep things as-is then this is okay from my side.
>>> If you want to keep this cache off on POWER then I could contribute a
>>> patch doing so.
>>
>> It turns out there's actually a bug in the current implementation. If
>> we have multiple phandles with the same mask, then we leak node
>> references if we miss in the cache and re-assign the cache entry.
> 
> Aaargh.  Patch sent.
> 
>> Easily fixed I suppose, but holding a ref count for a cached entry
>> seems wrong. That means we never have a ref count of 0 on every node
>> with a phandle.
> 
> It will go to zero when the cache is freed.
> 
> My memory is that we free the cache as part of removing an overlay.  I'll
> verify whether my memory is correct.

And I'll look at non-overlay use of dynamic devicetree too.

-Frank

> 
> -Frank
> 
> 
>>
>> I've done some more experiments with the performance. I've come to the
>> conclusion that just measuring boot time is not a good way at least if
>> the time is not a significant percentage of the total. I couldn't get
>> any measurable data. I'm using a RK3399 Rock960 board. It has 190
>> phandles. I get about 1500 calls to of_find_node_by_phandle() during
>> boot. Note that about the first 300 are before we have any timekeeping
>> (the prior measurements also would not account for this). Those calls
>> have no cache in the current implementation and are cached in my
>> implementation.
>>
>> no cache:  20124 us
>> current cache: 819 us
>>
>> new cache (64 entry): 4342 us
>> new cache (128 entry): 2875 us
>> new cache (256 entry): 1235 us
>>
>> To get some idea on the time spent before timekeeping is up, if we
>> take the avg miss time is ~17us (20124/1200), then we're spending
>> about ~5ms on lookups before the cache is enabled. I'd estimate the
>> new cache takes ~400us before timekeeping is up as there's 11 misses
>> early.
>>
>> >From these numbers, it seems the miss rate has a significant impact on
>> performance for the different sizes. But taken from the original 20+
>> ms, they all look like good improvement.
>>
>> Rob
>>
> 
> 



Re: [PATCH 5/6] mm, memory_hotplug: Provide argument for the pgprot_t in arch_add_memory()

2019-12-10 Thread David Hildenbrand
On 10.12.19 11:34, Michal Hocko wrote:
> On Tue 10-12-19 11:09:46, David Hildenbrand wrote:
>> On 10.12.19 11:04, Michal Hocko wrote:
>>> On Mon 09-12-19 12:43:40, Dan Williams wrote:
 On Mon, Dec 9, 2019 at 12:24 PM Logan Gunthorpe  
 wrote:
>
>
>
> On 2019-12-09 12:23 p.m., David Hildenbrand wrote:
>> On 09.12.19 20:13, Logan Gunthorpe wrote:
>>> [...]
>>>  #ifdef CONFIG_MEMORY_HOTPLUG
>>> -int arch_add_memory(int nid, u64 start, u64 size,
>>> +int arch_add_memory(int nid, u64 start, u64 size, pgprot_t prot,
>>>  struct mhp_restrictions *restrictions)
>>
>> Can we fiddle that into "struct mhp_restrictions" instead?
>
> Yes, if that's what people want, it's pretty trivial to do. I chose not
> to do it that way because it doesn't get passed down to add_pages() and
> it's not really a "restriction". If I don't hear any objections, I will
> do that for v2.

 +1 to storing this information alongside the altmap in that structure.
 However, I agree struct mhp_restrictions, with the MHP_MEMBLOCK_API
 flag now gone, has lost all of its "restrictions". How about dropping
 the 'flags' property and renaming the struct to 'struct
 mhp_modifiers'?
>>>
>>> Hmm, this email somehow didn't end up in my inbox so I have missed it
>>> before replying.
>>>
>>> Well, mhp_modifiers makes some sense and it would reduce the API
>>> proliferation but how do you expect the prot part to be handled?
>>> I really do not want people to think about PAGE_KERNEL or which
>>> protection to use because my experience tells that this will get copied
>>> without much thinking or simply will break with some odd usecases.
>>> So how exactly this would be used?
>>
>> I was thinking about exactly the same "issue".
>>
>> 1. default initialization via a function
>>
>> memhp_modifier_default_init();
>>
>> 2. a flag that unlocks the prot field (default:0). Without the flag, it
>> is ignored. We can keep the current initialization then.
>>
>> Other ideas?
> 
> 3. a prot mask to apply on top of PAGE_KERNEL? Or would that be
> insufficient/clumsy?
> 

If it works for the given use case, I guess this would be simple and ok.

-- 
Thanks,

David / dhildenb



Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-10 Thread kbuild test robot
Hi Daniel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20191209]
[also build test ERROR on linus/master v5.5-rc1]
[cannot apply to powerpc/next asm-generic/master v5.4]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Daniel-Axtens/KASAN-for-powerpc64-radix-plus-generic-mm-change/20191210-171342
base:6cf8298daad041cd15dc514d8a4f93ca3636c84e
config: powerpc-allnoconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 7.5.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.5.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/printk.h:7:0,
from include/linux/kernel.h:15,
from arch/powerpc/kernel/prom.c:15:
   arch/powerpc/kernel/prom.c: In function 'early_reserve_mem':
>> include/linux/kern_levels.h:5:18: error: format '%llu' expects argument of 
>> type 'long long unsigned int', but argument 3 has type 'phys_addr_t {aka 
>> unsigned int}' [-Werror=format=]
#define KERN_SOH "\001"  /* ASCII Start Of Header */
 ^
   include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH'
#define KERN_ERR KERN_SOH "3" /* error conditions */
 ^~~~
   include/linux/printk.h:304:9: note: in expansion of macro 'KERN_ERR'
 printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
^~~~
>> arch/powerpc/kernel/prom.c:694:4: note: in expansion of macro 'pr_err'
   pr_err("===\n"
   ^~
   arch/powerpc/kernel/prom.c:697:48: note: format string is defined here
"The actual physical memory detected is %llu MB.\n"
~~~^
%u
   cc1: all warnings being treated as errors
--
   In file included from include/linux/printk.h:7:0,
from include/linux/kernel.h:15,
from arch/powerpc//kernel/prom.c:15:
   arch/powerpc//kernel/prom.c: In function 'early_reserve_mem':
>> include/linux/kern_levels.h:5:18: error: format '%llu' expects argument of 
>> type 'long long unsigned int', but argument 3 has type 'phys_addr_t {aka 
>> unsigned int}' [-Werror=format=]
#define KERN_SOH "\001"  /* ASCII Start Of Header */
 ^
   include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH'
#define KERN_ERR KERN_SOH "3" /* error conditions */
 ^~~~
   include/linux/printk.h:304:9: note: in expansion of macro 'KERN_ERR'
 printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
^~~~
   arch/powerpc//kernel/prom.c:694:4: note: in expansion of macro 'pr_err'
   pr_err("===\n"
   ^~
   arch/powerpc//kernel/prom.c:697:48: note: format string is defined here
"The actual physical memory detected is %llu MB.\n"
~~~^
%u
   cc1: all warnings being treated as errors

vim +/pr_err +694 arch/powerpc/kernel/prom.c

   675  
   676  if (IS_ENABLED(CONFIG_KASAN) && 
IS_ENABLED(CONFIG_PPC_BOOK3S_64)) {
   677  kasan_memory_size =
   678  ((phys_addr_t)CONFIG_PHYS_MEM_SIZE_FOR_KASAN << 
20);
   679  
   680  if (top_phys_addr < kasan_memory_size) {
   681  /*
   682   * We are doomed. Attempts to call e.g. panic() 
are
   683   * likely to fail because they call out into
   684   * instrumented code, which will almost 
certainly
   685   * access memory beyond the end of physical
   686   * memory. Hang here so that at least the NIP 
points
   687   * somewhere that will help you debug it if you 
look at
   688   * it in qemu.
   689   */
   690  while (true)
   691  ;
   692  } else if (top_phys_addr > kasan_memory_size) {
   693  /* print a bg warning in hopes people 
notice */
 > 694  
 &g

Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-10 Thread Balbir Singh



On 10/12/19 3:47 pm, Daniel Axtens wrote:
> KASAN support on powerpc64 is challenging:
> 
>  - We want to be able to support inline instrumentation so as to be
>able to catch global and stack issues.
> 
>  - We run some code in real mode after boot, most notably a lot of
>KVM code. We'd like to be able to instrument this.
> 
>[For those not immersed in ppc64, in real mode, the top nibble or
>2 bits (depending on radix/hash mmu) of the address is ignored. The
>linear mapping is placed at 0xc000. This means that a
>pointer to part of the linear mapping will work both in real mode,
>where it will be interpreted as a physical address of the form
>0x000..., and out of real mode, where it will go via the linear
>mapping.]
> 
>  - Inline instrumentation requires a fixed offset.
> 
>  - Because of our running things in real mode, the offset has to
>point to valid memory both in and out of real mode.
> 
> This makes finding somewhere to put the KASAN shadow region challenging.
> 
> One approach is just to give up on inline instrumentation and override
> the address->shadow calculation. This way we can delay all checking
> until after we get everything set up to our satisfaction. However,
> we'd really like to do better.
> 
> What we can do - if we know _at compile time_ how much contiguous
> physical memory we have - is to set aside the top 1/8th of the memory
> and use that. This is a big hammer (hence the "heavyweight" name) and
> comes with 3 big consequences:
> 
>  - kernels will simply fail to boot on machines with less memory than
>specified when compiling.
> 
>  - kernels running on machines with more memory than specified when
>compiling will simply ignore the extra memory.
> 
>  - there's no nice way to handle physically discontiguous memory, so
>you are restricted to the first physical memory block.
> 
> If you can bear all this, you get full support for KASAN.
> 
> Despite the limitations, it can still find bugs,
> e.g. http://patchwork.ozlabs.org/patch/1103775/
> 
> The current implementation is Radix only.
> 
> Massive thanks to mpe, who had the idea for the initial design.
> 
> Signed-off-by: Daniel Axtens 
> 
> ---
> Changes since v1:
>  - Landed kasan vmalloc support upstream
>  - Lots of feedback from Christophe.
> 
> Changes since the rfc:
> 
>  - Boots real and virtual hardware, kvm works.
> 
>  - disabled reporting when we're checking the stack for exception
>frames. The behaviour isn't wrong, just incompatible with KASAN.
> 
>  - Documentation!
> 
>  - Dropped old module stuff in favour of KASAN_VMALLOC.
> 
> The bugs with ftrace and kuap were due to kernel bloat pushing
> prom_init calls to be done via the plt. Because we did not have
> a relocatable kernel, and they are done very early, this caused
> everything to explode. Compile with CONFIG_RELOCATABLE!
> ---
>  Documentation/dev-tools/kasan.rst |   8 +-
>  Documentation/powerpc/kasan.txt   | 102 +-
>  arch/powerpc/Kconfig  |   3 +
>  arch/powerpc/Kconfig.debug|  21 
>  arch/powerpc/Makefile |  11 ++
>  arch/powerpc/include/asm/kasan.h  |  20 +++-
>  arch/powerpc/kernel/process.c |   8 ++
>  arch/powerpc/kernel/prom.c|  59 +-
>  arch/powerpc/mm/kasan/Makefile|   3 +-
>  .../mm/kasan/{kasan_init_32.c => init_32.c}   |   0
>  arch/powerpc/mm/kasan/init_book3s_64.c|  67 
>  11 files changed, 293 insertions(+), 9 deletions(-)
>  rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)
>  create mode 100644 arch/powerpc/mm/kasan/init_book3s_64.c
> 
> diff --git a/Documentation/dev-tools/kasan.rst 
> b/Documentation/dev-tools/kasan.rst
> index 4af2b5d2c9b4..d99dc580bc11 100644
> --- a/Documentation/dev-tools/kasan.rst
> +++ b/Documentation/dev-tools/kasan.rst
> @@ -22,8 +22,9 @@ global variables yet.
>  Tag-based KASAN is only supported in Clang and requires version 7.0.0 or 
> later.
>  
>  Currently generic KASAN is supported for the x86_64, arm64, xtensa and s390
> -architectures. It is also supported on 32-bit powerpc kernels. Tag-based 
> KASAN
> -is supported only on arm64.
> +architectures. It is also supported on powerpc, for 32-bit kernels, and for
> +64-bit kernels running under the Radix MMU. Tag-based KASAN is supported only
> +on arm64.
>  
>  Usage
>  -
> @@ -256,7 +257,8 @@ CONFIG_KASAN_VMALLOC
>  
>  
>  With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the
> -cost of greater memory usage. Currently this is only supported on x86.
> +cost of greater memory usage. Currently this is optional on x86, and
> +required on 64-bit powerpc.
>  
>  This works by hooking into vmalloc and vmap, and dynamically
>  allocating real shadow memory to back the mappings.
> diff --git a/Documentation/powerpc/kasan.txt 

Re: [PATCH v2 1/4] mm: define MAX_PTRS_PER_{PTE,PMD,PUD}

2019-12-10 Thread kbuild test robot
Hi Daniel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20191209]
[also build test WARNING on linus/master v5.5-rc1]
[cannot apply to powerpc/next asm-generic/master v5.4]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Daniel-Axtens/KASAN-for-powerpc64-radix-plus-generic-mm-change/20191210-171342
base:6cf8298daad041cd15dc514d8a4f93ca3636c84e
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-1) 7.5.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All warnings (new ones prefixed by >>):

   In file included from arch/x86/include/asm/pgtable_types.h:359:0,
from arch/x86/include/asm/processor.h:20,
from arch/x86/include/asm/cpufeature.h:5,
from arch/x86/include/asm/thread_info.h:53,
from include/linux/thread_info.h:38,
from arch/x86/include/asm/preempt.h:7,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:51,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from include/linux/crypto.h:19,
from arch/x86/kernel/asm-offsets.c:9:
>> include/asm-generic/pgtable-nopud.h:22:0: warning: "MAX_PTRS_PER_PUD" 
>> redefined
#define MAX_PTRS_PER_PUD 1

   In file included from arch/x86/include/asm/processor.h:20:0,
from arch/x86/include/asm/cpufeature.h:5,
from arch/x86/include/asm/thread_info.h:53,
from include/linux/thread_info.h:38,
from arch/x86/include/asm/preempt.h:7,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:51,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from include/linux/crypto.h:19,
from arch/x86/kernel/asm-offsets.c:9:
   arch/x86/include/asm/pgtable_types.h:261:0: note: this is the location of 
the previous definition
#define MAX_PTRS_PER_PUD PTRS_PER_PUD

   In file included from arch/x86/include/asm/pgtable_types.h:385:0,
from arch/x86/include/asm/processor.h:20,
from arch/x86/include/asm/cpufeature.h:5,
from arch/x86/include/asm/thread_info.h:53,
from include/linux/thread_info.h:38,
from arch/x86/include/asm/preempt.h:7,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:51,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from include/linux/crypto.h:19,
from arch/x86/kernel/asm-offsets.c:9:
>> include/asm-generic/pgtable-nopmd.h:21:0: warning: "MAX_PTRS_PER_PMD" 
>> redefined
#define MAX_PTRS_PER_PMD 1

   In file included from arch/x86/include/asm/processor.h:20:0,
from arch/x86/include/asm/cpufeature.h:5,
from arch/x86/include/asm/thread_info.h:53,
from include/linux/thread_info.h:38,
from arch/x86/include/asm/preempt.h:7,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:51,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from include/linux/crypto.h:19,
from arch/x86/kernel/asm-offsets.c:9:
   arch/x86/include/asm/pgtable_types.h:262:0: note: this is the location of 
the previous definition
#define MAX_PTRS_PER_PMD PTRS_PER_PMD

--
   In file included from arch/x86/include/asm/pgtable_types.h:359:0,
from arch/x86/include/asm/processor.h:20,
from arch/x86/include/asm/cpufeature.h:5,
from arch/x86/include/asm/thread_info.h:53,
from include/linux/thread_info.h:38,
from arch/x86/include/asm/preempt.h:7,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:51,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,

Re: [PATCH 5/6] mm, memory_hotplug: Provide argument for the pgprot_t in arch_add_memory()

2019-12-10 Thread Michal Hocko
On Tue 10-12-19 11:09:46, David Hildenbrand wrote:
> On 10.12.19 11:04, Michal Hocko wrote:
> > On Mon 09-12-19 12:43:40, Dan Williams wrote:
> >> On Mon, Dec 9, 2019 at 12:24 PM Logan Gunthorpe  
> >> wrote:
> >>>
> >>>
> >>>
> >>> On 2019-12-09 12:23 p.m., David Hildenbrand wrote:
>  On 09.12.19 20:13, Logan Gunthorpe wrote:
> > [...]
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> > -int arch_add_memory(int nid, u64 start, u64 size,
> > +int arch_add_memory(int nid, u64 start, u64 size, pgprot_t prot,
> >  struct mhp_restrictions *restrictions)
> 
>  Can we fiddle that into "struct mhp_restrictions" instead?
> >>>
> >>> Yes, if that's what people want, it's pretty trivial to do. I chose not
> >>> to do it that way because it doesn't get passed down to add_pages() and
> >>> it's not really a "restriction". If I don't hear any objections, I will
> >>> do that for v2.
> >>
> >> +1 to storing this information alongside the altmap in that structure.
> >> However, I agree struct mhp_restrictions, with the MHP_MEMBLOCK_API
> >> flag now gone, has lost all of its "restrictions". How about dropping
> >> the 'flags' property and renaming the struct to 'struct
> >> mhp_modifiers'?
> > 
> > Hmm, this email somehow didn't end up in my inbox so I have missed it
> > before replying.
> > 
> > Well, mhp_modifiers makes some sense and it would reduce the API
> > proliferation but how do you expect the prot part to be handled?
> > I really do not want people to think about PAGE_KERNEL or which
> > protection to use because my experience tells that this will get copied
> > without much thinking or simply will break with some odd usecases.
> > So how exactly this would be used?
> 
> I was thinking about exactly the same "issue".
> 
> 1. default initialization via a function
> 
> memhp_modifier_default_init();
> 
> 2. a flag that unlocks the prot field (default:0). Without the flag, it
> is ignored. We can keep the current initialization then.
> 
> Other ideas?

3. a prot mask to apply on top of PAGE_KERNEL? Or would that be
insufficient/clumsy?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v8 08/26] mm/gup: allow FOLL_FORCE for get_user_pages_fast()

2019-12-10 Thread Jan Kara
On Mon 09-12-19 14:53:26, John Hubbard wrote:
> Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
> only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
> This, combined with the fact that get_user_pages_fast() falls back to
> "slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
> if you need FOLL_FORCE, you cannot call get_user_pages_fast().
> 
> There does not appear to be any reason for filtering out FOLL_FORCE.
> There is nothing in the _fast() implementation that requires that we
> avoid writing to the pages. So it appears to have been an oversight.
> 
> Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().
> 
> Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
> Cc: Christoph Hellwig 
> Reviewed-by: Leon Romanovsky 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index c0c56888e7cc..958ab0757389 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2414,7 +2414,8 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>   unsigned long addr, len, end;
>   int nr = 0, ret = 0;
>  
> - if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> + if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
> +FOLL_FORCE)))
>   return -EINVAL;
>  
>   start = untagged_addr(start) & PAGE_MASK;
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: Found the commit for: 5.3.7 64-bits kernel doesn't boot on G5 Quad [regression]

2019-12-10 Thread John Paul Adrian Glaubitz
Hi!

On 12/10/19 9:35 AM, Romain Dolbeau wrote:
> Le sam. 16 nov. 2019 à 17:34, Romain Dolbeau  a écrit :
>> So it seems to me that 0034d395f89d9c092bb15adbabdca5283e258b41
>> introduced the bug that crashes the PowerMac G5
> 
> There's been some commits in that subsystem, so I tried again; as of
> 6794862a16ef41f753abd75c03a152836e4c8028, the kernel still crashes
> when trying to boot my PowerMac G5.

If Aneesh is currently unable to look at the problem, I would suggest reverting
the commit in question since I don't think it's acceptable that users are unable
to boot their machines anymore after a kernel upgrade.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


Re: [PATCH v8 17/26] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-12-10 Thread Jan Kara
On Mon 09-12-19 16:56:27, Andrew Morton wrote:
> On Mon, 9 Dec 2019 14:53:35 -0800 John Hubbard  wrote:
> 
> > After DMA is complete, and the device and CPU caches are synchronized,
> > it's still required to mark the CPU pages as dirty, if the data was
> > coming from the device. However, this driver was just issuing a
> > bare put_page() call, without any set_page_dirty*() call.
> > 
> > Fix the problem, by calling set_page_dirty_lock() if the CPU pages
> > were potentially receiving data from the device.
> > 
> > Reviewed-by: Christoph Hellwig 
> > Acked-by: Hans Verkuil 
> > Cc: Mauro Carvalho Chehab 
> > Cc: 
> 
> What are the user-visible effects of this change?

Presumably loss of captured video data if the page writeback hits in the
wrong moment (i.e., after the page was faulted in but before the video HW
stored data in the page) and the page then gets evicted from the page cache.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops)

2019-12-10 Thread Peter Zijlstra
On Tue, Dec 10, 2019 at 04:38:54PM +1100, Michael Ellerman wrote:

> Good question, I'll have a look.
> 
> There seems to be confusion about what the type of the bit number is,
> which is leading to sign extension in some cases and not others.

Shiny.

> It looks like the type should be unsigned long?

I'm thinking unsigned makes most sense, I mean, negative bit offsets
should 'work' but that's almost always guaranteed to be an out-of-bound
operation.

As to 'long' vs 'int', I'm not sure, 4G bits is a long bitmap. But I
suppose since the bitmap itself is 'unsigned long', we might as well use
'unsigned long' for the bitnr too.

>   Documentation/core-api/atomic_ops.rst:  void __clear_bit_unlock(unsigned 
> long nr, unsigned long *addr);
>   arch/mips/include/asm/bitops.h:static inline void 
> __clear_bit_unlock(unsigned long nr, volatile unsigned long *addr)
>   arch/powerpc/include/asm/bitops.h:static inline void 
> arch___clear_bit_unlock(int nr, volatile unsigned long *addr)
>   arch/riscv/include/asm/bitops.h:static inline void 
> __clear_bit_unlock(unsigned long nr, volatile unsigned long *addr)
>   arch/s390/include/asm/bitops.h:static inline void 
> arch___clear_bit_unlock(unsigned long nr,
>   include/asm-generic/bitops/instrumented-lock.h:static inline void 
> __clear_bit_unlock(long nr, volatile unsigned long *addr)
>   include/asm-generic/bitops/lock.h:static inline void 
> __clear_bit_unlock(unsigned int nr,
> 
> So I guess step one is to convert our versions to use unsigned long, so
> we're at least not tripping over that difference when comparing the
> assembly.

Yeah, I'll look at fixing the generic code, bitops/atomic.h and
bitops/non-atomic.h don't even agree on the type of bitnr.


Re: [PATCH 5/6] mm, memory_hotplug: Provide argument for the pgprot_t in arch_add_memory()

2019-12-10 Thread David Hildenbrand
On 10.12.19 11:04, Michal Hocko wrote:
> On Mon 09-12-19 12:43:40, Dan Williams wrote:
>> On Mon, Dec 9, 2019 at 12:24 PM Logan Gunthorpe  wrote:
>>>
>>>
>>>
>>> On 2019-12-09 12:23 p.m., David Hildenbrand wrote:
 On 09.12.19 20:13, Logan Gunthorpe wrote:
> [...]
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size,
> +int arch_add_memory(int nid, u64 start, u64 size, pgprot_t prot,
>  struct mhp_restrictions *restrictions)

 Can we fiddle that into "struct mhp_restrictions" instead?
>>>
>>> Yes, if that's what people want, it's pretty trivial to do. I chose not
>>> to do it that way because it doesn't get passed down to add_pages() and
>>> it's not really a "restriction". If I don't hear any objections, I will
>>> do that for v2.
>>
>> +1 to storing this information alongside the altmap in that structure.
>> However, I agree struct mhp_restrictions, with the MHP_MEMBLOCK_API
>> flag now gone, has lost all of its "restrictions". How about dropping
>> the 'flags' property and renaming the struct to 'struct
>> mhp_modifiers'?
> 
> Hmm, this email somehow didn't end up in my inbox so I have missed it
> before replying.
> 
> Well, mhp_modifiers makes some sense and it would reduce the API
> proliferation but how do you expect the prot part to be handled?
> I really do not want people to think about PAGE_KERNEL or which
> protection to use because my experience tells that this will get copied
> without much thinking or simply will break with some odd usecases.
> So how exactly this would be used?

I was thinking about exactly the same "issue".

1. default initialization via a function

memhp_modifier_default_init();

2. a flag that unlocks the prot field (default:0). Without the flag, it
is ignored. We can keep the current initialization then.

Other ideas?

-- 
Thanks,

David / dhildenb



Re: [PATCH 5/6] mm, memory_hotplug: Provide argument for the pgprot_t in arch_add_memory()

2019-12-10 Thread Michal Hocko
On Mon 09-12-19 12:43:40, Dan Williams wrote:
> On Mon, Dec 9, 2019 at 12:24 PM Logan Gunthorpe  wrote:
> >
> >
> >
> > On 2019-12-09 12:23 p.m., David Hildenbrand wrote:
> > > On 09.12.19 20:13, Logan Gunthorpe wrote:
[...]
> > >>  #ifdef CONFIG_MEMORY_HOTPLUG
> > >> -int arch_add_memory(int nid, u64 start, u64 size,
> > >> +int arch_add_memory(int nid, u64 start, u64 size, pgprot_t prot,
> > >>  struct mhp_restrictions *restrictions)
> > >
> > > Can we fiddle that into "struct mhp_restrictions" instead?
> >
> > Yes, if that's what people want, it's pretty trivial to do. I chose not
> > to do it that way because it doesn't get passed down to add_pages() and
> > it's not really a "restriction". If I don't hear any objections, I will
> > do that for v2.
> 
> +1 to storing this information alongside the altmap in that structure.
> However, I agree struct mhp_restrictions, with the MHP_MEMBLOCK_API
> flag now gone, has lost all of its "restrictions". How about dropping
> the 'flags' property and renaming the struct to 'struct
> mhp_modifiers'?

Hmm, this email somehow didn't end up in my inbox so I have missed it
before replying.

Well, mhp_modifiers makes some sense and it would reduce the API
proliferation but how do you expect the prot part to be handled?
I really do not want people to think about PAGE_KERNEL or which
protection to use because my experience tells that this will get copied
without much thinking or simply will break with some odd usecases.
So how exactly this would be used?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 5/6] mm, memory_hotplug: Provide argument for the pgprot_t in arch_add_memory()

2019-12-10 Thread Michal Hocko
On Mon 09-12-19 14:24:22, Logan Gunthorpe wrote:
> 
> 
> On 2019-12-09 1:41 p.m., Michal Hocko wrote:
> > On Mon 09-12-19 13:24:19, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-12-09 12:23 p.m., David Hildenbrand wrote:
> >>> On 09.12.19 20:13, Logan Gunthorpe wrote:
>  devm_memremap_pages() is currently used by the PCI P2PDMA code to create
>  struct page mappings for IO memory. At present, these mappings are 
>  created
>  with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
>  x86, an mtrr register will typically override this and force the cache
>  type to be UC-. In the case firmware doesn't set this register it is
>  effectively WB and will typically result in a machine check exception
>  when it's accessed.
> 
>  Other arches are not currently likely to function correctly seeing they
>  don't have any MTRR registers to fall back on.
> 
>  To solve this, add an argument to arch_add_memory() to explicitly
>  set the pgprot value to a specific value.
> 
>  Of the arches that support MEMORY_HOTPLUG: x86_64, s390 and arm64 is a
>  simple change to pass the pgprot_t down to their respective functions
>  which set up the page tables. For x86_32, set the page tables explicitly
>  using _set_memory_prot() (seeing they are already mapped). For sh, reject
>  anything but PAGE_KERNEL settings -- this should be fine, for now, seeing
>  sh doesn't support ZONE_DEVICE anyway.
> 
>  Cc: Dan Williams 
>  Cc: David Hildenbrand 
>  Cc: Michal Hocko 
>  Signed-off-by: Logan Gunthorpe 
>  ---
>   arch/arm64/mm/mmu.c| 4 ++--
>   arch/ia64/mm/init.c| 5 -
>   arch/powerpc/mm/mem.c  | 4 ++--
>   arch/s390/mm/init.c| 4 ++--
>   arch/sh/mm/init.c  | 5 -
>   arch/x86/mm/init_32.c  | 7 ++-
>   arch/x86/mm/init_64.c  | 4 ++--
>   include/linux/memory_hotplug.h | 2 +-
>   mm/memory_hotplug.c| 2 +-
>   mm/memremap.c  | 2 +-
>   10 files changed, 25 insertions(+), 14 deletions(-)
> 
>  diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>  index 60c929f3683b..48b65272df15 100644
>  --- a/arch/arm64/mm/mmu.c
>  +++ b/arch/arm64/mm/mmu.c
>  @@ -1050,7 +1050,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long 
>  addr)
>   }
>   
>   #ifdef CONFIG_MEMORY_HOTPLUG
>  -int arch_add_memory(int nid, u64 start, u64 size,
>  +int arch_add_memory(int nid, u64 start, u64 size, pgprot_t prot,
>   struct mhp_restrictions *restrictions)
> >>>
> >>> Can we fiddle that into "struct mhp_restrictions" instead?
> >>
> >> Yes, if that's what people want, it's pretty trivial to do. I chose not
> >> to do it that way because it doesn't get passed down to add_pages() and
> >> it's not really a "restriction". If I don't hear any objections, I will
> >> do that for v2.
> > 
> > I do agree that restriction is not the best fit. But I consider prot
> > argument to complicate the API to all users even though it is not really
> > clear whether we are going to have many users really benefiting from it.
> > Look at the vmalloc API and try to find how many users of __vmalloc do
> > not use PAGE_KERNEL.
> > 
> > So I can see two options. One of them is to add arch_add_memory_prot
> > that would allow to have give and extra prot argument or simply call
> > an arch independent API to change the protection after arch_add_memory.
> > The later sounds like much less code. The memory shouldn't be in use by
> > anybody at that stage yet AFAIU. Maybe there even is an API like that.
> 
> Yes, well, we tried something like this by calling set_memory_wc()
> inside memremap_pages(); but on large bars (tens of GB) it was too slow
> (taking several seconds to complete) and on some hosts actually hit CPU
> watchdog errors.

Which looks like something to fix independently.

> So at the very least we'd have to add some cpu_relax() calls to that
> path. And it's also the case that set_memory_wc() is x86 only right now.
> So we'd have to create a new general interface to walk and fixup page
> tables for all arches.
> 
> But, in my opinion, setting up all those page tables twice is too large
> of an overhead and it's better to just add them correctly the first
> time. The changes I propose to do this aren't really a lot of code and
> probably less than creating a new interface for all arches.

OK, fair enough. Then I would suggest going with arch_add_memory_prot
then unless there is a wider disagreement witht that.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/4] kasan: use MAX_PTRS_PER_* for early shadow

2019-12-10 Thread Balbir Singh



On 10/12/19 3:47 pm, Daniel Axtens wrote:
> This helps with powerpc support, and should have no effect on
> anything else.
> 
> Suggested-by: Christophe Leroy 
> Signed-off-by: Daniel Axtens 

If you follow the recommendations by Christophe and I, you don't need this patch

Balbir Singh.

> ---
>  include/linux/kasan.h | 6 +++---
>  mm/kasan/init.c   | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index e18fe54969e9..d2f2a4ffcb12 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -15,9 +15,9 @@ struct task_struct;
>  #include 
>  
>  extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
> -extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
> -extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
> -extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD];
> +extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE];
> +extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD];
> +extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD];
>  extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D];
>  
>  int kasan_populate_early_shadow(const void *shadow_start,
> diff --git a/mm/kasan/init.c b/mm/kasan/init.c
> index ce45c491ebcd..8b54a96d3b3e 100644
> --- a/mm/kasan/init.c
> +++ b/mm/kasan/init.c
> @@ -46,7 +46,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
>  }
>  #endif
>  #if CONFIG_PGTABLE_LEVELS > 3
> -pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss;
> +pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss;
>  static inline bool kasan_pud_table(p4d_t p4d)
>  {
>   return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud));
> @@ -58,7 +58,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
>  }
>  #endif
>  #if CONFIG_PGTABLE_LEVELS > 2
> -pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss;
> +pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss;
>  static inline bool kasan_pmd_table(pud_t pud)
>  {
>   return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd));
> @@ -69,7 +69,7 @@ static inline bool kasan_pmd_table(pud_t pud)
>   return false;
>  }
>  #endif
> -pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss;
> +pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE] __page_aligned_bss;
>  
>  static inline bool kasan_pte_table(pmd_t pmd)
>  {
> 


Re: [PATCH v2 1/4] mm: define MAX_PTRS_PER_{PTE,PMD,PUD}

2019-12-10 Thread Balbir Singh



On 10/12/19 3:47 pm, Daniel Axtens wrote:
> powerpc has boot-time configurable PTRS_PER_PTE, PMD and PUD. The
> values are selected based on the MMU under which the kernel is
> booted. This is much like how 4 vs 5-level paging on x86_64 leads to
> boot-time configurable PTRS_PER_P4D.
> 
> So far, this hasn't leaked out of arch/powerpc. But with KASAN, we
> have static arrays based on PTRS_PER_*, so for powerpc support must
> provide constant upper bounds for generic code.
> 
> Define MAX_PTRS_PER_{PTE,PMD,PUD} for this purpose.
> 
> I have configured these constants:
>  - in asm-generic headers
>  - on arches that implement KASAN: x86, s390, arm64, xtensa and powerpc
> 
> I haven't wired up any other arches just yet - there is no user of
> the constants outside of the KASAN code I add in the next patch, so
> missing the constants on arches that don't support KASAN shouldn't
> break anything.

I would suggest limiting this to powerpc for now and use

#ifndef MAX_PTRS_PER_PUD
#define MAX_PTRS_PER_PUDPTRS_PER_PUD
#endif

style code in KASAN. It just keeps the change surface to a limited
value, till other arch's see value in migrating to support it.

> 
> Suggested-by: Christophe Leroy 
> Signed-off-by: Daniel Axtens 
> ---
>  arch/arm64/include/asm/pgtable-hwdef.h   | 3 +++
>  arch/powerpc/include/asm/book3s/64/hash.h| 4 
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +++
>  arch/powerpc/include/asm/book3s/64/radix.h   | 5 +
>  arch/s390/include/asm/pgtable.h  | 3 +++
>  arch/x86/include/asm/pgtable_types.h | 5 +
>  arch/xtensa/include/asm/pgtable.h| 1 +
>  include/asm-generic/pgtable-nop4d-hack.h | 9 +
>  include/asm-generic/pgtable-nopmd.h  | 9 +
>  include/asm-generic/pgtable-nopud.h  | 9 +
>  10 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index d9fbd433cc17..485e1f3c5c6f 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -41,6 +41,7 @@
>  #define ARM64_HW_PGTABLE_LEVEL_SHIFT(n)  ((PAGE_SHIFT - 3) * (4 - (n)) + 
> 3)
>  
>  #define PTRS_PER_PTE (1 << (PAGE_SHIFT - 3))
> +#define MAX_PTRS_PER_PTE PTRS_PER_PTE
>  
>  /*
>   * PMD_SHIFT determines the size a level 2 page table entry can map.
> @@ -50,6 +51,7 @@
>  #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
>  #define PMD_MASK (~(PMD_SIZE-1))
>  #define PTRS_PER_PMD PTRS_PER_PTE
> +#define MAX_PTRS_PER_PMD PTRS_PER_PMD
>  #endif
>  
>  /*
> @@ -60,6 +62,7 @@
>  #define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
>  #define PUD_MASK (~(PUD_SIZE-1))
>  #define PTRS_PER_PUD PTRS_PER_PTE
> +#define MAX_PTRS_PER_PUD PTRS_PER_PUD
>  #endif
>  
>  /*
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
> b/arch/powerpc/include/asm/book3s/64/hash.h
> index 2781ebf6add4..fce329b8452e 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -18,6 +18,10 @@
>  #include 
>  #endif
>  
> +#define H_PTRS_PER_PTE   (1 << H_PTE_INDEX_SIZE)
> +#define H_PTRS_PER_PMD   (1 << H_PMD_INDEX_SIZE)
> +#define H_PTRS_PER_PUD   (1 << H_PUD_INDEX_SIZE)
> +
>  /* Bits to set in a PMD/PUD/PGD entry valid bit*/
>  #define HASH_PMD_VAL_BITS(0x8000UL)
>  #define HASH_PUD_VAL_BITS(0x8000UL)
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index b01624e5c467..209817235a44 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -231,6 +231,13 @@ extern unsigned long __pmd_frag_size_shift;
>  #define PTRS_PER_PUD (1 << PUD_INDEX_SIZE)
>  #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE)
>  
> +#define MAX_PTRS_PER_PTE ((H_PTRS_PER_PTE > R_PTRS_PER_PTE) ? \
> +   H_PTRS_PER_PTE : R_PTRS_PER_PTE)
> +#define MAX_PTRS_PER_PMD ((H_PTRS_PER_PMD > R_PTRS_PER_PMD) ? \
> +   H_PTRS_PER_PMD : R_PTRS_PER_PMD)
> +#define MAX_PTRS_PER_PUD ((H_PTRS_PER_PUD > R_PTRS_PER_PUD) ? \
> +   H_PTRS_PER_PUD : R_PTRS_PER_PUD)
> +

How about reusing max

#define MAX_PTRS_PER_PTE  max(H_PTRS_PER_PTE, R_PTRS_PER_PTE)
#define MAX_PTRS_PER_PMD  max(H_PTRS_PER_PMD, R_PTRS_PER_PMD)
#define MAX_PTRS_PER_PUD  max(H_PTRS_PER_PUD, R_PTRS_PER_PUD)

Balbir Singh.



Re: Found the commit for: 5.3.7 64-bits kernel doesn't boot on G5 Quad [regression]

2019-12-10 Thread Romain Dolbeau
Hello,

Le sam. 16 nov. 2019 à 17:34, Romain Dolbeau  a écrit :
> So it seems to me that 0034d395f89d9c092bb15adbabdca5283e258b41
> introduced the bug that crashes the PowerMac G5

There's been some commits in that subsystem, so I tried again; as of
6794862a16ef41f753abd75c03a152836e4c8028, the kernel still crashes
when trying to boot my PowerMac G5.

Cordially,

-- 
Romain Dolbeau


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-10 Thread Frank Rowand
On 12/9/19 7:51 PM, Rob Herring wrote:
> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
>  wrote:
>>
>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
>>> Is there a memory usage issue for the systems that led to this thread?
>>
>> No, no memory issue led to this thread. I was just testing my patch and
>> I assumed that I did something wrong in the counting/lock drop/lock
>> acquire/allocate path because the array was hardly used. So I started to
>> look deeper…
>> Once I figured out everything was fine, I was curious if everyone is
>> aware of the different phandle creation by dtc vs POWER. And I posted
>> the mail in the thread.
>> Once you confirmed that everything is "known / not an issue" I was ready
>> to take off [0].
>>
>> Later more replies came in such as one mail [1] from Rob describing the
>> original reason with 814 phandles. _Here_ I was just surprised that 1024
>> were used over 64 entries for a benefit of 60ms. I understand that this
>> is low concern for you because that memory is released if modules are
>> not enabled. I usually see that module support is left enabled.
>>
>> However, Rob suggested / asked about the fixed size array (this is how I
>> understood it):
>> |And yes, as mentioned earlier I don't like the complexity. I didn't
>> |from the start and I'm  I'm still of the opinion we should have a
>> |fixed or 1 time sized true cache (i.e. smaller than total # of
>> |phandles). That would solve the RT memory allocation and locking issue
>> |too.
>>
>> so I attempted to ask if we should have the fixed size array maybe
>> with the hash_32() instead the mask. This would make my other patch
>> obsolete because the fixed size array should not have a RT issue. The
>> hash_32() part here would address the POWER issue where the cache is
>> currently not used efficiently.
>>
>> If you want instead to keep things as-is then this is okay from my side.
>> If you want to keep this cache off on POWER then I could contribute a
>> patch doing so.
> 
> It turns out there's actually a bug in the current implementation. If
> we have multiple phandles with the same mask, then we leak node
> references if we miss in the cache and re-assign the cache entry.

Aaargh.  Patch sent.

> Easily fixed I suppose, but holding a ref count for a cached entry
> seems wrong. That means we never have a ref count of 0 on every node
> with a phandle.

It will go to zero when the cache is freed.

My memory is that we free the cache as part of removing an overlay.  I'll
verify whether my memory is correct.

-Frank


> 
> I've done some more experiments with the performance. I've come to the
> conclusion that just measuring boot time is not a good way at least if
> the time is not a significant percentage of the total. I couldn't get
> any measurable data. I'm using a RK3399 Rock960 board. It has 190
> phandles. I get about 1500 calls to of_find_node_by_phandle() during
> boot. Note that about the first 300 are before we have any timekeeping
> (the prior measurements also would not account for this). Those calls
> have no cache in the current implementation and are cached in my
> implementation.
> 
> no cache:  20124 us
> current cache: 819 us
> 
> new cache (64 entry): 4342 us
> new cache (128 entry): 2875 us
> new cache (256 entry): 1235 us
> 
> To get some idea on the time spent before timekeeping is up, if we
> take the avg miss time is ~17us (20124/1200), then we're spending
> about ~5ms on lookups before the cache is enabled. I'd estimate the
> new cache takes ~400us before timekeeping is up as there's 11 misses
> early.
> 
>>From these numbers, it seems the miss rate has a significant impact on
> performance for the different sizes. But taken from the original 20+
> ms, they all look like good improvement.
> 
> Rob
>