date:20190416

Re: [PATCH v3 00/26] compat_ioctl: cleanups

2019-04-16 Thread Douglas Gilbert


On 2019-04-16 4:19 p.m., Arnd Bergmann wrote:

Hi Al,

It took me way longer than I had hoped to revisit this series, see
https://lore.kernel.org/lkml/20180912150142.157913-1-a...@arndb.de/
for the previously posted version.

I've come to the point where all conversion handlers and most
COMPATIBLE_IOCTL() entries are gone from this file, but for
now, this series only has the parts that have either been reviewed
previously, or that are simple enough to include.

The main missing piece is the SG_IO/SG_GET_REQUEST_TABLE conversion.
I'll post the patches I made for that later, as they need more
testing and review from the scsi maintainers.


Perhaps you could look at the document in this url:
http://sg.danny.cz/sg/sg_v40.html

It is work-in-progress to modernize the SCSI generic driver. It
extends ioctl(sg_fd, SG_IO, _obj) to additionally accept the sg v4
interface as defined in include/uapi/linux/bsg.h . Currently only the
bsg driver uses the sg v4 interface. Since struct sg_io_v4 is all
explicitly sized integers, I'm guessing it is immune "compat" problems.
[I can see no reference to bsg nor struct sg_io_v4 in the current
fs/compat_ioctl.c file.]

Other additions described in the that document are these new ioctls:
  - SG_IOSUBMITultimately to replace write(sg_fd, ...)
  - SG_IORECEIVE  to replace read(sg_fd, ...)
  - SG_IOABORT abort SCSI cmd in progress; new functionality
  - SG_SET_GET_EXTENDED   has associated struct sg_extended_info

The first three take a pointer to a struct sg_io_hdr (v3 interface) or
a struct sg_io_v4 object. Both objects start with a 32 bit integer:
'S' identifies the v3 interface while 'Q' identifies the v4 interface.

The SG_SET_GET_EXTENDED ioctl takes a pointer to a struct
sg_extended_info object which contains explicitly sized integers so it
may also be immune from "compat" problems. The ioctls section (13) of
that document referenced above has a table showing how many "sets and
gets" are hiding in the SG_SET_GET_EXTENDED ioctl.

BTW No change is proposed for this case:
ioctl(normal_block_device, SG_IO, _v3_obj)
which is handled by block/scsi_ioctl.c


This would be a good time for me to address any "compat" concerns in the
proposed sg driver update.

Doug Gilbert



I hope you can still take these for the coming merge window, unless
new problems come up.

   Arnd

Arnd Bergmann (26):
   compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
   compat_ioctl: move simple ppp command handling into driver
   compat_ioctl: avoid unused function warning for do_ioctl
   compat_ioctl: move PPPIOCSCOMPRESS32 to ppp-generic.c
   compat_ioctl: move PPPIOCSPASS32/PPPIOCSACTIVE32 to ppp_generic.c
   compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
   compat_ioctl: move rtc handling into rtc-dev.c
   compat_ioctl: add compat_ptr_ioctl()
   compat_ioctl: move drivers to compat_ptr_ioctl
   compat_ioctl: use correct compat_ptr() translation in drivers
   ceph: fix compat_ioctl for ceph_dir_operations
   compat_ioctl: move more drivers to compat_ptr_ioctl
   compat_ioctl: move tape handling into drivers
   compat_ioctl: move ATYFB_CLK handling to atyfb driver
   compat_ioctl: move isdn/capi ioctl translation into driver
   compat_ioctl: move rfcomm handlers into driver
   compat_ioctl: move hci_sock handlers into driver
   compat_ioctl: remove HCIUART handling
   compat_ioctl: remove HIDIO translation
   compat_ioctl: remove translation for sound ioctls
   compat_ioctl: remove IGNORE_IOCTL()
   compat_ioctl: remove /dev/random commands
   compat_ioctl: remove joystick ioctl translation
   compat_ioctl: remove PCI ioctl translation
   compat_ioctl: remove /dev/raw ioctl translation
   compat_ioctl: remove last RAID handling code

  Documentation/networking/ppp_generic.txt|   2 +
  arch/um/drivers/hostaudio_kern.c|   1 +
  drivers/android/binder.c|   2 +-
  drivers/char/ppdev.c|  12 +-
  drivers/char/random.c   |   1 +
  drivers/char/tpm/tpm_vtpm_proxy.c   |  12 +-
  drivers/crypto/qat/qat_common/adf_ctl_drv.c |   2 +-
  drivers/dma-buf/dma-buf.c   |   4 +-
  drivers/dma-buf/sw_sync.c   |   2 +-
  drivers/dma-buf/sync_file.c |   2 +-
  drivers/firewire/core-cdev.c|  12 +-
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c|   2 +-
  drivers/hid/hidraw.c|   4 +-
  drivers/hid/usbhid/hiddev.c |  11 +-
  drivers/hwtracing/stm/core.c|  12 +-
  drivers/ide/ide-tape.c  |  31 +-
  drivers/iio/industrialio-core.c |   2 +-
  drivers/infiniband/core/uverbs_main.c   |   4 +-
  drivers/isdn/capi/capi.c|  31 +
  drivers/isdn/i4l/isdn_ppp.c |  14 +-
  drivers/media/rc/lirc_dev.c |   4 +-
  drivers/mfd/cros_ec_dev.c   |   4 +-
  drivers/misc/cxl/flash.c|   8 +-

Re: Linux 5.1-rc5

2019-04-16 Thread Linus Torvalds

On Tue, Apr 16, 2019 at 8:38 PM Michael Ellerman  wrote:
>
> > That said, powerpc and s390 should at least look at maybe adding a
> > check for the page ref in their gup paths too. Powerpc has the special
> > gup_hugepte() case
>
> Which uses page_cache_add_speculative(), which handles the case of the
> refcount being zero but not overflow. So that looks like it needs
> fixing.

Note that unlike the zero check, the "too many refs" check does _not_
need to be atomic.

Because it's not a correctness issue right at some magical exact
point, it's a much more ambiguous a "the refcount is now so large that
I'm not going to do GUP on this page any more". Being off by a number
of pages in case there's a race is just fine.

So you could do something like this (TOTALLY UNTESTED, and
whitespace-damaged on purpose - I don't want you to apply it blindly)
appended patch.

> And we have a few uses of bare get_page() in KVM code which might be
> subject to the same attack.

Note that you really have to have not just a get_page(), but some way
of lining up *billions* of them. Which really tends to be pretty hard.

Linus

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9e732bb2c84a..52db7ff7c756 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -523,7 +523,8 @@ struct page *follow_huge_pd(struct vm_area_struct *vma,
page = pte_page(*ptep);
page += ((address & mask) >> PAGE_SHIFT);
if (flags & FOLL_GET)
-   get_page(page);
+   if (!try_get_page(page))
+   page = NULL;
} else {
if (is_hugetlb_entry_migration(*ptep)) {
spin_unlock(ptl);
@@ -883,6 +884,8 @@ int gup_hugepte(pte_t *ptep, unsigned long sz,
unsigned long addr,

refs = 0;
head = pte_page(pte);
+   if (page_ref_count(head) < 0)
+   return 0;

page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
do {

Re: [PATCH v2 00/21] Convert hwmon documentation to ReST

2019-04-16 Thread Guenter Roeck


On 4/16/19 6:58 PM, Mauro Carvalho Chehab wrote:

Em Tue, 16 Apr 2019 13:31:14 -0700
Guenter Roeck  escreveu:


On Tue, Apr 16, 2019 at 02:19:49PM -0600, Jonathan Corbet wrote:

On Fri, 12 Apr 2019 20:09:16 -0700
Guenter Roeck  wrote:
   

The big real-world question is: Is the series good enough for you to accept,
or do you expect some level of user/kernel separation ?


I guess it can go in; it's forward progress, even if it doesn't make the
improvements I would like to see.

The real question, I guess, is who should take it.  I've been seeing a
fair amount of activity on hwmon, so I suspect that the potential for
conflicts is real.  Perhaps things would go smoother if it went through
your tree?
   

We'll see a number of conflicts, yes. In terms of timing, this is probably
the worst release in the last few years to make such a change. I currently
have 9 patches queued in hwmon-next which touch Documentation/hwmon.
Of course the changes made in those are all not ReST compatible, and I have
no idea what to look out for to make it compatible. So this is going to be
fun (in a negative sense) either way.

I don't really have a recommendation at this point; I think the best I could
do to take the patches which don't generate conflicts and leave the rest
alone. But that would also be bad, since the new index file would not match
reality. No idea, really, what the best or even a useful approach would be.

Maybe automated changes like this (assuming they are indeed automated)
can be generated and pushed right after a commit window closes. Would
that by any chance be possible ?


No, those patches are hand-maid, but I can surely rebase it on the top of
your tree. Is your tree already merged at linux-next, or should I use some
other branch/tree for rebase?



linux-next merges hwmon-next. next-20190416 is missing one patch which touches
Documentation/hwmon, but that should be easy to deal with.

Thanks,
Guenter

Re: Linux 5.1-rc5

2019-04-16 Thread Michael Ellerman

[ Cc += Nick & Aneesh & Paul ]

Linus Torvalds  writes:
> On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  wrote:
>>
>> Can we please have the page refcount overflow fixes out on the list
>> for review, even if it is after the fact?
>
> They were actually on a list for review long before the fact, but it
> was the security mailing list. The issue actually got discussed back
> in January along with early versions of the patches, but then we
> dropped the ball because it just wasn't on anybody's radar and it got
> resurrected late March. Willy wrote a rather bigger patch-series, and
> review of that is what then resulted in those commits. So they may
> look recent, but that's just because the original patches got
> seriously edited down and rewritten.
>
> That said, powerpc and s390 should at least look at maybe adding a
> check for the page ref in their gup paths too. Powerpc has the special
> gup_hugepte() case

Which uses page_cache_add_speculative(), which handles the case of the
refcount being zero but not overflow. So that looks like it needs
fixing.

We also have follow_huge_pd() that should use try_get_page().

And we have a few uses of bare get_page() in KVM code which might be
subject to the same attack.

cheers

Re: [PATCH v3 7/8] powerpc/mm: Consolidate radix and hash address map details

2019-04-16 Thread Aneesh Kumar K.V


On 4/16/19 7:33 PM, Nicholas Piggin wrote:

Aneesh Kumar K.V's on April 16, 2019 8:07 pm:

We now have

4K page size config

  kernel_region_map_size = 16TB
  kernel vmalloc start   = 0xc0001000
  kernel IO start= 0xc0002000
  kernel vmemmap start   = 0xc0003000

with 64K page size config:

  kernel_region_map_size = 512TB
  kernel vmalloc start   = 0xc008
  kernel IO start= 0xc00a
  kernel vmemmap start   = 0xc00c


Hey Aneesh,

I like the series, I like consolidating the address spaces into 0xc,
and making the layouts match or similar isn't a bad thing. I don't
see any real reason to force limitations on one layout or another --
you could make the argument that 4k radix should match 64k radix
as much as matching 4k hash IMO.

I wouldn't like to tie them too strongly to the same base defines
that force them to stay in sync.

Can we drop this patch? Or at least keep the users of the H_ and R_
defines and set them to the same thing in map.h?




I did that based on the suggestion from Michael Ellerman. I guess he 
wanted the VMALLOC_START to match. I am not sure whether we should match 
the kernel_region_map_size too. I did mention that in the cover letter.


I agree with your suggestion above. I still can keep the VMALLOC_START 
at 16TB and keep the region_map_size as 512TB for radix 4k. I am not 
sure we want to do that.


I will wait for feedback from Michael to make the suggested changes.

-aneesh

Re: [PATCH v5 1/6] iommu: add generic boot option iommu.dma_mode

2019-04-16 Thread Leizhen (ThunderTown)




On 2019/4/16 23:21, Will Deacon wrote:
> On Fri, Apr 12, 2019 at 02:11:31PM +0100, Robin Murphy wrote:
>> On 12/04/2019 11:26, John Garry wrote:
>>> On 09/04/2019 13:53, Zhen Lei wrote:
 +static int __init iommu_dma_mode_setup(char *str)
 +{
 +if (!str)
 +goto fail;
 +
 +if (!strncmp(str, "passthrough", 11))
 +iommu_default_dma_mode = IOMMU_DMA_MODE_PASSTHROUGH;
 +else if (!strncmp(str, "lazy", 4))
 +iommu_default_dma_mode = IOMMU_DMA_MODE_LAZY;
 +else if (!strncmp(str, "strict", 6))
 +iommu_default_dma_mode = IOMMU_DMA_MODE_STRICT;
 +else
 +goto fail;
 +
 +pr_info("Force dma mode to be %d\n", iommu_default_dma_mode);
>>>
>>> What happens if the cmdline option iommu.dma_mode is passed multiple
>>> times? We get mutliple - possibily conflicting - prints, right?
>>
>> Indeed; we ended up removing such prints for the existing options here,
>> specifically because multiple messages seemed more likely to be confusing
>> than useful.

I originally intended to be compatible with X86 printing.

} else if (!strncmp(str, "strict", 6)) {
pr_info("Disable batched IOTLB flush\n");
intel_iommu_strict = 1;
}

>>
>>> And do we need to have backwards compatibility, such that the setting
>>> for iommu.strict or iommu.passthrough trumps iommu.dma_mode, regardless
>>> of order?
>>
>> As above I think it would be preferable to just keep using the existing
>> options anyway. The current behaviour works out as:
>>
>> iommu.passthrough |  Y   | N
>> iommu.strict   |  x  |Y N
>> --|-|-|
>> MODE   | PASSTHROUGH | STRICT  |  LAZY
>>
>> which seems intuitive enough that a specific dma_mode option doesn't add
>> much value, and would more likely just overcomplicate things for users as
>> well as our implementation.
> 
> Agreed. We can't remove the existing options, and they do the job perfectly
> well so I don't see the need to add more options on top.

OK, I will remove the iommu.dma_mode option in the next version. Thanks for you 
three.

I didn't want to add it at first, but later found that the boot options on
each ARCH are different, then want to normalize it.

In addition, do we need to compatible the build option name 
IOMMU_DEFAULT_PASSTHROUGH? or
change it to IOMMU_DEFAULT_DMA_MODE_PASSTHROUGH or 
IOMMU_DEFAULT_MODE_PASSTHROUGH?

> 
> Will
> 
> .
> 

-- 
Thanks!
BestRegards

Re: [PATCH v2 00/21] Convert hwmon documentation to ReST

2019-04-16 Thread Mauro Carvalho Chehab

Em Tue, 16 Apr 2019 13:31:14 -0700
Guenter Roeck  escreveu:

> On Tue, Apr 16, 2019 at 02:19:49PM -0600, Jonathan Corbet wrote:
> > On Fri, 12 Apr 2019 20:09:16 -0700
> > Guenter Roeck  wrote:
> >   
> > > The big real-world question is: Is the series good enough for you to 
> > > accept,
> > > or do you expect some level of user/kernel separation ?  
> > 
> > I guess it can go in; it's forward progress, even if it doesn't make the
> > improvements I would like to see.
> > 
> > The real question, I guess, is who should take it.  I've been seeing a
> > fair amount of activity on hwmon, so I suspect that the potential for
> > conflicts is real.  Perhaps things would go smoother if it went through
> > your tree?
> >   
> We'll see a number of conflicts, yes. In terms of timing, this is probably
> the worst release in the last few years to make such a change. I currently
> have 9 patches queued in hwmon-next which touch Documentation/hwmon.
> Of course the changes made in those are all not ReST compatible, and I have
> no idea what to look out for to make it compatible. So this is going to be
> fun (in a negative sense) either way.
> 
> I don't really have a recommendation at this point; I think the best I could
> do to take the patches which don't generate conflicts and leave the rest
> alone. But that would also be bad, since the new index file would not match
> reality. No idea, really, what the best or even a useful approach would be.
> 
> Maybe automated changes like this (assuming they are indeed automated)
> can be generated and pushed right after a commit window closes. Would
> that by any chance be possible ?

No, those patches are hand-maid, but I can surely rebase it on the top of
your tree. Is your tree already merged at linux-next, or should I use some
other branch/tree for rebase?

Thanks,
Mauro

Re: [PATCH v3 3/5] powerpc: Use the correct style for SPDX License Identifier

2019-04-16 Thread Andrew Donnellan


On 17/4/19 1:28 am, Nishad Kamdar wrote:

This patch corrects the SPDX License Identifier style
in the powerpc Hardware Architecture related files.

Suggested-by: Joe Perches 
Signed-off-by: Nishad Kamdar 
---
TIL there's a different style for source vs headers... sigh. :( Thanks 
for fixing.


Acked-by: Andrew Donnellan 


  arch/powerpc/include/asm/pnv-ocxl.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 208b5503f4ed..7de82647e761 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: GPL-2.0+
+/* SPDX-License-Identifier: GPL-2.0+ */
  // Copyright 2017 IBM Corp.
  #ifndef _ASM_PNV_OCXL_H
  #define _ASM_PNV_OCXL_H



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [PATCH 1/6] mm: change locked_vm's type from unsigned long to atomic64_t

2019-04-16 Thread Andrew Morton

On Thu, 11 Apr 2019 16:28:07 -0400 Daniel Jordan  
wrote:

> On Thu, Apr 11, 2019 at 10:55:43AM +0100, Mark Rutland wrote:
> > On Thu, Apr 11, 2019 at 02:22:23PM +1000, Alexey Kardashevskiy wrote:
> > > On 03/04/2019 07:41, Daniel Jordan wrote:
> > 
> > > > -   dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", 
> > > > current->pid,
> > > > +   dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %lld/%lu%s\n", 
> > > > current->pid,
> > > > incr ? '+' : '-', npages << PAGE_SHIFT,
> > > > -   current->mm->locked_vm << PAGE_SHIFT, 
> > > > rlimit(RLIMIT_MEMLOCK),
> > > > -   ret ? "- exceeded" : "");
> > > > +   (s64)atomic64_read(>mm->locked_vm) << 
> > > > PAGE_SHIFT,
> > > > +   rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : "");
> > > 
> > > 
> > > 
> > > atomic64_read() returns "long" which matches "%ld", why this change (and
> > > similar below)? You did not do this in the two pr_debug()s above anyway.
> > 
> > Unfortunately, architectures return inconsistent types for atomic64 ops.
> > 
> > Some return long (e..g. powerpc), some return long long (e.g. arc), and
> > some return s64 (e.g. x86).
> 
> Yes, Mark said it all, I'm just chiming in to confirm that's why I added the
> cast.
> 
> Btw, thanks for doing this, Mark.

What's the status of this patchset, btw?

I have a note here that
powerpc-mmu-drop-mmap_sem-now-that-locked_vm-is-atomic.patch is to be
updated.

Re: [PATCH v2 5/5] arm64/speculation: Support 'mitigations=' cmdline option

2019-04-16 Thread Will Deacon

On Tue, Apr 16, 2019 at 09:26:13PM +0200, Thomas Gleixner wrote:
> On Fri, 12 Apr 2019, Josh Poimboeuf wrote:
> 
> > Configure arm64 runtime CPU speculation bug mitigations in accordance
> > with the 'mitigations=' cmdline option.  This affects Meltdown, Spectre
> > v2, and Speculative Store Bypass.
> > 
> > The default behavior is unchanged.
> > 
> > Signed-off-by: Josh Poimboeuf 
> > ---
> > NOTE: This is based on top of Jeremy Linton's patches:
> >   https://lkml.kernel.org/r/20190410231237.52506-1-jeremy.lin...@arm.com
> 
> So I keep that out and we have to revisit that once the ARM64 stuff hits a
> tree, right? I can have a branch with just the 4 first patches applied
> which ARM64 folks can pull in when they apply Jeremy's patches before te
> merge window.

Yes, that would work for us, cheers. I should get to Jeremy's latest version
next week and I'm certainly planning to get them queued up for 5.2.

Will

[PATCH v3 3/5] powerpc: Use the correct style for SPDX License Identifier

2019-04-16 Thread Nishad Kamdar

This patch corrects the SPDX License Identifier style
in the powerpc Hardware Architecture related files.

Suggested-by: Joe Perches 
Signed-off-by: Nishad Kamdar 
---
 arch/powerpc/include/asm/pnv-ocxl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 208b5503f4ed..7de82647e761 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: GPL-2.0+
+/* SPDX-License-Identifier: GPL-2.0+ */
 // Copyright 2017 IBM Corp.
 #ifndef _ASM_PNV_OCXL_H
 #define _ASM_PNV_OCXL_H
-- 
2.17.1

[PATCH v12 08/31] mm: introduce INIT_VMA()

2019-04-16 Thread Laurent Dufour

Some VMA struct fields need to be initialized once the VMA structure is
allocated.
Currently this only concerns anon_vma_chain field but some other will be
added to support the speculative page fault.

Instead of spreading the initialization calls all over the code, let's
introduce a dedicated inline function.

Signed-off-by: Laurent Dufour 
---
 fs/exec.c  | 1 +
 include/linux/mm.h | 5 +
 kernel/fork.c  | 2 +-
 mm/mmap.c  | 3 +++
 mm/nommu.c | 1 +
 5 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..9762e060295c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -266,6 +266,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
vma->vm_start = vma->vm_end - PAGE_SIZE;
vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
VM_STACK_INCOMPLETE_SETUP;
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+   INIT_VMA(vma);
 
err = insert_vm_struct(mm, vma);
if (err)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4ba2f53f9d60..2ceb1d2869a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1407,6 +1407,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap 
*/
 };
 
+static inline void INIT_VMA(struct vm_area_struct *vma)
+{
+   INIT_LIST_HEAD(>anon_vma_chain);
+}
+
 struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 pte_t pte, bool with_public_device);
 #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
diff --git a/kernel/fork.c b/kernel/fork.c
index 915be4918a2b..f8dae021c2e5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -341,7 +341,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
*orig)
 
if (new) {
*new = *orig;
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(new);
}
return new;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index bd7b9f293b39..5ad3a3228d76 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1765,6 +1765,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
+   INIT_VMA(vma);
 
if (file) {
if (vm_flags & VM_DENYWRITE) {
@@ -3037,6 +3038,7 @@ static int do_brk_flags(unsigned long addr, unsigned long 
len, unsigned long fla
}
 
vma_set_anonymous(vma);
+   INIT_VMA(vma);
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_pgoff = pgoff;
@@ -3395,6 +3397,7 @@ static struct vm_area_struct *__install_special_mapping(
if (unlikely(vma == NULL))
return ERR_PTR(-ENOMEM);
 
+   INIT_VMA(vma);
vma->vm_start = addr;
vma->vm_end = addr + len;
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 749276beb109..acf7ca72ca90 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1210,6 +1210,7 @@ unsigned long do_mmap(struct file *file,
region->vm_flags = vm_flags;
region->vm_pgoff = pgoff;
 
+   INIT_VMA(vma);
vma->vm_flags = vm_flags;
vma->vm_pgoff = pgoff;
 
-- 
2.21.0

[PATCH v12 00/31] Speculative page faults

2019-04-16 Thread Laurent Dufour

This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type of page fault is named speculative
page fault. If the speculative page fault fails because a concurrency has
been detected or because underlying PMD or PTE tables are not yet
allocating, it is failing its processing and a regular page fault is then
tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by protecting the MM RB
tree with RCU and by using a reference counter on each VMA. When fetching a
VMA under the RCU protection, the VMA's reference counter is incremented to
ensure that the VMA will not freed in our back during the SPF
processing. Once that processing is done the VMA's reference counter is
decremented. To ensure that a VMA is still present when walking the RB tree
locklessly, the VMA's reference counter is incremented when that VMA is
linked in the RB tree. When the VMA is unlinked from the RB tree, its
reference counter will be decremented at the end of the RCU grace period,
ensuring it will be available during this time. This means that the VMA
freeing could be delayed and could delay the file closing for file
mapping. Since the SPF handler is not able to manage file mapping, file is
closed synchronously and not during the RCU cleaning. This is safe since
the page fault handler is aborting if a file pointer is associated to the
VMA.

Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
benchmark [2].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA has been found, the speculative page fault handler would check
for the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus, the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

The locking of the PTE is done with interrupts disabled, this allows
checking for the PMD to ensure that there is not an ongoing collapsing
operation. Since khugepaged is firstly set the PMD to pmd_none and then is
waiting for the other CPU to have caught the IPI interrupt, if the pmd is
valid at the time the PTE is locked, we have the guarantee that the
collapsing operation will have to wait on the PTE lock to move
forward. This allows the SPF handler to map the PTE safely. If the PMD
value is different from the one recorded at the beginning of the SPF
operation, the classic page fault handler will be called to handle the
operation while holding the mmap_sem. As the PTE lock is done with the
interrupts disabled, the lock is done using spin_trylock() to avoid dead
lock when handling a page fault while a TLB invalidate is requested by
another CPU holding the PTE.

In pseudo code, this could be seen as:
speculative_page_fault()
{
vma = find_vma_rcu()
check vma sequence count
check vma's support
disable interrupt
  check pgd,p4d,...,pte
  save pmd and pte in vmf
  save vma sequence counter in vmf
enable interrupt
check vma sequence count
handle_pte_fault(vma)
..
page = alloc_page()
pte_map_lock()
disable interrupt
abort if sequence counter has changed
abort if pmd or pte has changed
pte map and lock
enable interrupt
if abort
   free page
   abort
...
put_vma(vma)
}

arch_fault_handler()
{
if (speculative_page_fault())
   goto done
again:
lock(mmap_sem)
vma = find_vma();

[PATCH v12 24/31] mm: adding speculative page fault failure trace events

2019-04-16 Thread Laurent Dufour

This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour 
---
 include/trace/events/pagefault.h | 80 
 mm/memory.c  | 57 ++-
 2 files changed, 125 insertions(+), 12 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..d9438f3e6bad
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index 1991da97e2db..509851ad7c95 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -81,6 +81,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2100,8 +2103,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
 
 again:
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2109,8 +2114,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 #endif
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
@@ -2121,6 +2128,7 @@ static bool pte_spinlock(struct vm_fault *vmf)
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2154,8 +2162,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 */
 again:
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2163,8 +2173,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 #endif
 
/*
@@ -2184,6 +2196,7 @@ static bool pte_map_lock(struct vm_fault *vmf)
 
if (vma_has_changed(vmf)) {

[PATCH v12 17/31] mm: introduce __page_add_new_anon_rmap()

2019-04-16 Thread Laurent Dufour

When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Signed-off-by: Laurent Dufour 
---
 include/linux/rmap.h | 12 ++--
 mm/memory.c  |  8 
 mm/rmap.c|  5 ++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..a5d282573093 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct 
vm_area_struct *,
unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-   unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
diff --git a/mm/memory.c b/mm/memory.c
index be93f2c8ebe0..46f877b6abea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 * thread doing COW.
 */
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -2897,7 +2897,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
@@ -3049,7 +3049,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
}
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
@@ -3328,7 +3328,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
mem_cgroup *memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..2148e8ce6e34 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1140,7

Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-16 Thread Laurent Dufour


Le 16/04/2019 à 16:27, Mark Rutland a écrit :

On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:

From: Mahendran Ganesh 

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
enables Speculative Page Fault handler.

Signed-off-by: Ganesh Mahendran 


This is missing your S-o-B.


You're right, I missed that...



The first patch noted that the ARCH_SUPPORTS_* option was there because
the arch code had to make an explicit call to try to handle the fault
speculatively, but that isn't addeed until patch 30.

Why is this separate from that code?


Andrew was recommended this a long time ago for bisection purpose. This 
allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the 
code that trigger the spf handler is added to the per architecture's code.


Thanks,
Laurent.


Thanks,
Mark.


---
  arch/arm64/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 870ef86a64ed..8e86934d598b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
  ARM 64-bit (AArch64) Linux support.
  
--

2.21.0

[PATCH v12 22/31] mm: provide speculative fault infrastructure

2019-04-16 Thread Laurent Dufour

From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
[Move define of FAULT_FLAG_SPECULATIVE here]
[Introduce __handle_speculative_fault() and add a check against
 mm->mm_users in handle_speculative_fault() defined in mm.h]
[Abort if vm_ops->fault is set instead of checking only vm_ops]
[Use find_vma_rcu() and call put_vma() when we are done with the VMA]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |  30 +++
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  15 ++
 mm/memory.c| 344 -
 5 files changed, 389 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f761a9c65c74..ec609cbad25a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER0x40/* The fault originated in 
userspace */
 #define FAULT_FLAG_REMOTE  0x80/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100  /* The fault was during an instruction 
fetch */
+#define FAULT_FLAG_SPECULATIVE 0x200   /* Speculative fault, not holding 
mmap_sem */
 
 #define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
@@ -409,6 +410,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,

[PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()

2019-04-16 Thread Laurent Dufour

Vinayak Menon and Ganesh Mahendran reported that the following scenario may
lead to thread being blocked due to data corruption:

CPU 1   CPU 2CPU 3
Process 1,  Process 1,   Process 1,
Thread AThread B Thread C

while (1) { while (1) {  while(1) {
pthread_mutex_lock(l)   pthread_mutex_lock(l)fork
pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
}   }

In the details this happens because :

CPU 1CPU 2   CPU 3
fork()
copy_pte_range()
  set PTE rdonly
got to next VMA...
 .   PTE is seen rdonly  PTE still writable
 .   thread is writing to page
 .   -> page fault
 . copy the page Thread writes to page
 .  .-> no page fault
 . update the PTE
 . flush TLB for that PTE
   flush TLBPTE are now rdonly

So the write done by the CPU 3 is interfering with the page copy operation
done by CPU 2, leading to the data corruption.

To avoid this we mark all the VMA involved in the COW mechanism as changing
by calling vm_write_begin(). This ensures that the speculative page fault
handler will not try to handle a fault on these pages.
The marker is set until the TLB is flushed, ensuring that all the CPUs will
now see the PTE as not writable.
Once the TLB is flush, the marker is removed by calling vm_write_end().

The variable last is used to keep tracked of the latest VMA marked to
handle the error path where part of the VMA may have been marked.

Since multiple VMA from the same mm may have the sequence count increased
during this process, the use of the vm_raw_write_begin/end() is required to
avoid lockdep false warning messages.

Reported-by: Ganesh Mahendran 
Reported-by: Vinayak Menon 
Signed-off-by: Laurent Dufour 
---
 kernel/fork.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f8dae021c2e5..2992d2c95256 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
 {
-   struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
+   struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
struct rb_node **rb_link, *rb_parent;
int retval;
unsigned long charge;
@@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
rb_parent = >vm_rb;
 
mm->map_count++;
-   if (!(tmp->vm_flags & VM_WIPEONFORK))
+   if (!(tmp->vm_flags & VM_WIPEONFORK)) {
+   if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+   /*
+* Mark this VMA as changing to prevent the
+* speculative page fault hanlder to process
+* it until the TLB are flushed below.
+*/
+   last = mpnt;
+   vm_raw_write_begin(mpnt);
+   }
retval = copy_page_range(mm, oldmm, mpnt);
+   }
 
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
@@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 out:
up_write(>mmap_sem);
flush_tlb_mm(oldmm);
+
+   if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+   /*
+* Since the TLB has been flush, we can safely unmark the
+* copied VMAs and allows the speculative page fault handler to
+* process them again.
+* Walk back the VMA list from the last marked VMA.
+*/
+   for (; last; last = last->vm_prev) {
+   if (last->vm_flags & VM_DONTCOPY)
+   continue;
+   if (!(last->vm_flags & VM_WIPEONFORK))
+   vm_raw_write_end(last);
+   }
+   }
+
up_write(>mmap_sem);
dup_userfaultfd_complete();
 fail_uprobe_end:
-- 
2.21.0

[PATCH v12 23/31] mm: don't do swap readahead during speculative page fault

2019-04-16 Thread Laurent Dufour

Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.

To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.

Reported-by: Vinayak Menon 
Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 6e6bf61c0e5c..1991da97e2db 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2900,6 +2900,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
lru_cache_add_anon(page);
swap_readpage(page, true);
}
+   } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   /*
+* Don't try readahead during a speculative page fault
+* as the VMA's boundaries may change in our back.
+* If the page is not in the swap cache and synchronous
+* read is disabled, fall back to the regular page
+* fault mechanism.
+*/
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   ret = VM_FAULT_RETRY;
+   goto out;
} else {
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
vmf);
-- 
2.21.0

[PATCH v12 25/31] perf: add a speculative page fault sw event

2019-04-16 Thread Laurent Dufour

Add a new software event to count succeeded speculative page faults.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..3b4356c55caa 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
-- 
2.21.0

[PATCH v12 20/31] mm: introduce vma reference counter

2019-04-16 Thread Laurent Dufour

The final goal is to be able to use a VMA structure without holding the
mmap_sem and to be sure that the structure will not be freed in our back.

The lockless use of the VMA will be done through RCU protection and thus a
dedicated freeing service is required to manage it asynchronously.

As reported in a 2010's thread [1], this may impact file handling when a
file is still referenced while the mapping is no more there.  As the final
goal is to handle anonymous VMA in a speculative way and not file backed
mapping, we could close and free the file pointer in a synchronous way, as
soon as we are guaranteed to not use it without holding the mmap_sem. For
sanity reason, in a minimal effort, the vm_file file pointer is unset once
the file pointer is put.

[1] https://lore.kernel.org/linux-mm/20100104182429.833180...@chello.nl/

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h   |  4 
 include/linux/mm_types.h |  3 +++
 mm/internal.h| 27 +++
 mm/mmap.c| 13 +
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f14b2c9ddfd4..f761a9c65c74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -529,6 +529,9 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = _vm_ops;
INIT_LIST_HEAD(>anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   atomic_set(>vm_ref_count, 1);
+#endif
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
@@ -1418,6 +1421,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
INIT_LIST_HEAD(>anon_vma_chain);
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_init(>vm_sequence);
+   atomic_set(>vm_ref_count, 1);
 #endif
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 24b3f8ce9e42..6a6159e11a3f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -285,6 +285,9 @@ struct vm_area_struct {
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   atomic_t vm_ref_count;
+#endif
struct rb_node vm_rb;
 
/*
diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b95166..302382bed406 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,33 @@ void page_writeback_init(void);
 
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
+
+extern void __free_vma(struct vm_area_struct *vma);
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void get_vma(struct vm_area_struct *vma)
+{
+   atomic_inc(>vm_ref_count);
+}
+
+static inline void put_vma(struct vm_area_struct *vma)
+{
+   if (atomic_dec_and_test(>vm_ref_count))
+   __free_vma(vma);
+}
+
+#else
+
+static inline void get_vma(struct vm_area_struct *vma)
+{
+}
+
+static inline void put_vma(struct vm_area_struct *vma)
+{
+   __free_vma(vma);
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f7f6027a7dff..c106440dcae7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -188,6 +188,12 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
 }
 #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 
+void __free_vma(struct vm_area_struct *vma)
+{
+   mpol_put(vma_policy(vma));
+   vm_area_free(vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -200,8 +206,8 @@ static struct vm_area_struct *remove_vma(struct 
vm_area_struct *vma)
vma->vm_ops->close(vma);
if (vma->vm_file)
fput(vma->vm_file);
-   mpol_put(vma_policy(vma));
-   vm_area_free(vma);
+   vma->vm_file = NULL;
+   put_vma(vma);
return next;
 }
 
@@ -990,8 +996,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
if (next->anon_vma)
anon_vma_merge(vma, next);
mm->map_count--;
-   mpol_put(vma_policy(next));
-   vm_area_free(next);
+   put_vma(next);
/*
 * In mprotect's case 6 (see comments on vma_merge),
 * we must remove another next too. It would clutter
-- 
2.21.0

[PATCH v12 10/31] mm: protect VMA modifications using VMA sequence count

2019-04-16 Thread Laurent Dufour

The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour 
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++
 mm/mlock.c | 13 +++-
 mm/mmap.c  | 28 -
 mm/mprotect.c  |  4 +++-
 mm/swap_state.c| 10 ++---
 9 files changed, 95 insertions(+), 42 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 01d4eb0e6bd1..0864c050b2de 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1162,8 +1162,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+vma->vm_flags & ~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(>mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3b30301c90ec..2e0f98cadd81 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -667,8 +667,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -908,8 +911,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
 skip_mm:
up_write(>mmap_sem);
@@ -1474,8 +1479,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1636,8 +1643,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..6a0cbca3885e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+   vm_write_end(vma);
result = SCAN_FAIL;
goto out;
}
@@ -1081,6 +1083,7 @@ static

[PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-16 Thread Laurent Dufour

From: Mahendran Ganesh 

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
enables Speculative Page Fault handler.

Signed-off-by: Ganesh Mahendran 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 870ef86a64ed..8e86934d598b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
  ARM 64-bit (AArch64) Linux support.
 
-- 
2.21.0

[PATCH v12 28/31] x86/mm: add speculative pagefault handling

2019-04-16 Thread Laurent Dufour

From: Peter Zijlstra 

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Handle memory protection key fault]
Signed-off-by: Laurent Dufour 
---
 arch/x86/mm/fault.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 667f1da36208..4390d207a7a1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1401,6 +1401,18 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 #endif
 
+   /*
+* Do not try to do a speculative page fault if the fault was due to
+* protection keys since it can't be resolved.
+*/
+   if (!(hw_error_code & X86_PF_PK)) {
+   fault = handle_speculative_fault(mm, address, flags);
+   if (fault != VM_FAULT_RETRY) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+   goto done;
+   }
+   }
+
/*
 * Kernel-mode access to the user address space should only occur
 * on well-defined single instructions listed in the exception
@@ -1499,6 +1511,8 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 
up_read(>mmap_sem);
+
+done:
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, hw_error_code, address, fault);
return;
-- 
2.21.0

[PATCH v12 11/31] mm: protect mremap() against SPF hanlder

2019-04-16 Thread Laurent Dufour

If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1   CPU2
enter mremap()
unmap the dest area
copy_vma()  Enter speculative page fault handler
   >> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler

move_ptes()
  > it is assumed that the dest area is empty,
  > the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 24 -
 mm/mmap.c  | 53 +++---
 mm/mremap.c| 13 
 3 files changed, 73 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 906b9e06f18e..5d45b7d8718d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2343,18 +2343,32 @@ void anon_vma_interval_tree_verify(struct 
anon_vma_chain *node);
 
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
cap_sys_admin);
+
 extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand);
+   struct vm_area_struct *expand, bool keep_locked);
+
 static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
-   return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+   return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
 }
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+
+extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
+   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t pgoff, struct mempolicy *mpol,
+   struct vm_userfaultfd_ctx uff, bool keep_locked);
+
+static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
-   unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *, struct vm_userfaultfd_ctx);
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+   return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
+  pol, uff, false);
+}
+
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index b77ec0149249..13460b38b0fb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -714,7 +714,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand)
+   struct vm_area_struct *expand, bool keep_locked)
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -830,8 +830,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 
importer->anon_vma = exporter->anon_vma;
error = anon_vma_clone(importer, exporter);
-   if (error)
+   if (error) {
+   if (next && next != vma)
+   vm_raw_write_end(next);
+   vm_raw_write_end(vma);
return error;
+   }
}
}
 again:
@@ -1025,7 +1029,8 @@ int __vma_adjust(struct vm_area_struct *vma,

[PATCH v12 15/31] mm: introduce __lru_cache_add_active_or_unevictable

2019-04-16 Thread Laurent Dufour

The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/linux/swap.h | 10 --
 mm/memory.c  |  8 
 mm/swap.c|  6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..d33b94eb3c69 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -343,8 +343,14 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index 56802850e72c..85ec5ce5c0a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2896,7 +2896,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3048,7 +3048,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3327,7 +3327,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
mem_cgroup *memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 3a75722e68a9..a55f0505b563 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -450,12 +450,12 @@ void lru_cache_add(struct page *page)
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
-- 
2.21.0

[PATCH v12 19/31] mm: protect the RB tree with a sequence lock

2019-04-16 Thread Laurent Dufour

Introducing a per mm_struct seqlock, mm_seq field, to protect the changes
made in the MM RB tree. This allows to walk the RB tree without grabbing
the mmap_sem, and on the walk is done to double check that sequence counter
was stable during the walk.

The mm seqlock is held while inserting and removing entries into the MM RB
tree.  Later in this series, it will be check when looking for a VMA
without holding the mmap_sem.

This is based on the initial work from Peter Zijlstra:
https://lore.kernel.org/linux-mm/20100104182813.479668...@chello.nl/

Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |  3 +++
 kernel/fork.c|  3 +++
 mm/init-mm.c |  3 +++
 mm/mmap.c| 48 +++-
 4 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e78f72eb2576..24b3f8ce9e42 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -358,6 +358,9 @@ struct mm_struct {
struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqlock_t mm_seq;
+#endif
u64 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 2992d2c95256..3a1739197ebc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1008,6 +1008,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqlock_init(>mm_seq);
+#endif
atomic_set(>mm_users, 1);
atomic_set(>mm_count, 1);
init_rwsem(>mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index a787a319211e..69346b883a4e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -27,6 +27,9 @@
  */
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
+#endif
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/mmap.c b/mm/mmap.c
index 13460b38b0fb..f7f6027a7dff 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -170,6 +170,24 @@ void unlink_file_vma(struct vm_area_struct *vma)
}
 }
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void mm_write_seqlock(struct mm_struct *mm)
+{
+   write_seqlock(>mm_seq);
+}
+static inline void mm_write_sequnlock(struct mm_struct *mm)
+{
+   write_sequnlock(>mm_seq);
+}
+#else
+static inline void mm_write_seqlock(struct mm_struct *mm)
+{
+}
+static inline void mm_write_sequnlock(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -445,26 +463,32 @@ static void vma_gap_update(struct vm_area_struct *vma)
 }
 
 static inline void vma_rb_insert(struct vm_area_struct *vma,
-struct rb_root *root)
+struct mm_struct *mm)
 {
+   struct rb_root *root = >mm_rb;
+
/* All rb_subtree_gap values must be consistent prior to insertion */
validate_mm_rb(root, NULL);
 
rb_insert_augmented(>vm_rb, root, _gap_callbacks);
 }
 
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+   struct rb_root *root = >mm_rb;
+
/*
 * Note rb_erase_augmented is a fairly large inline function,
 * so make sure we instantiate it only once with our desired
 * augmented rbtree callbacks.
 */
+   mm_write_seqlock(mm);
rb_erase_augmented(>vm_rb, root, _gap_callbacks);
+   mm_write_sequnlock(mm); /* wmb */
 }
 
 static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
-   struct rb_root *root,
+   struct mm_struct *mm,
struct vm_area_struct *ignore)
 {
/*
@@ -472,21 +496,21 @@ static __always_inline void vma_rb_erase_ignore(struct 
vm_area_struct *vma,
 * with the possible exception of the "next" vma being erased if
 * next->vm_start was reduced.
 */
-   validate_mm_rb(root, ignore);
+   validate_mm_rb(>mm_rb, ignore);
 
-   __vma_rb_erase(vma, root);
+   __vma_rb_erase(vma, mm);
 }
 
 static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
-struct rb_root *root)
+struct mm_struct *mm)
 {
/*
 * All rb_subtree_gap values must be consistent prior to erase,

[PATCH v12 26/31] perf tools: add support for the SPF perf event

2019-04-16 Thread Laurent Dufour

Add support for the new speculative faults event.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..3b4356c55caa 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66d066f18b5b..1f3bea4379b2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -435,6 +435,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 5ef4939408f2..effa8929cc90 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 7805c71aaae2..d28a6edd0a95 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -324,6 +324,7 @@ emulation-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 duration_time  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index dda0ac978b1e..c617a4751549 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1200,6 +1200,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.21.0

[PATCH v12 05/31] mm: prepare for FAULT_FLAG_SPECULATIVE

2019-04-16 Thread Laurent Dufour

From: Peter Zijlstra 

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
 implemented as the vm_fault structure in the kernel]
[move pte_map_lock()'s definition upper in the file]
[move the define of FAULT_FLAG_SPECULATIVE later in the series]
[review error path in do_swap_page(), do_anonymous_page() and
 wp_page_copy()]
Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 87 +++--
 1 file changed, 58 insertions(+), 29 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c6ddadd9d2b7..fc3698d13cb5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+  vmf->address, >ptl);
+   return true;
+}
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry which was
  * read non-atomically.  Before making any commitment, on those architectures
@@ -2261,25 +2268,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
int page_copied = 0;
struct mem_cgroup *memcg;
struct mmu_notifier_range range;
+   int ret = VM_FAULT_OOM;
 
if (unlikely(anon_vma_prepare(vma)))
-   goto oom;
+   goto out;
 
if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
  vmf->address);
if (!new_page)
-   goto oom;
+   goto out;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
-   goto oom;
+   goto out;
cow_user_page(new_page, old_page, vmf->address, vma);
}
 
if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
false))
-   goto oom_free_new;
+   goto out_free_new;
 
__SetPageUptodate(new_page);
 
@@ -2291,7 +2299,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/*
 * Re-check the pte - we dropped the lock
 */
-   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
+   if (!pte_map_lock(vmf)) {
+   ret = VM_FAULT_RETRY;
+   goto out_uncharge;
+   }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -2378,12 +2389,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
-oom_free_new:
+out_uncharge:
+   mem_cgroup_cancel_charge(new_page, memcg, false);
+out_free_new:
put_page(new_page);
-oom:
+out:
if (old_page)
put_page(old_page);
-   return VM_FAULT_OOM;
+   return ret;
 }
 
 /**
@@ -2405,8 +2418,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
 {
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-  >ptl);
+   if (!pte_map_lock(vmf))
+   return VM_FAULT_RETRY;
/*
 * We might have raced with another page fault while we released the
 * pte_offset_map_lock.
@@ -2527,8 +2540,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
lock_page(vmf->page);
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
+   if (!pte_map_lock(vmf)) {
+   unlock_page(vmf->page);
+   put_page(vmf->page);
+   return VM_FAULT_RETRY;
+   }
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2744,11 +2760,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
if (!page) {
/*
-

[PATCH v12 30/31] arm64/mm: add speculative page fault

2019-04-16 Thread Laurent Dufour

From: Mahendran Ganesh 

This patch enables the speculative page fault on the arm64
architecture.

I completed spf porting in 4.9. From the test result,
we can see app launching time improved by about 10% in average.
For the apps which have more than 50 threads, 15% or even more
improvement can be got.

Signed-off-by: Ganesh Mahendran 

[handle_speculative_fault() is no more returning the vma pointer]
Signed-off-by: Laurent Dufour 
---
 arch/arm64/mm/fault.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 4f343e603925..b5e2a93f9c21 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -485,6 +485,16 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
+   /*
+* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, addr, mm_flags);
+   if (fault != VM_FAULT_RETRY) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, addr);
+   goto done;
+   }
+
/*
 * As per x86, we may deadlock here. However, since the kernel only
 * validly references user space from well defined areas of the code,
@@ -535,6 +545,8 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
}
up_read(>mmap_sem);
 
+done:
+
/*
 * Handle the "normal" (no error) case first.
 */
-- 
2.21.0

[PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF

2019-04-16 Thread Laurent Dufour

pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |  6 +-
 mm/memory.c  | 37 +++--
 2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8ec38b11b361..fd7d38ee2e33 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -652,6 +652,8 @@ typedef __bitwise unsigned int vm_fault_t;
  * @VM_FAULT_NEEDDSYNC:->fault did not modify page tables and 
needs
  * fsync() to complete (for synchronous page faults
  * in DAX)
+ * @VM_FAULT_PTNOTSAME Page table entries have changed during a
+ * speculative page fault handling.
  * @VM_FAULT_HINDEX_MASK:  mask HINDEX value
  *
  */
@@ -669,6 +671,7 @@ enum vm_fault_reason {
VM_FAULT_FALLBACK   = (__force vm_fault_t)0x000800,
VM_FAULT_DONE_COW   = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC  = (__force vm_fault_t)0x002000,
+   VM_FAULT_PTNOTSAME  = (__force vm_fault_t)0x004000,
VM_FAULT_HINDEX_MASK= (__force vm_fault_t)0x0f,
 };
 
@@ -693,7 +696,8 @@ enum vm_fault_reason {
{ VM_FAULT_RETRY,   "RETRY" },  \
{ VM_FAULT_FALLBACK,"FALLBACK" },   \
{ VM_FAULT_DONE_COW,"DONE_COW" },   \
-   { VM_FAULT_NEEDDSYNC,   "NEEDDSYNC" }
+   { VM_FAULT_NEEDDSYNC,   "NEEDDSYNC" },  \
+   { VM_FAULT_PTNOTSAME,   "PTNOTSAME" }
 
 struct vm_special_mapping {
const char *name;   /* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index 221ccdf34991..d5bebca47d98 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2094,21 +2094,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
  * parts, do_swap_page must check under lock before unmapping the pte and
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0   if the PTE are the same
+ * VM_FAULT_PTNOTSAME  if the PTE are different
+ * VM_FAULT_RETRY  if the VMA has changed in our back during
+ * a speculative page fault handling.
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
+static inline vm_fault_t pte_unmap_same(struct vm_fault *vmf)
 {
-   int same = 1;
+   int ret = 0;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
+   if (pte_spinlock(vmf)) {
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   ret = VM_FAULT_PTNOTSAME;
+   spin_unlock(vmf->ptl);
+   } else
+   ret = VM_FAULT_RETRY;
}
 #endif
-   pte_unmap(page_table);
-   return same;
+   pte_unmap(vmf->pte);
+   return ret;
 }
 
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
@@ -2714,8 +2722,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
int exclusive = 0;
vm_fault_t ret = 0;
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+   ret = pte_unmap_same(vmf);
+   if (ret) {
+   /*
+* If pte != orig_pte, this means another thread did the
+* swap operation in our back.
+* So nothing else to do.
+*/
+   if (ret == VM_FAULT_PTNOTSAME)
+

[PATCH v12 31/31] mm: Add a speculative page fault switch in sysctl

2019-04-16 Thread Laurent Dufour

This allows to turn on/off the use of the speculative page fault handler.

By default it's turned on.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 3 +++
 kernel/sysctl.c| 9 +
 mm/memory.c| 3 +++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec609cbad25a..f5bf13a2197a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1531,6 +1531,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct 
*vma,
unsigned long address, unsigned int flags);
 
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int sysctl_speculative_page_fault;
 extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
 unsigned long address,
 unsigned int flags);
@@ -1538,6 +1539,8 @@ static inline vm_fault_t handle_speculative_fault(struct 
mm_struct *mm,
  unsigned long address,
  unsigned int flags)
 {
+   if (unlikely(!sysctl_speculative_page_fault))
+   return VM_FAULT_RETRY;
/*
 * Try speculative page fault for multithreaded user space task only.
 */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9df14b07a488..3a712e52c14a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1295,6 +1295,15 @@ static struct ctl_table vm_table[] = {
.extra1 = ,
.extra2 = ,
},
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   {
+   .procname   = "speculative_page_fault",
+   .data   = _speculative_page_fault,
+   .maxlen = sizeof(sysctl_speculative_page_fault),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
{
.procname   = "panic_on_oom",
.data   = _panic_on_oom,
diff --git a/mm/memory.c b/mm/memory.c
index c65e8011d285..a12a60891350 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -83,6 +83,9 @@
 
 #define CREATE_TRACE_POINTS
 #include 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+int sysctl_speculative_page_fault = 1;
+#endif
 
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
-- 
2.21.0

[PATCH v12 02/31] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-16 Thread Laurent Dufour

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
Speculative Page Fault handler when building for 64bit.

Cc: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0f2ab09da060..8bd575184d0b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86_64
select SWIOTLB
select X86_DEV_DMA_OPS
select ARCH_HAS_SYSCALL_WRAPPER
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
 
 #
 # Arch settings
-- 
2.21.0

[PATCH v12 29/31] powerpc/mm: add speculative page fault

2019-04-16 Thread Laurent Dufour

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/fault.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index ec74305fa330..5d48016073cb 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -491,6 +491,21 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+   /*
+* Try speculative page fault before grabbing the mmap_sem.
+* The Page fault is done if VM_FAULT_RETRY is not returned.
+* But if the memory protection keys are active, we don't know if the
+* fault is due to key mistmatch or due to a classic protection check.
+* To differentiate that, we will need the VMA we no more have, so
+* let's retry with the mmap_sem held.
+*/
+   fault = handle_speculative_fault(mm, address, flags);
+   if (fault != VM_FAULT_RETRY && (IS_ENABLED(CONFIG_PPC_MEM_KEYS) &&
+   fault != VM_FAULT_SIGSEGV)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+   goto done;
+   }
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -600,6 +615,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(>mm->mmap_sem);
 
+done:
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.21.0

[PATCH v12 06/31] mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2019-04-16 Thread Laurent Dufour

When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fc3698d13cb5..221ccdf34991 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static inline bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3656,8 +3663,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -3850,8 +3857,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.21.0

[PATCH v12 09/31] mm: VMA sequence count

2019-04-16 Thread Laurent Dufour

From: Peter Zijlstra 

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The calls to vm_write_begin/end() in unmap_page_range() are
used to detect when a VMA is being unmap and thus that new page fault
should not be satisfied for this VMA. If the seqcount hasn't changed when
the page table are locked, this means we are safe to satisfy the page
fault.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

The VMA's sequence counter is also used to detect change to various VMA's
fields used during the page fault handling, such as:
 - vm_start, vm_end
 - vm_pgoff
 - vm_flags, vm_page_prot
 - anon_vma
 - vm_policy

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
[Introduce vm_write_* inline function depending on
 CONFIG_SPECULATIVE_PAGE_FAULT]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
 using vm_raw_write* functions]
[Fix a lock dependency warning in mmap_region() when entering the error
 path]
[move sequence initialisation INIT_VMA()]
[Review the patch description about unmap_page_range()]
Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h   | 44 
 include/linux/mm_types.h |  3 +++
 mm/memory.c  |  2 ++
 mm/mmap.c| 30 +++
 4 files changed, 79 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2ceb1d2869a6..906b9e06f18e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1410,6 +1410,9 @@ struct zap_details {
 static inline void INIT_VMA(struct vm_area_struct *vma)
 {
INIT_LIST_HEAD(>anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_init(>vm_sequence);
+#endif
 }
 
 struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct 
address_space *mapping,
unmap_mapping_range(mapping, holebegin, holelen, 0);
 }
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+   write_seqcount_begin(>vm_sequence);
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+int subclass)
+{
+   write_seqcount_begin_nested(>vm_sequence, subclass);
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+   write_seqcount_end(>vm_sequence);
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+   raw_write_seqcount_begin(>vm_sequence);
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+   raw_write_seqcount_end(>vm_sequence);
+}
+#else
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+int subclass)
+{
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd7d38ee2e33..e78f72eb2576 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -337,6 +337,9 @@ struct vm_area_struct {
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_t vm_sequence;
+#endif
 } __randomize_layout;
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index d5bebca47d98..423fa8ea0569 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1256,6 +1256,7 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long next;
 
BUG_ON(addr >= end);
+   vm_write_begin(vma);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -1265,6 +1266,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
+   vm_write_end(vma);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 5ad3a3228d76..a4e4d52a5148 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -726,6 +726,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
long adjust_next = 0;
int remove_next = 0;
 
+   /*
+* Why using vm_raw_write*() functions here to avoid lockdep's warning ?
+

[PATCH v12 12/31] mm: protect SPF handler against anon_vma changes

2019-04-16 Thread Laurent Dufour

The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 423fa8ea0569..2cf7b6185daa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -377,7 +377,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
@@ -391,7 +393,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
-- 
2.21.0

[PATCH v12 14/31] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2019-04-16 Thread Laurent Dufour

migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf2f9a5..0197e40325f8 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -125,14 +125,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index d0de58464479..56802850e72c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3747,7 +3747,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index a9138093a8e2..633bd9abac54 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1938,7 +1938,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1951,7 +1951,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.21.0

[PATCH v12 03/31] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-16 Thread Laurent Dufour

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for BOOK3S_64. This enables
the Speculative Page Fault handler.

Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
  set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
  called by update_mmu_cache()

Cc: Michael Ellerman 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d0be82c3061..a29887ea5383 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -238,6 +238,7 @@ config PPC
select PCI_SYSCALL  if PCI
select RTC_LIB
select SPARSE_IRQ
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select VIRT_TO_BUS  if !PPC64
-- 
2.21.0

[PATCH v12 16/31] mm: introduce __vm_normal_page()

2019-04-16 Thread Laurent Dufour

When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.

Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.

Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 18 +++---
 mm/memory.c| 21 -
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f465bb2b049e..f14b2c9ddfd4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1421,9 +1421,21 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
 #endif
 }
 
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device);
-#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+static inline struct page *_vm_normal_page(struct vm_area_struct *vma,
+   unsigned long addr, pte_t pte,
+   bool with_public_device)
+{
+   return __vm_normal_page(vma, addr, pte, with_public_device,
+   vma->vm_flags);
+}
+static inline struct page *vm_normal_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte)
+{
+   return _vm_normal_page(vma, addr, pte, false);
+}
 
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 85ec5ce5c0a8..be93f2c8ebe0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -533,7 +533,8 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
 }
 
 /*
- * vm_normal_page -- This function gets the "struct page" associated with a 
pte.
+ * __vm_normal_page -- This function gets the "struct page" associated with
+ * a pte.
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
  * it doesn't exist, or it exists but they don't want to touch it). In this
@@ -574,8 +575,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
  * PFNMAP mappings in order to support COWable mappings.
  *
  */
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
 {
unsigned long pfn = pte_pfn(pte);
 
@@ -584,7 +586,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -620,8 +622,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 
/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
 
-   if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-   if (vma->vm_flags & VM_MIXEDMAP) {
+   if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+   if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -630,7 +632,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
-   if (!is_cow_mapping(vma->vm_flags))
+   if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2532,7 +2534,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
-   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+vmf->vma_flags);
if (!vmf->page) {
/*
 * VM_MIXEDMAP !pfn_valid()

[PATCH v12 27/31] mm: add speculative page fault vmstats

2019-04-16 Thread Laurent Dufour

Add speculative_pgfault vmstat counter to count successful speculative page
fault handling.

Also fixing a minor typo in include/linux/vm_event_item.h.

Signed-off-by: Laurent Dufour 
---
 include/linux/vm_event_item.h | 3 +++
 mm/memory.c   | 3 +++
 mm/vmstat.c   | 5 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..137666e91074 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -109,6 +109,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   SPECULATIVE_PGFAULT,
 #endif
NR_VM_EVENT_ITEMS
 };
diff --git a/mm/memory.c b/mm/memory.c
index 509851ad7c95..c65e8011d285 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4367,6 +4367,9 @@ vm_fault_t __handle_speculative_fault(struct mm_struct 
*mm,
 
put_vma(vma);
 
+   if (ret != VM_FAULT_RETRY)
+   count_vm_event(SPECULATIVE_PGFAULT);
+
/*
 * The task may have entered a memcg OOM situation but
 * if the allocation error was handled gracefully (no
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a7d493366a65..93f54b31e150 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1288,7 +1288,10 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
 #endif
-#endif /* CONFIG_VM_EVENTS_COUNTERS */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   "speculative_pgfault",
+#endif
+#endif /* CONFIG_VM_EVENT_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
 
-- 
2.21.0

[PATCH v12 13/31] mm: cache some VMA fields in the vm_fault structure

2019-04-16 Thread Laurent Dufour

When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

In the detail, when we deal with a speculative page fault, the mmap_sem is
not taken, so parallel VMA's changes can occurred. When a VMA change is
done which will impact the page fault processing, we assumed that the VMA
sequence counter will be changed.  In the page fault processing, at the
time the PTE is locked, we checked the VMA sequence counter to detect
changes done in our back. If no change is detected we can continue further.
But this doesn't prevent the VMA to not be changed in our back while the
PTE is locked. So VMA's fields which are used while the PTE is locked must
be saved to ensure that we are using *static* values.  This is important
since the PTE changes will be made on regards to these VMA fields and they
need to be consistent. This concerns the vma->vm_flags and
vma->vm_page_prot VMA fields.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 10 +++--
 mm/huge_memory.c   |  6 +++---
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 53 --
 mm/migrate.c   |  2 +-
 6 files changed, 44 insertions(+), 31 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d45b7d8718d..f465bb2b049e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -439,6 +439,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
@@ -781,9 +787,9 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 823688414d27..865886a689ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1244,8 +1244,8 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct 
vm_fault *vmf,
 
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t entry;
-   entry = mk_pte(pages[i], vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(pages[i], vmf->vma_page_prot);
+   entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
@@ -2228,7 +2228,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
entry = pte_swp_mksoft_dirty(entry);
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
-   entry = maybe_mkwrite(entry, vma);
+   entry = maybe_mkwrite(entry, vma->vm_flags);
if (!write)
entry = pte_wrprotect(entry);
if (!young)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 109f5de82910..13246da4bc50 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3812,6 +3812,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
.vma = vma,
.address = haddr,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6a0cbca3885e..42469037240a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -888,6 +888,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma,

[PATCH v12 21/31] mm: Introduce find_vma_rcu()

2019-04-16 Thread Laurent Dufour

This allows to search for a VMA structure without holding the mmap_sem.

The search is repeated while the mm seqlock is changing and until we found
a valid VMA.

While under the RCU protection, a reference is taken on the VMA, so the
caller must call put_vma() once it not more need the VMA structure.

At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
reference is taken to the VMA by calling get_vma().

When removing a VMA from the MM RB tree, the VMA is not release immediately
but at the end of the RCU grace period through vm_rcu_put(). This ensures
that the VMA remains allocated until the end the RCU grace period.

Since the vm_file pointer, if valid, is released in put_vma(), there is no
guarantee that the file pointer will be valid on the returned VMA.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |  1 +
 mm/internal.h|  5 ++-
 mm/mmap.c| 76 ++--
 3 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a6159e11a3f..9af6694cb95d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -287,6 +287,7 @@ struct vm_area_struct {
 
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
atomic_t vm_ref_count;
+   struct rcu_head vm_rcu;
 #endif
struct rb_node vm_rb;
 
diff --git a/mm/internal.h b/mm/internal.h
index 302382bed406..1e368e4afe3c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
__free_vma(vma);
 }
 
-#else
+extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
+  unsigned long addr);
+
+#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
 
 static inline void get_vma(struct vm_area_struct *vma)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index c106440dcae7..34bf261dc2c8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
 {
write_sequnlock(>mm_seq);
 }
+
+static void __vm_rcu_put(struct rcu_head *head)
+{
+   struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+   put_vma(vma);
+}
+static void vm_rcu_put(struct vm_area_struct *vma)
+{
+   VM_BUG_ON_VMA(!RB_EMPTY_NODE(>vm_rb), vma);
+   call_rcu(>vm_rcu, __vm_rcu_put);
+}
 #else
 static inline void mm_write_seqlock(struct mm_struct *mm)
 {
@@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
 
 void __free_vma(struct vm_area_struct *vma)
 {
+   if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
+   VM_BUG_ON_VMA(!RB_EMPTY_NODE(>vm_rb), vma);
mpol_put(vma_policy(vma));
vm_area_free(vma);
 }
@@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
 /*
  * Close a vm structure and free it, returning the next.
  */
-static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
 {
struct vm_area_struct *next = vma->vm_next;
 
might_sleep();
+   if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
+   !RB_EMPTY_NODE(>vm_rb)) {
+   /*
+* If the VMA is still linked in the RB tree, we must release
+* that reference by calling put_vma().
+* This should only happen when called from exit_mmap().
+* We forcely clear the node to satisfy the chec in
+* __free_vma(). This is safe since the RB tree is not walked
+* anymore.
+*/
+   RB_CLEAR_NODE(>vm_rb);
+   put_vma(vma);
+   }
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
if (vma->vm_file)
@@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct 
vm_area_struct *vma)
return next;
 }
 
+static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+{
+   if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
+   VM_BUG_ON_VMA(!RB_EMPTY_NODE(>vm_rb), vma);
+   return __remove_vma(vma);
+}
+
 static int do_brk_flags(unsigned long addr, unsigned long request, unsigned 
long flags,
struct list_head *uf);
 SYSCALL_DEFINE1(brk, unsigned long, brk)
@@ -475,7 +509,7 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
 
/* All rb_subtree_gap values must be consistent prior to insertion */
validate_mm_rb(root, NULL);
-
+   get_vma(vma);
rb_insert_augmented(>vm_rb, root, _gap_callbacks);
 }
 
@@ -491,6 +525,14 @@ static void __vma_rb_erase(struct vm_area_struct *vma, 
struct mm_struct *mm)
mm_write_seqlock(mm);
rb_erase_augmented(>vm_rb, root, _gap_callbacks);
mm_write_sequnlock(mm); /* wmb */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   /*
+* Ensure the removal is

[PATCH v12 01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2019-04-16 Thread Laurent Dufour

This configuration variable will be used to build the code needed to
handle speculative page fault.

By default it is turned off, and activated depending on architecture
support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.

The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to handle the speculative handler.

The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
does processing that is not compatible with the speculative handling in the
case ARCH_HAS_PTE_SPECIAL is not set.

Suggested-by: Thomas Gleixner 
Suggested-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 mm/Kconfig | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 0eada3f818fa..ff278ac9978a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -761,4 +761,26 @@ config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
bool
 
+config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+   def_bool n
+
+config SPECULATIVE_PAGE_FAULT
+   bool "Speculative page faults"
+   default y
+   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+   depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
+   help
+ Try to handle user space page faults without holding the mmap_sem.
+
+ This should allow better concurrency for massively threaded processes
+ since the page fault handler will not wait for other thread's memory
+ layout change to be done, assuming that this change is done in
+ another part of the process's memory space. This type of page fault
+ is named speculative page fault.
+
+ If the speculative page fault fails because a concurrent modification
+ is detected or because underlying PMD or PTE tables are not yet
+ allocated, the speculative page fault fails and a classic page fault
+ is then tried.
+
 endmenu
-- 
2.21.0

Re: [PATCH v2 00/21] Convert hwmon documentation to ReST

2019-04-16 Thread Guenter Roeck

On Tue, Apr 16, 2019 at 02:19:49PM -0600, Jonathan Corbet wrote:
> On Fri, 12 Apr 2019 20:09:16 -0700
> Guenter Roeck  wrote:
> 
> > The big real-world question is: Is the series good enough for you to accept,
> > or do you expect some level of user/kernel separation ?
> 
> I guess it can go in; it's forward progress, even if it doesn't make the
> improvements I would like to see.
> 
> The real question, I guess, is who should take it.  I've been seeing a
> fair amount of activity on hwmon, so I suspect that the potential for
> conflicts is real.  Perhaps things would go smoother if it went through
> your tree?
> 
We'll see a number of conflicts, yes. In terms of timing, this is probably
the worst release in the last few years to make such a change. I currently
have 9 patches queued in hwmon-next which touch Documentation/hwmon.
Of course the changes made in those are all not ReST compatible, and I have
no idea what to look out for to make it compatible. So this is going to be
fun (in a negative sense) either way.

I don't really have a recommendation at this point; I think the best I could
do to take the patches which don't generate conflicts and leave the rest
alone. But that would also be bad, since the new index file would not match
reality. No idea, really, what the best or even a useful approach would be.

Maybe automated changes like this (assuming they are indeed automated)
can be generated and pushed right after a commit window closes. Would
that by any chance be possible ?

Guenter

[PATCH v3 10/26] compat_ioctl: use correct compat_ptr() translation in drivers

2019-04-16 Thread Arnd Bergmann

A handful of drivers all have a trivial wrapper around their ioctl
handler, but don't call the compat_ptr() conversion function at the
moment. In practice this does not matter, since none of them are used
on the s390 architecture and for all other architectures, compat_ptr()
does not do anything, but using the new compat_ptr_ioctl()
helper makes it more correct in theory, and simplifies the code.

Acked-by: Greg Kroah-Hartman 
Acked-by: Andrew Donnellan 
Acked-by: Felipe Balbi 
Signed-off-by: Arnd Bergmann 
---
 drivers/misc/cxl/flash.c|  8 +---
 drivers/misc/genwqe/card_dev.c  | 23 +--
 drivers/scsi/megaraid/megaraid_mm.c | 28 +---
 drivers/usb/gadget/function/f_fs.c  | 12 +---
 4 files changed, 4 insertions(+), 67 deletions(-)

diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c
index 4d6836f19489..cb9cca35a226 100644
--- a/drivers/misc/cxl/flash.c
+++ b/drivers/misc/cxl/flash.c
@@ -473,12 +473,6 @@ static long device_ioctl(struct file *file, unsigned int 
cmd, unsigned long arg)
return -EINVAL;
 }
 
-static long device_compat_ioctl(struct file *file, unsigned int cmd,
-   unsigned long arg)
-{
-   return device_ioctl(file, cmd, arg);
-}
-
 static int device_close(struct inode *inode, struct file *file)
 {
struct cxl *adapter = file->private_data;
@@ -514,7 +508,7 @@ static const struct file_operations fops = {
.owner  = THIS_MODULE,
.open   = device_open,
.unlocked_ioctl = device_ioctl,
-   .compat_ioctl   = device_compat_ioctl,
+   .compat_ioctl   = compat_ptr_ioctl,
.release= device_close,
 };
 
diff --git a/drivers/misc/genwqe/card_dev.c b/drivers/misc/genwqe/card_dev.c
index 8c1b63a4337b..5de0796f2786 100644
--- a/drivers/misc/genwqe/card_dev.c
+++ b/drivers/misc/genwqe/card_dev.c
@@ -1221,34 +1221,13 @@ static long genwqe_ioctl(struct file *filp, unsigned 
int cmd,
return rc;
 }
 
-#if defined(CONFIG_COMPAT)
-/**
- * genwqe_compat_ioctl() - Compatibility ioctl
- *
- * Called whenever a 32-bit process running under a 64-bit kernel
- * performs an ioctl on /dev/genwqe_card.
- *
- * @filp:file pointer.
- * @cmd: command.
- * @arg: user argument.
- * Return:   zero on success or negative number on failure.
- */
-static long genwqe_compat_ioctl(struct file *filp, unsigned int cmd,
-   unsigned long arg)
-{
-   return genwqe_ioctl(filp, cmd, arg);
-}
-#endif /* defined(CONFIG_COMPAT) */
-
 static const struct file_operations genwqe_fops = {
.owner  = THIS_MODULE,
.open   = genwqe_open,
.fasync = genwqe_fasync,
.mmap   = genwqe_mmap,
.unlocked_ioctl = genwqe_ioctl,
-#if defined(CONFIG_COMPAT)
-   .compat_ioctl   = genwqe_compat_ioctl,
-#endif
+   .compat_ioctl   = compat_ptr_ioctl,
.release= genwqe_release,
 };
 
diff --git a/drivers/scsi/megaraid/megaraid_mm.c 
b/drivers/scsi/megaraid/megaraid_mm.c
index 3ce837e4b24c..21ee5751c04e 100644
--- a/drivers/scsi/megaraid/megaraid_mm.c
+++ b/drivers/scsi/megaraid/megaraid_mm.c
@@ -45,10 +45,6 @@ static int mraid_mm_setup_dma_pools(mraid_mmadp_t *);
 static void mraid_mm_free_adp_resources(mraid_mmadp_t *);
 static void mraid_mm_teardown_dma_pools(mraid_mmadp_t *);
 
-#ifdef CONFIG_COMPAT
-static long mraid_mm_compat_ioctl(struct file *, unsigned int, unsigned long);
-#endif
-
 MODULE_AUTHOR("LSI Logic Corporation");
 MODULE_DESCRIPTION("LSI Logic Management Module");
 MODULE_LICENSE("GPL");
@@ -72,9 +68,7 @@ static wait_queue_head_t wait_q;
 static const struct file_operations lsi_fops = {
.open   = mraid_mm_open,
.unlocked_ioctl = mraid_mm_unlocked_ioctl,
-#ifdef CONFIG_COMPAT
-   .compat_ioctl = mraid_mm_compat_ioctl,
-#endif
+   .compat_ioctl = compat_ptr_ioctl,
.owner  = THIS_MODULE,
.llseek = noop_llseek,
 };
@@ -228,7 +222,6 @@ mraid_mm_unlocked_ioctl(struct file *filep, unsigned int 
cmd,
 {
int err;
 
-   /* inconsistent: mraid_mm_compat_ioctl doesn't take the BKL */
mutex_lock(_mm_mutex);
err = mraid_mm_ioctl(filep, cmd, arg);
mutex_unlock(_mm_mutex);
@@ -1232,25 +1225,6 @@ mraid_mm_init(void)
 }
 
 
-#ifdef CONFIG_COMPAT
-/**
- * mraid_mm_compat_ioctl   - 32bit to 64bit ioctl conversion routine
- * @filep  : file operations pointer (ignored)
- * @cmd: ioctl command
- * @arg: user ioctl packet
- */
-static long
-mraid_mm_compat_ioctl(struct file *filep, unsigned int cmd,
- unsigned long arg)
-{
-   int err;
-
-   err = mraid_mm_ioctl(filep, cmd, arg);
-
-   return err;
-}
-#endif
-
 /**
  * mraid_mm_exit   - Module exit point
  */
diff --git a/drivers/usb/gadget/function/f_fs.c 
b/drivers/usb/gadget/function/f_fs.c
index

[PATCH v3 00/26] compat_ioctl: cleanups

2019-04-16 Thread Arnd Bergmann

Hi Al,

It took me way longer than I had hoped to revisit this series, see
https://lore.kernel.org/lkml/20180912150142.157913-1-a...@arndb.de/
for the previously posted version.

I've come to the point where all conversion handlers and most
COMPATIBLE_IOCTL() entries are gone from this file, but for
now, this series only has the parts that have either been reviewed
previously, or that are simple enough to include.

The main missing piece is the SG_IO/SG_GET_REQUEST_TABLE conversion.
I'll post the patches I made for that later, as they need more
testing and review from the scsi maintainers.

I hope you can still take these for the coming merge window, unless
new problems come up.

  Arnd

Arnd Bergmann (26):
  compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
  compat_ioctl: move simple ppp command handling into driver
  compat_ioctl: avoid unused function warning for do_ioctl
  compat_ioctl: move PPPIOCSCOMPRESS32 to ppp-generic.c
  compat_ioctl: move PPPIOCSPASS32/PPPIOCSACTIVE32 to ppp_generic.c
  compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
  compat_ioctl: move rtc handling into rtc-dev.c
  compat_ioctl: add compat_ptr_ioctl()
  compat_ioctl: move drivers to compat_ptr_ioctl
  compat_ioctl: use correct compat_ptr() translation in drivers
  ceph: fix compat_ioctl for ceph_dir_operations
  compat_ioctl: move more drivers to compat_ptr_ioctl
  compat_ioctl: move tape handling into drivers
  compat_ioctl: move ATYFB_CLK handling to atyfb driver
  compat_ioctl: move isdn/capi ioctl translation into driver
  compat_ioctl: move rfcomm handlers into driver
  compat_ioctl: move hci_sock handlers into driver
  compat_ioctl: remove HCIUART handling
  compat_ioctl: remove HIDIO translation
  compat_ioctl: remove translation for sound ioctls
  compat_ioctl: remove IGNORE_IOCTL()
  compat_ioctl: remove /dev/random commands
  compat_ioctl: remove joystick ioctl translation
  compat_ioctl: remove PCI ioctl translation
  compat_ioctl: remove /dev/raw ioctl translation
  compat_ioctl: remove last RAID handling code

 Documentation/networking/ppp_generic.txt|   2 +
 arch/um/drivers/hostaudio_kern.c|   1 +
 drivers/android/binder.c|   2 +-
 drivers/char/ppdev.c|  12 +-
 drivers/char/random.c   |   1 +
 drivers/char/tpm/tpm_vtpm_proxy.c   |  12 +-
 drivers/crypto/qat/qat_common/adf_ctl_drv.c |   2 +-
 drivers/dma-buf/dma-buf.c   |   4 +-
 drivers/dma-buf/sw_sync.c   |   2 +-
 drivers/dma-buf/sync_file.c |   2 +-
 drivers/firewire/core-cdev.c|  12 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c|   2 +-
 drivers/hid/hidraw.c|   4 +-
 drivers/hid/usbhid/hiddev.c |  11 +-
 drivers/hwtracing/stm/core.c|  12 +-
 drivers/ide/ide-tape.c  |  31 +-
 drivers/iio/industrialio-core.c |   2 +-
 drivers/infiniband/core/uverbs_main.c   |   4 +-
 drivers/isdn/capi/capi.c|  31 +
 drivers/isdn/i4l/isdn_ppp.c |  14 +-
 drivers/media/rc/lirc_dev.c |   4 +-
 drivers/mfd/cros_ec_dev.c   |   4 +-
 drivers/misc/cxl/flash.c|   8 +-
 drivers/misc/genwqe/card_dev.c  |  23 +-
 drivers/misc/mei/main.c |  22 +-
 drivers/misc/vmw_vmci/vmci_host.c   |   2 +-
 drivers/mtd/ubi/cdev.c  |  36 +-
 drivers/net/ppp/ppp_generic.c   |  99 +++-
 drivers/net/ppp/pppoe.c |   7 +
 drivers/net/ppp/pptp.c  |   3 +
 drivers/net/tap.c   |  12 +-
 drivers/nvdimm/bus.c|   4 +-
 drivers/nvme/host/core.c|   2 +-
 drivers/pci/switch/switchtec.c  |   2 +-
 drivers/platform/x86/wmi.c  |   2 +-
 drivers/rpmsg/rpmsg_char.c  |   4 +-
 drivers/rtc/dev.c   |  13 +-
 drivers/rtc/rtc-vr41xx.c|  10 +
 drivers/s390/char/tape_char.c   |  41 +-
 drivers/sbus/char/display7seg.c |   2 +-
 drivers/sbus/char/envctrl.c |   4 +-
 drivers/scsi/3w-.c  |   4 +-
 drivers/scsi/cxlflash/main.c|   2 +-
 drivers/scsi/esas2r/esas2r_main.c   |   2 +-
 drivers/scsi/megaraid/megaraid_mm.c |  28 +-
 drivers/scsi/osst.c |  34 +-
 drivers/scsi/pmcraid.c  |   4 +-
 drivers/scsi/st.c   |  35 +-
 drivers/staging/android/ion/ion.c   |   4 +-
 drivers/staging/pi433/pi433_if.c|  12 +-
 drivers/staging/vme/devices/vme_user.c  |   2 +-
 drivers/tee/tee_core.c  |   2 +-
 drivers/usb/class/cdc-wdm.c |   2 +-
 drivers/usb/class/usbtmc.c  |   4 +-
 drivers/usb/core/devio.c

Re: [PATCH v2 5/5] arm64/speculation: Support 'mitigations=' cmdline option

2019-04-16 Thread Josh Poimboeuf

On Tue, Apr 16, 2019 at 09:26:13PM +0200, Thomas Gleixner wrote:
> On Fri, 12 Apr 2019, Josh Poimboeuf wrote:
> 
> > Configure arm64 runtime CPU speculation bug mitigations in accordance
> > with the 'mitigations=' cmdline option.  This affects Meltdown, Spectre
> > v2, and Speculative Store Bypass.
> > 
> > The default behavior is unchanged.
> > 
> > Signed-off-by: Josh Poimboeuf 
> > ---
> > NOTE: This is based on top of Jeremy Linton's patches:
> >   https://lkml.kernel.org/r/20190410231237.52506-1-jeremy.lin...@arm.com
> 
> So I keep that out and we have to revisit that once the ARM64 stuff hits a
> tree, right? I can have a branch with just the 4 first patches applied
> which ARM64 folks can pull in when they apply Jeremy's patches before te
> merge window.

Sounds good to me (though I guess it's up to the arm64 maintainers how
they want to handle the dependencies).

-- 
Josh

Re: [PATCH v2 00/21] Convert hwmon documentation to ReST

2019-04-16 Thread Jonathan Corbet

On Fri, 12 Apr 2019 20:09:16 -0700
Guenter Roeck  wrote:

> The big real-world question is: Is the series good enough for you to accept,
> or do you expect some level of user/kernel separation ?

I guess it can go in; it's forward progress, even if it doesn't make the
improvements I would like to see.

The real question, I guess, is who should take it.  I've been seeing a
fair amount of activity on hwmon, so I suspect that the potential for
conflicts is real.  Perhaps things would go smoother if it went through
your tree?

Thanks,

jon

Re: [PATCH v2 5/5] arm64/speculation: Support 'mitigations=' cmdline option

2019-04-16 Thread Thomas Gleixner

On Fri, 12 Apr 2019, Josh Poimboeuf wrote:

> Configure arm64 runtime CPU speculation bug mitigations in accordance
> with the 'mitigations=' cmdline option.  This affects Meltdown, Spectre
> v2, and Speculative Store Bypass.
> 
> The default behavior is unchanged.
> 
> Signed-off-by: Josh Poimboeuf 
> ---
> NOTE: This is based on top of Jeremy Linton's patches:
>   https://lkml.kernel.org/r/20190410231237.52506-1-jeremy.lin...@arm.com

So I keep that out and we have to revisit that once the ARM64 stuff hits a
tree, right? I can have a branch with just the 4 first patches applied
which ARM64 folks can pull in when they apply Jeremy's patches before te
merge window.

Thanks,

tglx

Re: [PATCH v2 1/5] arm64: Fix vDSO clock_getres()

2019-04-16 Thread Will Deacon

On Tue, Apr 16, 2019 at 05:24:33PM +0100, Catalin Marinas wrote:
> On Tue, Apr 16, 2019 at 05:14:30PM +0100, Vincenzo Frascino wrote:
> > diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> > index 2d419006ad43..5f5759d51c33 100644
> > --- a/arch/arm64/kernel/vdso.c
> > +++ b/arch/arm64/kernel/vdso.c
> > @@ -245,6 +245,8 @@ void update_vsyscall(struct timekeeper *tk)
> > vdso_data->cs_shift = tk->tkr_mono.shift;
> > }
> >  
> > +   vdso_data->hrtimer_res  = hrtimer_resolution;
> 
> This should be a WRITE_ONCE(), just in case.
> 
> > +
> > smp_wmb();
> > ++vdso_data->tb_seq_count;
> >  }
> > diff --git a/arch/arm64/kernel/vdso/gettimeofday.S 
> > b/arch/arm64/kernel/vdso/gettimeofday.S
> > index c39872a7b03c..e2e9dfe9ba4a 100644
> > --- a/arch/arm64/kernel/vdso/gettimeofday.S
> > +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> > @@ -296,32 +296,32 @@ ENDPROC(__kernel_clock_gettime)
> >  /* int __kernel_clock_getres(clockid_t clock_id, struct timespec *res); */
> >  ENTRY(__kernel_clock_getres)
> > .cfi_startproc
> > +   adr vdso_data, _vdso_data
> > cmp w0, #CLOCK_REALTIME
> > ccmpw0, #CLOCK_MONOTONIC, #0x4, ne
> > ccmpw0, #CLOCK_MONOTONIC_RAW, #0x4, ne
> > -   b.ne1f
> > +   b.ne2f
> >  
> > -   ldr x2, 5f
> > -   b   2f
> > -1:
> > +1: /* Get hrtimer_res */
> > +   ldr x2, [vdso_data, #CLOCK_REALTIME_RES]
> 
> And here we need an "ldr w2, ..." since hrtimer_res is u32.
> 
> With the above (which Will can fix up):
> 
> Reviewed-by: Catalin Marinas 

Applied, with the above and a few extra cleanups.

Will

Re: [PATCH v2 5/5] kselftest: Extend vDSO selftest to clock_getres

2019-04-16 Thread Will Deacon

On Tue, Apr 16, 2019 at 05:14:34PM +0100, Vincenzo Frascino wrote:
> The current version of the multiarch vDSO selftest verifies only
> gettimeofday.
> 
> Extend the vDSO selftest to clock_getres, to verify that the
> syscall and the vDSO library function return the same information.
> 
> The extension has been used to verify the hrtimer_resoltion fix.
> 
> Cc: Shuah Khan 
> Signed-off-by: Vincenzo Frascino 
> ---
>  tools/testing/selftests/vDSO/Makefile |   2 +
>  .../selftests/vDSO/vdso_clock_getres.c| 108 ++
>  2 files changed, 110 insertions(+)
>  create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c

Assuming this will go via Shuah's tree.

Will

Re: Linux 5.1-rc5

2019-04-16 Thread Linus Torvalds

On Tue, Apr 16, 2019 at 9:16 AM Linus Torvalds
 wrote:
>
> We actually already *have* this function.
>
> It's called "gup_fast_permitted()" and it's used by x86-64 to verify
> the proper address range. Exactly like s390 needs..
>
> Could you please use that instead?

IOW, something like the attached.

Obviously untested. And maybe 'current' isn't declared in
, in which case you'd need to modify it to instead make
the inline function be "s390_gup_fast_permitted()" that takes a
pointer to the mm, and do something like

  #define gup_fast_permitted(start, pages) \
 s390_gup_fast_permitted(current->mm, start, pages)

instead.

But I think you get the idea..

Linus
 arch/s390/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 76dc344edb8c..a08248995f50 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1659,4 +1659,16 @@ static inline void check_pgt_cache(void) { }
 
 #include 
 
+static inline bool gup_fast_permitted(unsigned long start, int nr_pages)
+{
+	unsigned long len, end;
+
+	len = (unsigned long)nr_pages << PAGE_SHIFT;
+	end = start + len;
+	if (end < start)
+		return false;
+	return end <= current->mm->context.asce_limit;
+}
+#define gup_fast_permitted gup_fast_permitted
+
 #endif /* _S390_PAGE_H */

Re: [PATCH v2 1/5] arm64: Fix vDSO clock_getres()

2019-04-16 Thread Catalin Marinas

On Tue, Apr 16, 2019 at 05:14:30PM +0100, Vincenzo Frascino wrote:
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index 2d419006ad43..5f5759d51c33 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -245,6 +245,8 @@ void update_vsyscall(struct timekeeper *tk)
>   vdso_data->cs_shift = tk->tkr_mono.shift;
>   }
>  
> + vdso_data->hrtimer_res  = hrtimer_resolution;

This should be a WRITE_ONCE(), just in case.

> +
>   smp_wmb();
>   ++vdso_data->tb_seq_count;
>  }
> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S 
> b/arch/arm64/kernel/vdso/gettimeofday.S
> index c39872a7b03c..e2e9dfe9ba4a 100644
> --- a/arch/arm64/kernel/vdso/gettimeofday.S
> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> @@ -296,32 +296,32 @@ ENDPROC(__kernel_clock_gettime)
>  /* int __kernel_clock_getres(clockid_t clock_id, struct timespec *res); */
>  ENTRY(__kernel_clock_getres)
>   .cfi_startproc
> + adr vdso_data, _vdso_data
>   cmp w0, #CLOCK_REALTIME
>   ccmpw0, #CLOCK_MONOTONIC, #0x4, ne
>   ccmpw0, #CLOCK_MONOTONIC_RAW, #0x4, ne
> - b.ne1f
> + b.ne2f
>  
> - ldr x2, 5f
> - b   2f
> -1:
> +1:   /* Get hrtimer_res */
> + ldr x2, [vdso_data, #CLOCK_REALTIME_RES]

And here we need an "ldr w2, ..." since hrtimer_res is u32.

With the above (which Will can fix up):

Reviewed-by: Catalin Marinas

Re: Linux 5.1-rc5

2019-04-16 Thread Linus Torvalds

On Tue, Apr 16, 2019 at 5:08 AM Martin Schwidefsky
 wrote:
>
> This is not nice, would a patch like the following be acceptable?

Umm.

We actually already *have* this function.

It's called "gup_fast_permitted()" and it's used by x86-64 to verify
the proper address range. Exactly like s390 needs..

Could you please use that instead?

Linus

[PATCH v2 5/5] kselftest: Extend vDSO selftest to clock_getres

2019-04-16 Thread Vincenzo Frascino

The current version of the multiarch vDSO selftest verifies only
gettimeofday.

Extend the vDSO selftest to clock_getres, to verify that the
syscall and the vDSO library function return the same information.

The extension has been used to verify the hrtimer_resoltion fix.

Cc: Shuah Khan 
Signed-off-by: Vincenzo Frascino 
---
 tools/testing/selftests/vDSO/Makefile |   2 +
 .../selftests/vDSO/vdso_clock_getres.c| 108 ++
 2 files changed, 110 insertions(+)
 create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c

diff --git a/tools/testing/selftests/vDSO/Makefile 
b/tools/testing/selftests/vDSO/Makefile
index 9e03d61f52fd..d5c5bfdf1ac1 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -5,6 +5,7 @@ uname_M := $(shell uname -m 2>/dev/null || echo not)
 ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
 
 TEST_GEN_PROGS := $(OUTPUT)/vdso_test
+TEST_GEN_PROGS += $(OUTPUT)/vdso_clock_getres
 ifeq ($(ARCH),x86)
 TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
 endif
@@ -18,6 +19,7 @@ endif
 
 all: $(TEST_GEN_PROGS)
 $(OUTPUT)/vdso_test: parse_vdso.c vdso_test.c
+$(OUTPUT)/vdso_clock_getres: vdso_clock_getres.c
 $(OUTPUT)/vdso_standalone_test_x86: vdso_standalone_test_x86.c parse_vdso.c
$(CC) $(CFLAGS) $(CFLAGS_vdso_standalone_test_x86) \
vdso_standalone_test_x86.c parse_vdso.c \
diff --git a/tools/testing/selftests/vDSO/vdso_clock_getres.c 
b/tools/testing/selftests/vDSO/vdso_clock_getres.c
new file mode 100644
index ..b1b9652972eb
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_clock_getres.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * vdso_clock_getres.c: Sample code to test clock_getres.
+ * Copyright (c) 2019 Arm Ltd.
+ *
+ * Compile with:
+ * gcc -std=gnu99 vdso_clock_getres.c
+ *
+ * Tested on ARM, ARM64, MIPS32, x86 (32-bit and 64-bit),
+ * Power (32-bit and 64-bit), S390x (32-bit and 64-bit).
+ * Might work on other architectures.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static long syscall_clock_getres(clockid_t _clkid, struct timespec *_ts)
+{
+   long ret;
+
+   ret = syscall(SYS_clock_getres, _clkid, _ts);
+
+   return ret;
+}
+
+const char *vdso_clock_name[12] = {
+   "CLOCK_REALTIME",
+   "CLOCK_MONOTONIC",
+   "CLOCK_PROCESS_CPUTIME_ID",
+   "CLOCK_THREAD_CPUTIME_ID",
+   "CLOCK_MONOTONIC_RAW",
+   "CLOCK_REALTIME_COARSE",
+   "CLOCK_MONOTONIC_COARSE",
+   "CLOCK_BOOTTIME",
+   "CLOCK_REALTIME_ALARM",
+   "CLOCK_BOOTTIME_ALARM",
+   "CLOCK_SGI_CYCLE",
+   "CLOCK_TAI",
+};
+
+/*
+ * Macro to call clock_getres in vdso and by system call
+ * with different values for clock_id.
+ */
+#define vdso_test_clock(clock_id)  \
+do {   \
+   struct timespec x, y;   \
+   printf("clock_id: %s", vdso_clock_name[clock_id]);  \
+   clock_getres(clock_id, ); \
+   syscall_clock_getres(clock_id, ); \
+   if ((x.tv_sec != y.tv_sec) || (x.tv_sec != y.tv_sec)) { \
+   printf(" [FAIL]\n");\
+   return KSFT_SKIP;   \
+   } else {\
+   printf(" [PASS]\n");\
+   }   \
+} while (0)
+
+int main(int argc, char **argv)
+{
+
+#if _POSIX_TIMERS > 0
+
+#ifdef CLOCK_REALTIME
+   vdso_test_clock(CLOCK_REALTIME);
+#endif
+
+#ifdef CLOCK_BOOTTIME
+   vdso_test_clock(CLOCK_BOOTTIME);
+#endif
+
+#ifdef CLOCK_TAI
+   vdso_test_clock(CLOCK_TAI);
+#endif
+
+#ifdef CLOCK_REALTIME_COARSE
+   vdso_test_clock(CLOCK_REALTIME_COARSE);
+#endif
+
+#ifdef CLOCK_MONOTONIC
+   vdso_test_clock(CLOCK_MONOTONIC);
+#endif
+
+#ifdef CLOCK_MONOTONIC_RAW
+   vdso_test_clock(CLOCK_MONOTONIC_RAW);
+#endif
+
+#ifdef CLOCK_MONOTONIC_COARSE
+   vdso_test_clock(CLOCK_MONOTONIC_COARSE);
+#endif
+
+#endif
+
+   return 0;
+}
-- 
2.21.0

[PATCH v2 4/5] nds32: Fix vDSO clock_getres()

2019-04-16 Thread Vincenzo Frascino

clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the nds32 vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Cc: Greentime Hu 
Cc: Vincent Chen 
Signed-off-by: Vincenzo Frascino 
---
 arch/nds32/include/asm/vdso_datapage.h | 1 +
 arch/nds32/kernel/vdso.c   | 1 +
 arch/nds32/kernel/vdso/gettimeofday.c  | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/nds32/include/asm/vdso_datapage.h 
b/arch/nds32/include/asm/vdso_datapage.h
index 79db5a12ca5e..34d80548297f 100644
--- a/arch/nds32/include/asm/vdso_datapage.h
+++ b/arch/nds32/include/asm/vdso_datapage.h
@@ -20,6 +20,7 @@ struct vdso_data {
u32 xtime_clock_sec;/* CLOCK_REALTIME - seconds */
u32 cs_mult;/* clocksource multiplier */
u32 cs_shift;   /* Cycle to nanosecond divisor (power of two) */
+   u32 hrtimer_res;/* hrtimer resolution */
 
u64 cs_cycle_last;  /* last cycle value */
u64 cs_mask;/* clocksource mask */
diff --git a/arch/nds32/kernel/vdso.c b/arch/nds32/kernel/vdso.c
index 016f15891f6d..90bcae6f8554 100644
--- a/arch/nds32/kernel/vdso.c
+++ b/arch/nds32/kernel/vdso.c
@@ -220,6 +220,7 @@ void update_vsyscall(struct timekeeper *tk)
vdso_data->xtime_coarse_sec = tk->xtime_sec;
vdso_data->xtime_coarse_nsec = tk->tkr_mono.xtime_nsec >>
tk->tkr_mono.shift;
+   vdso_data->hrtimer_res = hrtimer_resolution;
vdso_write_end(vdso_data);
 }
 
diff --git a/arch/nds32/kernel/vdso/gettimeofday.c 
b/arch/nds32/kernel/vdso/gettimeofday.c
index 038721af40e3..b02581891c33 100644
--- a/arch/nds32/kernel/vdso/gettimeofday.c
+++ b/arch/nds32/kernel/vdso/gettimeofday.c
@@ -208,6 +208,8 @@ static notrace int clock_getres_fallback(clockid_t _clk_id,
 
 notrace int __vdso_clock_getres(clockid_t clk_id, struct timespec *res)
 {
+   struct vdso_data *vdata = __get_datapage();
+
if (res == NULL)
return 0;
switch (clk_id) {
@@ -215,7 +217,7 @@ notrace int __vdso_clock_getres(clockid_t clk_id, struct 
timespec *res)
case CLOCK_MONOTONIC:
case CLOCK_MONOTONIC_RAW:
res->tv_sec = 0;
-   res->tv_nsec = CLOCK_REALTIME_RES;
+   res->tv_nsec = vdata->hrtimer_res;
break;
case CLOCK_REALTIME_COARSE:
case CLOCK_MONOTONIC_COARSE:
-- 
2.21.0

[PATCH v2 3/5] s390: Fix vDSO clock_getres()

2019-04-16 Thread Vincenzo Frascino

clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the s390 vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Signed-off-by: Vincenzo Frascino 
Acked-by: Martin Schwidefsky 
---
 arch/s390/include/asm/vdso.h   |  1 +
 arch/s390/kernel/asm-offsets.c |  2 +-
 arch/s390/kernel/time.c|  1 +
 arch/s390/kernel/vdso32/clock_getres.S | 12 +++-
 arch/s390/kernel/vdso64/clock_getres.S | 10 +-
 5 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h
index 169d7604eb80..f3ba84fa9bd1 100644
--- a/arch/s390/include/asm/vdso.h
+++ b/arch/s390/include/asm/vdso.h
@@ -36,6 +36,7 @@ struct vdso_data {
__u32 tk_shift; /* Shift used for xtime_nsec0x60 */
__u32 ts_dir;   /* TOD steering direction   0x64 */
__u64 ts_end;   /* TOD steering end 0x68 */
+   __u32 hrtimer_res;  /* hrtimer resolution   0x70 */
 };
 
 struct vdso_per_cpu_data {
diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c
index 164bec175628..36db4a9ee703 100644
--- a/arch/s390/kernel/asm-offsets.c
+++ b/arch/s390/kernel/asm-offsets.c
@@ -75,6 +75,7 @@ int main(void)
OFFSET(__VDSO_TK_SHIFT, vdso_data, tk_shift);
OFFSET(__VDSO_TS_DIR, vdso_data, ts_dir);
OFFSET(__VDSO_TS_END, vdso_data, ts_end);
+   OFFSET(__VDSO_CLOCK_REALTIME_RES, vdso_data, hrtimer_res);
OFFSET(__VDSO_ECTG_BASE, vdso_per_cpu_data, ectg_timer_base);
OFFSET(__VDSO_ECTG_USER, vdso_per_cpu_data, ectg_user_time);
OFFSET(__VDSO_CPU_NR, vdso_per_cpu_data, cpu_nr);
@@ -86,7 +87,6 @@ int main(void)
DEFINE(__CLOCK_REALTIME_COARSE, CLOCK_REALTIME_COARSE);
DEFINE(__CLOCK_MONOTONIC_COARSE, CLOCK_MONOTONIC_COARSE);
DEFINE(__CLOCK_THREAD_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID);
-   DEFINE(__CLOCK_REALTIME_RES, MONOTONIC_RES_NSEC);
DEFINE(__CLOCK_COARSE_RES, LOW_RES_NSEC);
BLANK();
/* idle data offsets */
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index e8766beee5ad..8ea9db599d38 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -310,6 +310,7 @@ void update_vsyscall(struct timekeeper *tk)
 
vdso_data->tk_mult = tk->tkr_mono.mult;
vdso_data->tk_shift = tk->tkr_mono.shift;
+   vdso_data->hrtimer_res = hrtimer_resolution;
smp_wmb();
++vdso_data->tb_update_count;
 }
diff --git a/arch/s390/kernel/vdso32/clock_getres.S 
b/arch/s390/kernel/vdso32/clock_getres.S
index eaf9cf1417f6..fecd7684c645 100644
--- a/arch/s390/kernel/vdso32/clock_getres.S
+++ b/arch/s390/kernel/vdso32/clock_getres.S
@@ -18,20 +18,22 @@
 __kernel_clock_getres:
CFI_STARTPROC
basr%r1,0
-   la  %r1,4f-.(%r1)
+10:al  %r1,4f-10b(%r1)
+   l   %r0,__VDSO_CLOCK_REALTIME_RES(%r1)
chi %r2,__CLOCK_REALTIME
je  0f
chi %r2,__CLOCK_MONOTONIC
je  0f
-   la  %r1,5f-4f(%r1)
+   basr%r1,0
+   la  %r1,5f-.(%r1)
+   l   %r0,0(%r1)
chi %r2,__CLOCK_REALTIME_COARSE
je  0f
chi %r2,__CLOCK_MONOTONIC_COARSE
jne 3f
 0: ltr %r3,%r3
jz  2f  /* res == NULL */
-1: l   %r0,0(%r1)
-   xc  0(4,%r3),0(%r3) /* set tp->tv_sec to zero */
+1: xc  0(4,%r3),0(%r3) /* set tp->tv_sec to zero */
st  %r0,4(%r3)  /* store tp->tv_usec */
 2: lhi %r2,0
br  %r14
@@ -39,6 +41,6 @@ __kernel_clock_getres:
svc 0
br  %r14
CFI_ENDPROC
-4: .long   __CLOCK_REALTIME_RES
+4: .long   _vdso_data - 10b
 5: .long   __CLOCK_COARSE_RES
.size   __kernel_clock_getres,.-__kernel_clock_getres
diff --git a/arch/s390/kernel/vdso64/clock_getres.S 
b/arch/s390/kernel/vdso64/clock_getres.S
index 081435398e0a..022b58c980db 100644
--- a/arch/s390/kernel/vdso64/clock_getres.S
+++ b/arch/s390/kernel/vdso64/clock_getres.S
@@ -17,12 +17,14 @@
.type  __kernel_clock_getres,@function
 __kernel_clock_getres:
CFI_STARTPROC
-   larl%r1,4f
+   larl%r1,3f
+   lg  %r0,0(%r1)
cghi%r2,__CLOCK_REALTIME_COARSE
je  0f
cghi%r2,__CLOCK_MONOTONIC_COARSE
je  0f
-   larl%r1,3f
+   larl%r1,_vdso_data
+   l   %r0,__VDSO_CLOCK_REALTIME_RES(%r1)
cghi

[PATCH v2 2/5] powerpc: Fix vDSO clock_getres()

2019-04-16 Thread Vincenzo Frascino

clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the powerpc vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Signed-off-by: Vincenzo Frascino 
---
 arch/powerpc/include/asm/vdso_datapage.h  | 2 ++
 arch/powerpc/kernel/asm-offsets.c | 2 +-
 arch/powerpc/kernel/time.c| 1 +
 arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +--
 arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +--
 5 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/vdso_datapage.h 
b/arch/powerpc/include/asm/vdso_datapage.h
index bbc06bd72b1f..4333b9a473dc 100644
--- a/arch/powerpc/include/asm/vdso_datapage.h
+++ b/arch/powerpc/include/asm/vdso_datapage.h
@@ -86,6 +86,7 @@ struct vdso_data {
__s32 wtom_clock_nsec;  /* Wall to monotonic clock nsec 
*/
__s64 wtom_clock_sec;   /* Wall to monotonic clock sec 
*/
struct timespec stamp_xtime;/* xtime as at tb_orig_stamp */
+   __u32 hrtimer_res;  /* hrtimer resolution */
__u32 syscall_map_64[SYSCALL_MAP_SIZE]; /* map of syscalls  */
__u32 syscall_map_32[SYSCALL_MAP_SIZE]; /* map of syscalls */
 };
@@ -107,6 +108,7 @@ struct vdso_data {
__s32 wtom_clock_nsec;
struct timespec stamp_xtime;/* xtime as at tb_orig_stamp */
__u32 stamp_sec_fraction;   /* fractional seconds of stamp_xtime */
+   __u32 hrtimer_res;  /* hrtimer resolution */
__u32 syscall_map_32[SYSCALL_MAP_SIZE]; /* map of syscalls */
__u32 dcache_block_size;/* L1 d-cache block size */
__u32 icache_block_size;/* L1 i-cache block size */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 86a61e5f8285..52e4b98a8492 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -383,6 +383,7 @@ int main(void)
OFFSET(WTOM_CLOCK_NSEC, vdso_data, wtom_clock_nsec);
OFFSET(STAMP_XTIME, vdso_data, stamp_xtime);
OFFSET(STAMP_SEC_FRAC, vdso_data, stamp_sec_fraction);
+   OFFSET(CLOCK_REALTIME_RES, vdso_data, hrtimer_res);
OFFSET(CFG_ICACHE_BLOCKSZ, vdso_data, icache_block_size);
OFFSET(CFG_DCACHE_BLOCKSZ, vdso_data, dcache_block_size);
OFFSET(CFG_ICACHE_LOGBLOCKSZ, vdso_data, icache_log_block_size);
@@ -413,7 +414,6 @@ int main(void)
DEFINE(CLOCK_REALTIME_COARSE, CLOCK_REALTIME_COARSE);
DEFINE(CLOCK_MONOTONIC_COARSE, CLOCK_MONOTONIC_COARSE);
DEFINE(NSEC_PER_SEC, NSEC_PER_SEC);
-   DEFINE(CLOCK_REALTIME_RES, MONOTONIC_RES_NSEC);
 
 #ifdef CONFIG_BUG
DEFINE(BUG_ENTRY_SIZE, sizeof(struct bug_entry));
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index bc0503ef9c9c..62c04a6746d8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -955,6 +955,7 @@ void update_vsyscall(struct timekeeper *tk)
vdso_data->wtom_clock_nsec = tk->wall_to_monotonic.tv_nsec;
vdso_data->stamp_xtime = xt;
vdso_data->stamp_sec_fraction = frac_sec;
+   vdso_data->hrtimer_res = hrtimer_resolution;
smp_wmb();
++(vdso_data->tb_update_count);
 }
diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S 
b/arch/powerpc/kernel/vdso32/gettimeofday.S
index afd516b572f8..2b5f9e83c610 100644
--- a/arch/powerpc/kernel/vdso32/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso32/gettimeofday.S
@@ -160,12 +160,15 @@ V_FUNCTION_BEGIN(__kernel_clock_getres)
crorcr0*4+eq,cr0*4+eq,cr1*4+eq
bne cr0,99f
 
+   mflrr12
+  .cfi_register lr,r12
+   bl  __get_datapage@local
+   lwz r5,CLOCK_REALTIME_RES(r3)
+   mtlrr12
li  r3,0
cmpli   cr0,r4,0
crclr   cr0*4+so
beqlr
-   lis r5,CLOCK_REALTIME_RES@h
-   ori r5,r5,CLOCK_REALTIME_RES@l
stw r3,TSPC32_TV_SEC(r4)
stw r5,TSPC32_TV_NSEC(r4)
blr
diff --git a/arch/powerpc/kernel/vdso64/gettimeofday.S 
b/arch/powerpc/kernel/vdso64/gettimeofday.S
index 1f324c28705b..f07730f73d5e 100644
--- a/arch/powerpc/kernel/vdso64/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso64/gettimeofday.S
@@ -190,12 +190,15 @@ V_FUNCTION_BEGIN(__kernel_clock_getres)
crorcr0*4+eq,cr0*4+eq,cr1*4+eq
bne cr0,99f
 
+   mflrr12
+  .cfi_register lr,r12
+   bl  V_LOCAL_FUNC(__get_datapage)
+   lwz r5,CLOCK_REALTIME_RES(r3)
+   mtlrr12
li  r3,0
cmpldi  cr0,r4,0
crclr   cr0*4+so
beqlr
-

[PATCH v2 1/5] arm64: Fix vDSO clock_getres()

2019-04-16 Thread Vincenzo Frascino

clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the arm64 vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Vincenzo Frascino 
---
 arch/arm64/include/asm/vdso_datapage.h |  1 +
 arch/arm64/kernel/asm-offsets.c|  2 +-
 arch/arm64/kernel/vdso.c   |  2 ++
 arch/arm64/kernel/vdso/gettimeofday.S  | 22 +++---
 4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/vdso_datapage.h 
b/arch/arm64/include/asm/vdso_datapage.h
index 2b9a63771eda..f89263c8e11a 100644
--- a/arch/arm64/include/asm/vdso_datapage.h
+++ b/arch/arm64/include/asm/vdso_datapage.h
@@ -38,6 +38,7 @@ struct vdso_data {
__u32 tz_minuteswest;   /* Whacky timezone stuff */
__u32 tz_dsttime;
__u32 use_syscall;
+   __u32 hrtimer_res;
 };
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 7f40dcbdd51d..e10e2a5d9ddc 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -94,7 +94,7 @@ int main(void)
   DEFINE(CLOCK_REALTIME,   CLOCK_REALTIME);
   DEFINE(CLOCK_MONOTONIC,  CLOCK_MONOTONIC);
   DEFINE(CLOCK_MONOTONIC_RAW,  CLOCK_MONOTONIC_RAW);
-  DEFINE(CLOCK_REALTIME_RES,   MONOTONIC_RES_NSEC);
+  DEFINE(CLOCK_REALTIME_RES,   offsetof(struct vdso_data, hrtimer_res));
   DEFINE(CLOCK_REALTIME_COARSE,CLOCK_REALTIME_COARSE);
   DEFINE(CLOCK_MONOTONIC_COARSE,CLOCK_MONOTONIC_COARSE);
   DEFINE(CLOCK_COARSE_RES, LOW_RES_NSEC);
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 2d419006ad43..5f5759d51c33 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -245,6 +245,8 @@ void update_vsyscall(struct timekeeper *tk)
vdso_data->cs_shift = tk->tkr_mono.shift;
}
 
+   vdso_data->hrtimer_res  = hrtimer_resolution;
+
smp_wmb();
++vdso_data->tb_seq_count;
 }
diff --git a/arch/arm64/kernel/vdso/gettimeofday.S 
b/arch/arm64/kernel/vdso/gettimeofday.S
index c39872a7b03c..e2e9dfe9ba4a 100644
--- a/arch/arm64/kernel/vdso/gettimeofday.S
+++ b/arch/arm64/kernel/vdso/gettimeofday.S
@@ -296,32 +296,32 @@ ENDPROC(__kernel_clock_gettime)
 /* int __kernel_clock_getres(clockid_t clock_id, struct timespec *res); */
 ENTRY(__kernel_clock_getres)
.cfi_startproc
+   adr vdso_data, _vdso_data
cmp w0, #CLOCK_REALTIME
ccmpw0, #CLOCK_MONOTONIC, #0x4, ne
ccmpw0, #CLOCK_MONOTONIC_RAW, #0x4, ne
-   b.ne1f
+   b.ne2f
 
-   ldr x2, 5f
-   b   2f
-1:
+1: /* Get hrtimer_res */
+   ldr x2, [vdso_data, #CLOCK_REALTIME_RES]
+   b   3f
+2:
cmp w0, #CLOCK_REALTIME_COARSE
ccmpw0, #CLOCK_MONOTONIC_COARSE, #0x4, ne
-   b.ne4f
+   b.ne5f
ldr x2, 6f
-2:
-   cbz x1, 3f
+3:
+   cbz x1, 4f
stp xzr, x2, [x1]
 
-3: /* res == NULL. */
+4: /* res == NULL. */
mov w0, wzr
ret
 
-4: /* Syscall fallback. */
+5: /* Syscall fallback. */
mov x8, #__NR_clock_getres
svc #0
ret
-5:
-   .quad   CLOCK_REALTIME_RES
 6:
.quad   CLOCK_COARSE_RES
.cfi_endproc
-- 
2.21.0

[PATCH v2 0/5] Fix vDSO clock_getres()

2019-04-16 Thread Vincenzo Frascino

clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].

This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.

Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.

[1] https://marc.info/?l=linux-arm-kernel=155110381930196=2

Changes:

v2:
  - Rebased on 5.1-rc5.
  - Addressed review comments.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Greentime Hu 
Cc: Vincent Chen 
Cc: Shuah Khan 
Cc: Thomas Gleixner 
Cc: Arnd Bergmann 
Signed-off-by: Vincenzo Frascino 

Vincenzo Frascino (5):
  arm64: Fix vDSO clock_getres()
  powerpc: Fix vDSO clock_getres()
  s390: Fix vDSO clock_getres()
  nds32: Fix vDSO clock_getres()
  kselftest: Extend vDSO selftest to clock_getres

 arch/arm64/include/asm/vdso_datapage.h|   1 +
 arch/arm64/kernel/asm-offsets.c   |   2 +-
 arch/arm64/kernel/vdso.c  |   2 +
 arch/arm64/kernel/vdso/gettimeofday.S |  22 ++--
 arch/nds32/include/asm/vdso_datapage.h|   1 +
 arch/nds32/kernel/vdso.c  |   1 +
 arch/nds32/kernel/vdso/gettimeofday.c |   4 +-
 arch/powerpc/include/asm/vdso_datapage.h  |   2 +
 arch/powerpc/kernel/asm-offsets.c |   2 +-
 arch/powerpc/kernel/time.c|   1 +
 arch/powerpc/kernel/vdso32/gettimeofday.S |   7 +-
 arch/powerpc/kernel/vdso64/gettimeofday.S |   7 +-
 arch/s390/include/asm/vdso.h  |   1 +
 arch/s390/kernel/asm-offsets.c|   2 +-
 arch/s390/kernel/time.c   |   1 +
 arch/s390/kernel/vdso32/clock_getres.S|  12 +-
 arch/s390/kernel/vdso64/clock_getres.S|  10 +-
 tools/testing/selftests/vDSO/Makefile |   2 +
 .../selftests/vDSO/vdso_clock_getres.c| 108 ++
 19 files changed, 159 insertions(+), 29 deletions(-)
 create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c

-- 
2.21.0

Re: [PATCH v2 1/5] cpu/speculation: Add 'mitigations=' cmdline option

2019-04-16 Thread Josh Poimboeuf

On Tue, Apr 16, 2019 at 04:13:35PM +0200, Borislav Petkov wrote:
> On Fri, Apr 12, 2019 at 03:39:28PM -0500, Josh Poimboeuf wrote:
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 38890f62f9a8..aed9083f8eac 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -2320,3 +2320,18 @@ void __init boot_cpu_hotplug_init(void)
> >  #endif
> > this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
> >  }
> > +
> > +enum cpu_mitigations cpu_mitigations __ro_after_init = 
> > CPU_MITIGATIONS_AUTO;
> > +
> > +static int __init mitigations_cmdline(char *arg)
> 
> Forgot the verb: "mitigations_parse_cmdline".

Sure.

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aed9083f8eac..cf9fea42d8fc 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2323,7 +2323,7 @@ void __init boot_cpu_hotplug_init(void)
 
 enum cpu_mitigations cpu_mitigations __ro_after_init = CPU_MITIGATIONS_AUTO;
 
-static int __init mitigations_cmdline(char *arg)
+static int __init mitigations_parse_cmdline(char *arg)
 {
if (!strcmp(arg, "off"))
cpu_mitigations = CPU_MITIGATIONS_OFF;
@@ -2334,4 +2334,4 @@ static int __init mitigations_cmdline(char *arg)
 
return 0;
 }
-early_param("mitigations", mitigations_cmdline);
+early_param("mitigations", mitigations_parse_cmdline);

Re: [PATCH v5 1/6] iommu: add generic boot option iommu.dma_mode

2019-04-16 Thread Will Deacon

On Fri, Apr 12, 2019 at 02:11:31PM +0100, Robin Murphy wrote:
> On 12/04/2019 11:26, John Garry wrote:
> > On 09/04/2019 13:53, Zhen Lei wrote:
> > > +static int __init iommu_dma_mode_setup(char *str)
> > > +{
> > > +    if (!str)
> > > +    goto fail;
> > > +
> > > +    if (!strncmp(str, "passthrough", 11))
> > > +    iommu_default_dma_mode = IOMMU_DMA_MODE_PASSTHROUGH;
> > > +    else if (!strncmp(str, "lazy", 4))
> > > +    iommu_default_dma_mode = IOMMU_DMA_MODE_LAZY;
> > > +    else if (!strncmp(str, "strict", 6))
> > > +    iommu_default_dma_mode = IOMMU_DMA_MODE_STRICT;
> > > +    else
> > > +    goto fail;
> > > +
> > > +    pr_info("Force dma mode to be %d\n", iommu_default_dma_mode);
> > 
> > What happens if the cmdline option iommu.dma_mode is passed multiple
> > times? We get mutliple - possibily conflicting - prints, right?
> 
> Indeed; we ended up removing such prints for the existing options here,
> specifically because multiple messages seemed more likely to be confusing
> than useful.
> 
> > And do we need to have backwards compatibility, such that the setting
> > for iommu.strict or iommu.passthrough trumps iommu.dma_mode, regardless
> > of order?
> 
> As above I think it would be preferable to just keep using the existing
> options anyway. The current behaviour works out as:
> 
> iommu.passthrough |  Y| N
> iommu.strict|  x  |Y N
> --|-|-|
> MODE| PASSTHROUGH | STRICT  |  LAZY
> 
> which seems intuitive enough that a specific dma_mode option doesn't add
> much value, and would more likely just overcomplicate things for users as
> well as our implementation.

Agreed. We can't remove the existing options, and they do the job perfectly
well so I don't see the need to add more options on top.

Will

Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-16 Thread Mark Rutland

On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
> Le 16/04/2019 à 16:27, Mark Rutland a écrit :
> > On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
> > > From: Mahendran Ganesh 
> > > 
> > > Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
> > > enables Speculative Page Fault handler.
> > > 
> > > Signed-off-by: Ganesh Mahendran 
> > 
> > This is missing your S-o-B.
> 
> You're right, I missed that...
> 
> > The first patch noted that the ARCH_SUPPORTS_* option was there because
> > the arch code had to make an explicit call to try to handle the fault
> > speculatively, but that isn't addeed until patch 30.
> > 
> > Why is this separate from that code?
> 
> Andrew was recommended this a long time ago for bisection purpose. This
> allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
> that trigger the spf handler is added to the per architecture's code.

Ok. I think it would be worth noting that in the commit message, to
avoid anyone else asking the same question. :)

Thanks,
Mark.

Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-16 Thread Mark Rutland

On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
> From: Mahendran Ganesh 
> 
> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
> enables Speculative Page Fault handler.
> 
> Signed-off-by: Ganesh Mahendran 

This is missing your S-o-B.

The first patch noted that the ARCH_SUPPORTS_* option was there because
the arch code had to make an explicit call to try to handle the fault
speculatively, but that isn't addeed until patch 30.

Why is this separate from that code?

Thanks,
Mark.

> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 870ef86a64ed..8e86934d598b 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -174,6 +174,7 @@ config ARM64
>   select SWIOTLB
>   select SYSCTL_EXCEPTION_TRACE
>   select THREAD_INFO_IN_TASK
> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>   help
> ARM 64-bit (AArch64) Linux support.
>  
> -- 
> 2.21.0
>

Re: [PATCH 1/5] arm64: Fix vDSO clock_getres()

2019-04-16 Thread Will Deacon

On Tue, Apr 16, 2019 at 01:42:58PM +0100, Vincenzo Frascino wrote:
> On 15/04/2019 18:35, Catalin Marinas wrote:
> > On Mon, Apr 01, 2019 at 12:51:48PM +0100, Vincenzo Frascino wrote:
> >> +1:/* Get hrtimer_res */
> >> +  seqcnt_acquire
> >> +  syscall_check fail=5f
> >> +  ldr x2, [vdso_data, #CLOCK_REALTIME_RES]
> >> +  seqcnt_check fail=1b
> >> +  b   3f
> >> +2:
> > 
> > We talked briefly but I'm still confused why we need the fallback to the
> > syscall here if archdata.vdso_direct is false. Is it because if the
> > timer driver code sets vdso_direct to false, we don't don't support
> > highres timers? If my understanding is correct, you may want to move the
> > hrtimer_res setting in update_vsyscall() to the !use_syscall block.
> > 
> 
> Ok, so let me try to provide more details on what I mentioned yesterday:
> - clock_getres syscall follows the rules of what defined in posix-timers.c
> - based on the clock_id that, for this purpose, can be separated in coarse and
> non-coarse calls either posix_get_coarse_res() or posix_get_hrtimer_res().
> - if clock id is set to a coarse clock and posix_get_coarse_res() is invoked,
> happens what follows:
> 
> static int posix_get_coarse_res(const clockid_t which_clock,
>   struct timespec64 *tp)
> {
>   *tp = ktime_to_timespec64(KTIME_LOW_RES);
>   return 0;
> }
> 
> Note that since CONFIG_1HZ seems not supported (jiffies.h) by the kernel in 
> this
> case we do not need rounding in our vDSO implementation.
> 
> - if clock id is set to non-coarse and posix_get_hrtimer_res() is invoked,
> happens the following:
> 
> static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
> {
>   tp->tv_sec = 0;
>   tp->tv_nsec = hrtimer_resolution;
>   return 0;
> }
> 
> hrtimer_resolution can be high res or low res depending on the call of
> hrtimer_switch_to_hres(). For us the only way to preserve the correct value is
> to keep it in the vdso data page.
> 
> - The assembly code mimics exactly the same behaviour detailed above, with one
> difference: the one related to the use_syscall parameter which is specific to 
> arm64.
> The use_syscall parameter is set by arm_arch_timer and consumed by
> update_vsyscall(). To mirror what update_vsyscall does in update_vsyscall() I
> check "syscall_check fail=5f" in clock_getres vdso function.
> 
> Said that, even if functionally it is the same thing, I think it is logically
> more correct to have hrtimer_res setting inside the !use_syscall block, hence 
> I
> am going to change it in the next iteration.
> 
> Please let me know your thoughts.

I think you can ignore the syscall_check, just like we seem to do for
CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE in clock_gettime().

Will

Re: [PATCH v2 1/5] cpu/speculation: Add 'mitigations=' cmdline option

2019-04-16 Thread Borislav Petkov

On Fri, Apr 12, 2019 at 03:39:28PM -0500, Josh Poimboeuf wrote:
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 38890f62f9a8..aed9083f8eac 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2320,3 +2320,18 @@ void __init boot_cpu_hotplug_init(void)
>  #endif
>   this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
>  }
> +
> +enum cpu_mitigations cpu_mitigations __ro_after_init = CPU_MITIGATIONS_AUTO;
> +
> +static int __init mitigations_cmdline(char *arg)

Forgot the verb: "mitigations_parse_cmdline".

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [PATCH v3 7/8] powerpc/mm: Consolidate radix and hash address map details

2019-04-16 Thread Nicholas Piggin

Aneesh Kumar K.V's on April 16, 2019 8:07 pm:
> We now have
> 
> 4K page size config
> 
>  kernel_region_map_size = 16TB
>  kernel vmalloc start   = 0xc0001000
>  kernel IO start= 0xc0002000
>  kernel vmemmap start   = 0xc0003000
> 
> with 64K page size config:
> 
>  kernel_region_map_size = 512TB
>  kernel vmalloc start   = 0xc008
>  kernel IO start= 0xc00a
>  kernel vmemmap start   = 0xc00c

Hey Aneesh,

I like the series, I like consolidating the address spaces into 0xc,
and making the layouts match or similar isn't a bad thing. I don't
see any real reason to force limitations on one layout or another --
you could make the argument that 4k radix should match 64k radix
as much as matching 4k hash IMO.

I wouldn't like to tie them too strongly to the same base defines
that force them to stay in sync.

Can we drop this patch? Or at least keep the users of the H_ and R_
defines and set them to the same thing in map.h?


> diff --git a/arch/powerpc/include/asm/book3s/64/map.h 
> b/arch/powerpc/include/asm/book3s/64/map.h
> new file mode 100644
> index ..5c01f8c18d61
> --- /dev/null
> +++ b/arch/powerpc/include/asm/book3s/64/map.h
> @@ -0,0 +1,80 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_BOOK3S_64_MAP_H_
> +#define _ASM_POWERPC_BOOK3S_64_MAP_H_
> +
> +/*
> + * We use MAX_EA_BITS_PER_CONTEXT (hash specific) here just to make sure we 
> pick
> + * the same value for hash and radix.
> + */
> +#ifdef CONFIG_PPC_64K_PAGES
> +
> +/*
> + * Each context is 512TB size. SLB miss for first context/default context
> + * is handled in the hotpath.

Now everything is handled in the slowpath :P I guess that's a copy
paste of the comment which my SLB miss patch should have fixed it.

Thanks,
Nick

Re: [PATCH 1/5] arm64: Fix vDSO clock_getres()

2019-04-16 Thread Vincenzo Frascino

Hi Catalin,

On 15/04/2019 18:35, Catalin Marinas wrote:
> On Mon, Apr 01, 2019 at 12:51:48PM +0100, Vincenzo Frascino wrote:
>> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
>> index 2d419006ad43..47ba72345739 100644
>> --- a/arch/arm64/kernel/vdso.c
>> +++ b/arch/arm64/kernel/vdso.c
>> @@ -245,6 +245,8 @@ void update_vsyscall(struct timekeeper *tk)
>>  vdso_data->cs_shift = tk->tkr_mono.shift;
>>  }
>>  
>> +vdso_data->hrtimer_res  = hrtimer_resolution;
>> +
>>  smp_wmb();
>>  ++vdso_data->tb_seq_count;
>>  }
>> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S 
>> b/arch/arm64/kernel/vdso/gettimeofday.S
>> index c39872a7b03c..7a2cd2f8e13a 100644
>> --- a/arch/arm64/kernel/vdso/gettimeofday.S
>> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
>> @@ -296,32 +296,35 @@ ENDPROC(__kernel_clock_gettime)
>>  /* int __kernel_clock_getres(clockid_t clock_id, struct timespec *res); */
>>  ENTRY(__kernel_clock_getres)
>>  .cfi_startproc
>> +adr vdso_data, _vdso_data
>>  cmp w0, #CLOCK_REALTIME
>>  ccmpw0, #CLOCK_MONOTONIC, #0x4, ne
>>  ccmpw0, #CLOCK_MONOTONIC_RAW, #0x4, ne
>> -b.ne1f
>> +b.ne2f
>>  
>> -ldr x2, 5f
>> -b   2f
>> -1:
>> +1:  /* Get hrtimer_res */
>> +seqcnt_acquire
>> +syscall_check fail=5f
>> +ldr x2, [vdso_data, #CLOCK_REALTIME_RES]
>> +seqcnt_check fail=1b
>> +b   3f
>> +2:
> 
> We talked briefly but I'm still confused why we need the fallback to the
> syscall here if archdata.vdso_direct is false. Is it because if the
> timer driver code sets vdso_direct to false, we don't don't support
> highres timers? If my understanding is correct, you may want to move the
> hrtimer_res setting in update_vsyscall() to the !use_syscall block.
> 

Ok, so let me try to provide more details on what I mentioned yesterday:
- clock_getres syscall follows the rules of what defined in posix-timers.c
- based on the clock_id that, for this purpose, can be separated in coarse and
non-coarse calls either posix_get_coarse_res() or posix_get_hrtimer_res().
- if clock id is set to a coarse clock and posix_get_coarse_res() is invoked,
happens what follows:

static int posix_get_coarse_res(const clockid_t which_clock,
struct timespec64 *tp)
{
*tp = ktime_to_timespec64(KTIME_LOW_RES);
return 0;
}

Note that since CONFIG_1HZ seems not supported (jiffies.h) by the kernel in this
case we do not need rounding in our vDSO implementation.

- if clock id is set to non-coarse and posix_get_hrtimer_res() is invoked,
happens the following:

static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
{
tp->tv_sec = 0;
tp->tv_nsec = hrtimer_resolution;
return 0;
}

hrtimer_resolution can be high res or low res depending on the call of
hrtimer_switch_to_hres(). For us the only way to preserve the correct value is
to keep it in the vdso data page.

- The assembly code mimics exactly the same behaviour detailed above, with one
difference: the one related to the use_syscall parameter which is specific to 
arm64.
The use_syscall parameter is set by arm_arch_timer and consumed by
update_vsyscall(). To mirror what update_vsyscall does in update_vsyscall() I
check "syscall_check fail=5f" in clock_getres vdso function.

Said that, even if functionally it is the same thing, I think it is logically
more correct to have hrtimer_res setting inside the !use_syscall block, hence I
am going to change it in the next iteration.

Please let me know your thoughts.

-- 
Regards,
Vincenzo

Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-16 Thread Adhemerval Zanella

On 16/04/2019 06:59, Florian Weimer wrote:
> * hpa:
> 
>> Using symbol versioning doesn't really help much since the real
>> problem is that struct termios can be passed around in userspace, and
>> the interfaces between user space libraries don't have any
>> versioning. However, my POC code deals with that too by only seeing
>> BOTHER when necessary, so if the structure is extended garbage in the
>> extra fields will be ignored unless new baud rates are in use.
> 
> That still doesn't solve the problem of changing struct offsets after a
> struct field of type struct termios.

We will need symbol versioning at least on sparc, since currently it 
defines NCSS 17 and termios-c_cc.h defines 16 control characters (there 
is no space to squeeze more fields one termios).  ANd The WIP branch
gratuitously change the termios struct size on the architecture.  

I am not sure which would be the best option to avoid the the user space 
libraries compatibility issue. It is unlikely to happen, it would require
one to use old libraries along with newer libraries build against a newer
glibc.  Not sure how often this scenarios arises in realworld (specially
on sparc).

I think MIPS would be fine to lower NCSS to 24, as WIP branch does.  And
alpha is also fine since it already provides the c_* fields.

> 
>> Exporting termios2 to user space feels a bit odd at this stage as it
>> would only be usable as a fallback on old glibc. Call it
>> kernel_termios2 at least.
> 
> I'm not sure why we should do that?  The kernel calls it struct termios2
> in its UAPI headers.  If that name is not appropriate, it should be
> changed first in the UAPI headers.
> 
> Thanks,
> Florian
>

Re: Linux 5.1-rc5

2019-04-16 Thread Martin Schwidefsky

On Tue, 16 Apr 2019 11:09:06 +0200
Martin Schwidefsky  wrote:

> On Mon, 15 Apr 2019 09:17:10 -0700
> Linus Torvalds  wrote:
> 
> > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  
> > wrote:  
> > >
> > > Can we please have the page refcount overflow fixes out on the list
> > > for review, even if it is after the fact?
> > 
> > They were actually on a list for review long before the fact, but it
> > was the security mailing list. The issue actually got discussed back
> > in January along with early versions of the patches, but then we
> > dropped the ball because it just wasn't on anybody's radar and it got
> > resurrected late March. Willy wrote a rather bigger patch-series, and
> > review of that is what then resulted in those commits. So they may
> > look recent, but that's just because the original patches got
> > seriously edited down and rewritten.  
> 
> First time I hear about this, thanks for the heads up.
>  
> > That said, powerpc and s390 should at least look at maybe adding a
> > check for the page ref in their gup paths too. Powerpc has the special
> > gup_hugepte() case, and s390 has its own version of gup entirely. I
> > was actually hoping the s390 guys would look at using the generic gup
> > code.  
> 
> We did look at converting the s390 gup code to CONFIG_HAVE_GENERIC_GUP,
> there are some details that need careful consideration. The top one
> is access_ok(), for s390 we always return true. The generic gup code
> relies on the fact that a page table walk with a specific address is
> doable if access_ok() returned true, the s390 specific check is slightly
> different:
> 
> if ((end <= start) || (end > mm->context.asce_limit))
> return 0;
> 
> The obvious approach would be to modify access_ok() to check against
> the asce_limit. I will try and see if anything breaks, e.g. the automatic
> page table upgrade.

I tested the waters in regard to access_ok() and the generic gup code.
The good news is that mm/gup.c with CONFIG_HAVE_GENERIC_GUP=y seems to
work just fine if the access_ok() issue is taken care of. But..

Bloat-o-meter with a non-empty uaccess_ok() that checks against
current->mm->context.asce_limit:

add/remove: 8/2 grow/shrink: 611/11 up/down: 61352/-1914 (59438)

with CONFIG_HAVE_GENERIC_GUP on top of that

add/remove: 10/2 grow/shrink: 612/12 up/down: 63568/-3280 (60288)

This is not nice, would a patch like the following be acceptable?
--
Subject: [PATCH] mm: introduce mm_pgd_walk_ok

Add the architecture overrideable function mm_pgd_walk_ok() to check
if a block of memory is inside the limits of the page table hierarchy
of a given mm struct.

Signed-off-by: Martin Schwidefsky 
---
 include/asm-generic/pgtable.h | 4 
 mm/gup.c  | 4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fa782fba51ee..7d2a8a58f1c1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1186,4 +1186,8 @@ static inline bool arch_has_pfn_modify_check(void)
 #define mm_pmd_folded(mm)  __is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef mm_pgd_walk_ok
+#define mm_pgd_walk_ok(mm, addr, size) access_ok(addr, size)
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff --git a/mm/gup.c b/mm/gup.c
index 91819b8ad9cc..b3eb3f45d237 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1990,7 +1990,7 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;
 
-   if (unlikely(!access_ok((void __user *)start, len)))
+   if (unlikely(!mm_pgd_walk_ok(current->mm, (void __user *)start, len)))
return 0;
 
/*
@@ -2044,7 +2044,7 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
if (nr_pages <= 0)
return 0;
 
-   if (unlikely(!access_ok((void __user *)start, len)))
+   if (unlikely(!mm_pgd_walk_ok(current->mm, (void __user *)start, len)))
return -EFAULT;
 
if (gup_fast_permitted(start, nr_pages)) {
-- 
2.16.4

With an empty access_ok() but a "real" mm_pgd_walk_ok() the results are
much more reasonable:

add/remove: 2/0 grow/shrink: 2/1 up/down: 2186/-1382 (804)

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

Re: [PATCH] [v2] arch: add pidfd and io_uring syscalls everywhere

2019-04-16 Thread Catalin Marinas

On Mon, Apr 15, 2019 at 04:22:57PM +0200, Arnd Bergmann wrote:
> Add the io_uring and pidfd_send_signal system calls to all architectures.
> 
> These system calls are designed to handle both native and compat tasks,
> so all entries are the same across architectures, only arm-compat and
> the generic tale still use an old format.
> 
> Acked-by: Michael Ellerman  (powerpc)
> Acked-by: Heiko Carstens  (s390)
> Acked-by: Geert Uytterhoeven 
> Signed-off-by: Arnd Bergmann 
> ---
> Changes since v1:
> - fix s390 table
> - use 'n64' tag in mips-n64 instead of common.
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl  | 4 
>  arch/arm/tools/syscall.tbl  | 4 
>  arch/arm64/include/asm/unistd.h | 2 +-
>  arch/arm64/include/asm/unistd32.h   | 8 
>  arch/ia64/kernel/syscalls/syscall.tbl   | 4 
>  arch/m68k/kernel/syscalls/syscall.tbl   | 4 
>  arch/microblaze/kernel/syscalls/syscall.tbl | 4 
>  arch/mips/kernel/syscalls/syscall_n32.tbl   | 4 
>  arch/mips/kernel/syscalls/syscall_n64.tbl   | 4 
>  arch/mips/kernel/syscalls/syscall_o32.tbl   | 4 
>  arch/parisc/kernel/syscalls/syscall.tbl | 4 
>  arch/powerpc/kernel/syscalls/syscall.tbl| 4 
>  arch/s390/kernel/syscalls/syscall.tbl   | 4 
>  arch/sh/kernel/syscalls/syscall.tbl | 4 
>  arch/sparc/kernel/syscalls/syscall.tbl  | 4 
>  arch/xtensa/kernel/syscalls/syscall.tbl | 4 
>  16 files changed, 65 insertions(+), 1 deletion(-)

For arm64:

Acked-by: Catalin Marinas

[PATCH v2 16/16] powernv/fadump: update documentation about option to release opalcore

2019-04-16 Thread Hari Bathini

With /proc/opalcore support available on OPAL based machines and an
option to release memory used by kernel in exporting /proc/opalcore,
update FADump documentation with these details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index fa35593..6411449 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -107,6 +107,16 @@ capture kernel boot to process this crash data. Kernel 
config
 option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel
 to ensure that crash data is preserved to process later.
 
+-- On OPAL based machines (PowerNV), if the kernel is build with
+   CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also
+   exported as /proc/opalcore file. This procfs file is helpful
+   in debugging OPAL crashes with GDB. The kernel memory used
+   for exporting this procfs file can be released by echo'ing
+   '1' to /sys/kernel/fadump_release_opalcore node.
+
+   e.g.
+ # echo 1 > /sys/kernel/fadump_release_opalcore
+
 Implementation details:
 --
 
@@ -260,6 +270,15 @@ Here is the list of files under kernel sysfs:
 enhanced to use this interface to release the memory reserved for
 dump and continue without 2nd reboot.
 
+ /sys/kernel/fadump_release_opalcore
+
+This file is available only on OPAL based machines when FADump is
+active during capture kernel. This is used to release the memory
+used by the kernel to export /proc/opalcore file. To release this
+memory, echo '1' to it:
+
+echo 1  > /sys/kernel/fadump_release_opalcore
+
 Here is the list of files under powerpc debugfs:
 (Assuming debugfs is mounted on /sys/kernel/debug directory.)

[PATCH v2 15/16] powernv/fadump: consider f/w load area

2019-04-16 Thread Hari Bathini

OPAL loads kernel & initrd at 512MB offset (256MB size), also exported
as ibm,opal/dump/fw-load-area. So, if boot memory size of FADump is
less than 768MB, kernel memory to be exported as '/proc/vmcore' would
be overwritten by f/w while loading kernel & initrd. To avoid such a
scenario, enforce a minimum boot memory size of 768MB on OPAL platform.

Also, skip using FADump if a newer F/W version loads kernel & initrd
above 768MB.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h  |   15 +--
 arch/powerpc/kernel/fadump.c |8 
 arch/powerpc/platforms/powernv/opal-fadump.c |   23 +++
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 1bd3aeb..f59fdc7 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -24,14 +24,25 @@
 #define RMA_END(ppc64_rma_size)
 
 /*
+ * With kernel & initrd loaded at 512MB (with 256MB size), enforce a minimum
+ * boot memory size of 768MB to ensure f/w loading kernel and initrd doesn't
+ * mess with crash'ed kernel's memory during MPIPL.
+ */
+#define OPAL_MIN_BOOT_MEM  (0x3000UL)
+
+/*
  * On some Power systems where RMO is 128MB, it still requires minimum of
  * 256MB for kernel to boot successfully. When kdump infrastructure is
  * configured to save vmcore over network, we run into OOM issue while
  * loading modules related to network setup. Hence we need additional 64M
  * of memory to avoid OOM issue.
  */
-#define MIN_BOOT_MEM   (((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : RMA_END) \
-   + (0x1UL << 26))
+#define PSERIES_MIN_BOOT_MEM   (((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : \
+RMA_END) + (0x1UL << 26))
+
+#define MIN_BOOT_MEM   ((fw_dump.fadump_platform ==\
+FADUMP_PLATFORM_POWERNV) ? OPAL_MIN_BOOT_MEM : \
+PSERIES_MIN_BOOT_MEM)
 
 /* The upper limit percentage for user specified boot memory size (25%) */
 #define MAX_BOOT_MEM_RATIO 4
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ba26169..3c3adc2 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -582,6 +582,14 @@ int __init fadump_reserve_mem(void)
ALIGN(fw_dump.boot_memory_size,
FADUMP_CMA_ALIGNMENT);
 #endif
+
+   if ((fw_dump.fadump_platform == FADUMP_PLATFORM_POWERNV) &&
+   (fw_dump.boot_memory_size < OPAL_MIN_BOOT_MEM)) {
+   pr_err("Can't enable fadump with boot memory size 
(0x%lx) less than 0x%lx\n",
+  fw_dump.boot_memory_size, OPAL_MIN_BOOT_MEM);
+   goto error_out;
+   }
+
fw_dump.rmr_source_len = fw_dump.boot_memory_size;
if (!fadump_get_rmr_regions()) {
pr_err("Too many holes in boot memory area to enable 
fadump\n");
diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index f530df0..0a22257 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -528,6 +528,29 @@ int __init opal_dt_scan_fadump(struct fw_dump 
*fadump_conf, ulong node)
fadump_conf->cpu_state_entry_size =
of_read_number(prop, 1);
}
+   } else {
+   int i, len;
+
+   prop = of_get_flat_dt_prop(dn, "fw-load-area", );
+   if (prop) {
+   /*
+* Each f/w load area is an (address,size) pair,
+* 2 cells each, totalling 4 cells per range.
+*/
+   for (i = 0; i < len / (sizeof(*prop) * 4); i++) {
+   u64 base, end;
+
+   base = of_read_number(prop + (i * 4) + 0, 2);
+   end = base;
+   end += of_read_number(prop + (i * 4) + 2, 2);
+   if (end > OPAL_MIN_BOOT_MEM) {
+   pr_err("F/W load area: 0x%llx-0x%llx\n",
+  base, end);
+   pr_err("F/W version not supported!\n");
+   return 1;
+   }
+   }
+   }
}
 
fadump_conf->ops= _fadump_ops;

[PATCH v2 14/16] powernv/opalcore: provide an option to invalidate /proc/opalcore file

2019-04-16 Thread Hari Bathini

Writing '1' to /sys/kernel/fadump_release_opalcore would release the
memory held by kernel in exporting /proc/opalcore file.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/platforms/powernv/opal-core.c |   39 
 1 file changed, 39 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-core.c 
b/arch/powerpc/platforms/powernv/opal-core.c
index 8bf687d..5503b8b 100644
--- a/arch/powerpc/platforms/powernv/opal-core.c
+++ b/arch/powerpc/platforms/powernv/opal-core.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -532,6 +534,36 @@ static void opalcore_cleanup(void)
 }
 __exitcall(opalcore_cleanup);
 
+static ssize_t fadump_release_opalcore_store(struct kobject *kobj,
+struct kobj_attribute *attr,
+const char *buf, size_t count)
+{
+   int input = -1;
+
+   if (kstrtoint(buf, 0, ))
+   return -EINVAL;
+
+   if (input == 1) {
+   if (oc_conf == NULL) {
+   pr_err("'/proc/opalcore' file does not exist!\n");
+   return -EPERM;
+   }
+
+   /*
+* Take away '/proc/opalcore' and release all memory
+* used for exporting this file.
+*/
+   opalcore_cleanup();
+   } else
+   return -EINVAL;
+
+   return count;
+}
+
+static struct kobj_attribute opalcore_rel_attr = 
__ATTR(fadump_release_opalcore,
+   0200, NULL,
+   fadump_release_opalcore_store);
+
 /* Init function for opalcore module. */
 static int __init opalcore_init(void)
 {
@@ -558,6 +590,13 @@ static int __init opalcore_init(void)
 _opalcore_operations);
if (oc_conf->proc_opalcore)
proc_set_size(oc_conf->proc_opalcore, oc_conf->opalcore_size);
+
+   rc = sysfs_create_file(kernel_kobj, _rel_attr.attr);
+   if (rc) {
+   pr_warn("unable to create sysfs file fadump_release_opalcore 
(%d)\n",
+   rc);
+   }
+
return 0;
 }
 fs_initcall(opalcore_init);

[PATCH v2 13/16] powernv/fadump: Skip processing /proc/vmcore when only OPAL core exists

2019-04-16 Thread Hari Bathini

If OPAL crashes when the kernel is not registered for FADump, F/W still
exports OPAL core through result-table DT node. Make sure '/proc/vmcore'
processing is skipped as only data relevant to OPAL core is exported in
such scenario.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/platforms/powernv/opal-fadump.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index 65db21a..f530df0 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -108,6 +108,18 @@ static void update_fadump_config(struct fw_dump 
*fadump_conf,
be64_to_cpu(fdm->section[i].dest_size);
}
}
+
+   /*
+* If dump is active and no kernel memory region is found in
+* result-table, it means OPAL crashed on system with MPIPL
+* support and the kernel was not registered for FADump at the
+* time of crash. Skip processing /proc/vmcore in that case.
+*/
+   if (j == 0) {
+   fadump_conf->dump_active = 0;
+   return;
+   }
+
fadump_conf->rmr_regions_cnt = j;
pr_debug("Real memory regions count: %lu\n",
 fadump_conf->rmr_regions_cnt);

[PATCH v2 12/16] powerpc/powernv: export /proc/opalcore for analysing opal crashes

2019-04-16 Thread Hari Bathini

From: Hari Bathini 

Export /proc/opalcore file to analyze opal crashes. Since opalcore can
be generated independent of CONFIG_FA_DUMP support in kernel, add this
support under a new kernel config option CONFIG_OPAL_CORE. Also, avoid
code duplication by moving common code used for processing the register
state data to export /proc/vmcore and/or /proc/opalcore file(s).

Signed-off-by: Hari Bathini 
---
 arch/powerpc/Kconfig |9 
 arch/powerpc/platforms/powernv/Makefile  |1 
 arch/powerpc/platforms/powernv/opal-core.c   |  563 ++
 arch/powerpc/platforms/powernv/opal-fadump.c |   94 +---
 arch/powerpc/platforms/powernv/opal-fadump.h |   72 +++
 5 files changed, 669 insertions(+), 70 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-core.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ac3259e..2c76203 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -579,6 +579,15 @@ config PRESERVE_FA_DUMP
  memory preserving kernel boot would process this crash data.
  Petitboot kernel is the typical usecase for this option.
 
+config OPAL_CORE
+   bool "Export OPAL memory as /proc/opalcore"
+   depends on PPC64 && PPC_POWERNV
+   help
+ This option uses the MPIPL support in firmware to provide
+ an ELF core of OPAL memory after a crash. The ELF core is
+ exported as /proc/opalcore file which is helpful in debugging
+ opal crashes using GDB.
+
 config IRQ_ALL_CPUS
bool "Distribute interrupts on all CPUs by default"
depends on SMP
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b4a8022..e659afd 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -8,6 +8,7 @@ obj-y   += opal-kmsg.o opal-powercap.o 
opal-psr.o opal-sensor-groups.o
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_FA_DUMP)  += opal-fadump.o
 obj-$(CONFIG_PRESERVE_FA_DUMP) += opal-fadump.o
+obj-$(CONFIG_OPAL_CORE)+= opal-core.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
diff --git a/arch/powerpc/platforms/powernv/opal-core.c 
b/arch/powerpc/platforms/powernv/opal-core.c
new file mode 100644
index 000..8bf687d
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-core.c
@@ -0,0 +1,563 @@
+/*
+ * Interface for exporting the OPAL ELF core.
+ * Heavily inspired from fs/proc/vmcore.c
+ *
+ * Copyright 2018-2019, IBM Corp.
+ * Author: Hari Bathini 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#undef DEBUG
+#define pr_fmt(fmt) "opalcore: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "../../kernel/fadump-common.h"
+#include "opal-fadump.h"
+
+#define MAX_PT_LOAD_CNT8
+
+/* NT_AUXV note related info */
+#define AUXV_CNT   1
+#define AUXV_DESC_SZ   (((2 * AUXV_CNT) + 1) * sizeof(Elf64_Off))
+
+struct opalcore_config {
+   unsigned intnum_cpus;
+   /* PIR value of crashing CPU */
+   unsigned intcrashing_cpu;
+
+   /* CPU state data info from F/W */
+   unsigned long   cpu_state_destination_addr;
+   unsigned long   cpu_state_data_size;
+   unsigned long   cpu_state_entry_size;
+
+   /* OPAL memory to be exported as PT_LOAD segments */
+   unsigned long   ptload_addr[MAX_PT_LOAD_CNT];
+   unsigned long   ptload_size[MAX_PT_LOAD_CNT];
+   unsigned long   ptload_cnt;
+
+   /* Pointer to the first PT_LOAD in the ELF core file */
+   Elf64_Phdr  *ptload_phdr;
+
+   /* Total size of opalcore file. */
+   size_t  opalcore_size;
+
+   struct proc_dir_entry   *proc_opalcore;
+
+   /* Buffer for all the ELF core headers and the PT_NOTE */
+   size_t  opalcorebuf_sz;
+   char*opalcorebuf;
+
+   /* NT_AUXV buffer */
+   charauxv_buf[AUXV_DESC_SZ];
+};
+
+struct opalcore {
+   struct list_head list;
+   unsigned long long paddr;
+   unsigned long long size;
+   loff_t offset;
+};
+
+static LIST_HEAD(opalcore_list);
+static struct opalcore_config *oc_conf;
+static const struct opal_fadump_mem_struct *fdm_active;
+
+/*
+ * Set crashing CPU's signal to SIGUSR1. if the kernel is triggered
+ * by kernel, SIGTERM otherwise.
+ */
+bool kernel_initiated;
+
+static struct opalcore * __init get_new_element(void)
+{
+   return kzalloc(sizeof(struct opalcore),

[PATCH v2 11/16] powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP

2019-04-16 Thread Hari Bathini

Kernel config option CONFIG_PRESERVE_FA_DUMP is introduced to ensure
crash data, from previously crash'ed kernel, is preserved. Update
documentation with this details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 844a229..fa35593 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -98,6 +98,15 @@ firmware versions on PSeries (PowerVM) platform and Power9
 and above systems with recent firmware versions on PowerNV
 (OPAL) platform.
 
+On OPAL based machines, system first boots into an intermittent
+kernel (referred to as petitboot kernel) before booting into the
+capture kernel. This kernel would have minimal kernel and/or
+userspace support to process crash data. Such kernel needs to
+preserve previously crash'ed kernel's memory for the subsequent
+capture kernel boot to process this crash data. Kernel config
+option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel
+to ensure that crash data is preserved to process later.
+
 Implementation details:
 --

[PATCH v2 10/16] powernv/fadump: add support to preserve crash data on FADUMP disabled kernel

2019-04-16 Thread Hari Bathini

Add a new kernel config option, CONFIG_PRESERVE_FA_DUMP that ensures
that crash data, from previously crash'ed kernel, is preserved. This
helps in cases where FADump is not enabled but the subsequent memory
preserving kernel boot is likely to process this crash data. One
typical usecase for this config option is petitboot kernel.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/Kconfig |9 +
 arch/powerpc/include/asm/fadump.h|9 +++--
 arch/powerpc/kernel/Makefile |6 +++
 arch/powerpc/kernel/fadump-common.h  |8 
 arch/powerpc/kernel/fadump.c |   47 +++---
 arch/powerpc/kernel/prom.c   |4 +-
 arch/powerpc/platforms/powernv/Makefile  |1 +
 arch/powerpc/platforms/powernv/opal-fadump.c |   37 +++-
 8 files changed, 106 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2366a84..ac3259e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -570,6 +570,15 @@ config FA_DUMP
  If unsure, say "y". Only special kernels like petitboot may
  need to say "N" here.
 
+config PRESERVE_FA_DUMP
+   bool "Preserve Firmware-assisted dump"
+   depends on PPC64 && PPC_POWERNV && !FA_DUMP
+   help
+ On a kernel with FA_DUMP disabled, this option helps to preserve
+ crash data from a previously crash'ed kernel. Useful when the next
+ memory preserving kernel boot would process this crash data.
+ Petitboot kernel is the typical usecase for this option.
+
 config IRQ_ALL_CPUS
bool "Distribute interrupts on all CPUs by default"
depends on SMP
diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index d27cde7..d09b77b 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -27,9 +27,6 @@
 extern int crashing_cpu;
 
 extern int is_fadump_memory_area(u64 addr, ulong size);
-extern int early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
- int depth, void *data);
-extern int fadump_reserve_mem(void);
 extern int setup_fadump(void);
 extern int is_fadump_active(void);
 extern int should_fadump_crash(void);
@@ -41,4 +38,10 @@ static inline int is_fadump_active(void) { return 0; }
 static inline int should_fadump_crash(void) { return 0; }
 static inline void crash_fadump(struct pt_regs *regs, const char *str) { }
 #endif /* !CONFIG_FA_DUMP */
+
+#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
+extern int early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
+ int depth, void *data);
+extern int fadump_reserve_mem(void);
+#endif
 #endif /* __PPC64_FA_DUMP_H__ */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index fbecfba..42c24f8 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -65,7 +65,11 @@ obj-$(CONFIG_EEH)  += eeh.o eeh_pe.o eeh_dev.o 
eeh_cache.o \
  eeh_driver.o eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_GENERIC_TBSYNC)   += smp-tbsync.o
 obj-$(CONFIG_CRASH_DUMP)   += crash_dump.o
-obj-$(CONFIG_FA_DUMP)  += fadump.o fadump-common.o
+ifeq ($(CONFIG_FA_DUMP),y)
+obj-y  += fadump.o fadump-common.o
+else
+obj-$(CONFIG_PRESERVE_FA_DUMP) += fadump.o
+endif
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
 endif
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 8d47382..1bd3aeb 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -13,6 +13,7 @@
 #ifndef __PPC64_FA_DUMP_INTERNAL_H__
 #define __PPC64_FA_DUMP_INTERNAL_H__
 
+#ifndef CONFIG_PRESERVE_FA_DUMP
 /*
  * The RMA region will be saved for later dumping when kernel crashes.
  * RMA is Real Mode Area, the first block of logical memory address owned
@@ -88,6 +89,7 @@ struct fadump_crash_info_header {
 
 /* Platform specific callback functions */
 struct fadump_ops;
+#endif /* !CONFIG_PRESERVE_FA_DUMP */
 
 /* Firmware-Assited Dump platforms */
 enum fadump_platform_type {
@@ -157,9 +159,12 @@ struct fw_dump {
unsigned long   nocma:1;
 
enum fadump_platform_type   fadump_platform;
+#ifndef CONFIG_PRESERVE_FA_DUMP
struct fadump_ops   *ops;
+#endif
 };
 
+#ifndef CONFIG_PRESERVE_FA_DUMP
 struct fadump_ops {
ulong   (*init_fadump_mem_struct)(struct fw_dump *fadump_config);
int (*register_fadump)(struct fw_dump *fadump_config);
@@ -181,8 +186,9 @@ u32 *fadump_regs_to_elf_notes(u32 *buf, struct pt_regs 
*regs);
 void fadump_update_elfcore_header(struct fw_dump *fadump_config, char *bufp);
 int is_boot_memory_area_contiguous(struct fw_dump *fadump_conf);
 int is_reserved_memory_area_contiguous(struct fw_dump *fadump_conf);
+#endif /*

[PATCH v2 09/16] powernv/fadump: process architected register state data provided by firmware

2019-04-16 Thread Hari Bathini

From: Hari Bathini 

Firmware provides architected register state data at the time of crash.
Process this data and build CPU notes to append to ELF core.

Signed-off-by: Hari Bathini 
Signed-off-by: Vasant Hegde 
---

Changes in v2:
* Updated reg type values according to recent OPAL changes


 arch/powerpc/include/asm/opal-api.h  |   23 +++
 arch/powerpc/kernel/fadump-common.h  |3 
 arch/powerpc/platforms/powernv/opal-fadump.c |  187 --
 arch/powerpc/platforms/powernv/opal-fadump.h |4 +
 4 files changed, 206 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 75471c2..91f2735 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -976,6 +976,29 @@ struct opal_sg_list {
  * Firmware-Assisted Dump (FADump)
  */
 
+/* FADump thread header for register entries */
+struct opal_fadump_thread_hdr {
+   __be32  pir;
+   /* 0x00 - 0x0F - The corresponding stop state of the core */
+   u8  core_state;
+   u8  reserved[3];
+
+   __be32  offset; /* Offset to Register Entries array */
+   __be32  ecnt;   /* Number of entries */
+   __be32  esize;  /* Alloc size of each array entry in bytes */
+   __be32  eactsz; /* Actual size of each array entry in bytes */
+} __packed;
+
+#define OPAL_REG_TYPE_GPR  0x01
+#define OPAL_REG_TYPE_SPR  0x02
+
+/* FADump register entry. */
+struct opal_fadump_reg_entry {
+   __be32  reg_type;
+   __be32  reg_num;
+   __be64  reg_val;
+};
+
 /* The maximum number of dump sections supported by OPAL */
 #define OPAL_FADUMP_NR_SECTIONS64
 
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index ff764d4..8d47382 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -117,6 +117,9 @@ struct fadump_memory_range {
 
 /* Firmware-assisted dump configuration details. */
 struct fw_dump {
+   unsigned long   cpu_state_destination_addr;
+   unsigned long   cpu_state_data_version;
+   unsigned long   cpu_state_entry_size;
unsigned long   cpu_state_data_size;
unsigned long   hpte_region_size;
unsigned long   boot_memory_size;
diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index da8480d..853f663 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -94,6 +94,12 @@ static void update_fadump_config(struct fw_dump *fadump_conf,
 
last_end = base + size;
j++;
+   } else if (fdm->section[i].src_type ==
+  OPAL_FADUMP_CPU_STATE_DATA) {
+   fadump_conf->cpu_state_destination_addr =
+   be64_to_cpu(fdm->section[i].dest_addr);
+   fadump_conf->cpu_state_data_size =
+   be64_to_cpu(fdm->section[i].dest_size);
}
}
fadump_conf->rmr_regions_cnt = j;
@@ -199,6 +205,75 @@ static int opal_invalidate_fadump(struct fw_dump 
*fadump_conf)
return 0;
 }
 
+static inline void fadump_set_regval_regnum(struct pt_regs *regs, u32 reg_type,
+   u32 reg_num, u64 reg_val)
+{
+   if (reg_type == OPAL_REG_TYPE_GPR) {
+   if (reg_num < 32)
+   regs->gpr[reg_num] = reg_val;
+   return;
+   }
+
+   switch (reg_num) {
+   case 2000:
+   regs->nip = reg_val;
+   break;
+   case 2001:
+   regs->msr = reg_val;
+   break;
+   case 9:
+   regs->ctr = reg_val;
+   break;
+   case 8:
+   regs->link = reg_val;
+   break;
+   case 1:
+   regs->xer = reg_val;
+   break;
+   case 2002:
+   regs->ccr = reg_val;
+   break;
+   case 19:
+   regs->dar = reg_val;
+   break;
+   case 18:
+   regs->dsisr = reg_val;
+   break;
+   }
+}
+
+static inline void fadump_read_registers(char *bufp, unsigned int regs_cnt,
+unsigned int reg_entry_size,
+struct pt_regs *regs)
+{
+   int i;
+   struct opal_fadump_reg_entry *reg_entry;
+
+   memset(regs, 0, sizeof(struct pt_regs));
+
+   for (i = 0; i < regs_cnt; i++, bufp += reg_entry_size) {
+   reg_entry = (struct opal_fadump_reg_entry *)bufp;
+   fadump_set_regval_regnum(regs,
+be32_to_cpu(reg_entry->reg_type),
+

[PATCH v2 08/16] powerpc/fadump: consider reserved ranges while releasing memory

2019-04-16 Thread Hari Bathini

Commit 0962e8004e97 ("powerpc/prom: Scan reserved-ranges node for
memory reservations") enabled support to parse 'reserved-ranges' DT
node to reserve kernel memory falling in these ranges for firmware
purposes. Along with the preserved area memory, also ensure memory
in reserved ranges is not overlapped with memory released by capture
kernel aftering saving vmcore. Also, fix the off-by-one error in
fadump_release_reserved_area function while releasing memory.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump.c |   59 +-
 1 file changed, 41 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 39b6670..fd06571 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -123,7 +123,7 @@ static int __init fadump_cma_init(void) { return 1; }
  * Sort the reserved ranges in-place and merge adjacent ranges
  * to minimize the reserved ranges count.
  */
-static void __init sort_and_merge_reserved_ranges(void)
+static void sort_and_merge_reserved_ranges(void)
 {
unsigned long long base, size;
struct fadump_memory_range tmp_range;
@@ -164,8 +164,7 @@ static void __init sort_and_merge_reserved_ranges(void)
reserved_ranges_cnt = idx + 1;
 }
 
-static int __init add_reserved_range(unsigned long base,
-unsigned long size)
+static int add_reserved_range(unsigned long base, unsigned long size)
 {
int i;
 
@@ -1126,33 +1125,57 @@ static void fadump_release_reserved_area(unsigned long 
start, unsigned long end)
if (tend == end_pfn)
break;
 
-   start_pfn = tend + 1;
+   start_pfn = tend;
}
}
 }
 
 /*
- * Release the memory that was reserved in early boot to preserve the memory
- * contents. The released memory will be available for general use.
+ * Release the memory that was reserved during early boot to preserve the
+ * crash'ed kernel's memory contents except reserved dump area (permanent
+ * reservation) and reserved ranges used by F/W. The released memory will
+ * be available for general use.
  */
 static void fadump_release_memory(unsigned long begin, unsigned long end)
 {
+   int i;
unsigned long ra_start, ra_end;
-
-   ra_start = fw_dump.reserve_dump_area_start;
-   ra_end = ra_start + fw_dump.reserve_dump_area_size;
+   unsigned long tstart;
 
/*
-* exclude the dump reserve area. Will reuse it for next
-* fadump registration.
+* Add memory to permanently preserve to reserved ranges list
+* and exclude all these ranges while releasing memory.
 */
-   if (begin < ra_end && end > ra_start) {
-   if (begin < ra_start)
-   fadump_release_reserved_area(begin, ra_start);
-   if (end > ra_end)
-   fadump_release_reserved_area(ra_end, end);
-   } else
-   fadump_release_reserved_area(begin, end);
+   i = add_reserved_range(fw_dump.reserve_dump_area_start,
+  fw_dump.reserve_dump_area_size);
+   if (i == 0) {
+   /*
+* Reached the MAX reserved ranges count. To ensure reserved
+* dump area is excluded (as it will be reused for next
+* FADump registration), ignore the last reserved range and
+* add reserved dump area instead.
+*/
+   reserved_ranges_cnt--;
+   add_reserved_range(fw_dump.reserve_dump_area_start,
+  fw_dump.reserve_dump_area_size);
+   }
+   sort_and_merge_reserved_ranges();
+
+   tstart = begin;
+   for (i = 0; i < reserved_ranges_cnt; i++) {
+   ra_start = reserved_ranges[i].base;
+   ra_end = ra_start + reserved_ranges[i].size;
+
+   if (tstart >= ra_end)
+   continue;
+
+   if (tstart < ra_start)
+   fadump_release_reserved_area(tstart, ra_start);
+   tstart = ra_end;
+   }
+
+   if (tstart < end)
+   fadump_release_reserved_area(tstart, end);
 }
 
 static void fadump_invalidate_release_mem(void)

[PATCH v2 07/16] powerpc/fadump: consider reserved ranges while reserving memory

2019-04-16 Thread Hari Bathini

Commit 0962e8004e97 ("powerpc/prom: Scan reserved-ranges node for
memory reservations") enabled support to parse reserved-ranges DT
node and reserve kernel memory falling in these ranges for F/W
purposes. Ensure memory in these ranges is not overlapped with
memory reserved for FADump.

Also, use a smaller offset, instead of the size of the memory to
be reserved, by which to skip memory before making another attempt
at reserving memory, after the previous attempt to reserve memory
for FADump failed due to memory holes and/or reserved ranges, to
reduce the likelihood of memory reservation failure.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h |   11 +++
 arch/powerpc/kernel/fadump.c|  137 ++-
 2 files changed, 145 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 8ad98db..ff764d4 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -101,6 +101,17 @@ struct fadump_memory_range {
unsigned long long  size;
 };
 
+/*
+ * Amount of memory (1024MB) to skip before making another attempt at
+ * reserving memory (after the previous attempt to reserve memory for
+ * FADump failed due to memory holes and/or reserved ranges) to reduce
+ * the likelihood of memory reservation failure.
+ */
+#define OFFSET_SIZE0x4000U
+
+/* Maximum no. of reserved ranges supported for processing. */
+#define MAX_RESERVED_RANGES128
+
 /* Maximum no. of real memory regions supported by the kernel */
 #define MAX_REAL_MEM_REGIONS   8
 
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 913ab6e..39b6670 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -53,6 +53,9 @@ int crash_memory_ranges_size;
 int crash_mem_ranges;
 int max_crash_mem_ranges;
 
+struct fadump_memory_range reserved_ranges[MAX_RESERVED_RANGES];
+int reserved_ranges_cnt;
+
 #ifdef CONFIG_CMA
 static struct cma *fadump_cma;
 
@@ -116,12 +119,116 @@ int __init fadump_cma_init(void)
 static int __init fadump_cma_init(void) { return 1; }
 #endif /* CONFIG_CMA */
 
+/*
+ * Sort the reserved ranges in-place and merge adjacent ranges
+ * to minimize the reserved ranges count.
+ */
+static void __init sort_and_merge_reserved_ranges(void)
+{
+   unsigned long long base, size;
+   struct fadump_memory_range tmp_range;
+   int i, j, idx;
+
+   if (!reserved_ranges_cnt)
+   return;
+
+   /* Sort the reserved ranges */
+   for (i = 0; i < reserved_ranges_cnt; i++) {
+   idx = i;
+   for (j = i + 1; j < reserved_ranges_cnt; j++) {
+   if (reserved_ranges[idx].base > reserved_ranges[j].base)
+   idx = j;
+   }
+   if (idx != i) {
+   tmp_range = reserved_ranges[idx];
+   reserved_ranges[idx] = reserved_ranges[i];
+   reserved_ranges[i] = tmp_range;
+   }
+   }
+
+   /* Merge adjacent reserved ranges */
+   idx = 0;
+   for (i = 1; i < reserved_ranges_cnt; i++) {
+   base = reserved_ranges[i-1].base;
+   size = reserved_ranges[i-1].size;
+   if (reserved_ranges[i].base == (base + size))
+   reserved_ranges[idx].size += reserved_ranges[i].size;
+   else {
+   idx++;
+   if (i == idx)
+   continue;
+
+   reserved_ranges[idx] = reserved_ranges[i];
+   }
+   }
+   reserved_ranges_cnt = idx + 1;
+}
+
+static int __init add_reserved_range(unsigned long base,
+unsigned long size)
+{
+   int i;
+
+   if (reserved_ranges_cnt == MAX_RESERVED_RANGES) {
+   /* Compact reserved ranges and try again. */
+   sort_and_merge_reserved_ranges();
+   if (reserved_ranges_cnt == MAX_RESERVED_RANGES)
+   return 0;
+   }
+
+   i = reserved_ranges_cnt++;
+   reserved_ranges[i].base = base;
+   reserved_ranges[i].size = size;
+   return 1;
+}
+
+/*
+ * Scan reserved-ranges to consider them while reserving/releasing
+ * memory for FADump.
+ */
+static void __init early_init_dt_scan_reserved_ranges(unsigned long node)
+{
+   int len, ret;
+   unsigned long i;
+   const __be32 *prop;
+
+   /* reserved-ranges already scanned */
+   if (reserved_ranges_cnt != 0)
+   return;
+
+   prop = of_get_flat_dt_prop(node, "reserved-ranges", );
+
+   if (!prop)
+   return;
+
+   /*
+* Each reserved range is an (address,size) pair, 2 cells each,
+* totalling 4 cells per range.
+*/
+   for (i = 0; i < len / (sizeof(*prop) * 4); i++) {
+   u64

[PATCH v2 06/16] powerpc/fadump: Update documentation about OPAL platform support

2019-04-16 Thread Hari Bathini

With FADump support now available on both pseries and OPAL platforms,
update FADump documentation with these details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   90 --
 1 file changed, 51 insertions(+), 39 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 62e75ef..844a229 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -70,7 +70,8 @@ as follows:
normal.
 
 -- The freshly booted kernel will notice that there is a new
-   node (ibm,dump-kernel) in the device tree, indicating that
+   node (ibm,dump-kernel on PSeries or ibm,opal/dump/result-table
+   on OPAL platform) in the device tree, indicating that
there is crash data available from a previous boot. During
the early boot OS will reserve rest of the memory above
boot memory size effectively booting with restricted memory
@@ -93,7 +94,9 @@ as follows:
 
 Please note that the firmware-assisted dump feature
 is only available on Power6 and above systems with recent
-firmware versions.
+firmware versions on PSeries (PowerVM) platform and Power9
+and above systems with recent firmware versions on PowerNV
+(OPAL) platform.
 
 Implementation details:
 --
@@ -108,57 +111,66 @@ that are run. If there is dump data, then the
 /sys/kernel/fadump_release_mem file is created, and the reserved
 memory is held.
 
-If there is no waiting dump data, then only the memory required
-to hold CPU state, HPTE region, boot memory dump and elfcore
-header, is usually reserved at an offset greater than boot memory
-size (see Fig. 1). This area is *not* released: this region will
-be kept permanently reserved, so that it can act as a receptacle
-for a copy of the boot memory content in addition to CPU state
-and HPTE region, in the case a crash does occur. Since this reserved
-memory area is used only after the system crash, there is no point in
-blocking this significant chunk of memory from production kernel.
-Hence, the implementation uses the Linux kernel's Contiguous Memory
-Allocator (CMA) for memory reservation if CMA is configured for kernel.
-With CMA reservation this memory will be available for applications to
-use it, while kernel is prevented from using it. With this FADump will
-still be able to capture all of the kernel memory and most of the user
-space memory except the user pages that were present in CMA region.
+If there is no waiting dump data, then only the memory required to
+hold CPU state, HPTE region, boot memory dump, FADump header and
+elfcore header, is usually reserved at an offset greater than boot
+memory size (see Fig. 1). This area is *not* released: this region
+will be kept permanently reserved, so that it can act as a receptacle
+for a copy of the boot memory content in addition to CPU state and
+HPTE region, in the case a crash does occur.
+
+Since this reserved memory area is used only after the system crash,
+there is no point in blocking this significant chunk of memory from
+production kernel. Hence, the implementation uses the Linux kernel's
+Contiguous Memory Allocator (CMA) for memory reservation if CMA is
+configured for kernel. With CMA reservation this memory will be
+available for applications to use it, while kernel is prevented from
+using it. With this FADump will still be able to capture all of the
+kernel memory and most of the user space memory except the user pages
+that were present in CMA region.
 
   o Memory Reservation during first kernel
 
-  Low memoryTop of memory
-  0  boot memory size  |<--Reserved dump area --->|  |
-  |   ||   Permanent Reservation  |  |
-  V   V|   (Preserve area)|  V
-  +---+--/ /---+---+++---++--+
-  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
-  +---+--/ /---+---+++---++--+
-|   ^  ^
-|   |  |
-\   /  |
- --- FADump Header
-  Boot memory content gets transferred   (meta area)
-  to reserved area by firmware at the
-  time of crash
-
+  Low memory Top of memory
+  0  boot memory size|<--- Reserved dump area --->|   |
+  |   |  |Permanent Reservatio|   |
+  V   V  |   (Preserve area)  |   V
+  +---+/ /---+---++---+-+-+---+
+  |   |  |///||  DUMP | HDR | ELF |   |
+  +---+/ /---+---++---+-+-+---+
+|

[PATCH v2 05/16] powerpc/fadump: enable fadump support on OPAL based POWER platform

2019-04-16 Thread Hari Bathini

From: Hari Bathini 

Firmware-assisted dump support is enabled for OPAL based POWER platforms
in P9 firmware. Make the corresponding updates in kernel to enable fadump
support for such platforms.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Updated API number for FADump according to recent OPAL changes


 arch/powerpc/Kconfig |5 
 arch/powerpc/include/asm/opal-api.h  |   35 ++
 arch/powerpc/include/asm/opal.h  |1 
 arch/powerpc/kernel/fadump-common.c  |   27 ++
 arch/powerpc/kernel/fadump-common.h  |   44 ++-
 arch/powerpc/kernel/fadump.c |  259 ++
 arch/powerpc/platforms/powernv/Makefile  |1 
 arch/powerpc/platforms/powernv/opal-call.c   |1 
 arch/powerpc/platforms/powernv/opal-fadump.c |  375 ++
 arch/powerpc/platforms/powernv/opal-fadump.h |   40 +++
 arch/powerpc/platforms/pseries/rtas-fadump.c |   18 -
 11 files changed, 716 insertions(+), 90 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d0be82..2366a84 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -556,7 +556,7 @@ config CRASH_DUMP
 
 config FA_DUMP
bool "Firmware-assisted dump"
-   depends on PPC64 && PPC_RTAS
+   depends on PPC64 && (PPC_RTAS || PPC_POWERNV)
select CRASH_CORE
select CRASH_DUMP
help
@@ -567,7 +567,8 @@ config FA_DUMP
  is meant to be a kdump replacement offering robustness and
  speed not possible without system firmware assistance.
 
- If unsure, say "N"
+ If unsure, say "y". Only special kernels like petitboot may
+ need to say "N" here.
 
 config IRQ_ALL_CPUS
bool "Distribute interrupts on all CPUs by default"
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b..75471c2 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -210,7 +210,8 @@
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   165
 #defineOPAL_NX_COPROC_INIT 167
-#define OPAL_LAST  167
+#define OPAL_CONFIGURE_FADUMP  173
+#define OPAL_LAST  173
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
@@ -972,6 +973,37 @@ struct opal_sg_list {
 };
 
 /*
+ * Firmware-Assisted Dump (FADump)
+ */
+
+/* The maximum number of dump sections supported by OPAL */
+#define OPAL_FADUMP_NR_SECTIONS64
+
+/* Kernel Dump section info */
+struct opal_fadump_section {
+   u8  src_type;
+   u8  reserved[7];
+   __be64  src_addr;
+   __be64  src_size;
+   __be64  dest_addr;
+   __be64  dest_size;
+};
+
+/*
+ * FADump memory structure for registering dump support with
+ * POWER f/w through opal call.
+ */
+struct opal_fadump_mem_struct {
+
+   __be16  section_size;   /*sizeof(struct fadump_section) */
+   __be16  section_count;  /* number of sections */
+   __be32  crashing_cpu;   /* Thread on which OPAL crashed */
+   __be64  reserved;
+
+   struct opal_fadump_section  section[OPAL_FADUMP_NR_SECTIONS];
+};
+
+/*
  * Dump region ID range usable by the OS
  */
 #define OPAL_DUMP_REGION_HOST_START0x80
@@ -1051,6 +1083,7 @@ enum {
OPAL_REBOOT_NORMAL  = 0,
OPAL_REBOOT_PLATFORM_ERROR  = 1,
OPAL_REBOOT_FULL_IPL= 2,
+   OPAL_REBOOT_OS_ERROR= 3,
 };
 
 /* Argument to OPAL_PCI_TCE_KILL */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a55b01c..2123b3f 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -43,6 +43,7 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t 
bdfn,
uint64_t PE_handle);
 int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
uint64_t rate_phys, uint32_t size);
+int64_t opal_configure_fadump(uint64_t command, void *data, uint64_t 
data_size);
 int64_t opal_console_write(int64_t term_number, __be64 *length,
   const uint8_t *buffer);
 int64_t opal_console_read(int64_t term_number, __be64 *length,
diff --git a/arch/powerpc/kernel/fadump-common.c 
b/arch/powerpc/kernel/fadump-common.c
index 0182886..514bbb5 100644
--- a/arch/powerpc/kernel/fadump-common.c
+++ b/arch/powerpc/kernel/fadump-common.c
@@ -10,6 +10,9 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#undef DEBUG
+#define pr_fmt(fmt) "fadump: " fmt
+
 #include 
 #include 
 #include 
@@ -48,6 +51,15 @@ void

[PATCH v2 04/16] powerpc/fadump: use FADump instead of fadump for how it is pronounced

2019-04-16 Thread Hari Bathini

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   56 +++---
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 059993b..62e75ef 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -8,18 +8,18 @@ a crashed system, and to do so from a fully-reset system, and
 to minimize the total elapsed time until the system is back
 in production use.
 
-- Firmware assisted dump (fadump) infrastructure is intended to replace
+- Firmware-Assisted Dump (FADump) infrastructure is intended to replace
   the existing phyp assisted dump.
 - Fadump uses the same firmware interfaces and memory reservation model
   as phyp assisted dump.
-- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
+- Unlike phyp dump, FADump exports the memory dump through /proc/vmcore
   in the ELF format in the same way as kdump. This helps us reuse the
   kdump infrastructure for dump capture and filtering.
 - Unlike phyp dump, userspace tool does not need to refer any sysfs
   interface while reading /proc/vmcore.
-- Unlike phyp dump, fadump allows user to release all the memory reserved
+- Unlike phyp dump, FADump allows user to release all the memory reserved
   for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
-- Once enabled through kernel boot parameter, fadump can be
+- Once enabled through kernel boot parameter, FADump can be
   started/stopped through /sys/kernel/fadump_registered interface (see
   sysfs files section below) and can be easily integrated with kdump
   service start/stop init scripts.
@@ -33,7 +33,7 @@ dump offers several strong, practical advantages:
in a clean, consistent state.
 -- Once the dump is copied out, the memory that held the dump
is immediately available to the running kernel. And therefore,
-   unlike kdump, fadump doesn't need a 2nd reboot to get back
+   unlike kdump, FADump doesn't need a 2nd reboot to get back
the system to the production configuration.
 
 The above can only be accomplished by coordination with,
@@ -61,7 +61,7 @@ as follows:
  boot successfully. For syntax of crashkernel= parameter,
  refer to Documentation/kdump/kdump.txt. If any offset is
  provided in crashkernel= parameter, it will be ignored
- as fadump uses a predefined offset to reserve memory
+ as FADump uses a predefined offset to reserve memory
  for boot memory dump preservation in case of a crash.
 
 -- After the low memory (boot memory) area has been saved, the
@@ -120,7 +120,7 @@ blocking this significant chunk of memory from production 
kernel.
 Hence, the implementation uses the Linux kernel's Contiguous Memory
 Allocator (CMA) for memory reservation if CMA is configured for kernel.
 With CMA reservation this memory will be available for applications to
-use it, while kernel is prevented from using it. With this fadump will
+use it, while kernel is prevented from using it. With this FADump will
 still be able to capture all of the kernel memory and most of the user
 space memory except the user pages that were present in CMA region.
 
@@ -170,14 +170,14 @@ KDump, as dump mechanism.
 The tools to examine the dump will be same as the ones
 used for kdump.
 
-How to enable firmware-assisted dump (fadump):
+How to enable firmware-assisted dump (FADump):
 -
 
 1. Set config option CONFIG_FA_DUMP=y and build kernel.
-2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
-   By default, fadump reserved memory will be initialized as CMA area.
-   Alternatively, user can boot linux kernel with 'fadump=nocma' to
-   prevent fadump to use CMA.
+2. Boot into linux kernel with 'FADump=on' kernel cmdline option.
+   By default, FADump reserved memory will be initialized as CMA area.
+   Alternatively, user can boot linux kernel with 'FADump=nocma' to
+   prevent FADump to use CMA.
 3. Optionally, user can also set 'crashkernel=' kernel cmdline
to specify size of the memory to reserve for boot memory dump
preservation.
@@ -190,7 +190,7 @@ NOTE: 1. 'fadump_reserve_mem=' parameter has been 
deprecated. Instead
  option is set at kernel cmdline.
   3. if user wants to capture all of user space memory and ok with
  reserved memory not available to production system, then
- 'fadump=nocma' kernel parameter can be used to fallback to
+ 'FADump=nocma' kernel parameter can be used to fallback to
  old behaviour.
 
 Sysfs/debugfs files:
@@ -203,29 +203,29 @@ Here is the list of files under kernel sysfs:
 
  /sys/kernel/fadump_enabled
 
-This is used to display the fadump status.
-0 = fadump is disabled
-1 = fadump is enabled
+This is used to display the FADump status.
+0 = FADump is

[PATCH v2 03/16] pseries/fadump: move out platform specific support from generic code

2019-04-16 Thread Hari Bathini

Introduce callbacks for platform specific operations like register,
unregister, invalidate & such, and move pseries specific code into
platform code.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* pSeries specific fadump code files are named rtas-fadump.*
  instead of pseries_fadump.*


 arch/powerpc/include/asm/fadump.h|   75 
 arch/powerpc/kernel/fadump-common.h  |   39 ++
 arch/powerpc/kernel/fadump.c |  501 ++--
 arch/powerpc/platforms/pseries/Makefile  |1 
 arch/powerpc/platforms/pseries/rtas-fadump.c |  538 ++
 arch/powerpc/platforms/pseries/rtas-fadump.h |   96 +
 6 files changed, 711 insertions(+), 539 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.c
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.h

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 028a8ef..d27cde7 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -24,79 +24,8 @@
 
 #ifdef CONFIG_FA_DUMP
 
-/* Firmware provided dump sections */
-#define FADUMP_CPU_STATE_DATA  0x0001
-#define FADUMP_HPTE_REGION 0x0002
-#define FADUMP_REAL_MODE_REGION0x0011
-
-/* Dump request flag */
-#define FADUMP_REQUEST_FLAG0x0001
-
-/* Dump status flag */
-#define FADUMP_ERROR_FLAG  0x2000
-
-/* Utility macros */
-#define SKIP_TO_NEXT_CPU(reg_entry)\
-({ \
-   while (be64_to_cpu(reg_entry->reg_id) != REG_ID("CPUEND"))  \
-   reg_entry++;\
-   reg_entry++;\
-})
-
 extern int crashing_cpu;
 
-/* Kernel Dump section info */
-struct fadump_section {
-   __be32  request_flag;
-   __be16  source_data_type;
-   __be16  error_flags;
-   __be64  source_address;
-   __be64  source_len;
-   __be64  bytes_dumped;
-   __be64  destination_address;
-};
-
-/* ibm,configure-kernel-dump header. */
-struct fadump_section_header {
-   __be32  dump_format_version;
-   __be16  dump_num_sections;
-   __be16  dump_status_flag;
-   __be32  offset_first_dump_section;
-
-   /* Fields for disk dump option. */
-   __be32  dd_block_size;
-   __be64  dd_block_offset;
-   __be64  dd_num_blocks;
-   __be32  dd_offset_disk_path;
-
-   /* Maximum time allowed to prevent an automatic dump-reboot. */
-   __be32  max_time_auto;
-};
-
-/*
- * Firmware Assisted dump memory structure. This structure is required for
- * registering future kernel dump with power firmware through rtas call.
- *
- * No disk dump option. Hence disk dump path string section is not included.
- */
-struct fadump_mem_struct {
-   struct fadump_section_headerheader;
-
-   /* Kernel dump sections */
-   struct fadump_section   cpu_state_data;
-   struct fadump_section   hpte_region;
-   struct fadump_section   rmr_region;
-};
-
-#define REGSAVE_AREA_MAGIC STR_TO_HEX("REGSAVE")
-
-/* Register save area header. */
-struct fadump_reg_save_area_header {
-   __be64  magic_number;
-   __be32  version;
-   __be32  num_cpu_offset;
-};
-
 extern int is_fadump_memory_area(u64 addr, ulong size);
 extern int early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
  int depth, void *data);
@@ -111,5 +40,5 @@ extern void fadump_cleanup(void);
 static inline int is_fadump_active(void) { return 0; }
 static inline int should_fadump_crash(void) { return 0; }
 static inline void crash_fadump(struct pt_regs *regs, const char *str) { }
-#endif
-#endif
+#endif /* !CONFIG_FA_DUMP */
+#endif /* __PPC64_FA_DUMP_H__ */
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 8ccd96d..f926145 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -47,6 +47,12 @@
 #define FADUMP_UNREGISTER  2
 #define FADUMP_INVALIDATE  3
 
+/* Firmware-Assited Dump platforms */
+enum fadump_platform_type {
+   FADUMP_PLATFORM_UNKNOWN = 0,
+   FADUMP_PLATFORM_PSERIES,
+};
+
 #define FADUMP_CPU_ID_MASK ((1UL << 32) - 1)
 
 #define CPU_UNKNOWN(~((u32)0))
@@ -91,6 +97,9 @@ struct fad_crash_memory_ranges {
unsigned long long  size;
 };
 
+/* Platform specific callback functions */
+struct fadump_ops;
+
 /* Firmware-assisted dump configuration details. */
 struct fw_dump {
unsigned long   cpu_state_data_size;
@@ -98,6 +107,8 @@ struct fw_dump {
unsigned long   boot_memory_size;
unsigned long   reserve_dump_area_start;
unsigned long   reserve_dump_area_size;
+   unsigned long   meta_area_start;
+   unsigned long

[PATCH v2 02/16] powerpc/fadump: Improve fadump documentation

2019-04-16 Thread Hari Bathini

The figures depicting FADump's (Firmware-Assisted Dump) memory layout
are missing some finer details like different memory regions and what
they represent. Improve the documentation by updating those details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   65 --
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 18c5fee..059993b 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -74,8 +74,9 @@ as follows:
there is crash data available from a previous boot. During
the early boot OS will reserve rest of the memory above
boot memory size effectively booting with restricted memory
-   size. This will make sure that the second kernel will not
-   touch any of the dump memory area.
+   size. This will make sure that this kernel (also, referred
+   to as second kernel or capture kernel) will not touch any
+   of the dump memory area.
 
 -- User-space tools will read /proc/vmcore to obtain the contents
of memory, which holds the previous crashed kernel dump in ELF
@@ -125,48 +126,52 @@ space memory except the user pages that were present in 
CMA region.
 
   o Memory Reservation during first kernel
 
-  Low memory Top of memory
-  0  boot memory size   |
-  |   ||<--Reserved dump area -->|  |
-  V   V|   Permanent Reservation |  V
-  +---+--/ /---+---++---++--+
-  |   ||CPU|HPTE|  DUMP |ELF |  |
-  +---+--/ /---+---++---++--+
-|   ^
-|   |
-\   /
- ---
-  Boot memory content gets transferred to
-  reserved area by firmware at the time of
-  crash
+  Low memoryTop of memory
+  0  boot memory size  |<--Reserved dump area --->|  |
+  |   ||   Permanent Reservation  |  |
+  V   V|   (Preserve area)|  V
+  +---+--/ /---+---+++---++--+
+  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
+  +---+--/ /---+---+++---++--+
+|   ^  ^
+|   |  |
+\   /  |
+ --- FADump Header
+  Boot memory content gets transferred   (meta area)
+  to reserved area by firmware at the
+  time of crash
+
Fig. 1
 
+
   o Memory Reservation during second kernel after crash
 
-  Low memoryTop of memory
-  0  boot memory size   |
-  |   |<- Reserved dump area --- -->|
-  V   V V
-  +---+--/ /---+---++---++--+
-  |   ||CPU|HPTE|  DUMP |ELF |  |
-  +---+--/ /---+---++---++--+
+  Low memoryTop of memory
+  0  boot memory size|
+  |   |<- Reserved dump area --->|
+  V   V|< Preserve area ->|  V
+  +---+--/ /---+---+++---++--+
+  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
+  +---+--/ /---+---+++---++--+
 |  |
 V  V
Used by second/proc/vmcore
kernel to boot
Fig. 2
 
-Currently the dump will be copied from /proc/vmcore to a
-a new file upon user intervention. The dump data available through
-/proc/vmcore will be in ELF format. Hence the existing kdump
-infrastructure (kdump scripts) to save the dump works fine with
-minor modifications.
+Currently the dump will be copied from /proc/vmcore to a new file upon
+user intervention. The dump data available through /proc/vmcore will be
+in ELF format. Hence the existing kdump infrastructure (kdump scripts)
+to save the dump works fine with minor modifications. KDump scripts on
+major Distro releases have already been modified to work seemlessly (no
+user intervention in saving the dump) when FADump is used,

[PATCH v2 01/16] powerpc/fadump: move internal fadump code to a new file

2019-04-16 Thread Hari Bathini

Refactoring fadump code means internal fadump code is referenced from
different places. For ease, move internal code to a new file.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Using fadump-common.* instead of fadump_internal.*


 arch/powerpc/include/asm/fadump.h   |  112 
 arch/powerpc/kernel/Makefile|2 
 arch/powerpc/kernel/fadump-common.c |  184 +
 arch/powerpc/kernel/fadump-common.h |  126 +++
 arch/powerpc/kernel/fadump.c|  194 ++-
 5 files changed, 324 insertions(+), 294 deletions(-)
 create mode 100644 arch/powerpc/kernel/fadump-common.c
 create mode 100644 arch/powerpc/kernel/fadump-common.h

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 188776b..028a8ef 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -24,34 +24,6 @@
 
 #ifdef CONFIG_FA_DUMP
 
-/*
- * The RMA region will be saved for later dumping when kernel crashes.
- * RMA is Real Mode Area, the first block of logical memory address owned
- * by logical partition, containing the storage that may be accessed with
- * translate off.
- */
-#define RMA_START  0x0
-#define RMA_END(ppc64_rma_size)
-
-/*
- * On some Power systems where RMO is 128MB, it still requires minimum of
- * 256MB for kernel to boot successfully. When kdump infrastructure is
- * configured to save vmcore over network, we run into OOM issue while
- * loading modules related to network setup. Hence we need aditional 64M
- * of memory to avoid OOM issue.
- */
-#define MIN_BOOT_MEM   (((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : RMA_END) \
-   + (0x1UL << 26))
-
-/* The upper limit percentage for user specified boot memory size (25%) */
-#define MAX_BOOT_MEM_RATIO 4
-
-#define memblock_num_regions(memblock_type)(memblock.memblock_type.cnt)
-
-/* Alignement per CMA requirement. */
-#define FADUMP_CMA_ALIGNMENT   (PAGE_SIZE <<   \
-   max_t(unsigned long, MAX_ORDER - 1, pageblock_order))
-
 /* Firmware provided dump sections */
 #define FADUMP_CPU_STATE_DATA  0x0001
 #define FADUMP_HPTE_REGION 0x0002
@@ -60,18 +32,9 @@
 /* Dump request flag */
 #define FADUMP_REQUEST_FLAG0x0001
 
-/* FAD commands */
-#define FADUMP_REGISTER1
-#define FADUMP_UNREGISTER  2
-#define FADUMP_INVALIDATE  3
-
 /* Dump status flag */
 #define FADUMP_ERROR_FLAG  0x2000
 
-#define FADUMP_CPU_ID_MASK ((1UL << 32) - 1)
-
-#define CPU_UNKNOWN(~((u32)0))
-
 /* Utility macros */
 #define SKIP_TO_NEXT_CPU(reg_entry)\
 ({ \
@@ -125,59 +88,8 @@ struct fadump_mem_struct {
struct fadump_section   rmr_region;
 };
 
-/* Firmware-assisted dump configuration details. */
-struct fw_dump {
-   unsigned long   cpu_state_data_size;
-   unsigned long   hpte_region_size;
-   unsigned long   boot_memory_size;
-   unsigned long   reserve_dump_area_start;
-   unsigned long   reserve_dump_area_size;
-   /* cmd line option during boot */
-   unsigned long   reserve_bootvar;
-
-   unsigned long   fadumphdr_addr;
-   unsigned long   cpu_notes_buf;
-   unsigned long   cpu_notes_buf_size;
-
-   int ibm_configure_kernel_dump;
-
-   unsigned long   fadump_enabled:1;
-   unsigned long   fadump_supported:1;
-   unsigned long   dump_active:1;
-   unsigned long   dump_registered:1;
-   unsigned long   nocma:1;
-};
-
-/*
- * Copy the ascii values for first 8 characters from a string into u64
- * variable at their respective indexes.
- * e.g.
- *  The string "FADMPINF" will be converted into 0x4641444d50494e46
- */
-static inline u64 str_to_u64(const char *str)
-{
-   u64 val = 0;
-   int i;
-
-   for (i = 0; i < sizeof(val); i++)
-   val = (*str) ? (val << 8) | *str++ : val << 8;
-   return val;
-}
-#define STR_TO_HEX(x)  str_to_u64(x)
-#define REG_ID(x)  str_to_u64(x)
-
-#define FADUMP_CRASH_INFO_MAGICSTR_TO_HEX("FADMPINF")
 #define REGSAVE_AREA_MAGIC STR_TO_HEX("REGSAVE")
 
-/* The firmware-assisted dump format.
- *
- * The register save area is an area in the partition's memory used to preserve
- * the register contents (CPU state data) for the active CPUs during a firmware
- * assisted dump. The dump format contains register save area header followed
- * by register entries. Each list of registers for a CPU starts with
- * "CPUSTRT" and ends with "CPUEND".
- */
-
 /* Register save area header. */
 struct fadump_reg_save_area_header {
__be64  magic_number;
@@ -185,29 +97,9 @@ struct fadump_reg_save_area_header {
__be32  num_cpu_offset;
 };
 
-/* Register entry. */
-struct

[PATCH v2 00/16] Add FADump support on PowerNV platform

2019-04-16 Thread Hari Bathini

Firmware-Assisted Dump (FADump) is currently supported only on pseries
platform. This patch series adds support for powernv platform too.

The first and third patches refactor the FADump code to make use of common
code across multiple platforms. The fifth patch adds basic FADump support
for powernv platform. Patches seven & eight honour reserved-ranges DT node
while reserving/releasing memory used by FADump. The next patch processes
CPU state data provided by firmware to create and append core notes to the
ELF core file. The tenth patch adds support for preserving crash data for
subsequent boots (useful in cases like petitboot). Patch twelve provides
support to export opalcore. This is to make debugging of failures in OPAL
code easier. The subsequent patch ensures vmcore processing is skipped
when only OPAL core is exported by f/w. The next patch provides option to
release the kernel memory used to export opalcore. The remaining patches
update Firmware-Assisted Dump documentation appropriately.

The patch series is tested with the latest firmware plus the below skiboot
changes for MPIPL support:

https://patchwork.ozlabs.org/project/skiboot/list/?series=102588
("MPIPL support")


Changes in v2:
  * Rebased to latest upstream kernel version.
  * Updated according to latest OPAL changes.
  * Dropped patch seventeen from previous version as the quantam of increase
in robustness due it doesn't seem worth breaking backward compatibility
for older kernel versions.
---

Hari Bathini (16):
  powerpc/fadump: move internal fadump code to a new file
  powerpc/fadump: Improve fadump documentation
  pseries/fadump: move out platform specific support from generic code
  powerpc/fadump: use FADump instead of fadump for how it is pronounced
  powerpc/fadump: enable fadump support on OPAL based POWER platform
  powerpc/fadump: Update documentation about OPAL platform support
  powerpc/fadump: consider reserved ranges while reserving memory
  powerpc/fadump: consider reserved ranges while releasing memory
  powernv/fadump: process architected register state data provided by 
firmware
  powernv/fadump: add support to preserve crash data on FADUMP disabled 
kernel
  powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
  powerpc/powernv: export /proc/opalcore for analysing opal crashes
  powernv/fadump: Skip processing /proc/vmcore when only OPAL core exists
  powernv/opalcore: provide an option to invalidate /proc/opalcore file
  powernv/fadump: consider f/w load area
  powernv/fadump: update documentation about option to release opalcore


 Documentation/powerpc/firmware-assisted-dump.txt |  193 ++--
 arch/powerpc/Kconfig |   23 
 arch/powerpc/include/asm/fadump.h|  190 
 arch/powerpc/include/asm/opal-api.h  |   58 +
 arch/powerpc/include/asm/opal.h  |1 
 arch/powerpc/kernel/Makefile |6 
 arch/powerpc/kernel/fadump-common.c  |  205 
 arch/powerpc/kernel/fadump-common.h  |  222 
 arch/powerpc/kernel/fadump.c | 1163 --
 arch/powerpc/kernel/prom.c   |4 
 arch/powerpc/platforms/powernv/Makefile  |3 
 arch/powerpc/platforms/powernv/opal-call.c   |1 
 arch/powerpc/platforms/powernv/opal-core.c   |  602 +++
 arch/powerpc/platforms/powernv/opal-fadump.c |  562 +++
 arch/powerpc/platforms/powernv/opal-fadump.h |  116 ++
 arch/powerpc/platforms/pseries/Makefile  |1 
 arch/powerpc/platforms/pseries/rtas-fadump.c |  534 ++
 arch/powerpc/platforms/pseries/rtas-fadump.h |   96 ++
 18 files changed, 2998 insertions(+), 982 deletions(-)
 create mode 100644 arch/powerpc/kernel/fadump-common.c
 create mode 100644 arch/powerpc/kernel/fadump-common.h
 create mode 100644 arch/powerpc/platforms/powernv/opal-core.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.h
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.c
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.h

[PATCH v3 8/8] powerpc/mm/hash: Rename KERNEL_REGION_ID to LINEAR_MAP_REGION_ID

2019-04-16 Thread Aneesh Kumar K.V

The region actually point to linear map. Rename the #define to
clarify thati.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash.h | 4 ++--
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 2 +-
 arch/powerpc/mm/copro_fault.c | 4 ++--
 arch/powerpc/mm/slb.c | 4 ++--
 arch/powerpc/platforms/cell/spu_base.c| 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index c6850a5a931d..e86c338f3ad7 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -77,7 +77,7 @@
  * Region IDs
  */
 #define USER_REGION_ID 0
-#define KERNEL_REGION_ID   1
+#define LINEAR_MAP_REGION_ID   1
 #define VMALLOC_REGION_ID  NON_LINEAR_REGION_ID(VMALLOC_START)
 #define IO_REGION_ID   NON_LINEAR_REGION_ID(KERN_IO_START)
 #define VMEMMAP_REGION_ID  NON_LINEAR_REGION_ID(VMEMMAP_BASE)
@@ -108,7 +108,7 @@ static inline int get_region_id(unsigned long ea)
return USER_REGION_ID;
 
if (ea < KERN_VIRT_START)
-   return KERNEL_REGION_ID;
+   return LINEAR_MAP_REGION_ID;
 
VM_BUG_ON(id != 0xc);
BUILD_BUG_ON(NON_LINEAR_REGION_ID(VMALLOC_START) != 2);
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index b146448109fd..5d2adf3c1325 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -788,7 +788,7 @@ static inline unsigned long get_kernel_context(unsigned 
long ea)
 * Depending on Kernel config, kernel region can have one context
 * or more.
 */
-   if (region_id == KERNEL_REGION_ID) {
+   if (region_id == LINEAR_MAP_REGION_ID) {
/*
 * We already verified ea to be not beyond the addr limit.
 */
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 9b0321061bc8..f137286740cb 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -129,8 +129,8 @@ int copro_calculate_slb(struct mm_struct *mm, u64 ea, 
struct copro_slb *slb)
vsid = get_kernel_vsid(ea, mmu_kernel_ssize);
vsidkey = SLB_VSID_KERNEL;
break;
-   case KERNEL_REGION_ID:
-   pr_devel("%s: 0x%llx -- KERNEL_REGION_ID\n", __func__, ea);
+   case LINEAR_MAP_REGION_ID:
+   pr_devel("%s: 0x%llx -- LINEAR_MAP_REGION_ID\n", __func__, ea);
psize = mmu_linear_psize;
ssize = mmu_kernel_ssize;
vsid = get_kernel_vsid(ea, mmu_kernel_ssize);
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 508573c56411..756cf087590b 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -691,7 +691,7 @@ static long slb_allocate_kernel(unsigned long ea, unsigned 
long id)
unsigned long flags;
int ssize;
 
-   if (id == KERNEL_REGION_ID) {
+   if (id == LINEAR_MAP_REGION_ID) {
 
/* We only support upto MAX_PHYSMEM_BITS */
if ((ea & EA_MASK) > (1UL << MAX_PHYSMEM_BITS))
@@ -790,7 +790,7 @@ long do_slb_fault(struct pt_regs *regs, unsigned long ea)
 * first class kernel code. But for performance it's probably nicer
 * if they go via fast_exception_return too.
 */
-   if (id >= KERNEL_REGION_ID) {
+   if (id >= LINEAR_MAP_REGION_ID) {
long err;
 #ifdef CONFIG_DEBUG_VM
/* Catch recursive kernel SLB faults. */
diff --git a/arch/powerpc/platforms/cell/spu_base.c 
b/arch/powerpc/platforms/cell/spu_base.c
index 4770cce1bfe2..6646f152d57b 100644
--- a/arch/powerpc/platforms/cell/spu_base.c
+++ b/arch/powerpc/platforms/cell/spu_base.c
@@ -224,7 +224,7 @@ static void __spu_kernel_slb(void *addr, struct copro_slb 
*slb)
unsigned long ea = (unsigned long)addr;
u64 llp;
 
-   if (get_region_id(ea) == KERNEL_REGION_ID)
+   if (get_region_id(ea) == LINEAR_MAP_REGION_ID)
llp = mmu_psize_defs[mmu_linear_psize].sllp;
else
llp = mmu_psize_defs[mmu_virtual_psize].sllp;
-- 
2.20.1

Re: [PATCH v4 0/5] powerpc/perf: IMC trace-mode support

2019-04-16 Thread Anju T Sudhakar




On 4/16/19 3:14 PM, Anju T Sudhakar wrote:

Hi,

Kindly ignore this series, since patch 5/5 in this series doesn't 
incorporate the event-format change


that I've done in v4 of this series.


Apologies for the inconvenience. I will post the updated v5 soon.



s/v5/v4



Thanks,

Anju

On 4/15/19 3:41 PM, Anju T Sudhakar wrote:

IMC (In-Memory collection counters) is a hardware monitoring facility
that collects large number of hardware performance events.
POWER9 support two modes for IMC which are the Accumulation mode and
Trace mode. In Accumulation mode, event counts are accumulated in system
Memory. Hypervisor then reads the posted counts periodically or when
requested. In IMC Trace mode, the 64 bit trace scom value is initialized
with the event information. The CPMC*SEL and CPMC_LOAD in the trace 
scom, specifies
the event to be monitored and the sampling duration. On each overflow 
in the

CPMC*SEL, hardware snapshots the program counter along with event counts
and writes into memory pointed by LDBAR. LDBAR has bits to indicate 
whether

hardware is configured for accumulation or trace mode.
Currently the event monitored for trace-mode is fixed as cycle.

Trace-IMC Implementation:
--
To enable trace-imc, we need to

* Add trace node in the DTS file for power9, so that the new trace 
node can

be discovered by the kernel.

Information included in the DTS file are as follows, (a snippet from
the ima-catalog)

TRACE_IMC: trace-events {
  #address-cells = <0x1>;
  #size-cells = <0x1>;
  event at 1020 {
 event-name = "cycles" ;
 reg = <0x1020 0x8>;
 desc = "Reference cycles" ;
  };
  };
  trace@0 {
 compatible = "ibm,imc-counters";
 events-prefix = "trace_";
 reg = <0x0 0x8>;
 events = < _IMC >;
 type = <0x2>;
 size = <0x4>;
  };

OP-BUILD changes needed to include the "trace node" is already pulled in
to the ima-catalog repo.

ps://github.com/open-power/op-build/commit/d3e75dc26d1283d7d5eb444bff1ec9e40d5dfc07 



* Enchance the opal_imc_counters_* calls to support this new trace mode
in imc. Add support to initialize the trace-mode scom.

TRACE_IMC_SCOM bit representation:

0:1 : SAMPSEL
2:33    : CPMC_LOAD
34:40   : CPMC1SEL
41:47   : CPMC2SEL
48:50   : BUFFERSIZE
51:63   : RESERVED

CPMC_LOAD contains the sampling duration. SAMPSEL and CPMC*SEL 
determines
the event to count. BUFFRSIZE indicates the memory range. On each 
overflow,
hardware snapshots program counter along with event counts and update 
the

memory and reloads the CMPC_LOAD value for the next sampling duration.
IMC hardware does not support exceptions, so it quietly wraps around if
memory buffer reaches the end.

OPAL support for IMC trace mode is already upstream.

* Set LDBAR spr to enable imc-trace mode.
   LDBAR Layout:
   0 : Enable/Disable
   1 : 0 -> Accumulation Mode
   1 -> Trace Mode
   2:3   : Reserved
   4-6   : PB scope
   7 : Reserved
   8:50  : Counter Address
   51:63 : Reserved

--

PMI interrupt handling is avoided, since IMC trace mode snapshots the
program counter and update to the memory. And this also provide a way 
for

the operating system to do instruction sampling in real time without
PMI(Performance Monitoring Interrupts) processing overhead.
Performance data using 'perf top' with and without trace-imc event:

PMI interrupts count when `perf top` command is executed without 
trace-imc event.


# cat /proc/interrupts  (a snippet from the output)
9944  1072    804    804   1644    804 1306
804    804    804    804    804 804    804
804    804   1961   1602    804    804 1258
[-]
803    803    803    803    803 803    803
803    803    803    803    804 804    804
804    804    804    804    804 804    803
803    803    803    803    803 1306    803
803   Performance monitoring interrupts


`perf top` with trace-imc (executed right after 'perf top' without 
trace-imc event):


# perf top -e trace_imc/trace_cycles/
12.50%  [kernel]  [k] arch_cpu_idle
11.81%  [kernel]  [k] __next_timer_interrupt
11.22%  [kernel]  [k] rcu_idle_enter
10.25%  [kernel]  [k] find_next_bit
  7.91%  [kernel]  [k] do_idle
  7.69%  [kernel]  [k] rcu_dynticks_eqs_exit
  5.20%  [kernel]  [k] tick_nohz_idle_stop_tick
  [---]

# cat /proc/interrupts (a snippet from the output)

9944  1072    804    804   1644    804 1306
804    804    804    804    804 804    804
804    804   1961   1602    804    804 1258
[-]
803    803    803    803    803 803    803
803    803    803    804

[PATCH v3 7/8] powerpc/mm: Consolidate radix and hash address map details

2019-04-16 Thread Aneesh Kumar K.V

We now have

4K page size config

 kernel_region_map_size = 16TB
 kernel vmalloc start   = 0xc0001000
 kernel IO start= 0xc0002000
 kernel vmemmap start   = 0xc0003000

with 64K page size config:

 kernel_region_map_size = 512TB
 kernel vmalloc start   = 0xc008
 kernel IO start= 0xc00a
 kernel vmemmap start   = 0xc00c

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 21 -
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 18 -
 arch/powerpc/include/asm/book3s/64/hash.h | 28 ++-
 arch/powerpc/include/asm/book3s/64/map.h  | 80 +++
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 35 +---
 arch/powerpc/include/asm/book3s/64/radix.h| 19 -
 arch/powerpc/mm/hash_utils_64.c   | 11 +--
 arch/powerpc/mm/pgtable-hash64.c  |  2 +-
 arch/powerpc/mm/pgtable-radix.c   | 13 +--
 arch/powerpc/mm/pgtable_64.c  | 10 ---
 arch/powerpc/mm/ptdump/hashpagetable.c|  4 -
 arch/powerpc/mm/ptdump/ptdump.c   |  5 --
 arch/powerpc/mm/slb.c |  6 +-
 13 files changed, 101 insertions(+), 151 deletions(-)
 create mode 100644 arch/powerpc/include/asm/book3s/64/map.h

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 64eaf187f891..fa47d8a237b2 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -7,27 +7,6 @@
 #define H_PUD_INDEX_SIZE  9
 #define H_PGD_INDEX_SIZE  9
 
-/*
- * Each context is 512TB. But on 4k we restrict our max TASK size to 64TB
- * Hence also limit max EA bits to 64TB.
- */
-#define MAX_EA_BITS_PER_CONTEXT46
-
-#define REGION_SHIFT   (MAX_EA_BITS_PER_CONTEXT - 2)
-
-/*
- * Our page table limit us to 64TB. Hence for the kernel mapping,
- * each MAP area is limited to 16 TB.
- * The four map areas are:  linear mapping, vmap, IO and vmemmap
- */
-#define H_KERN_MAP_SIZE(ASM_CONST(1) << REGION_SHIFT)
-
-/*
- * Define the address range of the kernel non-linear virtual area
- * 16TB
- */
-#define H_KERN_VIRT_START  ASM_CONST(0xc0001000)
-
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
 #define H_PMD_TABLE_SIZE   (sizeof(pmd_t) << H_PMD_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 24ca63beba14..1deddf73033c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -7,24 +7,6 @@
 #define H_PUD_INDEX_SIZE  10
 #define H_PGD_INDEX_SIZE  8
 
-/*
- * Each context is 512TB size. SLB miss for first context/default context
- * is handled in the hotpath.
- */
-#define MAX_EA_BITS_PER_CONTEXT49
-#define REGION_SHIFT   MAX_EA_BITS_PER_CONTEXT
-
-/*
- * We use one context for each MAP area.
- */
-#define H_KERN_MAP_SIZE(1UL << MAX_EA_BITS_PER_CONTEXT)
-
-/*
- * Define the address range of the kernel non-linear virtual area
- * 2PB
- */
-#define H_KERN_VIRT_START  ASM_CONST(0xc008)
-
 /*
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index cd9be5fb189b..c6850a5a931d 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -45,10 +45,6 @@
 #define H_PUD_CACHE_INDEX  (H_PUD_INDEX_SIZE)
 #endif
 
-/*
- * One context each will be used for vmap, IO and vmemmap
- */
-#define H_KERN_VIRT_SIZE   (H_KERN_MAP_SIZE * 3)
 /*
  * +--+
  * |  |
@@ -75,28 +71,16 @@
  * +--+  Kernel linear (0xc.)
  */
 
-#define H_VMALLOC_STARTH_KERN_VIRT_START
-#define H_VMALLOC_SIZE H_KERN_MAP_SIZE
-#define H_VMALLOC_END  (H_VMALLOC_START + H_VMALLOC_SIZE)
-
-#define H_KERN_IO_STARTH_VMALLOC_END
-#define H_KERN_IO_SIZE H_KERN_MAP_SIZE
-#define H_KERN_IO_END  (H_KERN_IO_START + H_KERN_IO_SIZE)
-
-#define H_VMEMMAP_STARTH_KERN_IO_END
-#define H_VMEMMAP_SIZE H_KERN_MAP_SIZE
-#define H_VMEMMAP_END  (H_VMEMMAP_START + H_VMEMMAP_SIZE)
-
-#define NON_LINEAR_REGION_ID(ea)   unsigned long)ea - 
H_KERN_VIRT_START) >> REGION_SHIFT) + 2)
+#define NON_LINEAR_REGION_ID(ea)   unsigned long)ea - KERN_VIRT_START) 
>> REGION_SHIFT) + 2)
 
 /*
  * Region IDs
  */
 #define USER_REGION_ID 0
 #define KERNEL_REGION_ID   1
-#define VMALLOC_REGION_ID  NON_LINEAR_REGION_ID(H_VMALLOC_START)
-#define IO_REGION_ID   NON_LINEAR_REGION_ID(H_KERN_IO_START)
-#define VMEMMAP_REGION_ID

[PATCH v3 6/8] powerpc/mm: Print kernel map details to dmesg

2019-04-16 Thread Aneesh Kumar K.V

This helps in debugging. We can look at the dmesg to find out
different kernel mapping details.

On 4K config this shows

 kernel vmalloc start   = 0xc0001000
 kernel IO start= 0xc0002000
 kernel vmemmap start   = 0xc0003000

On 64K config:

 kernel vmalloc start   = 0xc008
 kernel IO start= 0xc00a
 kernel vmemmap start   = 0xc00c

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/kernel/setup-common.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2e5dfb6e0823..a7ab9638ebd9 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -831,6 +831,9 @@ static __init void print_system_info(void)
pr_info("htab_address  = 0x%p\n", htab_address);
if (htab_hash_mask)
pr_info("htab_hash_mask= 0x%lx\n", htab_hash_mask);
+   pr_info("kernel vmalloc start   = 0x%lx\n", KERN_VIRT_START);
+   pr_info("kernel IO start= 0x%lx\n", KERN_IO_START);
+   pr_info("kernel vmemmap start   = 0x%lx\n", (unsigned long)vmemmap);
 #endif
 #ifdef CONFIG_PPC_BOOK3S_32
if (Hash)
-- 
2.20.1

[PATCH v3 5/8] powerpc/mm/hash: Simplify the region id calculation.

2019-04-16 Thread Aneesh Kumar K.V

This reduces multiple comparisons in get_region_id to a bit shift operation.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  4 ++-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  1 +
 arch/powerpc/include/asm/book3s/64/hash.h | 31 +--
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  2 +-
 4 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 0dd62287f56c..64eaf187f891 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -13,12 +13,14 @@
  */
 #define MAX_EA_BITS_PER_CONTEXT46
 
+#define REGION_SHIFT   (MAX_EA_BITS_PER_CONTEXT - 2)
+
 /*
  * Our page table limit us to 64TB. Hence for the kernel mapping,
  * each MAP area is limited to 16 TB.
  * The four map areas are:  linear mapping, vmap, IO and vmemmap
  */
-#define H_KERN_MAP_SIZE(ASM_CONST(1) << 
(MAX_EA_BITS_PER_CONTEXT - 2))
+#define H_KERN_MAP_SIZE(ASM_CONST(1) << REGION_SHIFT)
 
 /*
  * Define the address range of the kernel non-linear virtual area
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index e392cf17b457..24ca63beba14 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,6 +12,7 @@
  * is handled in the hotpath.
  */
 #define MAX_EA_BITS_PER_CONTEXT49
+#define REGION_SHIFT   MAX_EA_BITS_PER_CONTEXT
 
 /*
  * We use one context for each MAP area.
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 523b9191a1e2..cd9be5fb189b 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -87,26 +87,26 @@
 #define H_VMEMMAP_SIZE H_KERN_MAP_SIZE
 #define H_VMEMMAP_END  (H_VMEMMAP_START + H_VMEMMAP_SIZE)
 
+#define NON_LINEAR_REGION_ID(ea)   unsigned long)ea - 
H_KERN_VIRT_START) >> REGION_SHIFT) + 2)
+
 /*
  * Region IDs
  */
-#define USER_REGION_ID 1
-#define KERNEL_REGION_ID   2
-#define VMALLOC_REGION_ID  3
-#define IO_REGION_ID   4
-#define VMEMMAP_REGION_ID  5
+#define USER_REGION_ID 0
+#define KERNEL_REGION_ID   1
+#define VMALLOC_REGION_ID  NON_LINEAR_REGION_ID(H_VMALLOC_START)
+#define IO_REGION_ID   NON_LINEAR_REGION_ID(H_KERN_IO_START)
+#define VMEMMAP_REGION_ID  NON_LINEAR_REGION_ID(H_VMEMMAP_START)
 
 /*
  * Defines the address of the vmemap area, in its own region on
  * hash table CPUs.
  */
-
 #ifdef CONFIG_PPC_MM_SLICES
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
 #endif /* CONFIG_PPC_MM_SLICES */
 
-
 /* PTEIDX nibble */
 #define _PTEIDX_SECONDARY  0x8
 #define _PTEIDX_GROUP_IX   0x7
@@ -117,22 +117,21 @@
 #ifndef __ASSEMBLY__
 static inline int get_region_id(unsigned long ea)
 {
+   int region_id;
int id = (ea >> 60UL);
 
if (id == 0)
return USER_REGION_ID;
 
-   VM_BUG_ON(id != 0xc);
-   VM_BUG_ON(ea >= H_VMEMMAP_END);
+   if (ea < H_KERN_VIRT_START)
+   return KERNEL_REGION_ID;
 
-   if (ea >= H_VMEMMAP_START)
-   return VMEMMAP_REGION_ID;
-   else if (ea >= H_KERN_IO_START)
-   return IO_REGION_ID;
-   else if (ea >= H_VMALLOC_START)
-   return VMALLOC_REGION_ID;
+   VM_BUG_ON(id != 0xc);
+   BUILD_BUG_ON(NON_LINEAR_REGION_ID(H_VMALLOC_START) != 2);
 
-   return KERNEL_REGION_ID;
+   region_id = NON_LINEAR_REGION_ID(ea);
+   VM_BUG_ON(region_id > VMEMMAP_REGION_ID);
+   return region_id;
 }
 
 #definehash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index b3f256c042aa..b146448109fd 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -794,7 +794,7 @@ static inline unsigned long get_kernel_context(unsigned 
long ea)
 */
ctx =  1 + ((ea & EA_MASK) >> MAX_EA_BITS_PER_CONTEXT);
} else
-   ctx = region_id + MAX_KERNEL_CTX_CNT - 2;
+   ctx = region_id + MAX_KERNEL_CTX_CNT - 1;
return ctx;
 }
 
-- 
2.20.1

[PATCH v3 4/8] powerpc/mm: Drop the unnecessary region check

2019-04-16 Thread Aneesh Kumar K.V

All the regions are now mapped with top nibble 0xc. Hence the region id
check is not needed for virt_addr_valid()

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page.h | 12 
 1 file changed, 12 deletions(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 918228f2205b..748f5db2e2b7 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -132,19 +132,7 @@ static inline bool pfn_valid(unsigned long pfn)
 #define virt_to_page(kaddr)pfn_to_page(virt_to_pfn(kaddr))
 #define pfn_to_kaddr(pfn)  __va((pfn) << PAGE_SHIFT)
 
-#ifdef CONFIG_PPC_BOOK3S_64
-/*
- * On hash the vmalloc and other regions alias to the kernel region when passed
- * through __pa(), which virt_to_pfn() uses. That means virt_addr_valid() can
- * return true for some vmalloc addresses, which is incorrect. So explicitly
- * check that the address is in the kernel region.
- */
-/* may be can drop get_region_id */
-#define virt_addr_valid(kaddr) (get_region_id((unsigned long)kaddr) == 
KERNEL_REGION_ID && \
-   pfn_valid(virt_to_pfn(kaddr)))
-#else
 #define virt_addr_valid(kaddr) pfn_valid(virt_to_pfn(kaddr))
-#endif
 
 /*
  * On Book-E parts we need __va to parse the device tree and we can't
-- 
2.20.1

[PATCH v3 3/8] powerpc/mm: Validate address values against different region limits

2019-04-16 Thread Aneesh Kumar K.V

This adds an explicit check in various functions.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_utils_64.c  | 18 +++---
 arch/powerpc/mm/pgtable-hash64.c | 13 ++---
 arch/powerpc/mm/pgtable-radix.c  | 16 
 arch/powerpc/mm/pgtable_64.c |  5 +
 4 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index c6b39e7694ba..ef0ca3bf555d 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -786,9 +786,16 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
 
 int hash__create_section_mapping(unsigned long start, unsigned long end, int 
nid)
 {
-   int rc = htab_bolt_mapping(start, end, __pa(start),
-  pgprot_val(PAGE_KERNEL), mmu_linear_psize,
-  mmu_kernel_ssize);
+   int rc;
+
+   if (end >= H_VMALLOC_START) {
+   pr_warn("Outisde the supported range\n");
+   return -1;
+   }
+
+   rc = htab_bolt_mapping(start, end, __pa(start),
+  pgprot_val(PAGE_KERNEL), mmu_linear_psize,
+  mmu_kernel_ssize);
 
if (rc < 0) {
int rc2 = htab_remove_mapping(start, end, mmu_linear_psize,
@@ -929,6 +936,11 @@ static void __init htab_initialize(void)
DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
base, size, prot);
 
+   if ((base + size) >= H_VMALLOC_START) {
+   pr_warn("Outisde the supported range\n");
+   continue;
+   }
+
BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
prot, mmu_linear_psize, mmu_kernel_ssize));
}
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index c08d49046a96..d934de4e2b3a 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -112,9 +112,16 @@ int __meminit hash__vmemmap_create_mapping(unsigned long 
start,
   unsigned long page_size,
   unsigned long phys)
 {
-   int rc = htab_bolt_mapping(start, start + page_size, phys,
-  pgprot_val(PAGE_KERNEL),
-  mmu_vmemmap_psize, mmu_kernel_ssize);
+   int rc;
+
+   if ((start + page_size) >= H_VMEMMAP_END) {
+   pr_warn("Outisde the supported range\n");
+   return -1;
+   }
+
+   rc = htab_bolt_mapping(start, start + page_size, phys,
+  pgprot_val(PAGE_KERNEL),
+  mmu_vmemmap_psize, mmu_kernel_ssize);
if (rc < 0) {
int rc2 = htab_remove_mapping(start, start + page_size,
  mmu_vmemmap_psize,
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index ba485fbd81f1..c9b24bf78819 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -334,6 +334,12 @@ void __init radix_init_pgtable(void)
 * page tables will be allocated within the range. No
 * need or a node (which we don't have yet).
 */
+
+   if ((reg->base + reg->size) >= RADIX_VMALLOC_START) {
+   pr_warn("Outisde the supported range\n");
+   continue;
+   }
+
WARN_ON(create_physical_mapping(reg->base,
reg->base + reg->size,
-1));
@@ -866,6 +872,11 @@ static void __meminit remove_pagetable(unsigned long 
start, unsigned long end)
 
 int __meminit radix__create_section_mapping(unsigned long start, unsigned long 
end, int nid)
 {
+   if (end >= RADIX_VMALLOC_START) {
+   pr_warn("Outisde the supported range\n");
+   return -1;
+   }
+
return create_physical_mapping(start, end, nid);
 }
 
@@ -893,6 +904,11 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
int nid = early_pfn_to_nid(phys >> PAGE_SHIFT);
int ret;
 
+   if ((start + page_size) >= RADIX_VMEMMAP_END) {
+   pr_warn("Outisde the supported range\n");
+   return -1;
+   }
+
ret = __map_kernel_page_nid(start, phys, __pgprot(flags), page_size, 
nid);
BUG_ON(ret);
 
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 56068cac2a3c..72f58c076e26 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -121,6 +121,11 @@ void __iomem *__ioremap_at(phys_addr_t pa, void *ea, 
unsigned long size, pgprot_
if (pgprot_val(prot) & H_PAGE_4K_PFN)
return NULL;
 
+   if ((ea + size) >= (void *)IOREMAP_END) {
+

[PATCH v3 2/8] powerpc/mm/hash64: Map all the kernel regions in the same 0xc range

2019-04-16 Thread Aneesh Kumar K.V

This patch maps vmap, IO and vmemap regions in the 0xc address range
instead of the current 0xd and 0xf range. This brings the mapping closer
to radix translation mode.

With hash 64K page size each of this region is 512TB whereas with 4K config
we are limited by the max page table range of 64TB and hence there regions
are of 16TB size.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 13 +++
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 11 +++
 arch/powerpc/include/asm/book3s/64/hash.h | 95 ---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 31 +++---
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  1 -
 arch/powerpc/include/asm/book3s/64/radix.h| 41 
 arch/powerpc/include/asm/page.h   |  3 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |  2 +-
 arch/powerpc/mm/copro_fault.c | 14 ++-
 arch/powerpc/mm/hash_utils_64.c   | 26 ++---
 arch/powerpc/mm/pgtable-radix.c   |  3 +-
 arch/powerpc/mm/pgtable_64.c  |  2 -
 arch/powerpc/mm/ptdump/hashpagetable.c|  2 +-
 arch/powerpc/mm/ptdump/ptdump.c   |  3 +-
 arch/powerpc/mm/slb.c | 22 +++--
 arch/powerpc/platforms/cell/spu_base.c|  4 +-
 drivers/misc/cxl/fault.c  |  2 +-
 drivers/misc/ocxl/link.c  |  2 +-
 18 files changed, 170 insertions(+), 107 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index cf5ba5254299..0dd62287f56c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -13,6 +13,19 @@
  */
 #define MAX_EA_BITS_PER_CONTEXT46
 
+/*
+ * Our page table limit us to 64TB. Hence for the kernel mapping,
+ * each MAP area is limited to 16 TB.
+ * The four map areas are:  linear mapping, vmap, IO and vmemmap
+ */
+#define H_KERN_MAP_SIZE(ASM_CONST(1) << 
(MAX_EA_BITS_PER_CONTEXT - 2))
+
+/*
+ * Define the address range of the kernel non-linear virtual area
+ * 16TB
+ */
+#define H_KERN_VIRT_START  ASM_CONST(0xc0001000)
+
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
 #define H_PMD_TABLE_SIZE   (sizeof(pmd_t) << H_PMD_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index f82ee8a3b561..e392cf17b457 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -13,6 +13,17 @@
  */
 #define MAX_EA_BITS_PER_CONTEXT49
 
+/*
+ * We use one context for each MAP area.
+ */
+#define H_KERN_MAP_SIZE(1UL << MAX_EA_BITS_PER_CONTEXT)
+
+/*
+ * Define the address range of the kernel non-linear virtual area
+ * 2PB
+ */
+#define H_KERN_VIRT_START  ASM_CONST(0xc008)
+
 /*
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 8cbc4106d449..523b9191a1e2 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -29,6 +29,10 @@
 #define H_PGTABLE_EADDR_SIZE   (H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE + \
 H_PUD_INDEX_SIZE + H_PGD_INDEX_SIZE + 
PAGE_SHIFT)
 #define H_PGTABLE_RANGE(ASM_CONST(1) << H_PGTABLE_EADDR_SIZE)
+/*
+ * Top 2 bits are ignored in page table walk.
+ */
+#define EA_MASK(~(0xcUL << 60))
 
 /*
  * We store the slot details in the second half of page table.
@@ -42,53 +46,60 @@
 #endif
 
 /*
- * Define the address range of the kernel non-linear virtual area. In contrast
- * to the linear mapping, this is managed using the kernel page tables and then
- * inserted into the hash page table to actually take effect, similarly to user
- * mappings.
+ * One context each will be used for vmap, IO and vmemmap
  */
-#define H_KERN_VIRT_START ASM_CONST(0xD000)
-
+#define H_KERN_VIRT_SIZE   (H_KERN_MAP_SIZE * 3)
 /*
- * Allow virtual mapping of one context size.
- * 512TB for 64K page size
- * 64TB for 4K page size
+ * +--+
+ * |  |
+ * |  |
+ * |  |
+ * +--+  Kernel virtual map end 
(0xc00e)
+ * |  |
+ * |  |
+ * |  512TB/16TB of vmemmap   |
+ * |  |
+ * |  |
+ * +--+  Kernel vmemmap  start
+ * |  |
+ * |  512TB/16TB of IO map|
+ * |  |
+ * +--+  Kernel IO map start
+ *

[PATCH v3 1/8] powerpc/mm/hash64: Add a variable to track the end of IO mapping

2019-04-16 Thread Aneesh Kumar K.V

This makes it easy to update the region mapping in the later patch

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash.h| 3 ++-
 arch/powerpc/include/asm/book3s/64/pgtable.h | 8 +---
 arch/powerpc/include/asm/book3s/64/radix.h   | 1 +
 arch/powerpc/mm/hash_utils_64.c  | 1 +
 arch/powerpc/mm/pgtable-radix.c  | 1 +
 arch/powerpc/mm/pgtable_64.c | 2 ++
 6 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 54b7af6cd27f..8cbc4106d449 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -69,7 +69,8 @@
 #define H_VMALLOC_SIZE (H_KERN_VIRT_SIZE - H_KERN_IO_SIZE)
 #define H_VMALLOC_END  (H_VMALLOC_START + H_VMALLOC_SIZE)
 
-#define H_KERN_IO_START H_VMALLOC_END
+#define H_KERN_IO_STARTH_VMALLOC_END
+#define H_KERN_IO_END  (H_KERN_VIRT_START + H_KERN_VIRT_SIZE)
 
 /*
  * Region IDs
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 581f91be9dd4..51190a6d1c8a 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -277,9 +277,12 @@ extern unsigned long __vmalloc_end;
 extern unsigned long __kernel_virt_start;
 extern unsigned long __kernel_virt_size;
 extern unsigned long __kernel_io_start;
+extern unsigned long __kernel_io_end;
 #define KERN_VIRT_START __kernel_virt_start
 #define KERN_VIRT_SIZE  __kernel_virt_size
 #define KERN_IO_START  __kernel_io_start
+#define KERN_IO_END __kernel_io_end
+
 extern struct page *vmemmap;
 extern unsigned long ioremap_bot;
 extern unsigned long pci_io_base;
@@ -296,8 +299,7 @@ extern unsigned long pci_io_base;
 
 #include 
 /*
- * The second half of the kernel virtual space is used for IO mappings,
- * it's itself carved into the PIO region (ISA and PHB IO space) and
+ * IO space itself carved into the PIO region (ISA and PHB IO space) and
  * the ioremap space
  *
  *  ISA_IO_BASE = KERN_IO_START, 64K reserved area
@@ -310,7 +312,7 @@ extern unsigned long pci_io_base;
 #define  PHB_IO_BASE   (ISA_IO_END)
 #define  PHB_IO_END(KERN_IO_START + FULL_IO_SIZE)
 #define IOREMAP_BASE   (PHB_IO_END)
-#define IOREMAP_END(KERN_VIRT_START + KERN_VIRT_SIZE)
+#define IOREMAP_END(KERN_IO_END)
 
 /* Advertise special mapping type for AGP */
 #define HAVE_PAGE_AGP
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 5ab134eeed20..6d760a083d62 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -111,6 +111,7 @@
 #define RADIX_VMEMMAP_BASE (RADIX_VMALLOC_END)
 
 #define RADIX_KERN_IO_START(RADIX_KERN_VIRT_START + (RADIX_KERN_VIRT_SIZE 
>> 1))
+#define RADIX_KERN_IO_END   (RADIX_KERN_VIRT_START + RADIX_KERN_VIRT_SIZE)
 
 #ifndef __ASSEMBLY__
 #define RADIX_PTE_TABLE_SIZE   (sizeof(pte_t) << RADIX_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0a4f939a8161..394dd969002f 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1017,6 +1017,7 @@ void __init hash__early_init_mmu(void)
__vmalloc_start = H_VMALLOC_START;
__vmalloc_end = H_VMALLOC_END;
__kernel_io_start = H_KERN_IO_START;
+   __kernel_io_end = H_KERN_IO_END;
vmemmap = (struct page *)H_VMEMMAP_BASE;
ioremap_bot = IOREMAP_BASE;
 
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 154472a28c77..bca1bf66c56e 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -578,6 +578,7 @@ void __init radix__early_init_mmu(void)
__vmalloc_start = RADIX_VMALLOC_START;
__vmalloc_end = RADIX_VMALLOC_END;
__kernel_io_start = RADIX_KERN_IO_START;
+   __kernel_io_end = RADIX_KERN_IO_END;
vmemmap = (struct page *)RADIX_VMEMMAP_BASE;
ioremap_bot = IOREMAP_BASE;
 
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index fb1375c07e8c..7cea39bdf05f 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -98,6 +98,8 @@ unsigned long __vmalloc_end;
 EXPORT_SYMBOL(__vmalloc_end);
 unsigned long __kernel_io_start;
 EXPORT_SYMBOL(__kernel_io_start);
+unsigned long __kernel_io_end;
+EXPORT_SYMBOL(__kernel_io_end);
 struct page *vmemmap;
 EXPORT_SYMBOL(vmemmap);
 unsigned long __pte_frag_nr;
-- 
2.20.1

[PATCH v3 0/8] Update hash MMU kernel mapping to be in sync with radix

2019-04-16 Thread Aneesh Kumar K.V

This patch series map all the kernel regions (vmalloc, IO and vmemmap) using
0xc top nibble address. This brings hash translation kernel mapping in sync 
with radix.
Each of these regions can now map 512TB. We use one context to map these
regions and hence the 512TB limit. We also update radix to use the same limit 
even though
we don't have context related restrictions there.

For 4K page size, Michael Ellerman requested to keep the VMALLOC_START the same 
as
what we have with hash 64K. I did try to implement that but found that code
gets complicated with no real benefit. To assist in debugging I am adding patch
"powerpc/mm: Print kernel map details to dmesg"  which should show different
mapping region for the booted kernel.

Also to note here is we now have the same map for both hash and radix on 4K 
page size.
This limits 4K radix to 16TB linear mapping range. This was done to make sure we
have a similar mapping between hash and radix. If we think this is 
unnecessarily limiting
radix translation mode, I can drop this. 

Aneesh Kumar K.V (8):
  powerpc/mm/hash64: Add a variable to track the end of IO mapping
  powerpc/mm/hash64: Map all the kernel regions in the same 0xc range
  powerpc/mm: Validate address values against different region limits
  powerpc/mm: Drop the unnecessary region check
  powerpc/mm/hash: Simplify the region id calculation.
  powerpc/mm: Print kernel map details to dmesg
  powerpc/mm: Consolidate radix and hash address map details
  powerpc/mm/hash: Rename KERNEL_REGION_ID to LINEAR_MAP_REGION_ID

 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  6 --
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  6 --
 arch/powerpc/include/asm/book3s/64/hash.h | 89 +++
 arch/powerpc/include/asm/book3s/64/map.h  | 80 +
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 33 ---
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 34 +--
 arch/powerpc/include/asm/book3s/64/radix.h| 33 ++-
 arch/powerpc/include/asm/page.h   | 11 ---
 arch/powerpc/kernel/setup-common.c|  3 +
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |  2 +-
 arch/powerpc/mm/copro_fault.c | 18 ++--
 arch/powerpc/mm/hash_utils_64.c   | 48 ++
 arch/powerpc/mm/pgtable-hash64.c  | 13 ++-
 arch/powerpc/mm/pgtable-radix.c   | 23 +++--
 arch/powerpc/mm/pgtable_64.c  | 15 ++--
 arch/powerpc/mm/ptdump/hashpagetable.c|  4 -
 arch/powerpc/mm/ptdump/ptdump.c   |  4 -
 arch/powerpc/mm/slb.c | 26 +++---
 arch/powerpc/platforms/cell/spu_base.c|  4 +-
 drivers/misc/cxl/fault.c  |  2 +-
 drivers/misc/ocxl/link.c  |  2 +-
 21 files changed, 255 insertions(+), 201 deletions(-)
 create mode 100644 arch/powerpc/include/asm/book3s/64/map.h

-- 
2.20.1

Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-16 Thread Florian Weimer

* hpa:

> Using symbol versioning doesn't really help much since the real
> problem is that struct termios can be passed around in userspace, and
> the interfaces between user space libraries don't have any
> versioning. However, my POC code deals with that too by only seeing
> BOTHER when necessary, so if the structure is extended garbage in the
> extra fields will be ignored unless new baud rates are in use.

That still doesn't solve the problem of changing struct offsets after a
struct field of type struct termios.

> Exporting termios2 to user space feels a bit odd at this stage as it
> would only be usable as a fallback on old glibc. Call it
> kernel_termios2 at least.

I'm not sure why we should do that?  The kernel calls it struct termios2
in its UAPI headers.  If that name is not appropriate, it should be
changed first in the UAPI headers.

Thanks,
Florian

1 2 >

1 - 100 of 111 matches

Mail list logo