Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding

2020-09-15 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  unsigned int flags, struct page **pages, int *nr)
> ...
> pudp = pud_offset(, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  unsigned int flags, struct page **pages, int *nr)
> {
> unsigned long next;
> pud_t *pudp;
> 
> // pud_offset returns  itself (a pointer to a value on stack)
> pudp = pud_offset(, addr);
> do {
> // on second iteratation reading "random" stack value
> pud_t pud = READ_ONCE(*pudp);
> 
> // next = 0x1008000, due to PUD_SIZE/MASK != 
> PGDIR_SIZE/MASK on s390
> next = pud_addr_end(addr, end);
> ...
> } while (pudp++, addr = next, addr != end); // pudp++ iterating over 
> stack
> 
> return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc:  # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast 
> code")
> Reviewed-by: Gerald Schaefer 
> Reviewed-by: Alexander Gordeev 
> Signed-off-by: Vasily Gorbik 
> ---
> v2: added brackets  -> &(pgd)

Reviewed-by: Jason Gunthorpe 

Regards,
Jason


Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote:
> These would probably be better as static inlines though, as only s390
> compiles will type check pudp like this.

Never mind, it must be a macro - still need brackets though

Jason


Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote:
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..e899d3506671 100644
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(, address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(, address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(, address)

Needs brackets: &(pgd) 

These would probably be better as static inlines though, as only s390
compiles will type check pudp like this.

Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 09:09:39AM +0200, pet...@infradead.org wrote:
> On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> > So, I suggest pXX_offset_unlocked()
> 
> Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
> itself (get_user_pages_unlocked()) -- although often it seems to mean
> the lock is already held (git grep _unlocked and marvel).
>
> What we want is _lockless().

This is clear to me!

Thanks,
Jason 


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Jason Gunthorpe
On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe  wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.
> > 
> > I would say the page table API requires this invariant:
> > 
> > pud = pud_offset(p4d, addr);
> > do {
> > WARN_ON(pud != pud_offset(p4d, addr);
> > next = pud_addr_end(addr, end);
> > } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> 
> Hmm, IIUC, all architectures with static folding will simply return
> the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
> pagetables.

It is probably moot now, but since other arch's don't crash they also
return pud_addr_end() == end so the loop only does one iteration.

ie pud == pud_offset(p4d, addr) for all iterations as the pud++ never
happens.

Which is what this addr_end patch does for s390..

Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Jason Gunthorpe
On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:

> Or am I way off here, and it really is possible (aside from the current
> s390 situation) to observe something that "is no longer a page table"?

Yes, that is the issue. Remember there is no locking for GUP
fast. While a page table cannot be freed there is nothing preventing
the page table entry from being concurrently modified.

Without the stack variable it looks like this:

   pud_t pud = READ_ONCE(*pudp);
   if (!pud_present(pud))
return
   pmd_offset(pudp, address);

And pmd_offset() expands to

return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
of *pud can change, eg to !pud_present.

Then pud_page_vaddr(*pud) will crash. It is not use after free, it
is using data that has not been validated.

Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Jason Gunthorpe
On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote:
> Yeah, I get hung up on naming sometimes. I don't tend to care much
> about private local variables ("i" is a perfectly fine variable name),
> but these kinds of somewhat subtle cross-architecture definitions I
> feel matter.

One of the first replys to this patch was to ask "when would I use
_orig vs normal", so you are not alone. The name should convey it..

So, I suggest pXX_offset_unlocked()

Since it is safe to call without the page table lock, while pXX_offset()
requires the page table lock to be held as the internal *pXX is a data
race otherwise.

Patch 1 might be OK for a stable backport, but to get to a clear
pXX_offset_unlocked() all the arches would want to be changed to
implement that API and the generic code would provide the wrapper:

#define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address)

Arches would not have a *pXX inside their code.

Then we can talk about auditing call sites of pXX_offset and think
about using the _unlocked version in places where the page table lock
is not held.

For instance mm/pagewalk.c should be changed. So should
huge_pte_offset() and probably other places. These places might
already be exsting data-race bugs.

It is code-as-documentation indicating an unlocked page table walk.

Now it is not just a S390 story but a change that makes the data
concurrency much clearer, so I think I prefer this version to the
addr_end one too.

Regards,
Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Jason Gunthorpe
On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
>  wrote:
> >
> > It is only gup_fast case that exposes the issue. It hits because
> > pointers to stack copies are passed to gup_pXd_range iterators, not
> > pointers to real page tables itself.
> 
> Can we possibly change fast-gup to not do the stack copies?
>
> I'd actually rather do something like that, than the "addr_end" thing.

> As you say, none of the other page table walking code does what the
> GUP code does, and I don't think it's required.

As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.

Passing it the address of the stack value is a way to force
pXX_offset() to use the READ_ONCE result which has already been tested
to be a page table pointer.

Other page walking code that holds the mmap_sem tends to use
pmd_trans_unstable() which solves this problem by injecting a
barrier. The load hidden in pte_offset() after a pmd_trans_unstable()
can't be re-ordered and will only see a page table entry under the
mmap_sem.

However, I think that logic would have been much clearer following the
GUP model of READ_ONCE vs extra reads and a hidden barrier. At least
it took me a long time to work it out :(

I also think there are real bugs here where places are reading *pXX
multiple times without locking the page table. One was found recently
in the wild in the huge tlb code IIRC.

The mm/pagewalk.c has these missing READ_ONCE bugs too.

So.. To change away from the stack option I think we'd have to pass
the READ_ONCE value to pXX_offset() as an extra argument instead of it
derefing the pointer internally.

Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Jason Gunthorpe
On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Jason Gunthorpe
On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe  wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> > pud = pud_offset(p4d, addr);
> > do {
> > WARN_ON(pud != pud_offset(p4d, addr);
> > next = pud_addr_end(addr, end);
> > } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Jason Gunthorpe
On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

pud = pud_offset(p4d, addr);
do {
WARN_ON(pud != pud_offset(p4d, addr);
next = pud_addr_end(addr, end);
} while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason



Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;

2020-09-09 Thread Jason Gunthorpe
On Wed, Sep 09, 2020 at 01:06:39PM -0700, Joe Perches wrote:
> fallthrough to a separate case/default label break; isn't very readable.
> 
> Convert pseudo-keyword fallthrough; statements to a simple break; when
> the next label is case or default and the only statement in the next
> label block is break;
> 
> Found using:
> 
> $ grep-2.5.4 -rP --include=*.[ch] -n 
> "fallthrough;(\s*(case\s+\w+|default)\s*:\s*){1,7}break;" *
> 
> Miscellanea:
> 
> o Move or coalesce a couple label blocks above a default: block.
> 
> Signed-off-by: Joe Perches 
> ---
> 
> Compiled allyesconfig x86-64 only.
> A few files for other arches were not compiled.

IB part looks OK, I prefer it like this

You could do the same for continue as well, I saw a few of those..

Thanks,
Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Jason Gunthorpe
On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Jason Gunthorpe
On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev 
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason


Re: [PATCH 0/8 v2] PCI: Align return values of PCIe capability and PCI accessors

2020-06-30 Thread Jason Gunthorpe
On Mon, Jun 15, 2020 at 09:32:17AM +0200, refactormys...@gmail.com wrote:
> Bolarinwa Olayemi Saheed (8):
>   IB/hfi1: Convert PCIBIOS_* errors to generic -E* errors
>   IB/hfi1: Convert PCIBIOS_* errors to generic -E* errors

Applied to rdma for-next thanks

Jason


Re: [PATCH 00/17] spelling.txt: /decriptors/descriptors/

2020-06-15 Thread Jason Gunthorpe
On Tue, Jun 09, 2020 at 01:45:53PM +0100, Kieran Bingham wrote:
> I wouldn't normally go through spelling fixes, but I caught sight of
> this typo twice, and then foolishly grepped the tree for it, and saw how
> pervasive it was.
> 
> so here I am ... fixing a typo globally... but with an addition in
> scripts/spelling.txt so it shouldn't re-appear ;-)
> 
> Cc: linux-arm-ker...@lists.infradead.org (moderated list:TI DAVINCI MACHINE 
> SUPPORT)
> Cc: linux-ker...@vger.kernel.org (open list)
> Cc: linux...@vger.kernel.org (open list:DEVICE FREQUENCY EVENT 
> (DEVFREQ-EVENT))
> Cc: linux-g...@vger.kernel.org (open list:GPIO SUBSYSTEM)
> Cc: dri-de...@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: linux-r...@vger.kernel.org (open list:HFI1 DRIVER)
> Cc: linux-in...@vger.kernel.org (open list:INPUT (KEYBOARD, MOUSE, JOYSTICK, 
> TOUCHSCREEN)...)
> Cc: linux-...@lists.infradead.org (open list:NAND FLASH SUBSYSTEM)
> Cc: net...@vger.kernel.org (open list:NETWORKING DRIVERS)
> Cc: ath...@lists.infradead.org (open list:QUALCOMM ATHEROS ATH10K WIRELESS 
> DRIVER)
> Cc: linux-wirel...@vger.kernel.org (open list:NETWORKING DRIVERS (WIRELESS))
> Cc: linux-s...@vger.kernel.org (open list:IBM Power Virtual FC Device Drivers)
> Cc: linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 
> 64-BIT))
> Cc: linux-...@vger.kernel.org (open list:USB SUBSYSTEM)
> Cc: virtualizat...@lists.linux-foundation.org (open list:VIRTIO CORE AND NET 
> DRIVERS)
> Cc: linux...@kvack.org (open list:MEMORY MANAGEMENT)
> 
> 
> Kieran Bingham (17):
>   arch: arm: mach-davinci: Fix trivial spelling
>   drivers: infiniband: Fix trivial spelling
>   drivers: infiniband: Fix trivial spelling

I took these two RDMA patches and merged them, thanks

Jason


Re: [PATCH 0/8 v2] PCI: Align return values of PCIe capability and PCI accessors

2020-06-15 Thread Jason Gunthorpe
On Mon, Jun 15, 2020 at 09:32:17AM +0200, refactormys...@gmail.com wrote:
> From: Bolarinwa Olayemi Saheed 
> 
> 
> PATCH 1/8 to 7/8:
> PCIBIOS_ error codes have positive values and they are passed down the
> call heirarchy from accessors. For functions which are meant to return
> only a negative value on failure, passing on this value is a bug.
> To mitigate this, call pcibios_err_to_errno() before passing on return
> value from PCIe capability accessors call heirarchy. This function
> converts any positive PCIBIOS_ error codes to negative generic error
> values.
> 
> PATCH 8/8:
> The PCIe capability accessors can return 0, -EINVAL, or any PCIBIOS_ error
> code. The pci accessor on the other hand can only return 0 or any PCIBIOS_
> error code.This inconsistency among these accessor makes it harder for
> callers to check for errors.
> Return PCIBIOS_BAD_REGISTER_NUMBER instead of -EINVAL in all PCIe
> capability accessors.
> 
> MERGING:
> These may all be merged via the PCI tree, since it is a collection of
> similar fixes. This way they all get merged at once.

I prefer this not happen for active trees, it just risks needless
merge conflicts.

I will take the hfi1 patches at least, let me know when they are
reviewed

Thanks,
Jason


Re: [PATCH 08/12] docs: fix broken references to text files

2020-03-17 Thread Jason Gunthorpe
On Tue, Mar 17, 2020 at 02:10:47PM +0100, Mauro Carvalho Chehab wrote:
> Several references got broken due to txt to ReST conversion.
> 
> Several of them can be automatically fixed with:
> 
>   scripts/documentation-file-ref-check --fix
> 
> Signed-off-by: Mauro Carvalho Chehab 
>  Documentation/admin-guide/kernel-parameters.txt  |  2 +-
>  Documentation/memory-barriers.txt|  2 +-
>  Documentation/process/submit-checklist.rst   |  2 +-
>  .../translations/it_IT/process/submit-checklist.rst  |  2 +-
>  Documentation/translations/ko_KR/memory-barriers.txt |  2 +-
>  .../translations/zh_CN/filesystems/sysfs.txt |  2 +-
>  .../translations/zh_CN/process/submit-checklist.rst  |  2 +-
>  Documentation/virt/kvm/arm/pvtime.rst|  2 +-
>  Documentation/virt/kvm/devices/vcpu.rst  |  2 +-
>  Documentation/virt/kvm/hypercalls.rst|  4 ++--
>  arch/powerpc/include/uapi/asm/kvm_para.h |  2 +-
>  drivers/gpu/drm/Kconfig  |  2 +-
>  drivers/gpu/drm/drm_ioctl.c  |  2 +-
>  drivers/hwtracing/coresight/Kconfig  |  2 +-
>  fs/fat/Kconfig   |  8 
>  fs/fuse/Kconfig  |  2 +-
>  fs/fuse/dev.c|  2 +-
>  fs/overlayfs/Kconfig |  6 +++---
>  include/linux/mm.h   |  4 ++--
>  include/uapi/linux/ethtool_netlink.h |  2 +-
>  include/uapi/rdma/rdma_user_ioctl_cmds.h |  2 +-

For the rdma files

Acked-by: Jason Gunthorpe 

Jason


Re: [PATCH v3 0/7] Allow setting caching mode in arch_add_memory() for P2PDMA

2020-02-27 Thread Jason Gunthorpe
On Thu, Feb 27, 2020 at 09:55:04AM -0800, Dan Williams wrote:
> On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe  wrote:
> >
> > On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
> > >
> > >
> > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> > > >> Instead of this, this series proposes a change to arch_add_memory()
> > > >> to take the pgprot required by the mapping which allows us to
> > > >> explicitly set pagetable entries for P2PDMA memory to WC.
> > > >
> > > > Is there a particular reason why WC was selected here? I thought for
> > > > the p2pdma cases there was no kernel user that touched the memory?
> > >
> > > Yes, that's correct. I choose WC here because the existing users are
> > > registering memory blocks without side effects which fit the WC
> > > semantics well.
> >
> > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
> > Linux, so while it is true the memory has no side effects, there would
> > be surprising concurrency risks if anything in the kernel tried to
> > write to it.
> >
> > Not compatible means the locks don't contain stores to WC memory the
> > way you would expect. AFAIK on many CPUs extra barriers are required
> > to keep WC stores ordered, the same way ARM already has extra barriers
> > to keep UC stores ordered with locking..
> >
> > The spinlocks are defined to contain UC stores though.
> 
> How are spinlocks and mutexes getting into p2pdma ranges in the first
> instance? Even with UC, the system has bigger problems if it's trying
> to send bus locks targeting PCI, see the flurry of activity of trying
> to trigger faults on split locks [1].

This is not what I was trying to explain.

Consider

 static spinlock lock; // CPU DRAM
 static idx = 0;
 u64 *wc_memory = [..];

 spin_lock();
 wc_memory[0] = idx++;
 spin_unlock();

You'd expect that the PCI device will observe stores where idx is
strictly increasing, but this is not guarenteed. idx may decrease, idx
may skip. It just won't duplicate.

Or perhaps

 wc_memory[0] = foo;
 writel(doorbell)

foo is not guarenteed observable by the device before doorbell reaches
the device.

All of these are things that do not happen with UC or NC memory, and
are surprising violations of our programming model.

Generic kernel code should never touch WC memory unless the code is
specifically designed to handle it.

Jason


Re: [PATCH v3 0/7] Allow setting caching mode in arch_add_memory() for P2PDMA

2020-02-27 Thread Jason Gunthorpe
On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> >> Instead of this, this series proposes a change to arch_add_memory()
> >> to take the pgprot required by the mapping which allows us to
> >> explicitly set pagetable entries for P2PDMA memory to WC.
> > 
> > Is there a particular reason why WC was selected here? I thought for
> > the p2pdma cases there was no kernel user that touched the memory?
> 
> Yes, that's correct. I choose WC here because the existing users are
> registering memory blocks without side effects which fit the WC
> semantics well.

Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
Linux, so while it is true the memory has no side effects, there would
be surprising concurrency risks if anything in the kernel tried to
write to it.

Not compatible means the locks don't contain stores to WC memory the
way you would expect. AFAIK on many CPUs extra barriers are required
to keep WC stores ordered, the same way ARM already has extra barriers
to keep UC stores ordered with locking..

The spinlocks are defined to contain UC stores though.

If there is no actual need today for WC I would suggest using UC as
the default.

Jason


Re: [PATCH v3 0/7] Allow setting caching mode in arch_add_memory() for P2PDMA

2020-02-27 Thread Jason Gunthorpe
On Fri, Feb 21, 2020 at 11:24:56AM -0700, Logan Gunthorpe wrote:
> Hi,
> 
> This is v3 of the patchset which cleans up a number of minor issues
> from the feedback of v2 and rebases onto v5.6-rc2. Additional feedback
> is welcome.
> 
> Thanks,
> 
> Logan
> 
> --
> 
> Changes in v3:
>  * Rebased onto v5.6-rc2
>  * Rename mhp_modifiers to mhp_params per David with an updated kernel
>doc per Dan
>  * Drop support for s390 per David seeing it does not support
>ZONE_DEVICE yet and there was a potential problem with huge pages.
>  * Added WARN_ON_ONCE in cases where arches recieve non PAGE_KERNEL
>parameters
>  * Collected David and Micheal's Reviewed-By and Acked-by Tags
> 
> Changes in v2:
>  * Rebased onto v5.5-rc5
>  * Renamed mhp_restrictions to mhp_modifiers and added the pgprot field
>to that structure instead of using an argument for
>arch_add_memory().
>  * Add patch to drop the unused flags field in mhp_restrictions
> 
> A git branch is available here:
> 
> https://github.com/sbates130272/linux-p2pmem remap_pages_cache_v3
> 
> --
> 
> Currently, the page tables created using memremap_pages() are always
> created with the PAGE_KERNEL cacheing mode. However, the P2PDMA code
> is creating pages for PCI BAR memory which should never be accessed
> through the cache and instead use either WC or UC. This still works in
> most cases, on x86, because the MTRR registers typically override the
> caching settings in the page tables for all of the IO memory to be
> UC-. However, this tends not to work so well on other arches or
> some rare x86 machines that have firmware which does not setup the
> MTRR registers in this way.
> 
> Instead of this, this series proposes a change to arch_add_memory()
> to take the pgprot required by the mapping which allows us to
> explicitly set pagetable entries for P2PDMA memory to WC.

Is there a particular reason why WC was selected here? I thought for
the p2pdma cases there was no kernel user that touched the memory?

I definitely forsee devices where we want UC instead.

Even so, the whole idea looks like the right direction to me.

Jason


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-23 Thread Jason Gunthorpe
On Fri, Dec 20, 2019 at 04:32:13PM -0800, Dan Williams wrote:

> > > There's already a limit, it's just a much larger one. :) What does "no 
> > > limit"
> > > really mean, numerically, to you in this case?
> >
> > I guess I mean 'hidden limit' - hitting the limit and failing would
> > be managable.
> >
> > I think 7 is probably too low though, but we are not using 1GB huge
> > pages, only 2M..
> 
> What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?

I don't think the failing testing is doing that.

It is also less likely that 1GB regions will need multi-mapping, IMHO.

Jason


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Jason Gunthorpe
On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > > Hi,
> > > > > 
> > > > > This implements an API naming change (put_user_page*() -->
> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. 
> > > > > It
> > > > > extends that tracking to a few select subsystems. More subsystems will
> > > > > be added in follow up work.
> > > > 
> > > > Hi John,
> > > > 
> > > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > > allocated single memory block and registered multiple MRs using the 
> > > > single
> > > > block.
> > > > 
> > > > The possible bad flow is:
> > > >ib_umem_geti() ->
> > > > pin_user_pages_fast(FOLL_WRITE) ->
> > > >  internal_get_user_pages_fast(FOLL_WRITE) ->
> > > >   gup_pgd_range() ->
> > > >gup_huge_pd() ->
> > > > gup_hugepte() ->
> > > >  try_grab_compound_head() ->
> > > 
> > > Hi Leon,
> > > 
> > > Thanks very much for the detailed report! So we're overflowing...
> > > 
> > > At first look, this seems likely to be hitting a weak point in the
> > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > > with huge pages if the pages have a *lot* of subpages.
> > > 
> > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > 
> > Considering that establishing these pins is entirely under user
> > control, we can't have a limit here.
> 
> There's already a limit, it's just a much larger one. :) What does "no limit"
> really mean, numerically, to you in this case?

I guess I mean 'hidden limit' - hitting the limit and failing would
be managable.

I think 7 is probably too low though, but we are not using 1GB huge
pages, only 2M..

> > If the number of allowed pins are exhausted then the
> > pin_user_pages_fast() must fail back to the user.
> 
> I'll poke around the IB call stack and see how much of that return
> path is in place, if any. Because it's the same situation for
> get_user_pages_fast().  This code just added a warning on overflow
> so we could spot it early.

All GUP callers must be prepared for failure, IB should be fine...

Jason


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread Jason Gunthorpe
On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >pin_user_pages_fast(FOLL_WRITE) ->
> > internal_get_user_pages_fast(FOLL_WRITE) ->
> >  gup_pgd_range() ->
> >   gup_huge_pd() ->
> >gup_hugepte() ->
> > try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.

Considering that establishing these pins is entirely under user
control, we can't have a limit here.

If the number of allowed pins are exhausted then the
pin_user_pages_fast() must fail back to the user.

> 3. It would be nice if I could reproduce this. I have a two-node mlx5 
> Infiniband
> test setup, but I have done only the tiniest bit of user space IB coding, so
> if you have any test programs that aren't too hard to deal with that could
> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> that I'm not an advanced IB programmer. At all. :)

Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py 

If you get things that far I think Leon can get a reproduction for you

Jason


Re: [PATCH v7 07/24] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-11-24 Thread Jason Gunthorpe
On Sun, Nov 24, 2019 at 04:05:16PM -0800, John Hubbard wrote:
 
> I looked into this, and I believe that the problem is in gup.c. There appears 
> to
> have been an oversight, in commit 817be129e6f2 ("mm: validate 
> get_user_pages_fast
> flags"), in filtering out FOLL_FORCE. There is nothing in the _fast() 
> implementation
> that requires that we avoid writing to the pages.

I think it is too late to be doing these kinds of changes, I will
revert the patch and this will miss this merge window.

Jason

>From ec6cb45292d21d1af9b9d95997b8cf204bbe854c Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe 
Date: Sun, 24 Nov 2019 20:47:59 -0400
Subject: [PATCH] Revert "IB/umem: use get_user_pages_fast() to pin DMA pages"

This reverts commit c9a7a2ed837c563f9f89743a6db732591cb4035b.

This was merged before enough testing was done, and it triggers a WARN_ON()
in get_user_pages_fast():

  WARNING: CPU: 1 PID: 2557 at mm/gup.c:2404 get_user_pages_fast+0x115/0x180
  Call Trace:
   ib_umem_get+0x298/0x550 [ib_uverbs]
   mlx5_ib_db_map_user+0xad/0x130 [mlx5_ib]
   mlx5_ib_create_cq+0x1e8/0xaa0 [mlx5_ib]
   create_cq+0x1c8/0x2d0 [ib_uverbs]
   ib_uverbs_create_cq+0x70/0xa0 [ib_uverbs]
   ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
   ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
   ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
   ? kvm_clock_get_cycles+0xd/0x10
   ? kmem_cache_alloc+0x176/0x1c0
   ? filemap_map_pages+0x18c/0x350
   ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
   do_vfs_ioctl+0xa1/0x610
   ksys_ioctl+0x70/0x80
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x42/0x110
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

2404 if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
2405 return -EINVAL;

While we think this WARN_ON is probably bogus, resolving this will have to
wait.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/umem.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 214e87aa609d6e..7a3b99597eada1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,13 +266,16 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   ret = get_user_pages_fast(cur_base,
- min_t(unsigned long, npages,
-   PAGE_SIZE /
-   sizeof(struct page *)),
- gup_flags | FOLL_LONGTERM, page_list);
-   if (ret < 0)
+   down_read(>mmap_sem);
+   ret = get_user_pages(cur_base,
+min_t(unsigned long, npages,
+  PAGE_SIZE / sizeof (struct page *)),
+gup_flags | FOLL_LONGTERM,
+page_list, NULL);
+   if (ret < 0) {
+   up_read(>mmap_sem);
goto umem_release;
+   }
 
cur_base += ret * PAGE_SIZE;
npages   -= ret;
@@ -280,6 +283,8 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = ib_umem_add_sg_table(sg, page_list, ret,
dma_get_max_seg_size(context->device->dma_device),
>sg_nents);
+
+   up_read(>mmap_sem);
}
 
sg_mark_end(sg);
-- 
2.24.0



Re: [PATCH v7 07/24] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-11-21 Thread Jason Gunthorpe
On Thu, Nov 21, 2019 at 12:07:46AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 20, 2019 at 11:13:37PM -0800, John Hubbard wrote:
> > And get rid of the mmap_sem calls, as part of that. Note
> > that get_user_pages_fast() will, if necessary, fall back to
> > __gup_longterm_unlocked(), which takes the mmap_sem as needed.
> > 
> > Reviewed-by: Jan Kara 
> > Reviewed-by: Jason Gunthorpe 
> > Reviewed-by: Ira Weiny 
> > Signed-off-by: John Hubbard 
> 
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> Jason, can you queue this up for 5.5 to reduce this patch stack a bit?

Yes, I said I'd do this in an earlier revision. Now that it is clear this
won't go through Andrew's tree, applied to rdma for-next

Thanks,
Jason


Re: [PATCH v5 12/24] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

2019-11-15 Thread Jason Gunthorpe
On Thu, Nov 14, 2019 at 09:53:28PM -0800, John Hubbard wrote:
> Convert infiniband to use the new pin_user_pages*() calls.
> 
> Also, revert earlier changes to Infiniband ODP that had it using
> put_user_page(). ODP is "Case 3" in
> Documentation/core-api/pin_user_pages.rst, which is to say, normal
> get_user_pages() and put_page() is the API to use there.
> 
> The new pin_user_pages*() calls replace corresponding get_user_pages*()
> calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
> caller must return the pages via put_user_page*() calls, but infiniband
> was already doing that as part of an earlier commit.
> 
> Signed-off-by: John Hubbard 
> ---
>  drivers/infiniband/core/umem.c  |  2 +-
>  drivers/infiniband/core/umem_odp.c  | 13 ++---
>  drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
>  drivers/infiniband/hw/mthca/mthca_memfree.c |  2 +-
>  drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
>  drivers/infiniband/sw/siw/siw_mem.c |  2 +-
>  8 files changed, 13 insertions(+), 14 deletions(-)

Ok

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v5 09/24] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-15 Thread Jason Gunthorpe
On Thu, Nov 14, 2019 at 09:53:25PM -0800, John Hubbard wrote:
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments accordingly, and update the VFIO caller
> to take advantage of this, fixing a bug as a result: the VFIO caller
> is logically a FOLL_LONGTERM user.
> 
> Also, remove an unnessary pair of calls that were releasing and
> reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
> just in order to call page_to_pfn().
> 
> Also, move the DAX check ("if a VMA is DAX, don't allow long term
> pinning") from the VFIO call site, all the way into the internals
> of get_user_pages_remote() and __gup_longterm_locked(). That is:
> get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> calls check_dax_vmas(). It's lightly explained in the comments as well.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> and to Dan Williams for helping clarify the DAX refactoring.
> 
> Suggested-by: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Jerome Glisse 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 30 +-
>  mm/gup.c    | 27 ++-
>  2 files changed, 27 insertions(+), 30 deletions(-)

Looks OK now  

Reviewed-by: Jason Gunthorpe 


Re: [PATCH v4 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-13 Thread Jason Gunthorpe
On Tue, Nov 12, 2019 at 08:26:55PM -0800, John Hubbard wrote:
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments accordingly, and update the VFIO caller
> to take advantage of this, fixing a bug as a result: the VFIO caller
> is logically a FOLL_LONGTERM user.
> 
> Also, remove an unnessary pair of calls that were releasing and
> reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
> just in order to call page_to_pfn().
> 
> Also, move the DAX check ("if a VMA is DAX, don't allow long term
> pinning") from the VFIO call site, all the way into the internals
> of get_user_pages_remote() and __gup_longterm_locked(). That is:
> get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> calls check_dax_vmas(). It's lightly explained in the comments as well.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> and to Dan Williams for helping clarify the DAX refactoring.
> 
> Suggested-by: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Jerome Glisse 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 
>  drivers/vfio/vfio_iommu_type1.c | 25 ++---
>  mm/gup.c| 27 ++-
>  2 files changed, 24 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d864277ea16f..7301b710c9a4 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -340,7 +340,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  {
>   struct page *page[1];
>   struct vm_area_struct *vma;
> - struct vm_area_struct *vmas[1];
>   unsigned int flags = 0;
>   int ret;
>  
> @@ -348,33 +347,13 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>   flags |= FOLL_WRITE;
>  
>   down_read(>mmap_sem);
> - if (mm == current->mm) {
> - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> -  vmas);
> - } else {
> - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - vmas, NULL);
> - /*
> -  * The lifetime of a vaddr_get_pfn() page pin is
> -  * userspace-controlled. In the fs-dax case this could
> -  * lead to indefinite stalls in filesystem operations.
> -  * Disallow attempts to pin fs-dax pages via this
> -  * interface.
> -  */
> - if (ret > 0 && vma_is_fsdax(vmas[0])) {
> - ret = -EOPNOTSUPP;
> - put_page(page[0]);
> - }
> - }
> - up_read(>mmap_sem);
> -
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> + page, NULL, NULL);
>   if (ret == 1) {
>   *pfn = page_to_pfn(page[0]);
>   return 0;

Mind the return with the lock held this needs some goto unwind

Jason


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Jason Gunthorpe
On Tue, Nov 12, 2019 at 02:45:51PM -0800, Dan Williams wrote:
> On Tue, Nov 12, 2019 at 2:43 PM John Hubbard  wrote:
> >
> > On 11/12/19 12:43 PM, Jason Gunthorpe wrote:
> > ...
> > >> -}
> > >> +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | 
> > >> FOLL_LONGTERM,
> > >> +page, vmas, NULL);
> > >> +/*
> > >> + * The lifetime of a vaddr_get_pfn() page pin is
> > >> + * userspace-controlled. In the fs-dax case this could
> > >> + * lead to indefinite stalls in filesystem operations.
> > >> + * Disallow attempts to pin fs-dax pages via this
> > >> + * interface.
> > >> + */
> > >> +if (ret > 0 && vma_is_fsdax(vmas[0])) {
> > >> +ret = -EOPNOTSUPP;
> > >> +put_page(page[0]);
> > >>  }
> > >
> > > AFAIK this chunk is redundant now as it is some hack to emulate
> > > FOLL_LONGTERM? So vmas can be deleted too.
> >
> > Let me first make sure I understand what Dan has in mind for the vma
> > checking, in the other thread...
> 
> It's not redundant relative to upstream which does not do anything the
> FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
> 1-7 to see if something there made it redundant.

Oh, the hunk John had below for get_user_pages_remote() also needs to
call __gup_longterm_locked() when FOLL_LONGTERM is specified, then
that calls check_dax_vmas() which duplicates the vma_is_fsdax() check
above.

Certainly no caller of FOLL_LONGTERM should have to do dax specific
VMA checking.

Jason


Re: [PATCH v3 11/23] IB/{core,hw,umem}: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-11-12 Thread Jason Gunthorpe
On Mon, Nov 11, 2019 at 04:06:48PM -0800, John Hubbard wrote:
> @@ -542,7 +541,7 @@ static int ib_umem_odp_map_dma_single_page(
>   }
>  
>  out:
> - put_user_page(page);
> + put_page(page);
>  
>   if (remove_existing_mapping) {
>   ib_umem_notifier_start_account(umem_odp);
> @@ -639,13 +638,14 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp 
> *umem_odp, u64 user_virt,
>   /*
>* Note: this might result in redundent page getting. We can
>* avoid this by checking dma_list to be 0 before calling
> -  * get_user_pages. However, this make the code much more
> -  * complex (and doesn't gain us much performance in most use
> -  * cases).
> +  * get_user_pages. However, this makes the code much
> +  * more complex (and doesn't gain us much performance in most
> +  * use cases).
>*/
>   npages = get_user_pages_remote(owning_process, owning_mm,
> - user_virt, gup_num_pages,
> - flags, local_page_list, NULL, NULL);
> +user_virt, gup_num_pages,
> +flags, local_page_list, NULL,
> +NULL);
>   up_read(_mm->mmap_sem);

This is just whitespace churn? Drop it..

Jason


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Jason Gunthorpe
On Mon, Nov 11, 2019 at 04:06:45PM -0800, John Hubbard wrote:
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments accordingly, and update the VFIO caller
> to take advantage of this, fixing a bug as a result: the VFIO caller
> is logically a FOLL_LONGTERM user.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this.
> 
> Suggested-by: Jason Gunthorpe 
> Cc: Jerome Glisse 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 30 +-
>  mm/gup.c| 13 -
>  2 files changed, 21 insertions(+), 22 deletions(-)

This matches what I thought, but I think DanW should check it too, and
the vfio users should test..

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d864277ea16f..017689b7c32b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -348,24 +348,20 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>   flags |= FOLL_WRITE;
>  
>   down_read(>mmap_sem);
> - if (mm == current->mm) {
> - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> -  vmas);
> - } else {
> - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - vmas, NULL);
> - /*
> -  * The lifetime of a vaddr_get_pfn() page pin is
> -  * userspace-controlled. In the fs-dax case this could
> -  * lead to indefinite stalls in filesystem operations.
> -  * Disallow attempts to pin fs-dax pages via this
> -  * interface.
> -  */
> - if (ret > 0 && vma_is_fsdax(vmas[0])) {
> - ret = -EOPNOTSUPP;
> - put_page(page[0]);
> - }
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> + page, vmas, NULL);
> + /*
> +  * The lifetime of a vaddr_get_pfn() page pin is
> +  * userspace-controlled. In the fs-dax case this could
> +  * lead to indefinite stalls in filesystem operations.
> +  * Disallow attempts to pin fs-dax pages via this
> +  * interface.
> +  */
> + if (ret > 0 && vma_is_fsdax(vmas[0])) {
> + ret = -EOPNOTSUPP;
> + put_page(page[0]);
>   }

AFAIK this chunk is redundant now as it is some hack to emulate
FOLL_LONGTERM? So vmas can be deleted too.

Also unclear why this function has this:

up_read(>mmap_sem);

if (ret == 1) {
*pfn = page_to_pfn(page[0]);
return 0;
}

down_read(>mmap_sem);

Jason


Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-12 Thread Jason Gunthorpe
On Mon, Nov 11, 2019 at 04:06:37PM -0800, John Hubbard wrote:
> Hi,
> 
> The cover letter is long, so the more important stuff is first:
> 
> * Jason, if you or someone could look at the the VFIO cleanup (patch 8)
>   and conversion to FOLL_PIN (patch 18), to make sure it's use of
>   remote and longterm gup matches what we discussed during the review
>   of v2, I'd appreciate it.
> 
> * Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
>   converting from put_user_pages() to release_pages().

Why are we doing this? I think things got confused here someplace, as
the comment still says:

/**
 * put_user_page() - release a gup-pinned page
 * @page:pointer to page to be released
 *
 * Pages that were pinned via get_user_pages*() must be released via
 * either put_user_page(), or one of the put_user_pages*() routines
 * below.

I feel like if put_user_pages() is not the correct way to undo
get_user_pages() then it needs to be deleted.

Jason


Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 02:03:43PM -0800, John Hubbard wrote:
> On 11/4/19 12:57 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2019 at 12:48:13PM -0800, John Hubbard wrote:
> >> On 11/4/19 12:33 PM, Jason Gunthorpe wrote:
> >> ...
> >>>> diff --git a/drivers/infiniband/core/umem.c 
> >>>> b/drivers/infiniband/core/umem.c
> >>>> index 24244a2f68cc..c5a78d3e674b 100644
> >>>> +++ b/drivers/infiniband/core/umem.c
> >>>> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata 
> >>>> *udata, unsigned long addr,
> >>>>  
> >>>>  while (npages) {
> >>>>  down_read(>mmap_sem);
> >>>> -ret = get_user_pages(cur_base,
> >>>> +ret = pin_longterm_pages(cur_base,
> >>>>   min_t(unsigned long, npages,
> >>>> PAGE_SIZE / sizeof (struct 
> >>>> page *)),
> >>>> - gup_flags | FOLL_LONGTERM,
> >>>> - page_list, NULL);
> >>>> + gup_flags, page_list, NULL);
> >>>
> >>> FWIW, this one should be converted to fast as well, I think we finally
> >>> got rid of all the blockers for that?
> >>>
> >>
> >> I'm not aware of any blockers on the gup.c end, anyway. The only broken 
> >> thing we
> >> have there is "gup remote + FOLL_LONGTERM". But we can do "gup fast + 
> >> LONGTERM". 
> > 
> > I mean the use of the mmap_sem here is finally in a way where we can
> > just delete the mmap_sem and use _fast
> >  
> > ie, AFAIK there is no need for the mmap_sem to be held during
> > ib_umem_add_sg_table()
> > 
> > This should probably be a standalone patch however
> > 
> 
> Yes. Oh, actually I guess the patch flow should be: change to 
> get_user_pages_fast() and remove the mmap_sem calls, as one patch. And then 
> change 
> to pin_longterm_pages_fast() as the next patch. Otherwise, the internal 
> fallback
> from _fast to slow gup would attempt to take the mmap_sem (again) in the same
> thread, which is not good. :)
> 
> Or just defer the change until after this series. Either way is fine, let me
> know if you prefer one over the other.
> 
> The patch itself is trivial, but runtime testing to gain confidence that
> it's solid is much harder. Is there a stress test you would recommend for 
> that?
> (I'm not promising I can quickly run it yet--my local IB setup is still 
> nascent 
> at best.)

If you make a patch we can probably get it tested, it is something
we should do I keep forgetting about.

Jason


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 12:57:59PM -0800, John Hubbard wrote:
> On 11/4/19 12:37 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> >>> Note for Jason: the (a) or (b) items are talking about the vfio case, 
> >>> which is
> >>> one of the two call sites that now use pin_longterm_pages_remote(), and 
> >>> the
> >>> other one is infiniband:
> >>>
> >>> drivers/infiniband/core/umem_odp.c:646: npages = 
> >>> pin_longterm_pages_remote(owning_process, owning_mm,
> >>> drivers/vfio/vfio_iommu_type1.c:353:ret = 
> >>> pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> >>
> >> vfio should be reverted until it can be properly implemented.
> >> The issue is that when you fix the implementation you might
> >> break vfio existing user and thus regress the kernel from user
> >> point of view. So i rather have the change to vfio reverted,
> >> i believe it was not well understood when it got upstream,
> >> between in my 5.4 tree it is still gup_remote not longterm.
> > 
> > It is clearly a bug, vfio must use LONGTERM, and does right above this
> > remote call:
> > 
> > if (mm == current->mm) {
> > ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> >  vmas);
> > } else {
> > ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> > vmas, NULL);
> > 
> > 
> > I'm not even sure that it really makes any sense to build a 'if' like
> > that, surely just always call remote??
> > 
> 
> 
> Right, and I thought about this when converting, and realized that the above 
> code is working around the current gup.c limitations, which are "cannot 
> support
> gup remote with FOLL_LONGTERM".

But AFAICT it doesn't have a problem, the protection test is just too
strict, and I guess the control flow needs a bit of fixing..

The issue is this:

static __always_inline long __get_user_pages_locked():
{
if (locked) {
/* if VM_FAULT_RETRY can be returned, vmas become invalid */
BUG_ON(vmas);
/* check caller initialized locked */
BUG_ON(*locked != 1);
}


so remote could be written as:

if (gup_flags & FOLL_LONGTERM) {
   if (WARN_ON_ONCE(locked))
return -EINVAL;
   return __gup_longterm_locked(...)
}

return __get_user_pages_locked(...)

??

Jason


Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 12:48:13PM -0800, John Hubbard wrote:
> On 11/4/19 12:33 PM, Jason Gunthorpe wrote:
> ...
> >> diff --git a/drivers/infiniband/core/umem.c 
> >> b/drivers/infiniband/core/umem.c
> >> index 24244a2f68cc..c5a78d3e674b 100644
> >> +++ b/drivers/infiniband/core/umem.c
> >> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
> >> unsigned long addr,
> >>  
> >>while (npages) {
> >>down_read(>mmap_sem);
> >> -  ret = get_user_pages(cur_base,
> >> +  ret = pin_longterm_pages(cur_base,
> >> min_t(unsigned long, npages,
> >>   PAGE_SIZE / sizeof (struct page *)),
> >> -   gup_flags | FOLL_LONGTERM,
> >> -   page_list, NULL);
> >> +   gup_flags, page_list, NULL);
> > 
> > FWIW, this one should be converted to fast as well, I think we finally
> > got rid of all the blockers for that?
> > 
> 
> I'm not aware of any blockers on the gup.c end, anyway. The only broken thing 
> we
> have there is "gup remote + FOLL_LONGTERM". But we can do "gup fast + 
> LONGTERM". 

I mean the use of the mmap_sem here is finally in a way where we can
just delete the mmap_sem and use _fast
 
ie, AFAIK there is no need for the mmap_sem to be held during
ib_umem_add_sg_table()

This should probably be a standalone patch however

Jason


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> > Note for Jason: the (a) or (b) items are talking about the vfio case, which 
> > is
> > one of the two call sites that now use pin_longterm_pages_remote(), and the
> > other one is infiniband:
> > 
> > drivers/infiniband/core/umem_odp.c:646: npages = 
> > pin_longterm_pages_remote(owning_process, owning_mm,
> > drivers/vfio/vfio_iommu_type1.c:353:ret = 
> > pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> 
> vfio should be reverted until it can be properly implemented.
> The issue is that when you fix the implementation you might
> break vfio existing user and thus regress the kernel from user
> point of view. So i rather have the change to vfio reverted,
> i believe it was not well understood when it got upstream,
> between in my 5.4 tree it is still gup_remote not longterm.

It is clearly a bug, vfio must use LONGTERM, and does right above this
remote call:

if (mm == current->mm) {
ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
 vmas);
} else {
ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
vmas, NULL);


I'm not even sure that it really makes any sense to build a 'if' like
that, surely just always call remote??

Jason


Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-11-04 Thread Jason Gunthorpe
On Sun, Nov 03, 2019 at 01:18:02PM -0800, John Hubbard wrote:
> Convert infiniband to use the new wrapper calls, and stop
> explicitly setting FOLL_LONGTERM at the call sites.
> 
> The new pin_longterm_*() calls replace get_user_pages*()
> calls, and set both FOLL_LONGTERM and a new FOLL_PIN
> flag. The FOLL_PIN flag requires that the caller must
> return the pages via put_user_page*() calls, but
> infiniband was already doing that as part of an earlier
> commit.
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 
>  drivers/infiniband/core/umem.c  |  5 ++---
>  drivers/infiniband/core/umem_odp.c  | 10 +-
>  drivers/infiniband/hw/hfi1/user_pages.c |  4 ++--
>  drivers/infiniband/hw/mthca/mthca_memfree.c |  3 +--
>  drivers/infiniband/hw/qib/qib_user_pages.c  |  8 
>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c|  9 -
>  drivers/infiniband/sw/siw/siw_mem.c |  5 ++---
>  8 files changed, 21 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 24244a2f68cc..c5a78d3e674b 100644
> +++ b/drivers/infiniband/core/umem.c
> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
> unsigned long addr,
>  
>   while (npages) {
>   down_read(>mmap_sem);
> - ret = get_user_pages(cur_base,
> + ret = pin_longterm_pages(cur_base,
>min_t(unsigned long, npages,
>  PAGE_SIZE / sizeof (struct page *)),
> -  gup_flags | FOLL_LONGTERM,
> -  page_list, NULL);
> +  gup_flags, page_list, NULL);

FWIW, this one should be converted to fast as well, I think we finally
got rid of all the blockers for that?

Jason


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:

> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646: npages = 
> pin_longterm_pages_remote(owning_process, owning_mm,

This is a mistake, it is not a longterm pin and does not need FOLL_PIN
semantics

> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 

I think since you don't need to change this site things should be
fine?

Jason


Re: [PATCH v9 0/8] KVM: PPC: Driver to manage pages of secure guest

2019-09-25 Thread Jason Gunthorpe
On Wed, Sep 25, 2019 at 10:36:41AM +0530, Bharata B Rao wrote:
> [The main change in this version is the introduction of new
> locking to prevent concurrent page-in and page-out calls. More
> details about this are present in patch 2/8]
> 
> Hi,
> 
> A pseries guest can be run as a secure guest on Ultravisor-enabled
> POWER platforms. On such platforms, this driver will be used to manage
> the movement of guest pages between the normal memory managed by
> hypervisor(HV) and secure memory managed by Ultravisor(UV).
> 
> Private ZONE_DEVICE memory equal to the amount of secure memory
> available in the platform for running secure guests is created.
> Whenever a page belonging to the guest becomes secure, a page from
> this private device memory is used to represent and track that secure
> page on the HV side. The movement of pages between normal and secure
> memory is done via migrate_vma_pages(). The reverse movement is driven
> via pagemap_ops.migrate_to_ram().
> 
> The page-in or page-out requests from UV will come to HV as hcalls and
> HV will call back into UV via uvcalls to satisfy these page requests.
> 
> These patches are against hmm.git
> (https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm)
> 
> plus
> 
> Claudio Carvalho's base ultravisor enablement patches that are present
> in Michael Ellerman's tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=topic/ppc-kvm)
> 
> These patches along with Claudio's above patches are required to
> run secure pseries guests on KVM. This patchset is based on hmm.git
> because hmm.git has migrate_vma cleanup and not-device memremap_pages
> patchsets that are required by this patchset.

FWIW this is all merged to Linus now and will be in rc1

Jason


Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-09-25 Thread Jason Gunthorpe
On Wed, Sep 25, 2019 at 10:36:43AM +0530, Bharata B Rao wrote:
> Manage migration of pages betwen normal and secure memory of secure
> guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> 
> H_SVM_PAGE_IN: Move the content of a normal page to secure page
> H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> 
> Private ZONE_DEVICE memory equal to the amount of secure memory
> available in the platform for running secure guests is created.
> Whenever a page belonging to the guest becomes secure, a page from
> this private device memory is used to represent and track that secure
> page on the HV side. The movement of pages between normal and secure
> memory is done via migrate_vma_pages() using UV_PAGE_IN and
> UV_PAGE_OUT ucalls.
> 
> Signed-off-by: Bharata B Rao 
>  arch/powerpc/include/asm/hvcall.h   |   4 +
>  arch/powerpc/include/asm/kvm_book3s_uvmem.h |  29 ++
>  arch/powerpc/include/asm/kvm_host.h |  13 +
>  arch/powerpc/include/asm/ultravisor-api.h   |   2 +
>  arch/powerpc/include/asm/ultravisor.h   |  14 +
>  arch/powerpc/kvm/Makefile   |   3 +
>  arch/powerpc/kvm/book3s_hv.c|  20 +
>  arch/powerpc/kvm/book3s_hv_uvmem.c  | 481 
>  8 files changed, 566 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
>  create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index 2023e327..2595d0144958 100644
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -342,6 +342,10 @@
>  #define H_TLB_INVALIDATE 0xF808
>  #define H_COPY_TOFROM_GUEST  0xF80C
>  
> +/* Platform-specific hcalls used by the Ultravisor */
> +#define H_SVM_PAGE_IN0xEF00
> +#define H_SVM_PAGE_OUT   0xEF04
> +
>  /* Values for 2nd argument to H_SET_MODE */
>  #define H_SET_MODE_RESOURCE_SET_CIABR1
>  #define H_SET_MODE_RESOURCE_SET_DAWR 2
> diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
> b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> new file mode 100644
> index ..9603c2b48d67
> +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __POWERPC_KVM_PPC_HMM_H__
> +#define __POWERPC_KVM_PPC_HMM_H__

This is a strange sentinal for a header called kvm_book3s_uvmem.h

Jason


Re: [PATCH v5 7/7] KVM: PPC: Ultravisor: Add PPC_UV config option

2019-07-10 Thread Jason Gunthorpe
On Wed, Jul 10, 2019 at 08:24:56AM -0500, janani wrote:
> On 2019-07-09 05:25, Bharata B Rao wrote:
> > From: Anshuman Khandual 
> > 
> > CONFIG_PPC_UV adds support for ultravisor.
> > 
> > Signed-off-by: Anshuman Khandual 
> > Signed-off-by: Bharata B Rao 
> > Signed-off-by: Ram Pai 
> > [ Update config help and commit message ]
> > Signed-off-by: Claudio Carvalho 
>  Reviewed-by: Janani Janakiraman 
> >  arch/powerpc/Kconfig | 20 
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index f0e5b38d52e8..20c6c213d2be 100644
> > +++ b/arch/powerpc/Kconfig
> > @@ -440,6 +440,26 @@ config PPC_TRANSACTIONAL_MEM
> >   Support user-mode Transactional Memory on POWERPC.
> > 
> > +config PPC_UV
> > +   bool "Ultravisor support"
> > +   depends on KVM_BOOK3S_HV_POSSIBLE
> > +   select HMM_MIRROR
> > +   select HMM
> > +   select ZONE_DEVICE

These configs have also been changed lately, I didn't see any calls to
hmm_mirror in this patchset, so most likely the two HMM selects should
be dropped and all you'll need is ZONE_DEVICE..

Jason


Re: [PATCH v5 1/7] kvmppc: HMM backend driver to manage pages of secure guest

2019-07-10 Thread Jason Gunthorpe
On Tue, Jul 09, 2019 at 01:55:28PM -0500, janani wrote:

> > +int kvmppc_hmm_init(void)
> > +{
> > +   int ret = 0;
> > +   unsigned long size;
> > +
> > +   size = kvmppc_get_secmem_size();
> > +   if (!size) {
> > +   ret = -ENODEV;
> > +   goto out;
> > +   }
> > +
> > +   kvmppc_hmm.device = hmm_device_new(NULL);
> > +   if (IS_ERR(kvmppc_hmm.device)) {
> > +   ret = PTR_ERR(kvmppc_hmm.device);
> > +   goto out;
> > +   }
> > +
> > +   kvmppc_hmm.devmem = hmm_devmem_add(_hmm_devmem_ops,
> > +  _hmm.device->device, size);
> > +   if (IS_ERR(kvmppc_hmm.devmem)) {
> > +   ret = PTR_ERR(kvmppc_hmm.devmem);
> > +   goto out_device;
> > +   }

This 'hmm_device' API family was recently deleted from hmm:

commit 07ec38917e68f0114b9c8aeeb1c584b5e73e4dd6
Author: Christoph Hellwig 
Date:   Wed Jun 26 14:27:01 2019 +0200

mm: remove the struct hmm_device infrastructure

This code is a trivial wrapper around device model helpers, which
should have been integrated into the driver device model usage from
the start.  Assuming it actually had users, which it never had since
the code was added more than 1 1/2 years ago.

This patch should use the driver core directly instead.

Regards,
Jason


Re: [PATCH v2] tpm: tpm_ibm_vtpm: Fix unallocated banks

2019-07-08 Thread Jason Gunthorpe
On Mon, Jul 08, 2019 at 06:24:04PM -0400, Mimi Zohar wrote:
> Hi Jarkko,
> 
> On Mon, 2019-07-08 at 18:11 +0300, Jarkko Sakkinen wrote:
> > On Sat, 2019-07-06 at 20:18 -0400, Nayna Jain wrote:
> > > +/*
> > > + * tpm_get_pcr_allocation() - initialize the chip allocated banks for 
> > > PCRs
> > > + * @chip: TPM chip to use.
> > > + */
> > > +static int tpm_get_pcr_allocation(struct tpm_chip *chip)
> > > +{
> > > + int rc;
> > > +
> > > + if (chip->flags & TPM_CHIP_FLAG_TPM2)
> > > + rc = tpm2_get_pcr_allocation(chip);
> > > + else
> > > + rc = tpm1_get_pcr_allocation(chip);
> > > +
> > > + return rc;
> > > +}
> > 
> > It is just a trivial static function, which means that kdoc comment is
> > not required and neither it is useful. Please remove that. I would
> > rewrite the function like:
> > 
> > static int tpm_get_pcr_allocation(struct tpm_chip *chip)
> > {
> > int rc;
> > 
> > rc = (chip->flags & TPM_CHIP_FLAG_TPM2) ?
> >  tpm2_get_pcr_allocation(chip) :
> >  tpm1_get_pcr_allocation(chip);
> 
> > 
> > return rc > 0 ? -ENODEV : rc;
> > }
> > 
> > This addresses the issue that Stefan also pointed out. You have to
> > deal with the TPM error codes.
> 
> Hm, in the past I was told by Christoph not to use the ternary
> operator.  Have things changed?  Other than removing the comment, the
> only other difference is the return.

I also dislike most use of ternary expressions..

Also, it is not so nice that TPM error codes and errno are multiplexed
on the same 'int' type - very hard to get this right. I'm not sure
anything actually needs the tpm error, and if it does maybe we should
be mapping those special cases to errno instead.

Jason


Re: [PATCH 11/16] mm: consolidate the get_user_pages* implementations

2019-06-25 Thread Jason Gunthorpe
On Tue, Jun 25, 2019 at 09:56:50AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 21, 2019 at 11:41:31AM -0300, Jason Gunthorpe wrote:
> > >  static bool gup_fast_permitted(unsigned long start, unsigned long end)
> > >  {
> > > - return true;
> > > + return IS_ENABLED(CONFIG_HAVE_FAST_GUP) ? true : false;
> > 
> > The ?: is needed with IS_ENABLED?
> 
> It shouldn't, I'll fix it up.
> 
> > I'd suggest to revise this block a tiny bit:
> > 
> > -#ifndef gup_fast_permitted
> > +#if !IS_ENABLED(CONFIG_HAVE_FAST_GUP) || !defined(gup_fast_permitted)
> >  /*
> >   * Check if it's allowed to use __get_user_pages_fast() for the range, or
> >   * we need to fall back to the slow version:
> >   */
> > -bool gup_fast_permitted(unsigned long start, int nr_pages)
> > +static bool gup_fast_permitted(unsigned long start, int nr_pages)
> >  {
> > 
> > Just in case some future arch code mismatches the header and kconfig..
> 
> IS_ENABLED outside a function doesn't really make sense.  But I'll
> just life the IS_ENABLED(CONFIG_HAVE_FAST_GUP) checks into the two
> callers.

I often see '#if IS_ENABLED(CONFIG_X)', IIRC last I looked at that, it
was needed because the usual #ifdef CONFIG_X didn't work if the value
was =m?

Would be interested to know if that is not the right way to use
kconfig

Jason


Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Jason Gunthorpe
On Fri, Jun 21, 2019 at 09:35:11AM -0600, Khalid Aziz wrote:
> On 6/21/19 7:39 AM, Jason Gunthorpe wrote:
> > On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
> >> This will allow sparc64 to override its ADI tags for
> >> get_user_pages and get_user_pages_fast.
> >>
> >> Signed-off-by: Christoph Hellwig 
> >>  mm/gup.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index ddde097cf9e4..6bb521db67ec 100644
> >> +++ b/mm/gup.c
> >> @@ -2146,7 +2146,7 @@ int __get_user_pages_fast(unsigned long start, int 
> >> nr_pages, int write,
> >>unsigned long flags;
> >>int nr = 0;
> >>  
> >> -  start &= PAGE_MASK;
> >> +  start = untagged_addr(start) & PAGE_MASK;
> >>len = (unsigned long) nr_pages << PAGE_SHIFT;
> >>end = start + len;
> > 
> > Hmm, this function, and the other, goes on to do:
> > 
> > if (unlikely(!access_ok((void __user *)start, len)))
> > return 0;
> > 
> > and I thought that access_ok takes in the tagged pointer?
> > 
> > How about re-order it a bit?
> 
> access_ok() can handle tagged or untagged pointers. It just strips the
> tag bits from the top bits. Current order doesn't really matter from
> functionality point of view. There might be minor gain in delaying
> untagging in __get_user_pages_fast() but I could go either way.

I understand the current ARM and SPARC implementations don't do much
with the tags, but it feels like a really big assumption for the core
code that all future uses of tags will be fine to have them stripped
out of 'void __user *' pointers. IMHO that is something we should not
be doing in the core kernel..

Jason


Re: switch the remaining architectures to use generic GUP v3

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:46PM +0200, Christoph Hellwig wrote:
> Hi Linus and maintainers,
> 
> below is a series to switch mips, sh and sparc64 to use the generic
> GUP code so that we only have one codebase to touch for further
> improvements to this code.  I don't have hardware for any of these
> architectures, and generally no clue about their page table
> management, so handle with care.

I like this series, ther are alot of people talking about work for GUP
and this will make any of that so much easier to do.

Jason


Re: [PATCH 11/16] mm: consolidate the get_user_pages* implementations

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:57PM +0200, Christoph Hellwig wrote:
> @@ -2168,7 +2221,7 @@ static void gup_pgd_range(unsigned long addr, unsigned 
> long end,
>   */
>  static bool gup_fast_permitted(unsigned long start, unsigned long end)
>  {
> - return true;
> + return IS_ENABLED(CONFIG_HAVE_FAST_GUP) ? true : false;

The ?: is needed with IS_ENABLED?

>  }
>  #endif

Oh, you fixed the util.c this way instead of the headerfile
#ifdef..

I'd suggest to revise this block a tiny bit:

-#ifndef gup_fast_permitted
+#if !IS_ENABLED(CONFIG_HAVE_FAST_GUP) || !defined(gup_fast_permitted)
 /*
  * Check if it's allowed to use __get_user_pages_fast() for the range, or
  * we need to fall back to the slow version:
  */
-bool gup_fast_permitted(unsigned long start, int nr_pages)
+static bool gup_fast_permitted(unsigned long start, int nr_pages)
 {

Just in case some future arch code mismatches the header and kconfig..

Regards,
Jason


Re: [PATCH 10/16] mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:56PM +0200, Christoph Hellwig wrote:
> We only support the generic GUP now, so rename the config option to
> be more clear, and always use the mm/Kconfig definition of the
> symbol and select it from the arch Kconfigs.

Looks OK to me

Reviewed-by: Jason Gunthorpe 

But could you also roll something like this in to the series? There is
no longer any reason for the special __weak stuff that I can see -
just follow the normal pattern for stubbing config controlled
functions through the header file.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b76c..13b1cb573383d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1561,8 +1561,17 @@ long get_user_pages_locked(unsigned long start, unsigned 
long nr_pages,
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
 
+#ifdef CONFIG_HAVE_FAST_GUP
 int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
+#else
+static inline int get_user_pages_fast(unsigned long start, int nr_pages,
+ unsigned int gup_flags,
+ struct page **pages)
+{
+   return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
+}
+#endif
 
 /* Container for pinned pfns / pages */
 struct frame_vector {
@@ -1668,8 +1677,17 @@ extern int mprotect_fixup(struct vm_area_struct *vma,
 /*
  * doesn't attempt to fault and will return short.
  */
+#ifdef CONFIG_HAVE_FAST_GUP
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
+#else
+static inline int __get_user_pages_fast(unsigned long start, int nr_pages,
+   int write, struct page **pages)
+{
+   return 0;
+}
+#endif
+
 /*
  * per-process(per-mm_struct) statistics.
  */
diff --git a/mm/util.c b/mm/util.c
index 9834c4ab7d8e86..68575a315dc5ad 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -300,53 +300,6 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
rlimit *rlim_stack)
 }
 #endif
 
-/*
- * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
- * back to the regular GUP.
- * Note a difference with get_user_pages_fast: this always returns the
- * number of pages pinned, 0 if no pages were pinned.
- * If the architecture does not support this function, simply return with no
- * pages pinned.
- */
-int __weak __get_user_pages_fast(unsigned long start,
-int nr_pages, int write, struct page **pages)
-{
-   return 0;
-}
-EXPORT_SYMBOL_GPL(__get_user_pages_fast);
-
-/**
- * get_user_pages_fast() - pin user pages in memory
- * @start: starting user address
- * @nr_pages:  number of pages from start to pin
- * @gup_flags: flags modifying pin behaviour
- * @pages: array that receives pointers to the pages pinned.
- * Should be at least nr_pages long.
- *
- * get_user_pages_fast provides equivalent functionality to get_user_pages,
- * operating on current and current->mm, with force=0 and vma=NULL. However
- * unlike get_user_pages, it must be called without mmap_sem held.
- *
- * get_user_pages_fast may take mmap_sem and page table locks, so no
- * assumptions can be made about lack of locking. get_user_pages_fast is to be
- * implemented in a way that is advantageous (vs get_user_pages()) when the
- * user memory area is already faulted in and present in ptes. However if the
- * pages have to be faulted in, it may turn out to be slightly slower so
- * callers need to carefully consider what to use. On many architectures,
- * get_user_pages_fast simply falls back to get_user_pages.
- *
- * Return: number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno.
- */
-int __weak get_user_pages_fast(unsigned long start,
-   int nr_pages, unsigned int gup_flags,
-   struct page **pages)
-{
-   return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
-}
-EXPORT_SYMBOL_GPL(get_user_pages_fast);
-
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)


Re: [PATCH 04/16] MIPS: use the generic get_user_pages_fast code

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:50PM +0200, Christoph Hellwig wrote:
> diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
> index 4ccb465ef3f2..7d27194e3b45 100644
> +++ b/arch/mips/include/asm/pgtable.h
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct mm_struct;
>  struct vm_area_struct;
> @@ -626,6 +627,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
> mm_struct *mm,
>  
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +#define gup_fast_permitted(start, end)   (!cpu_has_dc_aliases)
> +

Today this check is only being done on the get_user_pages_fast() -
after this patch it is also done for __get_user_pages_fast().

Which means __get_user_pages_fast is now non-functional on a range of
MIPS CPUs, but that seems OK as far as I can tell, so:

Reviewed-by: Jason Gunthorpe 

However, looks to me like this patch is also a bug fix for this:

commit 5b167c123b3c3582f62cf1896465019bc40fe526
Author: Kamal Dasu 
Date:   Fri Jun 14 17:10:03 2013 +

MIPS: Fix get_user_page_fast() for mips with cache alias

get_user_pages_fast() is missing cache flushes for MIPS platforms with
cache aliases.  Filesystem failures observed with DirectIO operations due
to missing flush_anon_page() that use page coloring logic to work with
cache aliases. This fix falls through to take slow_irqon path that calls
get_user_pages() that has required logic for platforms where
cpu_has_dc_aliases is true.

> - pgdp = pgd_offset(mm, addr);
> - do {
> - pgd_t pgd = *pgdp;
> -
> - next = pgd_addr_end(addr, end);
> - if (pgd_none(pgd))
> - goto slow;
> - if (!gup_pud_range(pgd, addr, next, gup_flags & FOLL_WRITE,
> -pages, ))

This is different too, the core code has a p4d layer, but I see that
whole thing gets NOP'd by the compiler as mips uses pgtable-nop4d.h -
right?

Jason


Re: [PATCH 03/16] mm: lift the x86_32 PAE version of gup_get_pte to common code

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:49PM +0200, Christoph Hellwig wrote:
> The split low/high access is the only non-READ_ONCE version of
> gup_get_pte that did show up in the various arch implemenations.
> Lift it to common code and drop the ifdef based arch override.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/x86/Kconfig  |  1 +
>  arch/x86/include/asm/pgtable-3level.h | 47 
>  arch/x86/kvm/mmu.c|  2 +-
>  mm/Kconfig|  3 ++
>  mm/gup.c  | 51 ---
>  5 files changed, 52 insertions(+), 52 deletions(-)

Yep, the sh and mips conversions look right too.

Reviewed-by: Jason Gunthorpe 
 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index f0c76ba47695..fe51f104a9e0 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -762,6 +762,9 @@ config GUP_BENCHMARK
>  
> See tools/testing/selftests/vm/gup_benchmark.c
>
> +config GUP_GET_PTE_LOW_HIGH
> + bool
> +

The config name seems a bit out of place though, should it be prefixed
with GENERIC_ or ARCH_?

Jason


Re: [PATCH 02/16] mm: simplify gup_fast_permitted

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:48PM +0200, Christoph Hellwig wrote:
> Pass in the already calculated end value instead of recomputing it, and
> leave the end > start check in the callers instead of duplicating them
> in the arch code.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/s390/include/asm/pgtable.h   |  8 +---
>  arch/x86/include/asm/pgtable_64.h |  8 +---
>  mm/gup.c  | 17 +++--
>  3 files changed, 9 insertions(+), 24 deletions(-)

Much cleaner

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
> This will allow sparc64 to override its ADI tags for
> get_user_pages and get_user_pages_fast.
> 
> Signed-off-by: Christoph Hellwig 
>  mm/gup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e4..6bb521db67ec 100644
> +++ b/mm/gup.c
> @@ -2146,7 +2146,7 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>   unsigned long flags;
>   int nr = 0;
>  
> - start &= PAGE_MASK;
> + start = untagged_addr(start) & PAGE_MASK;
>   len = (unsigned long) nr_pages << PAGE_SHIFT;
>   end = start + len;

Hmm, this function, and the other, goes on to do:

if (unlikely(!access_ok((void __user *)start, len)))
return 0;

and I thought that access_ok takes in the tagged pointer?

How about re-order it a bit?

diff --git a/mm/gup.c b/mm/gup.c
index ddde097cf9e410..f48747ced4723b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2148,11 +2148,12 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
 
start &= PAGE_MASK;
len = (unsigned long) nr_pages << PAGE_SHIFT;
-   end = start + len;
-
if (unlikely(!access_ok((void __user *)start, len)))
return 0;
 
+   start = untagged_ptr(start);
+   end = start + len;
+
/*
 * Disable interrupts.  We use the nested form as we can already have
 * interrupts disabled by get_futex_key.


Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
> This will allow sparc64 to override its ADI tags for
> get_user_pages and get_user_pages_fast.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/gup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 


Re: [PATCH] mm: add account_locked_vm utility function

2019-05-03 Thread Jason Gunthorpe
On Fri, May 03, 2019 at 01:16:30PM -0700, Daniel Jordan wrote:
> locked_vm accounting is done roughly the same way in five places, so
> unify them in a helper.  Standardize the debug prints, which vary
> slightly.  Error codes stay the same, so user-visible behavior does too.
> 
> Signed-off-by: Daniel Jordan 
> Cc: Alan Tull 
> Cc: Alexey Kardashevskiy 
> Cc: Alex Williamson 
> Cc: Andrew Morton 
> Cc: Benjamin Herrenschmidt 
> Cc: Christoph Lameter 
> Cc: Christophe Leroy 
> Cc: Davidlohr Bueso 
> Cc: Jason Gunthorpe 
> Cc: Mark Rutland 
> Cc: Michael Ellerman 
> Cc: Moritz Fischer 
> Cc: Paul Mackerras 
> Cc: Steve Sistare 
> Cc: Wu Hao 
> Cc: linux...@kvack.org
> Cc: k...@vger.kernel.org
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-f...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> 
> Based on v5.1-rc7.  Tested with the vfio type1 driver.  Any feedback
> welcome.
> 
> Andrew, this one patch replaces these six from [1]:
> 
> mm-change-locked_vms-type-from-unsigned-long-to-atomic64_t.patch
> vfio-type1-drop-mmap_sem-now-that-locked_vm-is-atomic.patch
> vfio-spapr_tce-drop-mmap_sem-now-that-locked_vm-is-atomic.patch
> fpga-dlf-afu-drop-mmap_sem-now-that-locked_vm-is-atomic.patch
> kvm-book3s-drop-mmap_sem-now-that-locked_vm-is-atomic.patch
> powerpc-mmu-drop-mmap_sem-now-that-locked_vm-is-atomic.patch
> 
> That series converts locked_vm to an atomic, but on closer inspection causes 
> at
> least one accounting race in mremap, and fixing it just for this type
> conversion came with too much ugly in the core mm to justify, especially when
> the right long-term fix is making these drivers use pinned_vm instead.

Did we ever decide what to do here? Should all these drivers be
switched to pinned_vm anyhow?

Jason


Re: [PATCH 5/6] powerpc/mmu: drop mmap_sem now that locked_vm is atomic

2019-04-24 Thread Jason Gunthorpe
On Tue, Apr 23, 2019 at 07:15:44PM -0700, Davidlohr Bueso wrote:
> On Wed, 03 Apr 2019, Daniel Jordan wrote:
> 
> > On Wed, Apr 03, 2019 at 06:58:45AM +0200, Christophe Leroy wrote:
> > > Le 02/04/2019 à 22:41, Daniel Jordan a écrit :
> > > > With locked_vm now an atomic, there is no need to take mmap_sem as
> > > > writer.  Delete and refactor accordingly.
> > > 
> > > Could you please detail the change ?
> > 
> > Ok, I'll be more specific in the next version, using some of your language 
> > in
> > fact.  :)
> > 
> > > It looks like this is not the only
> > > change. I'm wondering what the consequences are.
> > > 
> > > Before we did:
> > > - lock
> > > - calculate future value
> > > - check the future value is acceptable
> > > - update value if future value acceptable
> > > - return error if future value non acceptable
> > > - unlock
> > > 
> > > Now we do:
> > > - atomic update with future (possibly too high) value
> > > - check the new value is acceptable
> > > - atomic update back with older value if new value not acceptable and 
> > > return
> > > error
> > > 
> > > So if a concurrent action wants to increase locked_vm with an acceptable
> > > step while another one has temporarily set it too high, it will now fail.
> > > 
> > > I think we should keep the previous approach and do a cmpxchg after
> > > validating the new value.
> 
> Wouldn't the cmpxchg alternative also be exposed the locked_vm changing 
> between
> validating the new value and the cmpxchg() and we'd bogusly fail even when 
> there
> is still just because the value changed (I'm assuming we don't hold any locks,
> otherwise all this is pointless).
> 
>   current_locked = atomic_read(>locked_vm);
>   new_locked = current_locked + npages;
>   if (new_locked < lock_limit)
>  if (cmpxchg(>locked_vm, current_locked, new_locked) == 
> current_locked)
>/* ENOMEM */

Well it needs a loop..

again:
   current_locked = atomic_read(>locked_vm);
   new_locked = current_locked + npages;
   if (new_locked < lock_limit)
  if (cmpxchg(>locked_vm, current_locked, new_locked) != current_locked)
goto again;

So it won't have bogus failures as there is no unwind after
error. Basically this is a load locked/store conditional style of
locking pattern.

> > That's a good idea, and especially worth doing considering that an arbitrary
> > number of threads that charge a low amount of locked_vm can fail just 
> > because
> > one thread charges lots of it.
> 
> Yeah but the window for this is quite small, I doubt it would be a real issue.
> 
> What if before doing the atomic_add_return(), we first did the racy new_locked
> check for ENOMEM, then do the speculative add and cleanup, if necessary. This
> would further reduce the scope of the window where false ENOMEM can occur.
> 
> > pinned_vm appears to be broken the same way, so I can fix it too unless 
> > someone
> > beats me to it.
> 
> This should not be a surprise for the rdma folks. Cc'ing Jason nonetheless.

I think we accepted this tiny race as a side effect of removing the
lock, which was very beneficial. Really the time window between the
atomic failing and unwind is very small, and there are enough other
ways a hostile user could DOS locked_vm that I don't think it really
matters in practice..

However, the cmpxchg seems better, so a helper to implement that would
probably be the best thing to do.

Jason


Re: [RESEND 4/7] mm/gup: Add FOLL_LONGTERM capability to GUP fast

2019-03-25 Thread Jason Gunthorpe
On Mon, Mar 25, 2019 at 02:23:15AM -0700, Ira Weiny wrote:
> > > Unfortunately holding the lock is required to support FOLL_LONGTERM (to 
> > > check
> > > the VMAs) but we don't want to hold the lock to be optimal (specifically 
> > > allow
> > > FAULT_FOLL_ALLOW_RETRY).  So I'm maintaining the optimization for *_fast 
> > > users
> > > who do not specify FOLL_LONGTERM.
> > > 
> > > Another way to do this would have been to define __gup_longterm_unlocked 
> > > with
> > > the above logic, but that seemed overkill at this point.
> > 
> > get_user_pages_unlocked() is an exported symbol, shouldn't it work
> > with the FOLL_LONGTERM flag?
> > 
> > I think it should even though we have no user..
> > 
> > Otherwise the GUP API just gets more confusing.
> 
> I agree WRT to the API.  But I think callers of get_user_pages_unlocked() are
> not going to get the behavior they want if they specify FOLL_LONGTERM.

Oh? Isn't the only thing FOLL_LONGTERM does is block the call on DAX?
Why does the locking mode matter to this test?

> What I could do is BUG_ON (or just WARN_ON) if unlocked is called with
> FOLL_LONGTERM similar to the code in get_user_pages_locked() which does not
> allow locked and vmas to be passed together:

The GUP call should fail if you are doing something like this. But I'd
rather not see confusing specialc cases in code without a clear
comment explaining why it has to be there.

Jason


Re: [RESEND 4/7] mm/gup: Add FOLL_LONGTERM capability to GUP fast

2019-03-25 Thread Jason Gunthorpe
On Mon, Mar 25, 2019 at 01:42:26AM -0700, Ira Weiny wrote:
> On Fri, Mar 22, 2019 at 03:12:55PM -0700, Dan Williams wrote:
> > On Sun, Mar 17, 2019 at 7:36 PM  wrote:
> > >
> > > From: Ira Weiny 
> > >
> > > DAX pages were previously unprotected from longterm pins when users
> > > called get_user_pages_fast().
> > >
> > > Use the new FOLL_LONGTERM flag to check for DEVMAP pages and fall
> > > back to regular GUP processing if a DEVMAP page is encountered.
> > >
> > > Signed-off-by: Ira Weiny 
> > >  mm/gup.c | 29 +
> > >  1 file changed, 25 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 0684a9536207..173db0c44678 100644
> > > +++ b/mm/gup.c
> > > @@ -1600,6 +1600,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> > > addr, unsigned long end,
> > > goto pte_unmap;
> > >
> > > if (pte_devmap(pte)) {
> > > +   if (unlikely(flags & FOLL_LONGTERM))
> > > +   goto pte_unmap;
> > > +
> > > pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
> > > if (unlikely(!pgmap)) {
> > > undo_dev_pagemap(nr, nr_start, pages);
> > > @@ -1739,8 +1742,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> > > unsigned long addr,
> > > if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
> > > return 0;
> > >
> > > -   if (pmd_devmap(orig))
> > > +   if (pmd_devmap(orig)) {
> > > +   if (unlikely(flags & FOLL_LONGTERM))
> > > +   return 0;
> > > return __gup_device_huge_pmd(orig, pmdp, addr, end, 
> > > pages, nr);
> > > +   }
> > >
> > > refs = 0;
> > > page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> > > @@ -1777,8 +1783,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> > > unsigned long addr,
> > > if (!pud_access_permitted(orig, flags & FOLL_WRITE))
> > > return 0;
> > >
> > > -   if (pud_devmap(orig))
> > > +   if (pud_devmap(orig)) {
> > > +   if (unlikely(flags & FOLL_LONGTERM))
> > > +   return 0;
> > > return __gup_device_huge_pud(orig, pudp, addr, end, 
> > > pages, nr);
> > > +   }
> > >
> > > refs = 0;
> > > page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> > > @@ -2066,8 +2075,20 @@ int get_user_pages_fast(unsigned long start, int 
> > > nr_pages,
> > > start += nr << PAGE_SHIFT;
> > > pages += nr;
> > >
> > > -   ret = get_user_pages_unlocked(start, nr_pages - nr, pages,
> > > - gup_flags);
> > > +   if (gup_flags & FOLL_LONGTERM) {
> > > +   down_read(>mm->mmap_sem);
> > > +   ret = __gup_longterm_locked(current, current->mm,
> > > +   start, nr_pages - nr,
> > > +   pages, NULL, 
> > > gup_flags);
> > > +   up_read(>mm->mmap_sem);
> > > +   } else {
> > > +   /*
> > > +* retain FAULT_FOLL_ALLOW_RETRY optimization if
> > > +* possible
> > > +*/
> > > +   ret = get_user_pages_unlocked(start, nr_pages - 
> > > nr,
> > > + pages, gup_flags);
> > 
> > I couldn't immediately grok why this path needs to branch on
> > FOLL_LONGTERM? Won't get_user_pages_unlocked(..., FOLL_LONGTERM) do
> > the right thing?
> 
> Unfortunately holding the lock is required to support FOLL_LONGTERM (to check
> the VMAs) but we don't want to hold the lock to be optimal (specifically allow
> FAULT_FOLL_ALLOW_RETRY).  So I'm maintaining the optimization for *_fast users
> who do not specify FOLL_LONGTERM.
> 
> Another way to do this would have been to define __gup_longterm_unlocked with
> the above logic, but that seemed overkill at this point.

get_user_pages_unlocked() is an exported symbol, shouldn't it work
with the FOLL_LONGTERM flag?

I think it should even though we have no user..

Otherwise the GUP API just gets more confusing.

Jason


Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages

2019-02-14 Thread Jason Gunthorpe
On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote:

> > > > Really unclear how to fix this. The pinned/locked split with two
> > > > buckets may be the right way.
> > > 
> > > Are you suggesting that we have 2 user limits?
> > 
> > This is what RDMA has done since CL's patch.
> 
> I don't understand?  What is the other _user_ limit (other than
> RLIMIT_MEMLOCK)?

With todays implementation RLIMIT_MEMLOCK covers two user limits,
total number of pinned pages and total number of mlocked pages. The
two are different buckets and not summed.

Jason


Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages

2019-02-14 Thread Jason Gunthorpe
On Thu, Feb 14, 2019 at 11:33:53AM -0800, Ira Weiny wrote:

> > I think it had to do with double accounting pinned and mlocked pages
> > and thus delivering a lower than expected limit to userspace.
> > 
> > vfio has this bug, RDMA does not. RDMA has a bug where it can
> > overallocate locked memory, vfio doesn't.
> 
> Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages?

Yes
 
> I think the problem is that if the user calls mlock on a large range then both
> vfio and RDMA could potentially overallocate even with this fix.  This was 
> your
> initial email to Daniel, I think...  And Alex's concern.

Here are the possibilities
- mlock and pin on the same pages - RDMA respects the limit, VFIO halfs it.
- mlock and pin on different pages - RDMA doubles the limit, VFIO
  respects it
- VFIO and RDMA in the same process, the limit is halfed or doubled, depending.

IHMO we should make VFIO & RDMA the same, and then decide what to do
about case #2.

> > Really unclear how to fix this. The pinned/locked split with two
> > buckets may be the right way.
> 
> Are you suggesting that we have 2 user limits?

This is what RDMA has done since CL's patch.

It is very hard to fix as you need to track how many pages are mlocked
*AND* pinned.

Jason


Re: [PATCH V2 3/7] mm/gup: Change GUP fast to use flags rather than a write 'bool'

2019-02-13 Thread Jason Gunthorpe
On Wed, Feb 13, 2019 at 03:04:51PM -0800, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> To facilitate additional options to get_user_pages_fast() change the
> singular write parameter to be gup_flags.

So now we have:

long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);

and 

int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)

Does this make any sense? At least the arguments should be in the same
order, I think.

Also this comment:
/*
 * get_user_pages_unlocked() is suitable to replace the form:
 *
 *  down_read(>mmap_sem);
 *  get_user_pages(tsk, mm, ..., pages, NULL);
 *  up_read(>mmap_sem);
 *
 *  with:
 *
 *  get_user_pages_unlocked(tsk, mm, ..., pages);
 *
 * It is functionally equivalent to get_user_pages_fast so
 * get_user_pages_fast should be used instead if specific gup_flags
 * (e.g. FOLL_FORCE) are not required.
 */

Needs some attention as the recommendation is now nonsense.

Honestly a proper explanation of why two functions exist would be
great at this point :)

Jason


Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-13 Thread Jason Gunthorpe
On Wed, Feb 13, 2019 at 01:03:30PM -0700, Alex Williamson wrote:
> > PeterZ posted an RFC that addresses this point[1].  It kept pinned_vm and
> > locked_vm accounting separate, but allowed the two to be added safely to be
> > compared against RLIMIT_MEMLOCK.
> 
> Unless I'm incorrect in the concerns above, I don't see how we can
> convert vfio before this occurs.

RDMA was converted to this pinned_vm scheme a long time ago, arguably
it is a mistake that VFIO did something different... This was to fix
some other bug where reporting of pages was wrong.

You are not wrong that this approach doesn't entirely make sense
though. :)

Jason


Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Jason Gunthorpe
On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:
> Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> pages"), locked and pinned pages are accounted separately.  Type1
> accounts pinned pages to locked_vm; use pinned_vm instead.
> 
> pinned_vm recently became atomic and so no longer relies on mmap_sem
> held as writer: delete.
> 
> Signed-off-by: Daniel Jordan 
>  drivers/vfio/vfio_iommu_type1.c | 31 ---
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 73652e21efec..a56cc341813f 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -257,7 +257,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, 
> struct vfio_pfn *vpfn)
>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  {
>   struct mm_struct *mm;
> - int ret;
> + s64 pinned_vm;
> + int ret = 0;
>  
>   if (!npage)
>   return 0;
> @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
> npage, bool async)
>   if (!mm)
>   return -ESRCH; /* process exited */
>  
> - ret = down_write_killable(>mmap_sem);
> - if (!ret) {
> - if (npage > 0) {
> - if (!dma->lock_cap) {
> - unsigned long limit;
> -
> - limit = task_rlimit(dma->task,
> - RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + pinned_vm = atomic64_add_return(npage, >pinned_vm);
>  
> - if (mm->locked_vm + npage > limit)
> - ret = -ENOMEM;
> - }
> + if (npage > 0 && !dma->lock_cap) {
> + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> +
> - PAGE_SHIFT;

I haven't looked at this super closely, but how does this stuff work?

do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...

Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?

Otherwise MEMLOCK is really doubled..

Jason


Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Jason Gunthorpe
On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> Hi,
> 
> This series converts users that account pinned pages with locked_vm to
> account with pinned_vm instead, pinned_vm being the correct counter to
> use.  It's based on a similar patch I posted recently[0].
> 
> The patches are based on rdma/for-next to build on Davidlohr Bueso's
> recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
> sense for these to be routed the same way, despite lack of rdma content?

Oy.. I'd be willing to accumulate a branch with acks to send to Linus
*separately* from RDMA to Linus, but this is very abnormal.

Better to wait a few weeks for -rc1 and send patches through the
subsystem trees.

> All five of these places, and probably some of Davidlohr's conversions,
> probably want to be collapsed into a common helper in the core mm for
> accounting pinned pages.  I tried, and there are several details that
> likely need discussion, so this can be done as a follow-on.

I've wondered the same..

Jason


Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-09 Thread Jason Gunthorpe
On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote:

> > POWER 8 firmware is good? If the link does eventually come back, is
> > the POWER8's D3 resumption timeout long enough?
> > 
> > If this doesn't lead to an obvious conclusion you'll probably need to
> > connect to IBM's Mellanox support team to get more information from
> > the card side.
> 
> We are IBM :-) So far, it seems to be that the card is doing something
> not quite right, but we don't know what. We might need to engage
> Mellanox themselves.

Sorry, it was unclear, I ment the support team for IBM inside Mellanox
..

There might be internal debugging available that can show if the card
is detecting the beacon, how far it gets in renegotiation, etc.

>From all the mails it really has the feel of a PCI-E interop problem between
these two specific chips..

Jason


Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-07 Thread Jason Gunthorpe
On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> > 
> > > Interesting.  I've investigated this further, though I don't have as
> > > many new clues as I'd like.  The problem occurs reliably, at least on
> > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > > I don't yet know if it occurs with other machines, I'm having trouble
> > > getting access to other machines with a suitable card.  I didn't
> > > manage to reproduce it on a different POWER8 machine with a
> > > ConnectX-5, but I don't know if it's the difference in machine or
> > > difference in card revision that's important.
> > 
> > Make sure the card has the latest firmware is always good advice..
> > 
> > > So possibilities that occur to me:
> > >   * It's something specific about how the vfio-pci driver uses D3
> > > state - have you tried rebinding your device to vfio-pci?
> > >   * It's something specific about POWER, either the kernel or the PCI
> > > bridge hardware
> > >   * It's something specific about this particular type of machine
> > 
> > Does the EEH indicate what happend to actually trigger it?
> 
> In a very cryptic way that requires manual parsing using non-public
> docs sadly but yes. From the look of it, it's a completion timeout.
> 
> Looks to me like we don't get a response to a config space access
> during the change of D state. I don't know if it's the write of the D3
> state itself or the read back though (it's probably detected on the
> read back or a subsequent read, but that doesn't tell me which specific
> one failed).

If it is just one card doing it (again, check you have latest
firmware) I wonder if it is a sketchy PCI-E electrical link that is
causing a long re-training cycle? Can you tell if the PCI-E link is
permanently gone or does it eventually return?

Does the card work in Gen 3 when it starts? Is there any indication of
PCI-E link errors?

Everytime or sometimes?

POWER 8 firmware is good? If the link does eventually come back, is
the POWER8's D3 resumption timeout long enough?

If this doesn't lead to an obvious conclusion you'll probably need to
connect to IBM's Mellanox support team to get more information from
the card side.

Jason


Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-05 Thread Jason Gunthorpe
On Fri, Jan 04, 2019 at 02:44:01PM +1100, David Gibson wrote:
> On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote:
> > On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote:
> > > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
> > > unbound from their regular driver and attached to vfio-pci in order to 
> > > pass
> > > them through to a guest.
> > >
> > > This goes away if the disable_idle_d3 option is used, so it looks like a
> > > problem with the hardware handling D3 state.  To fix that more 
> > > permanently,
> > > use a device quirk to disable D3 state for these devices.
> > >
> > > We do this by renaming the existing quirk_no_ata_d3() more generally and
> > > attaching it to the ConnectX-[45] devices (0x15b3:0x1013).
> > >
> > > Signed-off-by: David Gibson 
> > >  drivers/pci/quirks.c | 17 +++--
> > >  1 file changed, 11 insertions(+), 6 deletions(-)
> > >
> > 
> > Hi David,
> > 
> > Thank for your patch,
> > 
> > I would like to reproduce the calltrace before moving forward,
> > but have trouble to reproduce the original issue.
> > 
> > I'm working with vfio-pci and CX-4/5 cards on daily basis,
> > tried manually enter into D3 state now, and it worked for me.
> 
> Interesting.  I've investigated this further, though I don't have as
> many new clues as I'd like.  The problem occurs reliably, at least on
> one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> I don't yet know if it occurs with other machines, I'm having trouble
> getting access to other machines with a suitable card.  I didn't
> manage to reproduce it on a different POWER8 machine with a
> ConnectX-5, but I don't know if it's the difference in machine or
> difference in card revision that's important.

Make sure the card has the latest firmware is always good advice..

> So possibilities that occur to me:
>   * It's something specific about how the vfio-pci driver uses D3
> state - have you tried rebinding your device to vfio-pci?
>   * It's something specific about POWER, either the kernel or the PCI
> bridge hardware
>   * It's something specific about this particular type of machine

Does the EEH indicate what happend to actually trigger it?

Jason


Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-24 Thread Jason Gunthorpe
On Mon, Sep 24, 2018 at 10:18:52PM +0200, Arnd Bergmann wrote:
> On Tue, Sep 18, 2018 at 7:59 PM Jason Gunthorpe  wrote:
> >
> > On Tue, Sep 18, 2018 at 10:51:08AM -0700, Darren Hart wrote:
> > > On Fri, Sep 14, 2018 at 09:57:48PM +0100, Al Viro wrote:
> > > > On Fri, Sep 14, 2018 at 01:35:06PM -0700, Darren Hart wrote:
> > > >
> > > > > Acked-by: Darren Hart (VMware) 
> > > > >
> > > > > As for a longer term solution, would it be possible to init fops in 
> > > > > such
> > > > > a way that the compat_ioctl call defaults to 
> > > > > generic_compat_ioctl_ptrarg
> > > > > so we don't have to duplicate this boilerplate for every ioctl fops
> > > > > structure?
> > > >
> > > > Bad idea, that...  Because several years down the road somebody 
> > > > will add
> > > > an ioctl that takes an unsigned int for argument.  Without so much as 
> > > > looking
> > > > at your magical mystery macro being used to initialize file_operations.
> > >
> > > Fair, being explicit in the declaration as it is currently may be
> > > preferable then.
> >
> > It would be much cleaner and safer if you could arrange things to add
> > something like this to struct file_operations:
> >
> >   long (*ptr_ioctl) (struct file *, unsigned int, void __user *);
> >
> > Where the core code automatically converts the unsigned long to the
> > void __user * as appropriate.
> >
> > Then it just works right always and the compiler will help address
> > Al's concern down the road.
> 
> I think if we wanted to do this with a new file operation, the best
> way would be to do the copy_from_user()/copy_to_user() in the caller
> as well.
>
> We already do this inside of some subsystems, notably drivers/media/,
> and it simplifies the implementation of the ioctl handler function
> significantly. We obviously cannot do this in general, both because of
> traditional drivers that have 16-bit command codes (drivers/tty and others)
> and also because of drivers that by accident defined the commands
> incorrectly and use the wrong type or the wrong direction in the
> definition.

That could work well, but the first idea could be done globally and
mechanically, while this would require very careful per-driver
investigation. 

Particularly if the core code has worse performance.. ie due to
kmalloc calls or something.

I think it would make more sense to start by having the core do the
case to __user and then add another entry point to have the core do
the copy_from_user, and so on.

Jason


Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-18 Thread Jason Gunthorpe
On Tue, Sep 18, 2018 at 10:51:08AM -0700, Darren Hart wrote:
> On Fri, Sep 14, 2018 at 09:57:48PM +0100, Al Viro wrote:
> > On Fri, Sep 14, 2018 at 01:35:06PM -0700, Darren Hart wrote:
> >  
> > > Acked-by: Darren Hart (VMware) 
> > > 
> > > As for a longer term solution, would it be possible to init fops in such
> > > a way that the compat_ioctl call defaults to generic_compat_ioctl_ptrarg
> > > so we don't have to duplicate this boilerplate for every ioctl fops
> > > structure?
> > 
> > Bad idea, that...  Because several years down the road somebody will add
> > an ioctl that takes an unsigned int for argument.  Without so much as 
> > looking
> > at your magical mystery macro being used to initialize file_operations.
> 
> Fair, being explicit in the declaration as it is currently may be
> preferable then.

It would be much cleaner and safer if you could arrange things to add
something like this to struct file_operations:

  long (*ptr_ioctl) (struct file *, unsigned int, void __user *);

Where the core code automatically converts the unsigned long to the
void __user * as appropriate.

Then it just works right always and the compiler will help address
Al's concern down the road.

Cheers,
Jason


Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-12 Thread Jason Gunthorpe
On Wed, Sep 12, 2018 at 05:08:52PM +0200, Arnd Bergmann wrote:
> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
> 
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will ever run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
> 
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
> 
> Signed-off-by: Arnd Bergmann 
>  drivers/android/binder.c| 2 +-
>  drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 +-
>  drivers/dma-buf/dma-buf.c   | 4 +---
>  drivers/dma-buf/sw_sync.c   | 2 +-
>  drivers/dma-buf/sync_file.c | 2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c| 2 +-
>  drivers/hid/hidraw.c| 4 +---
>  drivers/iio/industrialio-core.c | 2 +-
>  drivers/infiniband/core/uverbs_main.c   | 4 ++--
>  drivers/media/rc/lirc_dev.c | 4 +---
>  drivers/mfd/cros_ec_dev.c   | 4 +---
>  drivers/misc/vmw_vmci/vmci_host.c   | 2 +-
>  drivers/nvdimm/bus.c| 4 ++--
>  drivers/nvme/host/core.c| 2 +-
>  drivers/pci/switch/switchtec.c  | 2 +-
>  drivers/platform/x86/wmi.c  | 2 +-
>  drivers/rpmsg/rpmsg_char.c  | 4 ++--
>  drivers/sbus/char/display7seg.c | 2 +-
>  drivers/sbus/char/envctrl.c | 4 +---
>  drivers/scsi/3w-.c  | 4 +---
>  drivers/scsi/cxlflash/main.c| 2 +-
>  drivers/scsi/esas2r/esas2r_main.c   | 2 +-
>  drivers/scsi/pmcraid.c  | 4 +---
>  drivers/staging/android/ion/ion.c   | 4 +---
>  drivers/staging/vme/devices/vme_user.c  | 2 +-
>  drivers/tee/tee_core.c  | 2 +-
>  drivers/usb/class/cdc-wdm.c | 2 +-
>  drivers/usb/class/usbtmc.c  | 4 +---
>  drivers/video/fbdev/ps3fb.c | 2 +-
>  drivers/virt/fsl_hypervisor.c   | 2 +-
>  fs/btrfs/super.c| 2 +-
>  fs/ceph/dir.c   | 2 +-
>  fs/ceph/file.c  | 2 +-
>  fs/fuse/dev.c   | 2 +-
>  fs/notify/fanotify/fanotify_user.c  | 2 +-
>  fs/userfaultfd.c| 2 +-
>  net/rfkill/core.c   | 2 +-
>  37 files changed, 40 insertions(+), 58 deletions(-)

> diff --git a/drivers/infiniband/core/uverbs_main.c 
> b/drivers/infiniband/core/uverbs_main.c
> index 823beca448e1..f4755c1c9cfa 100644
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -930,7 +930,7 @@ static const struct file_operations uverbs_fops = {
>   .release = ib_uverbs_close,
>   .llseek  = no_llseek,
>   .unlocked_ioctl = ib_uverbs_ioctl,
> - .compat_ioctl = ib_uverbs_ioctl,
> + .compat_ioctl = generic_compat_ioctl_ptrarg,
>  };
>  
>  static const struct file_operations uverbs_mmap_fops = {
> @@ -941,7 +941,7 @@ static const struct file_operations uverbs_mmap_fops = {
>   .release = ib_uverbs_close,
>   .llseek  = no_llseek,
>   .unlocked_ioctl = ib_uverbs_ioctl,
> - .compat_ioctl = ib_uverbs_ioctl,
> + .compat_ioctl = generic_compat_ioctl_ptrarg,
>  };
>  
>  static struct ib_client uverbs_client = {

For uverbs:

Acked-by: Jason Gunthorpe 

It is very strange, this patch did not appear in the RDMA patchworks,
I almost missed it  :|

Jason


Re: [PATCH 0/3] tty: hvc: latency break regression fixes

2018-09-05 Thread Jason Gunthorpe
On Wed, Sep 05, 2018 at 10:14:36PM +1000, Nicholas Piggin wrote:
> Hi Greg,
> 
> Here are fixes for a few regressions that came in with my
> carelessness with the irq latency work for the hvc subsystem.
> These were independently reported in 2 configurations, and I
> confirmed with another.
> 
> I think those went upstream via Michael's tree, but he's away
> at the moment so if you would be able to consider them for
> the tty tree that would be appreciated.

Series works for me too, thanks.

Tested-by: Jason Gunthorpe 

Jason


Re: Regression from patch 'tty: hvc: hvc_poll() break hv read loop'

2018-09-04 Thread Jason Gunthorpe
On Wed, Sep 05, 2018 at 07:15:29AM +1000, Nicholas Piggin wrote:
> On Tue, 4 Sep 2018 11:48:08 -0600
> Jason Gunthorpe  wrote:
> 
> > Hi Nicholas,
> > 
> > I am testing 4.19-rc2 and I see bad behavior with my qemu hvc0
> > console..
> > 
> > Running interactive with qemu (qemu-2.11.2-1.fc28) on the console
> > providing hvc0, using options like:
> > 
> > -nographic
> > -chardev stdio,id=stdio,mux=on,signal=off
> > -mon chardev=stdio
> > -device isa-serial,chardev=stdio
> > -device virtio-serial-pci
> > -device virtconsole,chardev=stdio
> > 
> > I see the hvc0 console hang regularly, ie doing something like 'up
> > arrow' in bash causes the hvc0 console to hang. Prior kernels worked
> > OK.
> > 
> > Any ideas? I'm not familiar with this code.. Thanks!
> 
> Yes I have had another report, I'm working on a fix. Sorry it has taken
> a while and thank you for the report.

Okay, let me know when you have a fix and I will be able to test it
for you!

Thanks,
Jason


Regression from patch 'tty: hvc: hvc_poll() break hv read loop'

2018-09-04 Thread Jason Gunthorpe
Hi Nicholas,

I am testing 4.19-rc2 and I see bad behavior with my qemu hvc0
console..

Running interactive with qemu (qemu-2.11.2-1.fc28) on the console
providing hvc0, using options like:

-nographic
-chardev stdio,id=stdio,mux=on,signal=off
-mon chardev=stdio
-device isa-serial,chardev=stdio
-device virtio-serial-pci
-device virtconsole,chardev=stdio

I see the hvc0 console hang regularly, ie doing something like 'up
arrow' in bash causes the hvc0 console to hang. Prior kernels worked
OK.

Any ideas? I'm not familiar with this code.. Thanks!

git bisect says this patch is to blame:

# bad: [5b394b2ddf0347bef56e50c69a58773c94343ff3] Linux 4.19-rc1
# good: [845b397a7771f2d3504beff5521f452be0d22eec] IB/ucm: fix UCM link error
# bad: [60c1f89241d49bacf71035470684a8d7b4bb46ea] Merge tag 
'dma-mapping-4.19-2' of git://git.infradead.org/users/hch/dma-mapping
# bad: [1290290c922fdcefbce8984e6e44b8f4e3a169b5] Merge tag 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
# bad: [6eaac34ff30e189fda28110298ca9fbfb2f51e28] Merge tag 
'linux-watchdog-4.19-rc1' of git://www.linux-watchdog.org/linux-watchdog
# good: [e4f6a44c4aeca9eda153302abb0c14d053914f72] staging:rtl8192u: Remove 
unused macro definitions - Style
# bad: [5695d5d1970f975de059bb6dec76941440f62488] Merge tag 'usb-4.19-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
# bad: [f80a71b0c4111e26433744721ad68f05c169ad39] Merge tag 
'drm-next-2018-08-17' of git://anongit.freedesktop.org/drm/drm
# good: [cf175dc315f90185128fb061dc05b6fbb211aa2f] powerpc/64: Disable the 
speculation barrier from the command line
# bad: [a2dc009afa9ae8b92305be7728676562a104cb40] powerpc/mm/book3s/radix: Add 
mapping statistics
# good: [d6690b1a9b0dd95a6fbd166d9657e0cc80afbf99] powerpc: Allow CPU selection 
of e300core variants
# good: [f2c6d0d1092e3da2bd36c768777e883fa3908548] powerpc/32: Include setup.h 
header file to fix warnings
# good: [b3124ec2f9970c7374d34b00843d9791fca66afc] Merge branch 'fixes' into 
next
# bad: [70b5c4ee8e3bf7ce107d6926d9c7d8ebb68578c2] Merge branch 'topic/hvc' into 
next
# bad: [cfb5946b55f1dfd19e042feae1fbff6041e25a98] tty: hvc: hvc_poll() may sleep
# bad: [ec97eaad1383ab2500fcf9a07ade6044fbcc67f5] tty: hvc: hvc_poll() break hv 
read loop
# good: [a9bf5c8a271b9a954709b7ada1bd258f5cadf7ff] tty: hvc: use mutex instead 
of spinlock for hvc_structs lock
# first bad commit: [ec97eaad1383ab2500fcf9a07ade6044fbcc67f5] tty: hvc: 
hvc_poll() break hv read loop

>From ec97eaad1383ab2500fcf9a07ade6044fbcc67f5 Mon Sep 17 00:00:00 2001
From: Nicholas Piggin 
Date: Tue, 1 May 2018 00:55:54 +1000
Subject: [PATCH] tty: hvc: hvc_poll() break hv read loop

Avoid looping with the spinlock held while there is read data
being returned from the hv driver. Instead note if the entire
size returned by tty_buffer_request_room was read, and request
another read poll.

This limits the critical section lengths, and provides more
even service to other consoles in case there is a pathological
condition.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
---
 drivers/tty/hvc/hvc_console.c | 88 ++-
 1 file changed, 45 insertions(+), 43 deletions(-)

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index fddb63322c6786..745ac220fce84c 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -592,7 +592,7 @@ static u32 timeout = MIN_TIMEOUT;
 int hvc_poll(struct hvc_struct *hp)
 {
struct tty_struct *tty;
-   int i, n, poll_mask = 0;
+   int i, n, count, poll_mask = 0;
char buf[N_INBUF] __ALIGNED__;
unsigned long flags;
int read_total = 0;
@@ -618,7 +618,7 @@ int hvc_poll(struct hvc_struct *hp)
 
/* Now check if we can get data (are we throttled ?) */
if (tty_throttled(tty))
-   goto throttled;
+   goto out;
 
/* If we aren't notifier driven and aren't throttled, we always
 * request a reschedule
@@ -627,56 +627,58 @@ int hvc_poll(struct hvc_struct *hp)
poll_mask |= HVC_POLL_READ;
 
/* Read data if any */
-   for (;;) {
-   int count = tty_buffer_request_room(>port, N_INBUF);
 
-   /* If flip is full, just reschedule a later read */
-   if (count == 0) {
+   count = tty_buffer_request_room(>port, N_INBUF);
+
+   /* If flip is full, just reschedule a later read */
+   if (count == 0) {
+   poll_mask |= HVC_POLL_READ;
+   goto out;
+   }
+
+   n = hp->ops->get_chars(hp->vtermno, buf, count);
+   if (n <= 0) {
+   /* Hangup the tty when disconnected from host */
+   if (n == -EPIPE) {
+   spin_unlock_irqrestore(>lock, flags);
+   tty_hangup(tty);
+   spin_lock_irqsave(>lock, flags);
+   } else if ( n == 

Re: RFC on writel and writel_relaxed

2018-03-29 Thread Jason Gunthorpe
On Thu, Mar 29, 2018 at 02:58:34PM +, David Laight wrote:
> From: Jason Gunthorpe
> > Sent: 29 March 2018 15:45
> ...
> > > > When talking about ordering between the devices, the relevant question
> > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > > from the DEVICE_FOO. 'ordered' means that in this case
> > > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > > by BAR.
> > >
> > > Yes, and that isn't the case for arm because the writes can still be
> > > buffered.
> > 
> > The statement is not about buffering, or temporal completion order, or
> > the order of acks returning to the CPU. It is about pure transaction
> > ordering inside the interconnect.
> > 
> > Can write BAR -> FOO pass write CPU -> FOO?
> 
> Almost certainly.
> The first cpu write can almost certainly be 'stalled' at the shared PCIe 
> bridge.
> The second cpu write then completes (to a different target).
> That target then issues a peer to peer transfer that reaches the shared 
> bridge.
> I doubt the order of the transactions is guaranteed when it becomes 
> 'un-stalled'.

The PCI spec has very strong wording on ordering that covers this
case. Stalled bridges have to follow the ordering rules, and posted
writes cannot pass other posted writes.

Since in PCI all three transactions:
 CPU -> FOO
 CPU -> BAR
 BAR -> FOO

Must traverse a shared bus segment, they must be placed on that bus in
the above order, and the bridge(s) toward FOO must preserve this
order.

ARM's AXI has similar rules, I just can't recall the tiny details
right now :)

> Of course, these are peer to peer transfers, and strange ones at that.
> Normally you'd not be doing peer to peer transfers that access 'memory'
> the cpu has just written to.

It is the best situation I can think of where order of completion to
different devices would matter to a generic Linux driver..

.. And there are patches circulating right now for NVMe that enable
exactly this kind of transfer, and rely on these kind of semantics, so
it is a relevant detail :)

Jason


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Jason Gunthorpe
On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > > powerpc and ARM can't quite make them synchronous I think, but at 
> > > > > > least
> > > > > > they should have the same semantics as writel.
> > > > > 
> > > > > One thing that ARM does IIRC is that it only guarantees to order 
> > > > > writel() within
> > > > > one device, and the memory mapped PCI I/O space window almost 
> > > > > certainly
> > > > > counts as a separate device to the CPU.
> > > > 
> > > > That sounds bogus.
> > > 
> > > To elaborate, if you do the following on arm:
> > > 
> > >   writel(DEVICE_FOO);
> > >   writel(DEVICE_BAR);
> > > 
> > > we generally cannot guarantee in which order those accesses will hit the
> > > devices even if we add every barrier under the sun. You'd need something
> > > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > > the first write out. This doesn't sound like it would be that uncommon to
> > > me.
> > 
> > The PCI posted write does not require the above to execute 'in order'
> > only that any bus segment shared by the two devices have the writes
> > issued in CPU order. ie at a shared PCI root port for instance.
> > 
> > If I recall this is very similar to the ordering that ARM's on-chip
> > AXI interconnect is supposed to provide.. So I'd be very surprised if
> > a modern ARM64 has an meaningful difference from x86 here.
> 
> From the architectural perspective, writes to different "peripherals" are
> not ordered with respect to each other. The first writel will complete once
> it gets its write acknowledgement, but this may not necessarily come from
> the endpoint -- it could come from an intermediate buffer past the point of
> serialisation (i.e. the write will then be ordered with respect to other
> accesses to that same endpoint). The PCI root port would look like one
> peripheral here.

That is basically the same as PCI - PCI has no write ACK, so all
writes are buffered by the PCI interconnect and complete in some
undefined temporal order when multiple end points are involved.

This does not seem very different from what happens in x86..

> > When talking about ordering between the devices, the relevant question
> > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > from the DEVICE_FOO. 'ordered' means that in this case
> > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > by BAR.
> 
> Yes, and that isn't the case for arm because the writes can still be
> buffered.

The statement is not about buffering, or temporal completion order, or
the order of acks returning to the CPU. It is about pure transaction
ordering inside the interconnect.

Can write BAR -> FOO pass write CPU -> FOO?

Jason


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Jason Gunthorpe
On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > they should have the same semantics as writel.
> > > 
> > > One thing that ARM does IIRC is that it only guarantees to order writel() 
> > > within
> > > one device, and the memory mapped PCI I/O space window almost certainly
> > > counts as a separate device to the CPU.
> > 
> > That sounds bogus.
> 
> To elaborate, if you do the following on arm:
> 
>   writel(DEVICE_FOO);
>   writel(DEVICE_BAR);
> 
> we generally cannot guarantee in which order those accesses will hit the
> devices even if we add every barrier under the sun. You'd need something
> in between, specific to DEVICE_FOO (probably a read-back) to really push
> the first write out. This doesn't sound like it would be that uncommon to
> me.

The PCI posted write does not require the above to execute 'in order'
only that any bus segment shared by the two devices have the writes
issued in CPU order. ie at a shared PCI root port for instance.

If I recall this is very similar to the ordering that ARM's on-chip
AXI interconnect is supposed to provide.. So I'd be very surprised if
a modern ARM64 has an meaningful difference from x86 here.

When talking about ordering between the devices, the relevant question
is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
from the DEVICE_FOO. 'ordered' means that in this case
writel(DEVICE_FOO) must be presented to FOO before anything generated
by BAR.

Jason


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 10:42:00AM +0100, Will Deacon wrote:
> On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <j...@ziepe.ca> wrote:
> > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > >
> > > I even see patches adding wmb() based on actual observed memory
> > > corruption during testing on Intel:
> > >
> > > https://patchwork.kernel.org/patch/10177207/
> > >
> > > So you think all of this is unnecessary and writel is totally strongly
> > > ordered, even on multi-socket Intel?
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> If this is just about ordering accesses to coherent DMA, then using
> dma_wmb() instead will be much better performance on arm/arm64.

dma_wmb() is a NOP on x86, it was tested anyhow and didn't help
this case.. Confusing, but probably not relevant to this discussion.

Jason


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <j...@ziepe.ca> wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> >
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> >
> > https://patchwork.kernel.org/patch/10177207/
> >
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there.

Ah! So it is, too many things called 'db' in that driver.. One of the
'db' is also WC BAR memory..

Jason


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 08:22:55AM -0400, ok...@codeaurora.org wrote:
> On 2018-03-27 07:23, Benjamin Herrenschmidt wrote:
> >On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
> >>> The interesting thing is that we do seem to have a whole LOT of these
> >>> spurrious wmb before writel all over the tree, I suspect because of
> >>> that incorrect recommendation in memory-barriers.txt.
> >>>
> >>> We should fix that.
> >>
> >>Maybe the problem is just that it's so counter-intuitive that we don't
> >>need that barrier in Linux, when the hardware does need one on some
> >>architectures.
> >>
> >>How about we define a barrier type instruction specifically for this
> >>purpose, something like wmb_before_mmio() and have all architectures
> >>define that to an empty macro?
> >
> >This is exactly what wmb() is about and exactly what Linux rejected
> >back in the day (and in hindsight I agree with him).
> >
> >>That way, having correct code using wmb_before_mmio() will not
> >>trigger an incorrect review comment that leads to extra wmb(). ;-)
> >
> >Ah, you mean have an empty macro that will always be empty on all
> >architectures just to fool people ? :-)
> >
> >Not sure that will fly ... I think we just need to be documenting that
> >stuff better and not have incorrect examples. Also a sweep to remove
> >some useless ones like the one in e1000e would help.
> 
> I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches)
>
> I will have to just remove the wmb and keep writel, then repost.

Okay, but before you do that, can we get a statement how this works
for WC?

Some of these writels are to WC memory, do they need the wmb()?!?

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 10:59:40AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote:
> > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > > > Otherwise almost all drivers out there are broken which I very much
> > > > > doubt :-)
> > > > 
> > > > But.. Sinan is right, you look anywhere in the driver tree and you
> > > > find stuff like this:
> > > > 
> > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > > 
> > > > /* Force memory writes to complete before letting h/w
> > > >  * know there are new descriptors to fetch.
> > > >  */
> > > > wmb();
> > > > 
> > > > 
> > > > It is *systemic*
> > > 
> > > Yes, because they all copied e1000e :-) If you look at the comment in
> > > there, it does say it's only for weakly ordered archs such as ia64, and
> > > even then, probably predates Linus strong statement on the matter.
> > 
> > Hahah, sure I'll buy that..
> > 
> > But still, if this really is the case, a *strong* statement in
> > barriers.txt to that effect (and not an example demanding the wmb()!)
> > would be very helpful for those of us that have to review driver code!
> 
> I agree, and that Mellanox bug you pointed me to seems to indicate that
> this may not even be true on x86 anymore ...

However, with bugs like that it is hard to know what is going on.. It
could be a CPU bug instead.

> I think we might need to revisit this properly...

I would love to hear a definitive statement from Intel on what wmb();
writel(); does on x86..

Sinan's patches are backwards if writel is ordered, instead of
using writel_relaxed, they should be eliminating the wmb().

But there is no way patches like that could go ahead until
barriers.txt is updated..

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > Otherwise almost all drivers out there are broken which I very much
> > > doubt :-)
> > 
> > But.. Sinan is right, you look anywhere in the driver tree and you
> > find stuff like this:
> > 
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > 
> > /* Force memory writes to complete before letting h/w
> >  * know there are new descriptors to fetch.
> >  */
> > wmb();
> > 
> > 
> > It is *systemic*
> 
> Yes, because they all copied e1000e :-) If you look at the comment in
> there, it does say it's only for weakly ordered archs such as ia64, and
> even then, probably predates Linus strong statement on the matter.

Hahah, sure I'll buy that..

But still, if this really is the case, a *strong* statement in
barriers.txt to that effect (and not an example demanding the wmb()!)
would be very helpful for those of us that have to review driver code!

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > > But that was never a requirement of writel(),
> > > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > > the wmb() before writel() for ordering system memory against writel.
> > > 
> > > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > > Adding Alexander Duyck to Cc, he added that section as part of
> > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > > dma_wmb()"). Also adding the other people that were involved with that.
> > > 
> > 
> > ARM developers can get away with not including wmb() in their code and use
> > writel() to observe memory writes due to implicit barriers.
> > 
> > However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.
> 
> > writel() has a compiler barrier in it for x86.
> > wmb() has a sync operation in it for x86. 
> > 
> > Unless wmb() is called, PCIe device won't observe memory updates from the 
> > CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 
> Otherwise almost all drivers out there are broken which I very much
> doubt :-)

But.. Sinan is right, you look anywhere in the driver tree and you
find stuff like this:

drivers/net/ethernet/intel/i40e/i40e_txrx.c

/* Force memory writes to complete before letting h/w
 * know there are new descriptors to fetch.
 */
wmb();


It is *systemic*

I even see patches adding wmb() based on actual observed memory
corruption during testing on Intel:

https://patchwork.kernel.org/patch/10177207/

So you think all of this is unnecessary and writel is totally strongly
ordered, even on multi-socket Intel?

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <j...@ziepe.ca> wrote:
> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <j...@ziepe.ca> wrote:
> >> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
> >> >> > > This is a super performance critical operation for most drivers and
> >> >> > > directly impacts network performance.
> >> >>
> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> >> any barriers at all.
> >> >> This might mean that they are always just the memory operation,
> >> >> but it would make it more obvious what the driver was doing.
> >> >
> >> > I think that is what writel_relaxed is supposed to be.
> >> >
> >> > The only restriction it has is that the writes to a single device
> >> > using UC memory must be kept in program order..
> >>
> >> Not sure about whether we have ever defined what happens to
> >> writel_relaxed() on WC memory though: On ARM, we disallow
> >> the compiler to combine writes, but the CPU still might.
> >
> > If the driver uses WC memory then I think it should not expect
> > anything in terms of how writes map to TLPs other than nothing
> > combines across mmiowb() and mmiowb() is fully globally ordered when
> > enclosed in a spinlock.
> >
> > The entire point of using WC memory is usually to get combining :) If
> > the driver doesn't want that then it should map UC..
> 
> Usually, WC memory is used with memcpy_toio() though, which
> by definition doesn't have any barriers between accesses, and
> is required to get the correct byte ordering on writes to memory buffers.

memcpy_toio is too expensive to actually use for anything performance
though. It is too pessimistic. What the drivers usually want is a
unwound block of 4 or 8 8-byte copies. No function calls, no
branching. Everything is already known to be aligned.

Most of the drivers have a unwound loop with writeq() or something to
do it.

> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
> I was always puzzled by this: The intention of _relaxed() on ARM
> (where it originates) was to skip the barrier that serializes DMA
> with MMIO, not to skip the serialization between MMIO and locks.

But that was never a requirement of writel(),
Documentation/memory-barriers.txt gives an explicit example demanding
the wmb() before writel() for ordering system memory against writel.

I actually have no idea why ARM had that barrier, I always assumed it
was to give program ordering to the accesses and that _relaxed allowed
re-ordering (the usual meaning of relaxed)..

But the barrier document makes it pretty clear that the only
difference between the two is spinlock containment, and WillD wrote
this text, so I belive it is accurate for ARM.

Very confusing.

> I never fully understood the part about the locks, but from what
> I remember, ARM is still serialized without the barrier here, but
> dropping the barrier on powerpc writel_relaxed() would not
> serialize against locks or DMA.

WC is usually the problem here.. I've been told it is necessary on ARM
as well..

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <j...@ziepe.ca> wrote:
> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
> >> > > This is a super performance critical operation for most drivers and
> >> > > directly impacts network performance.
> >>
> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> any barriers at all.
> >> This might mean that they are always just the memory operation,
> >> but it would make it more obvious what the driver was doing.
> >
> > I think that is what writel_relaxed is supposed to be.
> >
> > The only restriction it has is that the writes to a single device
> > using UC memory must be kept in program order..
> 
> Not sure about whether we have ever defined what happens to
> writel_relaxed() on WC memory though: On ARM, we disallow
> the compiler to combine writes, but the CPU still might.

If the driver uses WC memory then I think it should not expect
anything in terms of how writes map to TLPs other than nothing
combines across mmiowb() and mmiowb() is fully globally ordered when
enclosed in a spinlock.

The entire point of using WC memory is usually to get combining :) If
the driver doesn't want that then it should map UC..

> It's also not entirely clear to me what we want writel() inside a
> spinlock to mean: should the spinlock guarantee that two writel()
> calls on different CPUs that are protected by spinlocks are
> serialized by those locks, or not?

Yes for writel, I think that is already defined by the barriers
document

The same document says that _relaxed() does not give that guarentee.

The lwn articule on this went into some depth on the interaction with
spinlocks.

As far as I can see, containment in a spinlock seems to be the only
different between writel and writel_relaxed..

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
> > > This is a super performance critical operation for most drivers and
> > > directly impacts network performance.
> 
> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> any barriers at all.
> This might mean that they are always just the memory operation,
> but it would make it more obvious what the driver was doing.

I think that is what writel_relaxed is supposed to be.

The only restriction it has is that the writes to a single device
using UC memory must be kept in program order..

Jason


Re: RFC on writel and writel_relaxed

2018-03-23 Thread Jason Gunthorpe
On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote:

> > >  - Make writel_relaxed() be a simple store without barriers, and
> > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> > > to happen between successive writel_relaxed on WC space (no change on
> > > normal NC space) while maintaining the ordering between relaxed reads
> > > and writes. The flip side is a (slight) increased overhead of
> > > readl_relaxed.
> > 
> > Are there many drivers that actually do writeX() on WC space?
> > memory-barriers.txt
> > pretty much says that all bets are off and no ordering guarantees can be 
> > assumed
> > when using readX/writeX on prefetchable IO memory. It seems sketchy enough 
> > to
> > give me some pause, but maybe it works fine elsewhere.
> 
> I don't know whether any does it, but I want to provide a way for a
> driver to somewhat reliably obtain write combine semantics without
> having to hand code endian swap and other horrors involved with using
> __raw_* accessors.

Many of the drivers in drivers/infiniband work with write combining
memory.

The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as
efficiently as possible, ideally in a single PCI-E TLP.

A memcpy_to_wc primitive could probably cover these use cases, no need
to redesign the IO accessors..

The WC memory is never read, so read/write order is not important to
any infiniband driver.

What is very important is keeping the WC behavior isolated within the
spinlock. WC to the same addresses cannot be permitted in this pattern:

   writel(addr = 0);
   mmiowmb();
   spin_unlock();
   spin_lock()
   writel(addr = 0);

The CPU must always generate two PCI-E TLPs to the device.

This is a super performance critical operation for most drivers and
directly impacts network performance.

Jason


Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

2018-03-18 Thread Jason Gunthorpe
On Sat, Mar 17, 2018 at 02:30:10PM -0400, Sinan Kaya wrote:

> Somebody also has to take a task and work very hard to get rid of 
> __raw_writeX()
> APIs in drivers/net directory. It looked like a very common practice though
> it clearly violates multiarch portability concerns Jason and Deve highlighted.

When you posted your list I thought most of the hits were in what I'd
think of 'one-arch drivers', eg an IRQ controller or clock driver or
something.. Some might have a reason for it (eg avoiding the swap, for
instance), maybe it is a hold over from before writel_relaxed, or
maybe it is just a cargo-cult behavior..

It is the obviously multi-arch drivers that probably need some
attention..

Jason


Re: [PATCH] tpm: vtpm: constify vio_device_id

2017-08-18 Thread Jason Gunthorpe
On Fri, Aug 18, 2017 at 09:32:46PM +1000, Michael Ellerman wrote:

> >>  drivers/char/tpm/tpm_ibmvtpm.c | 2 +-
> 
> Who merges changes for this driver? I assume it's Jarkko?

Yes

Jason


Re: [PATCH] tpm: vtpm: constify vio_device_id

2017-08-17 Thread Jason Gunthorpe
On Thu, Aug 17, 2017 at 11:04:21PM +0530, Arvind Yadav wrote:
> vio_device_id are not supposed to change at runtime. All functions
> working with vio_device_id provided by  work with
> const vio_device_id. So mark the non-const structs as const.
> 
> Signed-off-by: Arvind Yadav <arvind.yadav...@gmail.com>

Reviewed-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>

>  drivers/char/tpm/tpm_ibmvtpm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/char/tpm/tpm_ibmvtpm.c b/drivers/char/tpm/tpm_ibmvtpm.c
> index f01d083..d2ce46b 100644
> +++ b/drivers/char/tpm/tpm_ibmvtpm.c
> @@ -32,7 +32,7 @@
>  
>  static const char tpm_ibmvtpm_driver_name[] = "tpm_ibmvtpm";
>  
> -static struct vio_device_id tpm_ibmvtpm_device_table[] = {
> +static const struct vio_device_id tpm_ibmvtpm_device_table[] = {
>   { "IBM,vtpm", "IBM,vtpm"},
>   { "", "" }
>  };


Re: ibmvtpm byteswapping inconsistency

2017-01-26 Thread Jason Gunthorpe
On Thu, Jan 26, 2017 at 09:22:48PM +0100, Michal Such??nek wrote:

> This is repeated a few times in the driver so I added memset to quiet
> gcc and make behavior deterministic in case the unused fields get some
> meaning in the future.

Yep, reserved certainly needs to be zeroed.. Can you send a patch?
memset is overkill...

> However, in tpm_ibmvtpm_send the structure is initialized as
> 
>   struct ibmvtpm_crq crq;
> __be64 *word = (__be64 *)
> ...
> crq.valid = (u8)IBMVTPM_VALID_CMD;
> crq.msg = (u8)VTPM_TPM_COMMAND;
> crq.len = cpu_to_be16(count);
> crq.data = cpu_to_be32(ibmvtpm->rtce_dma_handle);
> 
> and submitted with
> 
>   rc = ibmvtpm_send_crq(ibmvtpm->vdev, be64_to_cpu(word[0]),
>   be64_to_cpu(word[1]));
> meaning it is swapped twice.

No idea, Nayna may know.

My guess is that '__be64 *word' should be 'u64 *word'...

Jason


Re: [PATCH v7] powerpc: Do not make the entire heap executable

2016-12-12 Thread Jason Gunthorpe
On Wed, Dec 07, 2016 at 02:15:27PM -0800, Kees Cook wrote:
> Can you resend this patch with /proc/$pid/maps output showing the
> before/after effects of this change also included here? Showing that
> reduction in executable mapping should illustrate the benefit of
> avoiding having the execute bit set on the brk area (which is a clear
> security weakness, having writable/executable code in the process's
> memory segment, especially when it was _not_ requested by the ELF
> headers).

Denys, I'll leave it to you to re-sumbit again..

I suggest this reworded commit message:

powerpc: Do not make the entire heap executable

gcc has a configure option to generate ELF files that are W^X:
--enable-secureplt

However the PPC32 kernel is hardwired to add PROT_EXEC to the heap and
BSS segments, totally defeating this security protection.

This is done because the common ELF loader does not properly respect
the load header permissions when mapping a region that is not file
backed.

For example, non-secure PPC creates these segments:

  [21] .data PROGBITS10061254 051254 001868 00  WA  0   0  4
  [22] .got  PROGBITS10062abc 052abc 14 04 WAX  0   0  4
  [23] .sdataPROGBITS10062ad0 052ad0 58 00  WA  0   0  4
  [24] .sbss NOBITS  10062b28 052b28 40 00  WA  0   0  8
  [25] .plt  NOBITS  10062b68 052b28 000d38 00 WAX  0   0  4
  [26] .bss  NOBITS  100638a0 052b28 006424 00  WA  0   0 16

Which results in an ELF load header covering those segments:

  Type   Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD   0x051160 0x10061160 0x10061160 0x019c8 0x08b64 RWE 0x1

If we just remove the PROT_EXEC PPC32 hack then the kernel will do
something like:

10061000-10063000 rwxp 00051000 07:00 3208784/app/bin/busybox.dynamic
10063000-1006a000 rw-p  00:00 0
104a6000-104c7000 rw-p  00:00 0  [heap]

The 2nd mapping is the anonymous 0 filled pages requested by FileSiz <
MemSiz, but it is not executable. This causes instant crashes of
normal PPC32 ELFs as their PLT and GOTs must be executable.

This patches fixes the above bug in the ELF loader and removes the
forced PROT_EXEC from PPC32.

The improvement is seen in /proc/PID/maps. After the patch a normal
ELF looks like:

10061000-10063000 rwxp 00051000 07:00 3208784/app/bin/busybox.dynamic
10063000-1006a000 rwxp  00:00 0
104a6000-104c7000 rw-p  00:00 0  [heap]

while a secure-PLT ELF looks like:

1005-10052000 rw-p 0005 07:00 3208780/app/bin/busybox.dynamic
10052000-10059000 rw-p  00:00 0 
1074-10761000 rw-p  00:00 0  [heap]

Critically, with this patch applied all of the maps within a
secure-plt ELF are W^X.

While before this patch we see undeseriable W

normal ELF:

10061000-10063000 rwxp 00051000 07:00 3208784/app/bin/busybox.dynamic
10063000-1006a000 rwxp  00:00 0
104a6000-104c7000 rwxp  00:00 0  [heap]

secure-plt ELF:

1005-10052000 rw-p 0005 07:00 3208780/app/bin/busybox.dynamic
10052000-10059000 rwxp  00:00 0 
104c7000-104e8000 rwxp  00:00 0  [heap]

-

Andrew/Kees/Denys:

Here are full dumps from my PPC405 system.

Before patch (secure-plt):

0010-00103000 r-xp  00:00 0  [vdso]
0fe25000-0fe3 r-xp  07:00 6557504/app/lib/libnss_files-2.23.so
0fe3-0fe44000 ---p b000 07:00 6557504/app/lib/libnss_files-2.23.so
0fe44000-0fe45000 r--p f000 07:00 6557504/app/lib/libnss_files-2.23.so
0fe45000-0fe46000 rw-p 0001 07:00 6557504/app/lib/libnss_files-2.23.so
0fe46000-0fe4c000 rw-p  00:00 0 
0fe5c000-0ffd2000 r-xp  07:00 4094672/app/lib/libc-2.23.so
0ffd2000-0ffe8000 ---p 00176000 07:00 4094672/app/lib/libc-2.23.so
0ffe8000-0ffec000 r--p 0017c000 07:00 4094672/app/lib/libc-2.23.so
0ffec000-0ffed000 rw-p 0018 07:00 4094672/app/lib/libc-2.23.so
0ffed000-0fff rw-p  00:00 0 
1000-1004e000 r-xp  07:00 3208780/app/bin/busybox.dynamic
1005-10052000 rw-p 0005 07:00 3208780/app/bin/busybox.dynamic
10052000-10059000 rwxp  00:00 0 
104c7000-104e8000 rwxp  00:00 0  [heap]
b7ab1000-b7ad2000 r-xp  07:00 4009940/app/lib/ld-2.23.so
b7aee000-b7af rw-p  00:00 0 
b7af-b7af1000 r--p 0002f000 07:00 4009940/app/lib/ld-2.23.so
b7af1000-b7af2000 rw-p 0003 07:00 4009940/app/lib/ld-2.23.so
bfee1000-bff02000 rw-p  00:00 0  [stack]

After Patch (secure-plt):

0010-00103000 r-xp  00:00 0  [vdso]
0fe25000-0fe3 r-xp  07:00 6557504/app/lib/libnss_files-2.23.so
0fe3-0fe44000 ---p b000 07:00 6557504/app/lib/libnss_files-2.23.so
0fe44000-0fe45000 r--p f000 07:00 6557504/app/lib/libnss_files-2.23.so
0fe45000-0fe46000 

Re: [PATCH v6] powerpc: Do not make the entire heap executable

2016-10-20 Thread Jason Gunthorpe
On Tue, Oct 04, 2016 at 09:54:12AM -0700, Kees Cook wrote:
> On Mon, Oct 3, 2016 at 5:18 PM, Michael Ellerman <m...@ellerman.id.au> wrote:
> > Kees Cook <keesc...@chromium.org> writes:
> >
> >> On Mon, Oct 3, 2016 at 9:13 AM, Denys Vlasenko <dvlas...@redhat.com> wrote:
> >>> On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt,
> >>> or with a toolchain which defaults to it) look like this:
> > ...
> >>>
> >>> Signed-off-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>
> >>> Signed-off-by: Denys Vlasenko <dvlas...@redhat.com>
> >>> Acked-by: Kees Cook <keesc...@chromium.org>
> >>> Acked-by: Michael Ellerman <m...@ellerman.id.au>
> >>> CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> >>> CC: Paul Mackerras <pau...@samba.org>
> >>> CC: "Aneesh Kumar K.V" <aneesh.ku...@linux.vnet.ibm.com>
> >>> CC: Kees Cook <keesc...@chromium.org>
> >>> CC: Oleg Nesterov <o...@redhat.com>
> >>> CC: Michael Ellerman <m...@ellerman.id.au>
> >>> CC: Florian Weimer <fwei...@redhat.com>
> >>> CC: linux...@kvack.org
> >>> CC: linuxppc-dev@lists.ozlabs.org
> >>> CC: linux-ker...@vger.kernel.org
> >>> Changes since v5:
> >>> * made do_brk_flags() error out if any bits other than VM_EXEC are set.
> >>>   (Kees Cook: "With this, I'd be happy to Ack.")
> >>>   See https://patchwork.ozlabs.org/patch/661595/
> >>
> >> Excellent, thanks for the v6! Should this go via the ppc tree or the -mm 
> >> tree?
> >
> > -mm would be best, given the diffstat I think it's less likely to
> >  conflict if it goes via -mm.
> 
> Okay, excellent. Andrew, do you have this already in email? I think
> you weren't on the explicit CC from the v6...

FWIW (and ping),

Tested-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>

On ARM32 (kirkwood) and PPC32 (405)

For reference, here is the patchwork URL:

https://patchwork.ozlabs.org/patch/677753/

Jason


Re: [PATCH v5] powerpc: Do not make the entire heap executable

2016-09-27 Thread Jason Gunthorpe
On Wed, Sep 28, 2016 at 11:42:11AM +1000, Michael Ellerman wrote:

> But this is not really a powerpc patch, and I'm not an ELF expert. So
> I'm not comfortable merging it via the powerpc tree. It doesn't look
> like we really have a maintainer for binfmt_elf.c, so I'm not sure who
> should be acking that part.

Thanks a bunch for looking at this Michael.

> I've added Al Viro to Cc, he maintains fs/ and might be interested.

> I've also added Andrew Morton who might be happy to put this in his
> tree, and see if anyone complains?

For those added to the CC, I would re-state my original commit message
more clearly.

My research showed that the ELF loader bug fixed in this patch is the
root cause bug fix required to implement this hunk:

> > -#define VM_DATA_DEFAULT_FLAGS32(VM_READ | VM_WRITE | VM_EXEC | \
> > +#define VM_DATA_DEFAULT_FLAGS32 \
> > +   (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0) | \
> > +VM_READ | VM_WRITE | \
> >  VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)

Eg that 32 bit powerpc currently unconditionally injects writable,
executable pages into a user space process.

This critically undermines all the W^X security work that has been
done in the tool chain and user space by the PPC community.

I would encourage people to view this as an important security patch
for 32 bit powerpc environments.

Regards,
Jason


Re: [PATCH v4] powerpc: Do not make the entire heap executable

2016-08-22 Thread Jason Gunthorpe
On Mon, Aug 22, 2016 at 08:37:05PM +0200, Denys Vlasenko wrote:

> >Is this going to break any application ? I am asking because you
> >mentioned the patch is lightly tested.
> 
> I booted powerpc64 machine with RHEL7 installation,
> it did not catch fire.

When I authored the original patch my concern was if it had unforseen
impacts on other platforms. I know PPC32 and ARM32 work OK.

It would good to test other platforms as well.

Jason


Re: [PATCH] powerpc: Do not make the entire heap executable

2016-08-02 Thread Jason Gunthorpe
On Mon, Aug 01, 2016 at 11:07:21PM +0200, Denys Vlasenko wrote:
> On 32-bit powerps the ELF PLT sections of binaries (built with --bss-plt,
> or with a toolchain which defaults to it) look like this:

Hi Denys,

Thanks for resurrecting this, I am still interested here, we have been
using this patch in our PPC & ARM32 systems since it was posted.

Benjamin gave some feedback that some small style issues might need
changing, but I never agitated enough to get feedback from mm or core,
which is really what is needed for this patch..

Jason
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2] powerpc/infiniband: Use cache inhibitted and guarded mapping on powerpc

2016-04-22 Thread Jason Gunthorpe
On Fri, Apr 22, 2016 at 08:47:56AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2016-04-20 at 03:58 -0400, Aneesh Kumar K.V wrote:
> > The driver was requesting for a writethrough mapping. But with thoses
> > flags we will end up with a SAO mapping because we now have memory
> > conherence always enabled. ie, the existing mapping will end up with
> > a WIMG value 0b1110 which is Strong Access Order.
> > 
> > Update this to use cache inhibitted guarded mapping
> 
> Why guarded ? If it's performance sensitive (and the driver has
> appropriate barriers where needed), you will get write combining
> without guarded, you won't with it.

This driver uses uncached write combining on x86

Jason
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] net: mv643xx_eth: Add missing phy_addr_set in DT mode

2013-11-04 Thread Jason Gunthorpe
Commit cc9d4598 'net: mv643xx_eth: use of_phy_connect if phy_node
present' made the call to phy_scan optional, if the DT has a link to
the phy node.

However phy_scan has the side effect of calling phy_addr_set, which
writes the phy MDIO address to the ethernet controller. If phy_addr_set
is not called, and the bootloader has not set the correct address then
the driver will fail to function.

Tested on Kirkwood.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
Cc: David Miller da...@davemloft.net
Cc: Lennert Buytenhek buyt...@wantstofly.org
Cc: Jason Cooper ja...@lakedaemon.net
Cc: Andrew Lunn and...@lunn.ch
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: net...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
---
 drivers/net/ethernet/marvell/mv643xx_eth.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c 
b/drivers/net/ethernet/marvell/mv643xx_eth.c
index 2c210ec..00e43b5 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -2890,6 +2890,7 @@ static int mv643xx_eth_probe(struct platform_device *pdev)
 PHY_INTERFACE_MODE_GMII);
if (!mp-phy)
err = -ENODEV;
+   phy_addr_set(mp, mp-phy-addr);
} else if (pd-phy_addr != MV643XX_ETH_PHY_NONE) {
mp-phy = phy_scan(mp, pd-phy_addr);
 
-- 
1.8.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [tpmdd-devel] Has anyone a ATMEL TPM Chip on PPC64 (CONFIG_TCG_ATMEL)?

2013-10-28 Thread Jason Gunthorpe
On Mon, Oct 28, 2013 at 01:03:43PM -0500, Joel Schopp wrote:
 On 10/27/2013 05:06 PM, Peter H?we wrote:
  Hi,
  
  I was wondering if anyone here on this list still has a machine with an old 
  ATMEL TPM (trusted platform module) lying around?
  
 From the kconfig entry it becomes evident that it was only supported on 
 ppc64 
  machines.
  
  config TCG_ATMEL
  tristate Atmel TPM Interface
  depends on PPC64 || HAS_IOPORT

Hurm, that is crazy, because tpm_atmel.h contains an #else block for
!CONFIG_PPC64. The single major source of complexity in this driver is
that else block.

The driver would be fine, and straightforward if it was a standard,
modern DT enabled driver, which is very easy if PPC64 is the only
supported implementation.

 I reccomend removing the driver.  If the stars align and a user actually
 appears who wants to use one I'll clean up the driver and resubmit it
 for inclusion.  I just don't think that will happen.

The needed clean up is really easy actually, replace everything below
'tpm_vendor_specific tpm_atmel' with approximately this:

static int atml_probe(struct platform_device *pdev)
{
struct tpm_chip *chip;
struct resrouce *res;

res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!r)
return -EIO;

if (!(chip = tpm_register_hardware(dev, tpm_atmel)))
   return -ENODEV;

chip-iobase = devm_request_and_ioremap(pdev-dev, res);
if (!chip-iobase)
return -ENOMEM;

return 0;
}

static void atml_plat_remove(struct platform_device *pdev)
{
struct tpm_chip *chip = dev_get_drvdata(pdev-dev);
tpm_remove_hardware(chip-dev); 
};

static const struct of_device_id platform_match[] = {
{.compatible = AT97SC3201},
{},
};
MODULE_DEVICE_TABLE(of, platform_match);

static SIMPLE_DEV_PM_OPS(tpm_atml_pm, tpm_pm_suspend, tpm_pm_resume);

static struct platform_driver atml_drv = {
.probe = atml_probe,
.remove = atml_plat_remove,
.driver = {
.name = tpm_atmel,
.owner  = THIS_MODULE,
.pm = tpm_atml_pm,
.of_match_table = of_match_ptr(platform_match),
},
};

module_platform_driver(atml_drv);

MODULE_AUTHOR(Leendert van Doorn (leend...@watson.ibm.com));
MODULE_DESCRIPTION(TPM Driver);
MODULE_VERSION(2.0);
MODULE_LICENSE(GPL);

If you guys can convice yourselves that doesn't obviously break anything I can
probably send a proper patch.

Jason
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2] net: mv643xx_eth: proper initialization for Kirkwood SoCs

2013-05-24 Thread Jason Gunthorpe
On Fri, May 24, 2013 at 12:46:36PM -0400, Jason Cooper wrote:

  Why are you not keen on this? It seems like normal device driver
  practice, that is what the data field of of_device_id is typically
  used for..
 
 I'm not keen on it because we don't have a document saying All kirkwood
 SoCs need PSC1 set to X after reset.  We know it, but have we tested
 the 6282?

I disagree. The manul is very clear how PSC1 must be set for proper
operation. Clk125BypassEn bit is used only for loopback testing, it
should never set for driver operation. Similarly PortReset must be
cleared for driver operation.

It is always safe for the driver to clear these bits, if it knows they
are available.  In fact, I would argue the driver should always clear
these bits so that the bootloader isn't relied on to do it. It doesn't
matter if the SOC errantly sets the bit or not, it can *always* be
safely cleared.

Further, I compared my 88F6282/88F6283 manual against the public
88F6180/88F619x/88F6281 spec and confirmed that the PSC1 layout is the
same.

So all of these SOC's can share a driver compatible string.

AFAICT the only reason the driver doesn't touch PSC1 today is because
the PSC1 was introduced in a later IP revision and its presence isn't
auto-detectable.

The last bit of the puzzle to get bootloader independence on kirkwood
is to encode the phy interface type (GMII/RGMII/BASE-X) in the DT so
the entire PSC1 can be set by the driver..

Jason
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2] net: mv643xx_eth: proper initialization for Kirkwood SoCs

2013-05-23 Thread Jason Gunthorpe
On Thu, May 23, 2013 at 12:01:11PM -0400, Jason Cooper wrote:
   + /* Kirkwood resets some registers on gated clocks. Especially
   +  * CLK125_BYPASS_EN must be cleared but is not available on
   +  * all other SoCs/System Controllers using this driver.
   +  */
   + if (of_machine_is_compatible(marvell,kirkwood))
   + wrlp(mp, PORT_SERIAL_CONTROL1,
   +  rdlp(mp, PORT_SERIAL_CONTROL1)  ~CLK125_BYPASS_EN);
  
  of_machine_is_compatible seems heavy handed, I would expect this to be
  based on the compatible string of the ethernet node itself, not the
  machine??
 
 Is there a model number variation between IP that needs this and IP that
 doesn't?  If not, I'm fine with of_machine_is_compatible().

Well the name 'mv643xx' is a family of system controller SOC's
from ages ago, it seems reasonble to continue the trend and label the
IP variations with the SOC name:

 compatible = marvell,kirwood,ethernet, marvell,mv643xx_eth

Jason
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


  1   2   >