Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-18 Thread Dan Williams
On Mon, Dec 18, 2017 at 3:54 AM, Peter Zijlstra  wrote:
> On Fri, Dec 15, 2017 at 08:38:02AM -0800, Dan Williams wrote:
>
>> The motivation was that I noticed that get_user_pages_fast() was doing
>> a full pud_access_permitted() check, but the get_user_pages() slow
>> path was only doing a pud_write() check. That was inconsistent so I
>> went to go resolve that across all the pte types and ended up making a
>> mess of things,
>
>> I'm fine if the answer is that we should have went the
>> other way to only do write checks. However, when I was investigating
>> which way to go the aspect that persuaded me to start sprinkling
>> p??_access_permitted checks around was that the application behavior
>> changed between mmap access and direct-i/o access to the same buffer.
>
>> I assumed that different access behavior between those would be an
>> inconsistent surprise to userspace. Although, infinitely looping in
>> handle_mm_fault is an even worse surprise, apologies for that.
>
> Well, we all make a mess of things at time. I'm certainly guilty of
> that, so no worries there. But it really helps if your Changelogs at
> least describe what you're trying to do and why.

Yes, agreed. Unfortunately in this case all those details were
included in the lead in patch, and should have been duplicated to the
follow on cleanups. See:

1501899a898d mm: fix device-dax pud write-faults triggered by get_user_pages()

"For now this just implements a simple check for the _PAGE_RW bit similar
to pmd_write.  However, this implies that the gup-slow-path check is
missing the extra checks that the gup-fast-path performs with
pud_access_permitted.  Later patches will align all checks to use the
'access_permitted' helper if the architecture provides it."

...and that paragraph would have saved you some wondering.

> So I think I covered what you set out to do. In any case, Linus took the
> whole lot back out, so we can look at this afresh.

Thanks Peter, appreciate it.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-18 Thread Dan Williams
On Mon, Dec 18, 2017 at 3:54 AM, Peter Zijlstra  wrote:
> On Fri, Dec 15, 2017 at 08:38:02AM -0800, Dan Williams wrote:
>
>> The motivation was that I noticed that get_user_pages_fast() was doing
>> a full pud_access_permitted() check, but the get_user_pages() slow
>> path was only doing a pud_write() check. That was inconsistent so I
>> went to go resolve that across all the pte types and ended up making a
>> mess of things,
>
>> I'm fine if the answer is that we should have went the
>> other way to only do write checks. However, when I was investigating
>> which way to go the aspect that persuaded me to start sprinkling
>> p??_access_permitted checks around was that the application behavior
>> changed between mmap access and direct-i/o access to the same buffer.
>
>> I assumed that different access behavior between those would be an
>> inconsistent surprise to userspace. Although, infinitely looping in
>> handle_mm_fault is an even worse surprise, apologies for that.
>
> Well, we all make a mess of things at time. I'm certainly guilty of
> that, so no worries there. But it really helps if your Changelogs at
> least describe what you're trying to do and why.

Yes, agreed. Unfortunately in this case all those details were
included in the lead in patch, and should have been duplicated to the
follow on cleanups. See:

1501899a898d mm: fix device-dax pud write-faults triggered by get_user_pages()

"For now this just implements a simple check for the _PAGE_RW bit similar
to pmd_write.  However, this implies that the gup-slow-path check is
missing the extra checks that the gup-fast-path performs with
pud_access_permitted.  Later patches will align all checks to use the
'access_permitted' helper if the architecture provides it."

...and that paragraph would have saved you some wondering.

> So I think I covered what you set out to do. In any case, Linus took the
> whole lot back out, so we can look at this afresh.

Thanks Peter, appreciate it.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-18 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 08:38:02AM -0800, Dan Williams wrote:

> The motivation was that I noticed that get_user_pages_fast() was doing
> a full pud_access_permitted() check, but the get_user_pages() slow
> path was only doing a pud_write() check. That was inconsistent so I
> went to go resolve that across all the pte types and ended up making a
> mess of things,

> I'm fine if the answer is that we should have went the
> other way to only do write checks. However, when I was investigating
> which way to go the aspect that persuaded me to start sprinkling
> p??_access_permitted checks around was that the application behavior
> changed between mmap access and direct-i/o access to the same buffer.

> I assumed that different access behavior between those would be an
> inconsistent surprise to userspace. Although, infinitely looping in
> handle_mm_fault is an even worse surprise, apologies for that.

Well, we all make a mess of things at time. I'm certainly guilty of
that, so no worries there. But it really helps if your Changelogs at
least describe what you're trying to do and why.

So I think I covered what you set out to do. In any case, Linus took the
whole lot back out, so we can look at this afresh.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-18 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 08:38:02AM -0800, Dan Williams wrote:

> The motivation was that I noticed that get_user_pages_fast() was doing
> a full pud_access_permitted() check, but the get_user_pages() slow
> path was only doing a pud_write() check. That was inconsistent so I
> went to go resolve that across all the pte types and ended up making a
> mess of things,

> I'm fine if the answer is that we should have went the
> other way to only do write checks. However, when I was investigating
> which way to go the aspect that persuaded me to start sprinkling
> p??_access_permitted checks around was that the application behavior
> changed between mmap access and direct-i/o access to the same buffer.

> I assumed that different access behavior between those would be an
> inconsistent surprise to userspace. Although, infinitely looping in
> handle_mm_fault is an even worse surprise, apologies for that.

Well, we all make a mess of things at time. I'm certainly guilty of
that, so no worries there. But it really helps if your Changelogs at
least describe what you're trying to do and why.

So I think I covered what you set out to do. In any case, Linus took the
whole lot back out, so we can look at this afresh.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dave Hansen
On 12/15/2017 06:52 PM, Linus Torvalds wrote:
> On Fri, Dec 15, 2017 at 6:48 PM, Al Viro  wrote:
>> Treating protection key bits as "escalate to page fault and let that
>> deal with the checks" should be fine
> 
> Well, it's *semantically* fine and I think it's the right model from
> that standpoint.

It's _close_ to fine.  :)

Practically, we're going to have two classes of things in the world:
1. Things that are protected with protection keys and have non-zero bits
   in the pkey PTE bits.
2. Things that are _not_ protected will have zeros in there.

But, in the hardware, *everything* has a pkey.  0 is the default,
obviously, but the hardware treats it the same as all the other values.
So, if we go checking for the "pkey bits being set", and have behavior
diverge when they are set, we end up with pkey=0 being even more special
compared to the rest.

This might be OK, but it's going to be interesting to document and write
tests for it.  I'm already dreading the manpage updates.

> However, since the main use case of protection keys is probably
> databases (Dave?) and since those also might be performance-sensitive
> about direct-IO doing page table lookups, it might not be great in
> practice.

Yeah, databases are definitely the heavy-hitters that care about it.

But, these PKRU checks are cheap.  I forget the actual cycle counts, but
I remember thinking that it's pretty darn cheap to read PKRU.  In the
grand scheme of doing a page table walk and incrementing an atomic, it's
surely in the noise for direct I/O to large pages, which is basically
guaranteed for the database guys.

I did some get_user_pages() torture tests (on small pages IIRC) before I
put the code in and could not detect a delta from the code being there
or not.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dave Hansen
On 12/15/2017 06:52 PM, Linus Torvalds wrote:
> On Fri, Dec 15, 2017 at 6:48 PM, Al Viro  wrote:
>> Treating protection key bits as "escalate to page fault and let that
>> deal with the checks" should be fine
> 
> Well, it's *semantically* fine and I think it's the right model from
> that standpoint.

It's _close_ to fine.  :)

Practically, we're going to have two classes of things in the world:
1. Things that are protected with protection keys and have non-zero bits
   in the pkey PTE bits.
2. Things that are _not_ protected will have zeros in there.

But, in the hardware, *everything* has a pkey.  0 is the default,
obviously, but the hardware treats it the same as all the other values.
So, if we go checking for the "pkey bits being set", and have behavior
diverge when they are set, we end up with pkey=0 being even more special
compared to the rest.

This might be OK, but it's going to be interesting to document and write
tests for it.  I'm already dreading the manpage updates.

> However, since the main use case of protection keys is probably
> databases (Dave?) and since those also might be performance-sensitive
> about direct-IO doing page table lookups, it might not be great in
> practice.

Yeah, databases are definitely the heavy-hitters that care about it.

But, these PKRU checks are cheap.  I forget the actual cycle counts, but
I remember thinking that it's pretty darn cheap to read PKRU.  In the
grand scheme of doing a page table walk and incrementing an atomic, it's
surely in the noise for direct I/O to large pages, which is basically
guaranteed for the database guys.

I did some get_user_pages() torture tests (on small pages IIRC) before I
put the code in and could not detect a delta from the code being there
or not.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 6:52 PM, Linus Torvalds
 wrote:
>
> However, since the main use case of protection keys is probably
> databases (Dave?) and since those also might be performance-sensitive
> about direct-IO doing page table lookups, it might not be great in
> practice.

Anyway, I reverted the three commits Dan pointed to, and we can always
revisit the exact path forward later

Because regardless of what the eventual future checks will be, for
4.15 I think we'll just go back to doing what we used to do.

If somebody sees problems, holler.

  Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 6:52 PM, Linus Torvalds
 wrote:
>
> However, since the main use case of protection keys is probably
> databases (Dave?) and since those also might be performance-sensitive
> about direct-IO doing page table lookups, it might not be great in
> practice.

Anyway, I reverted the three commits Dan pointed to, and we can always
revisit the exact path forward later

Because regardless of what the eventual future checks will be, for
4.15 I think we'll just go back to doing what we used to do.

If somebody sees problems, holler.

  Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 6:48 PM, Al Viro  wrote:
>
> Treating protection key bits as "escalate to page fault and let that
> deal with the checks" should be fine

Well, it's *semantically* fine and I think it's the right model from
that standpoint.

However, since the main use case of protection keys is probably
databases (Dave?) and since those also might be performance-sensitive
about direct-IO doing page table lookups, it might not be great in
practice.

 Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 6:48 PM, Al Viro  wrote:
>
> Treating protection key bits as "escalate to page fault and let that
> deal with the checks" should be fine

Well, it's *semantically* fine and I think it's the right model from
that standpoint.

However, since the main use case of protection keys is probably
databases (Dave?) and since those also might be performance-sensitive
about direct-IO doing page table lookups, it might not be great in
practice.

 Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Al Viro
On Fri, Dec 15, 2017 at 06:28:36PM -0800, Linus Torvalds wrote:
> On Fri, Dec 15, 2017 at 5:25 PM, Dave Hansen  wrote:
> >
> > I think the reason we needed VMA and PTE checks was the
> > get_user_pages_fast() path not having a VMA.
> 
> That is indeed the point of get_user_pages_fast(): no vma lookup, no
> locking, just "do the default case as streamlined as possible".
> 
> But part of it is also that we should fall back to the slow case if
> the fast case doesn't work (eg because the page isn't there or
> whatever).
> 
> So what we could do - perhaps - is to just make get_user_pages_fast()
> check whether any of the protection key bits are set, and fail for
> that case.

FWIW, a good description of fast path in get_user_pages_fast() is
"simulate a TLB miss", the slow path being "... and go for simulated
page fault if TLB miss would have escalated to #PF".

Treating protection key bits as "escalate to page fault and let that
deal with the checks" should be fine - page fault handler must
cope with the page actually being present in page tables anyway, for
obvious reasons...


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Al Viro
On Fri, Dec 15, 2017 at 06:28:36PM -0800, Linus Torvalds wrote:
> On Fri, Dec 15, 2017 at 5:25 PM, Dave Hansen  wrote:
> >
> > I think the reason we needed VMA and PTE checks was the
> > get_user_pages_fast() path not having a VMA.
> 
> That is indeed the point of get_user_pages_fast(): no vma lookup, no
> locking, just "do the default case as streamlined as possible".
> 
> But part of it is also that we should fall back to the slow case if
> the fast case doesn't work (eg because the page isn't there or
> whatever).
> 
> So what we could do - perhaps - is to just make get_user_pages_fast()
> check whether any of the protection key bits are set, and fail for
> that case.

FWIW, a good description of fast path in get_user_pages_fast() is
"simulate a TLB miss", the slow path being "... and go for simulated
page fault if TLB miss would have escalated to #PF".

Treating protection key bits as "escalate to page fault and let that
deal with the checks" should be fine - page fault handler must
cope with the page actually being present in page tables anyway, for
obvious reasons...


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 5:25 PM, Dave Hansen  wrote:
>
> I think the reason we needed VMA and PTE checks was the
> get_user_pages_fast() path not having a VMA.

That is indeed the point of get_user_pages_fast(): no vma lookup, no
locking, just "do the default case as streamlined as possible".

But part of it is also that we should fall back to the slow case if
the fast case doesn't work (eg because the page isn't there or
whatever).

So what we could do - perhaps - is to just make get_user_pages_fast()
check whether any of the protection key bits are set, and fail for
that case.

If we care, that is.

   Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 5:25 PM, Dave Hansen  wrote:
>
> I think the reason we needed VMA and PTE checks was the
> get_user_pages_fast() path not having a VMA.

That is indeed the point of get_user_pages_fast(): no vma lookup, no
locking, just "do the default case as streamlined as possible".

But part of it is also that we should fall back to the slow case if
the fast case doesn't work (eg because the page isn't there or
whatever).

So what we could do - perhaps - is to just make get_user_pages_fast()
check whether any of the protection key bits are set, and fail for
that case.

If we care, that is.

   Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dan Williams
On Fri, Dec 15, 2017 at 5:10 PM, Linus Torvalds
 wrote:
> On Fri, Dec 15, 2017 at 4:29 PM, Dan Williams  
> wrote:
>> So do you want to do a straight revert of these that went in for 4.15:
>
> I think that's the right thing to do, but would want to verify that
> there are no *other* issues than just the attempt at PKRU.
>
> The commit message does talk about PAGE_USER, and as mentioned I do
> think that's a good thing to check, I just don't think it should be
> done this way,
>
> Was there something else going behind these commits? Because if not,
> let's revert and then perhaps later introduce a more targeted thing?

Yes, these three can be safely reverted.

5c9d2d5c269c mm: replace pte_write with pte_access_permitted...
c7da82b894e9 mm: replace pmd_write with pmd_access_permitted...
e7fe7b5cae90 mm: replace pud_write with pud_access_permitted...

They were part of a 4 patch series where this lead one below is the
one fix we actually need.

1501899a898d mm: fix device-dax pud write-faults triggered by...

---

Now, the original access permitted was born out of a cleanup to
introduce pte_allows_gup(), this is where the PAGE_USER check came
from:

1874f6895c92 x86/mm/gup: Simplify get_user_pages() PTE bit handling

...and that helper later grew pkey check support here:

   33a709b25a76 mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys

...sometime later it was all renamed and made kernel-global here when
the x86 gup implementation was converted to use the common
implementation:

e7884f8ead4a mm/gup: Move permission checks into helpers

All this to say that these are not revert candidates and need
incremental patches if we want to back out the pkey checking for the
gup-fast path and re-work the PAGE_USER checking.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dan Williams
On Fri, Dec 15, 2017 at 5:10 PM, Linus Torvalds
 wrote:
> On Fri, Dec 15, 2017 at 4:29 PM, Dan Williams  
> wrote:
>> So do you want to do a straight revert of these that went in for 4.15:
>
> I think that's the right thing to do, but would want to verify that
> there are no *other* issues than just the attempt at PKRU.
>
> The commit message does talk about PAGE_USER, and as mentioned I do
> think that's a good thing to check, I just don't think it should be
> done this way,
>
> Was there something else going behind these commits? Because if not,
> let's revert and then perhaps later introduce a more targeted thing?

Yes, these three can be safely reverted.

5c9d2d5c269c mm: replace pte_write with pte_access_permitted...
c7da82b894e9 mm: replace pmd_write with pmd_access_permitted...
e7fe7b5cae90 mm: replace pud_write with pud_access_permitted...

They were part of a 4 patch series where this lead one below is the
one fix we actually need.

1501899a898d mm: fix device-dax pud write-faults triggered by...

---

Now, the original access permitted was born out of a cleanup to
introduce pte_allows_gup(), this is where the PAGE_USER check came
from:

1874f6895c92 x86/mm/gup: Simplify get_user_pages() PTE bit handling

...and that helper later grew pkey check support here:

   33a709b25a76 mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys

...sometime later it was all renamed and made kernel-global here when
the x86 gup implementation was converted to use the common
implementation:

e7884f8ead4a mm/gup: Move permission checks into helpers

All this to say that these are not revert candidates and need
incremental patches if we want to back out the pkey checking for the
gup-fast path and re-work the PAGE_USER checking.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dave Hansen
On 12/15/2017 05:10 PM, Linus Torvalds wrote:
> Because *if* we want to check protection keys, I think we should do
> that at the vma layer, partly exactly because the exact implementation
> of protection keys is so architecture-specific, and partly because I
> don't think it makes sense to check them for every page anyway.

So, there are VMA checks against protection keys.  The problem _here_ is
that we are checking against the VMA (and correctly skipping the PKRU
checks) and then _mistakenly_ applying the PTE checks against PKRU.

I think the reason we needed VMA and PTE checks was the
get_user_pages_fast() path not having a VMA.

I need to go re-read the commits, though.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dave Hansen
On 12/15/2017 05:10 PM, Linus Torvalds wrote:
> Because *if* we want to check protection keys, I think we should do
> that at the vma layer, partly exactly because the exact implementation
> of protection keys is so architecture-specific, and partly because I
> don't think it makes sense to check them for every page anyway.

So, there are VMA checks against protection keys.  The problem _here_ is
that we are checking against the VMA (and correctly skipping the PKRU
checks) and then _mistakenly_ applying the PTE checks against PKRU.

I think the reason we needed VMA and PTE checks was the
get_user_pages_fast() path not having a VMA.

I need to go re-read the commits, though.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 4:29 PM, Dan Williams  wrote:
> So do you want to do a straight revert of these that went in for 4.15:

I think that's the right thing to do, but would want to verify that
there are no *other* issues than just the attempt at PKRU.

The commit message does talk about PAGE_USER, and as mentioned I do
think that's a good thing to check, I just don't think it should be
done this way,

Was there something else going behind these commits? Because if not,
let's revert and then perhaps later introduce a more targeted thing?

Also, aren't the protection keys encoded in the vma?

Because *if* we want to check protection keys, I think we should do
that at the vma layer, partly exactly because the exact implementation
of protection keys is so architecture-specific, and partly because I
don't think it makes sense to check them for every page anyway.

   Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 4:29 PM, Dan Williams  wrote:
> So do you want to do a straight revert of these that went in for 4.15:

I think that's the right thing to do, but would want to verify that
there are no *other* issues than just the attempt at PKRU.

The commit message does talk about PAGE_USER, and as mentioned I do
think that's a good thing to check, I just don't think it should be
done this way,

Was there something else going behind these commits? Because if not,
let's revert and then perhaps later introduce a more targeted thing?

Also, aren't the protection keys encoded in the vma?

Because *if* we want to check protection keys, I think we should do
that at the vma layer, partly exactly because the exact implementation
of protection keys is so architecture-specific, and partly because I
don't think it makes sense to check them for every page anyway.

   Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 4:31 PM, Al Viro  wrote:
>>
>> The fact is, if we have non-user mappings in the user part of the
>> address space, we _need_ to teach access_ok() about them, because
>> fundamentally any "get_user()/put_user()" will happily ignore the lack
>> of PAGE_USER (since those happen from kernel space).
>
> Details, please - how *can* access_ok() be taught of that?

We'd have to do something like put the !PAGE_USER mapping at the top
of the user address space, and then simply make user_addr_max()
smaller than the actual user page table size.

Or some other silly hack.

I do not believe there is any sane way to have !PAGE_USER in
_general_, if you actually want to limit access to it.

(We _could_ use !PAGE_USER for things that aren't really strictly
about security - ie  we could have used it for the NUMA balancing
instead of using the P bit, and just let put_user/get_user blow
through them).

 Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Fri, Dec 15, 2017 at 4:31 PM, Al Viro  wrote:
>>
>> The fact is, if we have non-user mappings in the user part of the
>> address space, we _need_ to teach access_ok() about them, because
>> fundamentally any "get_user()/put_user()" will happily ignore the lack
>> of PAGE_USER (since those happen from kernel space).
>
> Details, please - how *can* access_ok() be taught of that?

We'd have to do something like put the !PAGE_USER mapping at the top
of the user address space, and then simply make user_addr_max()
smaller than the actual user page table size.

Or some other silly hack.

I do not believe there is any sane way to have !PAGE_USER in
_general_, if you actually want to limit access to it.

(We _could_ use !PAGE_USER for things that aren't really strictly
about security - ie  we could have used it for the NUMA balancing
instead of using the P bit, and just let put_user/get_user blow
through them).

 Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Al Viro
On Fri, Dec 15, 2017 at 04:20:31PM -0800, Linus Torvalds wrote:
> On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra  wrote:
> >
> > So we actually need the pte_access_permitted() stuff if we want to
> > ensure we're not stepping on !PAGE_USER things.
> 
> We really don't. Not in that complex and broken format, and not for every 
> level.
> 
> Also, while I think we *should* check the PAGE_USER bit when walking
> the page tables, like we used to, we should
> 
>  (a) do it much more simply, not with that broken interface that takes
> insane and pointless flags
> 
>  (b) not tie it together with this issue at all, since the PAGE_USER
> thing really is largely immaterial.
> 
> The fact is, if we have non-user mappings in the user part of the
> address space, we _need_ to teach access_ok() about them, because
> fundamentally any "get_user()/put_user()" will happily ignore the lack
> of PAGE_USER (since those happen from kernel space).

Details, please - how *can* access_ok() be taught of that?


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Al Viro
On Fri, Dec 15, 2017 at 04:20:31PM -0800, Linus Torvalds wrote:
> On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra  wrote:
> >
> > So we actually need the pte_access_permitted() stuff if we want to
> > ensure we're not stepping on !PAGE_USER things.
> 
> We really don't. Not in that complex and broken format, and not for every 
> level.
> 
> Also, while I think we *should* check the PAGE_USER bit when walking
> the page tables, like we used to, we should
> 
>  (a) do it much more simply, not with that broken interface that takes
> insane and pointless flags
> 
>  (b) not tie it together with this issue at all, since the PAGE_USER
> thing really is largely immaterial.
> 
> The fact is, if we have non-user mappings in the user part of the
> address space, we _need_ to teach access_ok() about them, because
> fundamentally any "get_user()/put_user()" will happily ignore the lack
> of PAGE_USER (since those happen from kernel space).

Details, please - how *can* access_ok() be taught of that?


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dan Williams
On Fri, Dec 15, 2017 at 4:20 PM, Linus Torvalds
 wrote:
>
> On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra  wrote:
> >
> > So we actually need the pte_access_permitted() stuff if we want to
> > ensure we're not stepping on !PAGE_USER things.
>
> We really don't. Not in that complex and broken format, and not for every 
> level.
>
> Also, while I think we *should* check the PAGE_USER bit when walking
> the page tables, like we used to, we should
>
>  (a) do it much more simply, not with that broken interface that takes
> insane and pointless flags
>
>  (b) not tie it together with this issue at all, since the PAGE_USER
> thing really is largely immaterial.
>
> The fact is, if we have non-user mappings in the user part of the
> address space, we _need_ to teach access_ok() about them, because
> fundamentally any "get_user()/put_user()" will happily ignore the lack
> of PAGE_USER (since those happen from kernel space).
>
> So I'd like to check PAGE_USER in GUP simply because it's a simple
> sanity check, not because it is important.
>
> And that whole "p??_access_permitted() checks against the current
> PKRU" is just incredible shit. It's currently broken, exactly because
> "current PKRU" isn't even well-defined when you do it across different
> threads, much less different address spaces.
>
> This is why I'm 100% convinced that the current
> "p??_access_permitted()" is just pure and utter garbage. And it's
> garbage at a _fundamental_ level, not because of some small
> implementation detail.

So do you want to do a straight revert of these that went in for 4.15:

5c9d2d5c269c mm: replace pte_write with pte_access_permitted in fault
+ gup paths
c7da82b894e9 mm: replace pmd_write with pmd_access_permitted in fault
+ gup paths
e7fe7b5cae90 mm: replace pud_write with pud_access_permitted in fault
+ gup paths

...or take Peter's patches that are trying to fix up the damage?


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dan Williams
On Fri, Dec 15, 2017 at 4:20 PM, Linus Torvalds
 wrote:
>
> On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra  wrote:
> >
> > So we actually need the pte_access_permitted() stuff if we want to
> > ensure we're not stepping on !PAGE_USER things.
>
> We really don't. Not in that complex and broken format, and not for every 
> level.
>
> Also, while I think we *should* check the PAGE_USER bit when walking
> the page tables, like we used to, we should
>
>  (a) do it much more simply, not with that broken interface that takes
> insane and pointless flags
>
>  (b) not tie it together with this issue at all, since the PAGE_USER
> thing really is largely immaterial.
>
> The fact is, if we have non-user mappings in the user part of the
> address space, we _need_ to teach access_ok() about them, because
> fundamentally any "get_user()/put_user()" will happily ignore the lack
> of PAGE_USER (since those happen from kernel space).
>
> So I'd like to check PAGE_USER in GUP simply because it's a simple
> sanity check, not because it is important.
>
> And that whole "p??_access_permitted() checks against the current
> PKRU" is just incredible shit. It's currently broken, exactly because
> "current PKRU" isn't even well-defined when you do it across different
> threads, much less different address spaces.
>
> This is why I'm 100% convinced that the current
> "p??_access_permitted()" is just pure and utter garbage. And it's
> garbage at a _fundamental_ level, not because of some small
> implementation detail.

So do you want to do a straight revert of these that went in for 4.15:

5c9d2d5c269c mm: replace pte_write with pte_access_permitted in fault
+ gup paths
c7da82b894e9 mm: replace pmd_write with pmd_access_permitted in fault
+ gup paths
e7fe7b5cae90 mm: replace pud_write with pud_access_permitted in fault
+ gup paths

...or take Peter's patches that are trying to fix up the damage?


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra  wrote:
>
> So we actually need the pte_access_permitted() stuff if we want to
> ensure we're not stepping on !PAGE_USER things.

We really don't. Not in that complex and broken format, and not for every level.

Also, while I think we *should* check the PAGE_USER bit when walking
the page tables, like we used to, we should

 (a) do it much more simply, not with that broken interface that takes
insane and pointless flags

 (b) not tie it together with this issue at all, since the PAGE_USER
thing really is largely immaterial.

The fact is, if we have non-user mappings in the user part of the
address space, we _need_ to teach access_ok() about them, because
fundamentally any "get_user()/put_user()" will happily ignore the lack
of PAGE_USER (since those happen from kernel space).

So I'd like to check PAGE_USER in GUP simply because it's a simple
sanity check, not because it is important.

And that whole "p??_access_permitted() checks against the current
PKRU" is just incredible shit. It's currently broken, exactly because
"current PKRU" isn't even well-defined when you do it across different
threads, much less different address spaces.

This is why I'm 100% convinced that the current
"p??_access_permitted()" is just pure and utter garbage. And it's
garbage at a _fundamental_ level, not because of some small
implementation detail.

Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Linus Torvalds
On Thu, Dec 14, 2017 at 11:51 PM, Peter Zijlstra  wrote:
>
> So we actually need the pte_access_permitted() stuff if we want to
> ensure we're not stepping on !PAGE_USER things.

We really don't. Not in that complex and broken format, and not for every level.

Also, while I think we *should* check the PAGE_USER bit when walking
the page tables, like we used to, we should

 (a) do it much more simply, not with that broken interface that takes
insane and pointless flags

 (b) not tie it together with this issue at all, since the PAGE_USER
thing really is largely immaterial.

The fact is, if we have non-user mappings in the user part of the
address space, we _need_ to teach access_ok() about them, because
fundamentally any "get_user()/put_user()" will happily ignore the lack
of PAGE_USER (since those happen from kernel space).

So I'd like to check PAGE_USER in GUP simply because it's a simple
sanity check, not because it is important.

And that whole "p??_access_permitted() checks against the current
PKRU" is just incredible shit. It's currently broken, exactly because
"current PKRU" isn't even well-defined when you do it across different
threads, much less different address spaces.

This is why I'm 100% convinced that the current
"p??_access_permitted()" is just pure and utter garbage. And it's
garbage at a _fundamental_ level, not because of some small
implementation detail.

Linus


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dan Williams
On Fri, Dec 15, 2017 at 3:38 AM, Peter Zijlstra  wrote:
> On Fri, Dec 15, 2017 at 11:25:29AM +0100, Peter Zijlstra wrote:
>> The memory one is also clearly wrong, not having access does not a write
>> fault make. If we have pte_write() set we should not do_wp_page() just
>> because we don't have access. This falls under the "doing anything other
>> than hard failure for !access is crazy" header.
>
> So per the very same reasoning I think the below is warranted too; also
> rename that @dirty variable, because its also wrong.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5eb3d2524bdc..0d43b347eb0a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3987,7 +3987,7 @@ static int __handle_mm_fault(struct vm_area_struct 
> *vma, unsigned long address,
> .pgoff = linear_page_index(vma, address),
> .gfp_mask = __get_fault_gfp_mask(vma),
> };
> -   unsigned int dirty = flags & FAULT_FLAG_WRITE;
> +   unsigned int write = flags & FAULT_FLAG_WRITE;
> struct mm_struct *mm = vma->vm_mm;
> pgd_t *pgd;
> p4d_t *p4d;
> @@ -4013,7 +4013,7 @@ static int __handle_mm_fault(struct vm_area_struct 
> *vma, unsigned long address,
>
> /* NUMA case for anonymous PUDs would go here */
>
> -   if (dirty && !pud_access_permitted(orig_pud, WRITE)) {
> +   if (write && !pud_write(orig_pud)) {
> ret = wp_huge_pud(, orig_pud);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> @@ -4046,7 +4046,7 @@ static int __handle_mm_fault(struct vm_area_struct 
> *vma, unsigned long address,
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(, orig_pmd);
>
> -   if (dirty && !pmd_access_permitted(orig_pmd, WRITE)) {
> +   if (write && !pmd_write(orig_pmd)) {
> ret = wp_huge_pmd(, orig_pmd);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
>
>
> I still cannot make sense of what the intention behind these changes
> were, the Changelog that went with them is utter crap, it doesn't
> explain anything.

The motivation was that I noticed that get_user_pages_fast() was doing
a full pud_access_permitted() check, but the get_user_pages() slow
path was only doing a pud_write() check. That was inconsistent so I
went to go resolve that across all the pte types and ended up making a
mess of things, I'm fine if the answer is that we should have went the
other way to only do write checks. However, when I was investigating
which way to go the aspect that persuaded me to start sprinkling
p??_access_permitted checks around was that the application behavior
changed between mmap access and direct-i/o access to the same buffer.
I assumed that different access behavior between those would be an
inconsistent surprise to userspace. Although, infinitely looping in
handle_mm_fault is an even worse surprise, apologies for that.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Dan Williams
On Fri, Dec 15, 2017 at 3:38 AM, Peter Zijlstra  wrote:
> On Fri, Dec 15, 2017 at 11:25:29AM +0100, Peter Zijlstra wrote:
>> The memory one is also clearly wrong, not having access does not a write
>> fault make. If we have pte_write() set we should not do_wp_page() just
>> because we don't have access. This falls under the "doing anything other
>> than hard failure for !access is crazy" header.
>
> So per the very same reasoning I think the below is warranted too; also
> rename that @dirty variable, because its also wrong.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5eb3d2524bdc..0d43b347eb0a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3987,7 +3987,7 @@ static int __handle_mm_fault(struct vm_area_struct 
> *vma, unsigned long address,
> .pgoff = linear_page_index(vma, address),
> .gfp_mask = __get_fault_gfp_mask(vma),
> };
> -   unsigned int dirty = flags & FAULT_FLAG_WRITE;
> +   unsigned int write = flags & FAULT_FLAG_WRITE;
> struct mm_struct *mm = vma->vm_mm;
> pgd_t *pgd;
> p4d_t *p4d;
> @@ -4013,7 +4013,7 @@ static int __handle_mm_fault(struct vm_area_struct 
> *vma, unsigned long address,
>
> /* NUMA case for anonymous PUDs would go here */
>
> -   if (dirty && !pud_access_permitted(orig_pud, WRITE)) {
> +   if (write && !pud_write(orig_pud)) {
> ret = wp_huge_pud(, orig_pud);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> @@ -4046,7 +4046,7 @@ static int __handle_mm_fault(struct vm_area_struct 
> *vma, unsigned long address,
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(, orig_pmd);
>
> -   if (dirty && !pmd_access_permitted(orig_pmd, WRITE)) {
> +   if (write && !pmd_write(orig_pmd)) {
> ret = wp_huge_pmd(, orig_pmd);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
>
>
> I still cannot make sense of what the intention behind these changes
> were, the Changelog that went with them is utter crap, it doesn't
> explain anything.

The motivation was that I noticed that get_user_pages_fast() was doing
a full pud_access_permitted() check, but the get_user_pages() slow
path was only doing a pud_write() check. That was inconsistent so I
went to go resolve that across all the pte types and ended up making a
mess of things, I'm fine if the answer is that we should have went the
other way to only do write checks. However, when I was investigating
which way to go the aspect that persuaded me to start sprinkling
p??_access_permitted checks around was that the application behavior
changed between mmap access and direct-i/o access to the same buffer.
I assumed that different access behavior between those would be an
inconsistent surprise to userspace. Although, infinitely looping in
handle_mm_fault is an even worse surprise, apologies for that.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 03:37:30PM +0100, Peter Zijlstra wrote:

> Kirill did point out that my patch(es) break FOLL_DUMP in that it would
> now exclude pkey protected pages from core-dumps.
> 
> My counter argument is that it will now properly exclude !_PAGE_USER
> pages.
> 
> If we change p??_access_permitted() to pass the full follow flags
> instead of just the write part we could fix that.

Something like the completely untested below would do that.

Then we'd need this on top:

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1232,7 +1232,10 @@ __pte_access_permitted(unsigned long pte
need_pte_bits |= _PAGE_RW;
 
if ((pteval & need_pte_bits) != need_pte_bits)
-   return 0;
+   return false;
+
+   if (foll_flags & FOLL_DUMP)
+   return true;
 
return __pkru_allows_pkey(pte_flags_pkey(pteval), write);
 }

But it is rather ugly... :/


---

--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -11,6 +11,7 @@
 #define _ASMARM_PGTABLE_H
 
 #include 
+#include 
 #include 
 
 #ifndef CONFIG_MMU
@@ -232,12 +233,12 @@ static inline pte_t *pmd_page_vaddr(pmd_
 #define pte_valid_user(pte)\
(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
 
-static inline bool pte_access_permitted(pte_t pte, bool write)
+static inline bool pte_access_permitted(pte_t pte, unsigned int foll_flags)
 {
pteval_t mask = L_PTE_PRESENT | L_PTE_USER;
pteval_t needed = mask;
 
-   if (write)
+   if (foll_flags & FOLL_WRITE)
mask |= L_PTE_RDONLY;
 
return (pte_val(pte) & mask) == needed;
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -16,6 +16,8 @@
 #ifndef __ASM_PGTABLE_H
 #define __ASM_PGTABLE_H
 
+#include 
+
 #include 
 #include 
 
@@ -114,12 +116,23 @@ extern unsigned long empty_zero_page[PAG
  * write permission check) other than user execute-only which do not have the
  * PTE_USER bit set. PROT_NONE mappings do not have the PTE_VALID bit set.
  */
-#define pte_access_permitted(pte, write) \
-   (pte_valid_user(pte) && (!(write) || pte_write(pte)))
-#define pmd_access_permitted(pmd, write) \
-   (pte_access_permitted(pmd_pte(pmd), (write)))
-#define pud_access_permitted(pud, write) \
-   (pte_access_permitted(pud_pte(pud), (write)))
+static inline bool __pte_access_permitted(pte_t pte, unsigned int foll_flags)
+{
+   if (!pte_valid_user(pte))
+   return false;
+
+   if (foll_flags & FOLL_WRITE)
+   return !!pte_write(pte);
+
+   return true;
+}
+
+#define pte_access_permitted(pte, foll_flags) \
+   (__pte_access_permitted((pte), (foll_flags))
+#define pmd_access_permitted(pmd, foll_flags) \
+   (pte_access_permitted(pmd_pte(pmd), (foll_flags)))
+#define pud_access_permitted(pud, foll_flags) \
+   (pte_access_permitted(pud_pte(pud), (foll_flags)))
 
 static inline pte_t clear_pte_bit(pte_t pte, pgprot_t prot)
 {
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -75,7 +75,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd
if (!(pmd_val(pmd) & _PAGE_VALID))
return 0;
 
-   if (!pmd_access_permitted(pmd, write))
+   if (!pmd_access_permitted(pmd, !!write * FOLL_WRITE))
return 0;
 
refs = 0;
@@ -114,7 +114,7 @@ static int gup_huge_pud(pud_t *pudp, pud
if (!(pud_val(pud) & _PAGE_VALID))
return 0;
 
-   if (!pud_access_permitted(pud, write))
+   if (!pud_access_permitted(pud, !!write * FOLL_WRITE))
return 0;
 
refs = 0;
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_PGTABLE_H
 #define _ASM_X86_PGTABLE_H
 
+#include 
 #include 
 #include 
 #include 
@@ -1221,9 +1222,11 @@ static inline bool __pkru_allows_pkey(u1
  * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
  * same value on all 3 types.
  */
-static inline bool __pte_access_permitted(unsigned long pteval, bool write)
+static inline bool
+__pte_access_permitted(unsigned long pteval, unsigned int foll_flags)
 {
unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
+   bool write = foll_flags & FOLL_WRITE;
 
if (write)
need_pte_bits |= _PAGE_RW;
@@ -1235,21 +1238,21 @@ static inline bool __pte_access_permitte
 }
 
 #define pte_access_permitted pte_access_permitted
-static inline bool pte_access_permitted(pte_t pte, bool write)
+static inline bool pte_access_permitted(pte_t pte, bool foll_flags)
 {
-   return __pte_access_permitted(pte_val(pte), write);
+   return __pte_access_permitted(pte_val(pte), foll_flags);
 }
 
 #define pmd_access_permitted pmd_access_permitted
-static inline bool pmd_access_permitted(pmd_t pmd, bool write)
+static inline bool pmd_access_permitted(pmd_t pmd, bool foll_flags)
 {
-   return __pte_access_permitted(pmd_val(pmd), write);
+   

Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 03:37:30PM +0100, Peter Zijlstra wrote:

> Kirill did point out that my patch(es) break FOLL_DUMP in that it would
> now exclude pkey protected pages from core-dumps.
> 
> My counter argument is that it will now properly exclude !_PAGE_USER
> pages.
> 
> If we change p??_access_permitted() to pass the full follow flags
> instead of just the write part we could fix that.

Something like the completely untested below would do that.

Then we'd need this on top:

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1232,7 +1232,10 @@ __pte_access_permitted(unsigned long pte
need_pte_bits |= _PAGE_RW;
 
if ((pteval & need_pte_bits) != need_pte_bits)
-   return 0;
+   return false;
+
+   if (foll_flags & FOLL_DUMP)
+   return true;
 
return __pkru_allows_pkey(pte_flags_pkey(pteval), write);
 }

But it is rather ugly... :/


---

--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -11,6 +11,7 @@
 #define _ASMARM_PGTABLE_H
 
 #include 
+#include 
 #include 
 
 #ifndef CONFIG_MMU
@@ -232,12 +233,12 @@ static inline pte_t *pmd_page_vaddr(pmd_
 #define pte_valid_user(pte)\
(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
 
-static inline bool pte_access_permitted(pte_t pte, bool write)
+static inline bool pte_access_permitted(pte_t pte, unsigned int foll_flags)
 {
pteval_t mask = L_PTE_PRESENT | L_PTE_USER;
pteval_t needed = mask;
 
-   if (write)
+   if (foll_flags & FOLL_WRITE)
mask |= L_PTE_RDONLY;
 
return (pte_val(pte) & mask) == needed;
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -16,6 +16,8 @@
 #ifndef __ASM_PGTABLE_H
 #define __ASM_PGTABLE_H
 
+#include 
+
 #include 
 #include 
 
@@ -114,12 +116,23 @@ extern unsigned long empty_zero_page[PAG
  * write permission check) other than user execute-only which do not have the
  * PTE_USER bit set. PROT_NONE mappings do not have the PTE_VALID bit set.
  */
-#define pte_access_permitted(pte, write) \
-   (pte_valid_user(pte) && (!(write) || pte_write(pte)))
-#define pmd_access_permitted(pmd, write) \
-   (pte_access_permitted(pmd_pte(pmd), (write)))
-#define pud_access_permitted(pud, write) \
-   (pte_access_permitted(pud_pte(pud), (write)))
+static inline bool __pte_access_permitted(pte_t pte, unsigned int foll_flags)
+{
+   if (!pte_valid_user(pte))
+   return false;
+
+   if (foll_flags & FOLL_WRITE)
+   return !!pte_write(pte);
+
+   return true;
+}
+
+#define pte_access_permitted(pte, foll_flags) \
+   (__pte_access_permitted((pte), (foll_flags))
+#define pmd_access_permitted(pmd, foll_flags) \
+   (pte_access_permitted(pmd_pte(pmd), (foll_flags)))
+#define pud_access_permitted(pud, foll_flags) \
+   (pte_access_permitted(pud_pte(pud), (foll_flags)))
 
 static inline pte_t clear_pte_bit(pte_t pte, pgprot_t prot)
 {
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -75,7 +75,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd
if (!(pmd_val(pmd) & _PAGE_VALID))
return 0;
 
-   if (!pmd_access_permitted(pmd, write))
+   if (!pmd_access_permitted(pmd, !!write * FOLL_WRITE))
return 0;
 
refs = 0;
@@ -114,7 +114,7 @@ static int gup_huge_pud(pud_t *pudp, pud
if (!(pud_val(pud) & _PAGE_VALID))
return 0;
 
-   if (!pud_access_permitted(pud, write))
+   if (!pud_access_permitted(pud, !!write * FOLL_WRITE))
return 0;
 
refs = 0;
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_PGTABLE_H
 #define _ASM_X86_PGTABLE_H
 
+#include 
 #include 
 #include 
 #include 
@@ -1221,9 +1222,11 @@ static inline bool __pkru_allows_pkey(u1
  * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
  * same value on all 3 types.
  */
-static inline bool __pte_access_permitted(unsigned long pteval, bool write)
+static inline bool
+__pte_access_permitted(unsigned long pteval, unsigned int foll_flags)
 {
unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
+   bool write = foll_flags & FOLL_WRITE;
 
if (write)
need_pte_bits |= _PAGE_RW;
@@ -1235,21 +1238,21 @@ static inline bool __pte_access_permitte
 }
 
 #define pte_access_permitted pte_access_permitted
-static inline bool pte_access_permitted(pte_t pte, bool write)
+static inline bool pte_access_permitted(pte_t pte, bool foll_flags)
 {
-   return __pte_access_permitted(pte_val(pte), write);
+   return __pte_access_permitted(pte_val(pte), foll_flags);
 }
 
 #define pmd_access_permitted pmd_access_permitted
-static inline bool pmd_access_permitted(pmd_t pmd, bool write)
+static inline bool pmd_access_permitted(pmd_t pmd, bool foll_flags)
 {
-   return __pte_access_permitted(pmd_val(pmd), write);
+   

Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 11:25:29AM +0100, Peter Zijlstra wrote:
> The memory one is also clearly wrong, not having access does not a write
> fault make. If we have pte_write() set we should not do_wp_page() just
> because we don't have access. This falls under the "doing anything other
> than hard failure for !access is crazy" header.

So per the very same reasoning I think the below is warranted too; also
rename that @dirty variable, because its also wrong.

diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..0d43b347eb0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3987,7 +3987,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
};
-   unsigned int dirty = flags & FAULT_FLAG_WRITE;
+   unsigned int write = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
@@ -4013,7 +4013,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
 
/* NUMA case for anonymous PUDs would go here */
 
-   if (dirty && !pud_access_permitted(orig_pud, WRITE)) {
+   if (write && !pud_write(orig_pud)) {
ret = wp_huge_pud(, orig_pud);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -4046,7 +4046,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(, orig_pmd);
 
-   if (dirty && !pmd_access_permitted(orig_pmd, WRITE)) {
+   if (write && !pmd_write(orig_pmd)) {
ret = wp_huge_pmd(, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;


I still cannot make sense of what the intention behind these changes
were, the Changelog that went with them is utter crap, it doesn't
explain anything.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 11:25:29AM +0100, Peter Zijlstra wrote:
> The memory one is also clearly wrong, not having access does not a write
> fault make. If we have pte_write() set we should not do_wp_page() just
> because we don't have access. This falls under the "doing anything other
> than hard failure for !access is crazy" header.

So per the very same reasoning I think the below is warranted too; also
rename that @dirty variable, because its also wrong.

diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..0d43b347eb0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3987,7 +3987,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
};
-   unsigned int dirty = flags & FAULT_FLAG_WRITE;
+   unsigned int write = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
@@ -4013,7 +4013,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
 
/* NUMA case for anonymous PUDs would go here */
 
-   if (dirty && !pud_access_permitted(orig_pud, WRITE)) {
+   if (write && !pud_write(orig_pud)) {
ret = wp_huge_pud(, orig_pud);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -4046,7 +4046,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(, orig_pmd);
 
-   if (dirty && !pmd_access_permitted(orig_pmd, WRITE)) {
+   if (write && !pmd_write(orig_pmd)) {
ret = wp_huge_pmd(, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;


I still cannot make sense of what the intention behind these changes
were, the Changelog that went with them is utter crap, it doesn't
explain anything.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 09:00:41AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:04:56PM -0800, Dave Hansen wrote:
> > 
> > I've got some additions to the selftests and a fix where we pass FOLL_*
> > flags around a bit more instead of just 'write'.  I'll get those out as
> > soon as I do a bit more testing.
> 
> Try the below; I have more in the works, but this already fixes a whole
> bunch of obvious fail and should fix the case I described.
> 
> The thing is, you should _never_ return NULL for an access error, that's
> complete crap.
> 
> You should also not blindly change every pte_write() test to
> pte_access_permitted(), that's also wrong, because then you're missing
> the read-access tests.
> 
> Basically you need to very carefully audit each and every
> p??_access_permitted() call; they're currently mostly wrong.

I think we also want this:

diff --git a/fs/dax.c b/fs/dax.c
index 78b72c48374e..95981591977a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -627,8 +627,7 @@ static void dax_mapping_entry_mkclean(struct address_space 
*mapping,
 
if (pfn != pmd_pfn(*pmdp))
goto unlock_pmd;
-   if (!pmd_dirty(*pmdp)
-   && !pmd_access_permitted(*pmdp, WRITE))
+   if (!pmd_dirty(*pmdp) && !pmd_write(*pmdp))
goto unlock_pmd;
 
flush_cache_page(vma, address, pfn);
diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..6ce3ba12e07d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3948,7 +3948,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
if (vmf->flags & FAULT_FLAG_WRITE) {
-   if (!pte_access_permitted(entry, WRITE))
+   if (!pte_write(entry))
return do_wp_page(vmf);
entry = pte_mkdirty(entry);
}


the DAX one is both inconsistent (only the PMD case, not also the PTE
case) and just wrong; you don't want PKEYs to avoid cleaning pages,
that's crazy.

The memory one is also clearly wrong, not having access does not a write
fault make. If we have pte_write() set we should not do_wp_page() just
because we don't have access. This falls under the "doing anything other
than hard failure for !access is crazy" header.


Still looking at __handle_mm_fault(), they smell bad, but I can't get my
brain started today :/


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 09:00:41AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:04:56PM -0800, Dave Hansen wrote:
> > 
> > I've got some additions to the selftests and a fix where we pass FOLL_*
> > flags around a bit more instead of just 'write'.  I'll get those out as
> > soon as I do a bit more testing.
> 
> Try the below; I have more in the works, but this already fixes a whole
> bunch of obvious fail and should fix the case I described.
> 
> The thing is, you should _never_ return NULL for an access error, that's
> complete crap.
> 
> You should also not blindly change every pte_write() test to
> pte_access_permitted(), that's also wrong, because then you're missing
> the read-access tests.
> 
> Basically you need to very carefully audit each and every
> p??_access_permitted() call; they're currently mostly wrong.

I think we also want this:

diff --git a/fs/dax.c b/fs/dax.c
index 78b72c48374e..95981591977a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -627,8 +627,7 @@ static void dax_mapping_entry_mkclean(struct address_space 
*mapping,
 
if (pfn != pmd_pfn(*pmdp))
goto unlock_pmd;
-   if (!pmd_dirty(*pmdp)
-   && !pmd_access_permitted(*pmdp, WRITE))
+   if (!pmd_dirty(*pmdp) && !pmd_write(*pmdp))
goto unlock_pmd;
 
flush_cache_page(vma, address, pfn);
diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..6ce3ba12e07d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3948,7 +3948,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
if (vmf->flags & FAULT_FLAG_WRITE) {
-   if (!pte_access_permitted(entry, WRITE))
+   if (!pte_write(entry))
return do_wp_page(vmf);
entry = pte_mkdirty(entry);
}


the DAX one is both inconsistent (only the PMD case, not also the PTE
case) and just wrong; you don't want PKEYs to avoid cleaning pages,
that's crazy.

The memory one is also clearly wrong, not having access does not a write
fault make. If we have pte_write() set we should not do_wp_page() just
because we don't have access. This falls under the "doing anything other
than hard failure for !access is crazy" header.


Still looking at __handle_mm_fault(), they smell bad, but I can't get my
brain started today :/


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 09:04:56PM -0800, Dave Hansen wrote:
> 
> I've got some additions to the selftests and a fix where we pass FOLL_*
> flags around a bit more instead of just 'write'.  I'll get those out as
> soon as I do a bit more testing.

Try the below; I have more in the works, but this already fixes a whole
bunch of obvious fail and should fix the case I described.

The thing is, you should _never_ return NULL for an access error, that's
complete crap.

You should also not blindly change every pte_write() test to
pte_access_permitted(), that's also wrong, because then you're missing
the read-access tests.

Basically you need to very carefully audit each and every
p??_access_permitted() call; they're currently mostly wrong.

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -66,7 +66,7 @@ static int follow_pfn_pte(struct vm_area
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-   return pte_access_permitted(pte, WRITE) ||
+   return pte_write(pte) ||
((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 }
 
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(stru
}
 
if (flags & FOLL_GET) {
+   if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   goto out;
+   }
+
get_page(page);
 
/* drop the pgmap reference now that we hold the page */
@@ -244,6 +249,15 @@ static struct page *follow_pmd_mask(stru
pmd_migration_entry_wait(mm, pmd);
goto retry;
}
+
+   if (flags & FOLL_GET) {
+   if (!pmd_access_permitted(*pmd, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -326,6 +340,15 @@ static struct page *follow_pud_mask(stru
return page;
return no_page_table(vma, flags);
}
+
+   if (flags & FOLL_GET) {
+   if (!pud_access_permitted(*pud, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags);
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,9 +870,6 @@ struct page *follow_devmap_pmd(struct vm
 */
WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW 
set");
 
-   if (!pmd_access_permitted(*pmd, flags & FOLL_WRITE))
-   return NULL;
-
if (pmd_present(*pmd) && pmd_devmap(*pmd))
/* pass */;
else
@@ -1012,9 +1009,6 @@ struct page *follow_devmap_pud(struct vm
 
assert_spin_locked(pud_lockptr(mm, pud));
 
-   if (!pud_access_permitted(*pud, flags & FOLL_WRITE))
-   return NULL;
-
if (pud_present(*pud) && pud_devmap(*pud))
/* pass */;
else
@@ -1386,7 +1380,7 @@ int do_huge_pmd_wp_page(struct vm_fault
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-   return pmd_access_permitted(pmd, WRITE) ||
+   return pmd_write(pmd) ||
   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
 }
 


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-15 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 09:04:56PM -0800, Dave Hansen wrote:
> 
> I've got some additions to the selftests and a fix where we pass FOLL_*
> flags around a bit more instead of just 'write'.  I'll get those out as
> soon as I do a bit more testing.

Try the below; I have more in the works, but this already fixes a whole
bunch of obvious fail and should fix the case I described.

The thing is, you should _never_ return NULL for an access error, that's
complete crap.

You should also not blindly change every pte_write() test to
pte_access_permitted(), that's also wrong, because then you're missing
the read-access tests.

Basically you need to very carefully audit each and every
p??_access_permitted() call; they're currently mostly wrong.

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -66,7 +66,7 @@ static int follow_pfn_pte(struct vm_area
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-   return pte_access_permitted(pte, WRITE) ||
+   return pte_write(pte) ||
((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 }
 
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(stru
}
 
if (flags & FOLL_GET) {
+   if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   goto out;
+   }
+
get_page(page);
 
/* drop the pgmap reference now that we hold the page */
@@ -244,6 +249,15 @@ static struct page *follow_pmd_mask(stru
pmd_migration_entry_wait(mm, pmd);
goto retry;
}
+
+   if (flags & FOLL_GET) {
+   if (!pmd_access_permitted(*pmd, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -326,6 +340,15 @@ static struct page *follow_pud_mask(stru
return page;
return no_page_table(vma, flags);
}
+
+   if (flags & FOLL_GET) {
+   if (!pud_access_permitted(*pud, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags);
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,9 +870,6 @@ struct page *follow_devmap_pmd(struct vm
 */
WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW 
set");
 
-   if (!pmd_access_permitted(*pmd, flags & FOLL_WRITE))
-   return NULL;
-
if (pmd_present(*pmd) && pmd_devmap(*pmd))
/* pass */;
else
@@ -1012,9 +1009,6 @@ struct page *follow_devmap_pud(struct vm
 
assert_spin_locked(pud_lockptr(mm, pud));
 
-   if (!pud_access_permitted(*pud, flags & FOLL_WRITE))
-   return NULL;
-
if (pud_present(*pud) && pud_devmap(*pud))
/* pass */;
else
@@ -1386,7 +1380,7 @@ int do_huge_pmd_wp_page(struct vm_fault
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-   return pmd_access_permitted(pmd, WRITE) ||
+   return pmd_write(pmd) ||
   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
 }
 


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 10:09:24PM -0800, Linus Torvalds wrote:
> On Dec 14, 2017 21:04, "Dave Hansen"  wrote:
> Can we please just undo that broken crap instead of trying to "fix" it?
> 
> It was wrong. We absolutely do not want to complicate the gup path.
> 
> Let's fet rid of those broken p??_access_permited() things.

So we actually need the pte_access_permitted() stuff if we want to
ensure we're not stepping on !PAGE_USER things.




Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 10:09:24PM -0800, Linus Torvalds wrote:
> On Dec 14, 2017 21:04, "Dave Hansen"  wrote:
> Can we please just undo that broken crap instead of trying to "fix" it?
> 
> It was wrong. We absolutely do not want to complicate the gup path.
> 
> Let's fet rid of those broken p??_access_permited() things.

So we actually need the pte_access_permitted() stuff if we want to
ensure we're not stepping on !PAGE_USER things.




Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Dave Hansen
On 12/14/2017 12:54 PM, Peter Zijlstra wrote:
>> That short-circuits the page fault pretty quickly.  So, basically, the
>> rule is: if the hardware says you tripped over pkey permissions, you
>> die.  We don't try to do anything to the underlying page *before* saying
>> that you die.
> That only works when you trip the fault from hardware. Not if you do a
> software fault using gup().
> 
> AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
> indefinitely on the case I described.

So, the underlying bug here is that we now a get_user_pages_remote() and
then go ahead and do the p*_access_permitted() checks against the
current PKRU.  This was introduced recently with the addition of the new
p??_access_permitted() calls.

We have checks in the VMA path for the "remote" gups and we avoid
consulting PKRU for them.  This got missed in the pkeys selftests
because I did a ptrace read, but not a *write*.  I also didn't
explicitly test it against something where a COW needed to be done.

I've got some additions to the selftests and a fix where we pass FOLL_*
flags around a bit more instead of just 'write'.  I'll get those out as
soon as I do a bit more testing.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Dave Hansen
On 12/14/2017 12:54 PM, Peter Zijlstra wrote:
>> That short-circuits the page fault pretty quickly.  So, basically, the
>> rule is: if the hardware says you tripped over pkey permissions, you
>> die.  We don't try to do anything to the underlying page *before* saying
>> that you die.
> That only works when you trip the fault from hardware. Not if you do a
> software fault using gup().
> 
> AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
> indefinitely on the case I described.

So, the underlying bug here is that we now a get_user_pages_remote() and
then go ahead and do the p*_access_permitted() checks against the
current PKRU.  This was introduced recently with the addition of the new
p??_access_permitted() calls.

We have checks in the VMA path for the "remote" gups and we avoid
consulting PKRU for them.  This got missed in the pkeys selftests
because I did a ptrace read, but not a *write*.  I also didn't
explicitly test it against something where a COW needed to be done.

I've got some additions to the selftests and a fix where we pass FOLL_*
flags around a bit more instead of just 'write'.  I'll get those out as
soon as I do a bit more testing.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 09:54:50PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:44:58PM -0800, Dave Hansen wrote:
> > On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> > > I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> > > looks very dodgy to me. How does that not result in endlessly CoW'ing
> > > the same page over and over when we have a PKEY disallowing write access
> > > on that page?
> > 
> > I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
> > assume that's something you added in this series.
> 
> No, Dan did in 5c9d2d5c269c4.
> 
> > But, one of the ways that we keep pkeys from causing these kinds of
> > repeating loops when interacting with other things is this hunk in the
> > page fault code:
> > 
> > > static inline int
> > > access_error(unsigned long error_code, struct vm_area_struct *vma)
> > > {
> > ...
> > > /*
> > >  * Read or write was blocked by protection keys.  This is
> > >  * always an unconditional error and can never result in
> > >  * a follow-up action to resolve the fault, like a COW.
> > >  */
> > > if (error_code & PF_PK)
> > > return 1;
> > 
> > That short-circuits the page fault pretty quickly.  So, basically, the
> > rule is: if the hardware says you tripped over pkey permissions, you
> > die.  We don't try to do anything to the underlying page *before* saying
> > that you die.
> 
> That only works when you trip the fault from hardware. Not if you do a
> software fault using gup().
> 
> AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
> indefinitely on the case I described.

Note that my patch actually fixes this by making can_follow_write_pte()
not return NULL (we'll take the CoW fault irrespective of PKEYs) and
then on the second go-around, we'll find a writable PTE but return
-EFAULT from follow_page_mask() because of PKEY and terminate.

But as is, follow_page_mask() will return NULL because either !write or
PKEY, faultin_page()->handle_mm_fault() will see !write because of PKEY
go into the CoW path, we rety follow_page_mask() it will _still_ return
NULL because PKEY, again to the fault, again retry, again 


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 09:54:50PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:44:58PM -0800, Dave Hansen wrote:
> > On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> > > I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> > > looks very dodgy to me. How does that not result in endlessly CoW'ing
> > > the same page over and over when we have a PKEY disallowing write access
> > > on that page?
> > 
> > I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
> > assume that's something you added in this series.
> 
> No, Dan did in 5c9d2d5c269c4.
> 
> > But, one of the ways that we keep pkeys from causing these kinds of
> > repeating loops when interacting with other things is this hunk in the
> > page fault code:
> > 
> > > static inline int
> > > access_error(unsigned long error_code, struct vm_area_struct *vma)
> > > {
> > ...
> > > /*
> > >  * Read or write was blocked by protection keys.  This is
> > >  * always an unconditional error and can never result in
> > >  * a follow-up action to resolve the fault, like a COW.
> > >  */
> > > if (error_code & PF_PK)
> > > return 1;
> > 
> > That short-circuits the page fault pretty quickly.  So, basically, the
> > rule is: if the hardware says you tripped over pkey permissions, you
> > die.  We don't try to do anything to the underlying page *before* saying
> > that you die.
> 
> That only works when you trip the fault from hardware. Not if you do a
> software fault using gup().
> 
> AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
> indefinitely on the case I described.

Note that my patch actually fixes this by making can_follow_write_pte()
not return NULL (we'll take the CoW fault irrespective of PKEYs) and
then on the second go-around, we'll find a writable PTE but return
-EFAULT from follow_page_mask() because of PKEY and terminate.

But as is, follow_page_mask() will return NULL because either !write or
PKEY, faultin_page()->handle_mm_fault() will see !write because of PKEY
go into the CoW path, we rety follow_page_mask() it will _still_ return
NULL because PKEY, again to the fault, again retry, again 


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 12:44:58PM -0800, Dave Hansen wrote:
> On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> > I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> > looks very dodgy to me. How does that not result in endlessly CoW'ing
> > the same page over and over when we have a PKEY disallowing write access
> > on that page?
> 
> I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
> assume that's something you added in this series.

No, Dan did in 5c9d2d5c269c4.

> But, one of the ways that we keep pkeys from causing these kinds of
> repeating loops when interacting with other things is this hunk in the
> page fault code:
> 
> > static inline int
> > access_error(unsigned long error_code, struct vm_area_struct *vma)
> > {
> ...
> > /*
> >  * Read or write was blocked by protection keys.  This is
> >  * always an unconditional error and can never result in
> >  * a follow-up action to resolve the fault, like a COW.
> >  */
> > if (error_code & PF_PK)
> > return 1;
> 
> That short-circuits the page fault pretty quickly.  So, basically, the
> rule is: if the hardware says you tripped over pkey permissions, you
> die.  We don't try to do anything to the underlying page *before* saying
> that you die.

That only works when you trip the fault from hardware. Not if you do a
software fault using gup().

AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
indefinitely on the case I described.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 12:44:58PM -0800, Dave Hansen wrote:
> On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> > I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> > looks very dodgy to me. How does that not result in endlessly CoW'ing
> > the same page over and over when we have a PKEY disallowing write access
> > on that page?
> 
> I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
> assume that's something you added in this series.

No, Dan did in 5c9d2d5c269c4.

> But, one of the ways that we keep pkeys from causing these kinds of
> repeating loops when interacting with other things is this hunk in the
> page fault code:
> 
> > static inline int
> > access_error(unsigned long error_code, struct vm_area_struct *vma)
> > {
> ...
> > /*
> >  * Read or write was blocked by protection keys.  This is
> >  * always an unconditional error and can never result in
> >  * a follow-up action to resolve the fault, like a COW.
> >  */
> > if (error_code & PF_PK)
> > return 1;
> 
> That short-circuits the page fault pretty quickly.  So, basically, the
> rule is: if the hardware says you tripped over pkey permissions, you
> die.  We don't try to do anything to the underlying page *before* saying
> that you die.

That only works when you trip the fault from hardware. Not if you do a
software fault using gup().

AFAIK __get_user_pages(FOLL_FORCE|FOLL_WRITE|FOLL_GET) will loop
indefinitely on the case I described.


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Dave Hansen
On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> looks very dodgy to me. How does that not result in endlessly CoW'ing
> the same page over and over when we have a PKEY disallowing write access
> on that page?

I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
assume that's something you added in this series.

But, one of the ways that we keep pkeys from causing these kinds of
repeating loops when interacting with other things is this hunk in the
page fault code:

> static inline int
> access_error(unsigned long error_code, struct vm_area_struct *vma)
> {
...
> /*
>  * Read or write was blocked by protection keys.  This is
>  * always an unconditional error and can never result in
>  * a follow-up action to resolve the fault, like a COW.
>  */
> if (error_code & PF_PK)
> return 1;

That short-circuits the page fault pretty quickly.  So, basically, the
rule is: if the hardware says you tripped over pkey permissions, you
die.  We don't try to do anything to the underlying page *before* saying
that you die.



Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Dave Hansen
On 12/14/2017 06:37 AM, Peter Zijlstra wrote:
> I'm also looking at pte_access_permitted() in handle_pte_fault(); that
> looks very dodgy to me. How does that not result in endlessly CoW'ing
> the same page over and over when we have a PKEY disallowing write access
> on that page?

I'm not seeing the pte_access_permitted() in handle_pte_fault().  I
assume that's something you added in this series.

But, one of the ways that we keep pkeys from causing these kinds of
repeating loops when interacting with other things is this hunk in the
page fault code:

> static inline int
> access_error(unsigned long error_code, struct vm_area_struct *vma)
> {
...
> /*
>  * Read or write was blocked by protection keys.  This is
>  * always an unconditional error and can never result in
>  * a follow-up action to resolve the fault, like a COW.
>  */
> if (error_code & PF_PK)
> return 1;

That short-circuits the page fault pretty quickly.  So, basically, the
rule is: if the hardware says you tripped over pkey permissions, you
die.  We don't try to do anything to the underlying page *before* saying
that you die.



Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 01:41:17PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:27:27PM +0100, Peter Zijlstra wrote:
> > The gup_*_range() functions which implement __get_user_pages_fast() do
> > a p*_access_permitted() test to see if the memory is at all accessible
> > (tests both _PAGE_USER|_PAGE_RW as well as architectural things like
> > pkeys).
> > 
> > But the follow_*() functions which implement __get_user_pages() do not
> > have this test. Recently, commit:
> > 
> >   5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + 
> > gup paths")
> > 
> > added it to a few specific write paths, but it failed to consistently
> > apply it (I've not audited anything outside of gup).
> > 
> > Revert the change from that patch and insert the tests in the right
> > locations such that they cover all READ / WRITE accesses for all
> > pte/pmd/pud levels.
> > 
> > In particular I care about the _PAGE_USER test, we should not ever,
> > allow access to pages not marked with it, but it also makes the pkey
> > accesses more consistent.
> 
> This should probably go on top. These are now all superfluous and
> slightly wrong.

I also cannot explain dax_mapping_entry_mkclean(), why would we not make
clean those pages that are not pkey writable (but clearly are writable
and dirty)? That doesn't make any sense at all.

Kirill did point out that my patch(es) break FOLL_DUMP in that it would
now exclude pkey protected pages from core-dumps.

My counter argument is that it will now properly exclude !_PAGE_USER
pages.

If we change p??_access_permitted() to pass the full follow flags
instead of just the write part we could fix that.

I'm also looking at pte_access_permitted() in handle_pte_fault(); that
looks very dodgy to me. How does that not result in endlessly CoW'ing
the same page over and over when we have a PKEY disallowing write access
on that page?

Bah... /me grumpy



Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 01:41:17PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:27:27PM +0100, Peter Zijlstra wrote:
> > The gup_*_range() functions which implement __get_user_pages_fast() do
> > a p*_access_permitted() test to see if the memory is at all accessible
> > (tests both _PAGE_USER|_PAGE_RW as well as architectural things like
> > pkeys).
> > 
> > But the follow_*() functions which implement __get_user_pages() do not
> > have this test. Recently, commit:
> > 
> >   5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + 
> > gup paths")
> > 
> > added it to a few specific write paths, but it failed to consistently
> > apply it (I've not audited anything outside of gup).
> > 
> > Revert the change from that patch and insert the tests in the right
> > locations such that they cover all READ / WRITE accesses for all
> > pte/pmd/pud levels.
> > 
> > In particular I care about the _PAGE_USER test, we should not ever,
> > allow access to pages not marked with it, but it also makes the pkey
> > accesses more consistent.
> 
> This should probably go on top. These are now all superfluous and
> slightly wrong.

I also cannot explain dax_mapping_entry_mkclean(), why would we not make
clean those pages that are not pkey writable (but clearly are writable
and dirty)? That doesn't make any sense at all.

Kirill did point out that my patch(es) break FOLL_DUMP in that it would
now exclude pkey protected pages from core-dumps.

My counter argument is that it will now properly exclude !_PAGE_USER
pages.

If we change p??_access_permitted() to pass the full follow flags
instead of just the write part we could fix that.

I'm also looking at pte_access_permitted() in handle_pte_fault(); that
looks very dodgy to me. How does that not result in endlessly CoW'ing
the same page over and over when we have a PKEY disallowing write access
on that page?

Bah... /me grumpy



Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 12:27:27PM +0100, Peter Zijlstra wrote:
> The gup_*_range() functions which implement __get_user_pages_fast() do
> a p*_access_permitted() test to see if the memory is at all accessible
> (tests both _PAGE_USER|_PAGE_RW as well as architectural things like
> pkeys).
> 
> But the follow_*() functions which implement __get_user_pages() do not
> have this test. Recently, commit:
> 
>   5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + 
> gup paths")
> 
> added it to a few specific write paths, but it failed to consistently
> apply it (I've not audited anything outside of gup).
> 
> Revert the change from that patch and insert the tests in the right
> locations such that they cover all READ / WRITE accesses for all
> pte/pmd/pud levels.
> 
> In particular I care about the _PAGE_USER test, we should not ever,
> allow access to pages not marked with it, but it also makes the pkey
> accesses more consistent.

This should probably go on top. These are now all superfluous and
slightly wrong.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..1797368cc83a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,9 +870,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 */
WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW 
set");
 
-   if (!pmd_access_permitted(*pmd, flags & FOLL_WRITE))
-   return NULL;
-
if (pmd_present(*pmd) && pmd_devmap(*pmd))
/* pass */;
else
@@ -1012,9 +1009,6 @@ struct page *follow_devmap_pud(struct vm_area_struct 
*vma, unsigned long addr,
 
assert_spin_locked(pud_lockptr(mm, pud));
 
-   if (!pud_access_permitted(*pud, flags & FOLL_WRITE))
-   return NULL;
-
if (pud_present(*pud) && pud_devmap(*pud))
/* pass */;
else
@@ -1386,7 +1380,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-   return pmd_access_permitted(pmd, WRITE) ||
+   return pmd_write(pmd) ||
   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
 }
 


Re: [PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 12:27:27PM +0100, Peter Zijlstra wrote:
> The gup_*_range() functions which implement __get_user_pages_fast() do
> a p*_access_permitted() test to see if the memory is at all accessible
> (tests both _PAGE_USER|_PAGE_RW as well as architectural things like
> pkeys).
> 
> But the follow_*() functions which implement __get_user_pages() do not
> have this test. Recently, commit:
> 
>   5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + 
> gup paths")
> 
> added it to a few specific write paths, but it failed to consistently
> apply it (I've not audited anything outside of gup).
> 
> Revert the change from that patch and insert the tests in the right
> locations such that they cover all READ / WRITE accesses for all
> pte/pmd/pud levels.
> 
> In particular I care about the _PAGE_USER test, we should not ever,
> allow access to pages not marked with it, but it also makes the pkey
> accesses more consistent.

This should probably go on top. These are now all superfluous and
slightly wrong.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..1797368cc83a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,9 +870,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 */
WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW 
set");
 
-   if (!pmd_access_permitted(*pmd, flags & FOLL_WRITE))
-   return NULL;
-
if (pmd_present(*pmd) && pmd_devmap(*pmd))
/* pass */;
else
@@ -1012,9 +1009,6 @@ struct page *follow_devmap_pud(struct vm_area_struct 
*vma, unsigned long addr,
 
assert_spin_locked(pud_lockptr(mm, pud));
 
-   if (!pud_access_permitted(*pud, flags & FOLL_WRITE))
-   return NULL;
-
if (pud_present(*pud) && pud_devmap(*pud))
/* pass */;
else
@@ -1386,7 +1380,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-   return pmd_access_permitted(pmd, WRITE) ||
+   return pmd_write(pmd) ||
   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
 }
 


[PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
The gup_*_range() functions which implement __get_user_pages_fast() do
a p*_access_permitted() test to see if the memory is at all accessible
(tests both _PAGE_USER|_PAGE_RW as well as architectural things like
pkeys).

But the follow_*() functions which implement __get_user_pages() do not
have this test. Recently, commit:

  5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup 
paths")

added it to a few specific write paths, but it failed to consistently
apply it (I've not audited anything outside of gup).

Revert the change from that patch and insert the tests in the right
locations such that they cover all READ / WRITE accesses for all
pte/pmd/pud levels.

In particular I care about the _PAGE_USER test, we should not ever,
allow access to pages not marked with it, but it also makes the pkey
accesses more consistent.

Signed-off-by: Peter Zijlstra (Intel) 
---
 mm/gup.c |   25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -66,7 +66,7 @@ static int follow_pfn_pte(struct vm_area
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-   return pte_access_permitted(pte, WRITE) ||
+   return pte_write(pte) ||
((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 }
 
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(stru
}
 
if (flags & FOLL_GET) {
+   if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   goto out;
+   }
+
get_page(page);
 
/* drop the pgmap reference now that we hold the page */
@@ -244,6 +249,15 @@ static struct page *follow_pmd_mask(stru
pmd_migration_entry_wait(mm, pmd);
goto retry;
}
+
+   if (flags & FOLL_GET) {
+   if (!pmd_access_permitted(*pmd, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -326,6 +340,15 @@ static struct page *follow_pud_mask(stru
return page;
return no_page_table(vma, flags);
}
+
+   if (flags & FOLL_GET) {
+   if (!pud_access_permitted(*pud, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags);




[PATCH v2 01/17] mm/gup: Fixup p*_access_permitted()

2017-12-14 Thread Peter Zijlstra
The gup_*_range() functions which implement __get_user_pages_fast() do
a p*_access_permitted() test to see if the memory is at all accessible
(tests both _PAGE_USER|_PAGE_RW as well as architectural things like
pkeys).

But the follow_*() functions which implement __get_user_pages() do not
have this test. Recently, commit:

  5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup 
paths")

added it to a few specific write paths, but it failed to consistently
apply it (I've not audited anything outside of gup).

Revert the change from that patch and insert the tests in the right
locations such that they cover all READ / WRITE accesses for all
pte/pmd/pud levels.

In particular I care about the _PAGE_USER test, we should not ever,
allow access to pages not marked with it, but it also makes the pkey
accesses more consistent.

Signed-off-by: Peter Zijlstra (Intel) 
---
 mm/gup.c |   25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -66,7 +66,7 @@ static int follow_pfn_pte(struct vm_area
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-   return pte_access_permitted(pte, WRITE) ||
+   return pte_write(pte) ||
((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 }
 
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(stru
}
 
if (flags & FOLL_GET) {
+   if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   goto out;
+   }
+
get_page(page);
 
/* drop the pgmap reference now that we hold the page */
@@ -244,6 +249,15 @@ static struct page *follow_pmd_mask(stru
pmd_migration_entry_wait(mm, pmd);
goto retry;
}
+
+   if (flags & FOLL_GET) {
+   if (!pmd_access_permitted(*pmd, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -326,6 +340,15 @@ static struct page *follow_pud_mask(stru
return page;
return no_page_table(vma, flags);
}
+
+   if (flags & FOLL_GET) {
+   if (!pud_access_permitted(*pud, !!(flags & FOLL_WRITE))) {
+   page = ERR_PTR(-EFAULT);
+   spin_unlock(ptr);
+   return page;
+   }
+   }
+
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags);