Re: Linux 5.1-rc5

2019-05-20 Thread Greg KH
On Thu, May 02, 2019 at 05:10:55PM +0200, Martin Schwidefsky wrote:
> On Thu, 2 May 2019 16:31:10 +0200
> Greg KH  wrote:
> 
> > On Thu, May 02, 2019 at 04:17:58PM +0200, Martin Schwidefsky wrote:
> > > On Thu, 2 May 2019 14:21:28 +0200
> > > Greg KH  wrote:
> > >   
> > > > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote:  
> > > > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig 
> > > > >  wrote:
> > > > > >
> > > > > > Can we please have the page refcount overflow fixes out on the list
> > > > > > for review, even if it is after the fact?
> > > > > 
> > > > > They were actually on a list for review long before the fact, but it
> > > > > was the security mailing list. The issue actually got discussed back
> > > > > in January along with early versions of the patches, but then we
> > > > > dropped the ball because it just wasn't on anybody's radar and it got
> > > > > resurrected late March. Willy wrote a rather bigger patch-series, and
> > > > > review of that is what then resulted in those commits. So they may
> > > > > look recent, but that's just because the original patches got
> > > > > seriously edited down and rewritten.
> > > > > 
> > > > > That said, powerpc and s390 should at least look at maybe adding a
> > > > > check for the page ref in their gup paths too. Powerpc has the special
> > > > > gup_hugepte() case, and s390 has its own version of gup entirely. I
> > > > > was actually hoping the s390 guys would look at using the generic gup
> > > > > code.
> > > > > 
> > > > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
> > > > > largely irrelevant, partly since even theoretically this whole issue
> > > > > needs a _lot_ of memory.
> > > > > 
> > > > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
> > > > > (page ref overflow)"). You may or may not really care.
> > > > 
> > > > I've now queued these patches up for the next round of stable releases,
> > > > as some people seem to care about these.
> > > > 
> > > > I didn't see any follow-on patches for s390 or ppc64 hit the tree for
> > > > these changes, am I just missing them and should also queue up a few
> > > > more to handle this issue on those platforms?  
> > > 
> > > I fixed that with a different approach. The following two patches are
> > > queued for the next merge window:
> > > 
> > > d1874a0c2805 "s390/mm: make the pxd_offset functions more robust"
> > > 1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code"
> > > 
> > > With these two s390 now uses the generic gup code in mm/gup.c  
> > 
> > Nice!  Do you want me to queue those up for the stable backports once
> > they hit a public -rc release?
> 
> Yes please!

Now queued up to 5.0 and 5.1, but did not apply to 4.19 or older :(

thanks,

greg k-h


Re: Linux 5.1-rc5

2019-05-03 Thread Michael Ellerman
Greg KH  writes:
> On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote:
>> On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  
>> wrote:
>> >
>> > Can we please have the page refcount overflow fixes out on the list
>> > for review, even if it is after the fact?
>> 
>> They were actually on a list for review long before the fact, but it
>> was the security mailing list. The issue actually got discussed back
>> in January along with early versions of the patches, but then we
>> dropped the ball because it just wasn't on anybody's radar and it got
>> resurrected late March. Willy wrote a rather bigger patch-series, and
>> review of that is what then resulted in those commits. So they may
>> look recent, but that's just because the original patches got
>> seriously edited down and rewritten.
>> 
>> That said, powerpc and s390 should at least look at maybe adding a
>> check for the page ref in their gup paths too. Powerpc has the special
>> gup_hugepte() case, and s390 has its own version of gup entirely. I
>> was actually hoping the s390 guys would look at using the generic gup
>> code.
>> 
>> I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
>> largely irrelevant, partly since even theoretically this whole issue
>> needs a _lot_ of memory.
>> 
>> Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
>> (page ref overflow)"). You may or may not really care.
>
> I've now queued these patches up for the next round of stable releases,
> as some people seem to care about these.
>
> I didn't see any follow-on patches for s390 or ppc64 hit the tree for
> these changes, am I just missing them and should also queue up a few
> more to handle this issue on those platforms?

No you haven't missed them for powerpc. It's on my list.

cheers


Re: Linux 5.1-rc5

2019-05-02 Thread Christoph Hellwig
On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote:
> I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
> largely irrelevant, partly since even theoretically this whole issue
> needs a _lot_ of memory.

Adding the relevant people - while the might be irrelevant, at least
mips and sparc have some giant memory systems.  And I'd really like
to see the arch-specific GUP implementations to go away for other
reasons, as we have a few issues to sort out with GUP usage now
(we just had discussions at LSF/MM), and the less implementations we
have to deal with the better.


Re: Linux 5.1-rc5

2019-05-02 Thread Martin Schwidefsky
On Thu, 2 May 2019 16:31:10 +0200
Greg KH  wrote:

> On Thu, May 02, 2019 at 04:17:58PM +0200, Martin Schwidefsky wrote:
> > On Thu, 2 May 2019 14:21:28 +0200
> > Greg KH  wrote:
> >   
> > > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote:  
> > > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  
> > > > wrote:
> > > > >
> > > > > Can we please have the page refcount overflow fixes out on the list
> > > > > for review, even if it is after the fact?
> > > > 
> > > > They were actually on a list for review long before the fact, but it
> > > > was the security mailing list. The issue actually got discussed back
> > > > in January along with early versions of the patches, but then we
> > > > dropped the ball because it just wasn't on anybody's radar and it got
> > > > resurrected late March. Willy wrote a rather bigger patch-series, and
> > > > review of that is what then resulted in those commits. So they may
> > > > look recent, but that's just because the original patches got
> > > > seriously edited down and rewritten.
> > > > 
> > > > That said, powerpc and s390 should at least look at maybe adding a
> > > > check for the page ref in their gup paths too. Powerpc has the special
> > > > gup_hugepte() case, and s390 has its own version of gup entirely. I
> > > > was actually hoping the s390 guys would look at using the generic gup
> > > > code.
> > > > 
> > > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
> > > > largely irrelevant, partly since even theoretically this whole issue
> > > > needs a _lot_ of memory.
> > > > 
> > > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
> > > > (page ref overflow)"). You may or may not really care.
> > > 
> > > I've now queued these patches up for the next round of stable releases,
> > > as some people seem to care about these.
> > > 
> > > I didn't see any follow-on patches for s390 or ppc64 hit the tree for
> > > these changes, am I just missing them and should also queue up a few
> > > more to handle this issue on those platforms?  
> > 
> > I fixed that with a different approach. The following two patches are
> > queued for the next merge window:
> > 
> > d1874a0c2805 "s390/mm: make the pxd_offset functions more robust"
> > 1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code"
> > 
> > With these two s390 now uses the generic gup code in mm/gup.c  
> 
> Nice!  Do you want me to queue those up for the stable backports once
> they hit a public -rc release?

Yes please!

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-05-02 Thread Greg KH
On Thu, May 02, 2019 at 04:17:58PM +0200, Martin Schwidefsky wrote:
> On Thu, 2 May 2019 14:21:28 +0200
> Greg KH  wrote:
> 
> > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote:
> > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  
> > > wrote:  
> > > >
> > > > Can we please have the page refcount overflow fixes out on the list
> > > > for review, even if it is after the fact?  
> > > 
> > > They were actually on a list for review long before the fact, but it
> > > was the security mailing list. The issue actually got discussed back
> > > in January along with early versions of the patches, but then we
> > > dropped the ball because it just wasn't on anybody's radar and it got
> > > resurrected late March. Willy wrote a rather bigger patch-series, and
> > > review of that is what then resulted in those commits. So they may
> > > look recent, but that's just because the original patches got
> > > seriously edited down and rewritten.
> > > 
> > > That said, powerpc and s390 should at least look at maybe adding a
> > > check for the page ref in their gup paths too. Powerpc has the special
> > > gup_hugepte() case, and s390 has its own version of gup entirely. I
> > > was actually hoping the s390 guys would look at using the generic gup
> > > code.
> > > 
> > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
> > > largely irrelevant, partly since even theoretically this whole issue
> > > needs a _lot_ of memory.
> > > 
> > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
> > > (page ref overflow)"). You may or may not really care.  
> > 
> > I've now queued these patches up for the next round of stable releases,
> > as some people seem to care about these.
> > 
> > I didn't see any follow-on patches for s390 or ppc64 hit the tree for
> > these changes, am I just missing them and should also queue up a few
> > more to handle this issue on those platforms?
> 
> I fixed that with a different approach. The following two patches are
> queued for the next merge window:
> 
> d1874a0c2805 "s390/mm: make the pxd_offset functions more robust"
> 1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code"
> 
> With these two s390 now uses the generic gup code in mm/gup.c

Nice!  Do you want me to queue those up for the stable backports once
they hit a public -rc release?

thanks,

greg k-h


Re: Linux 5.1-rc5

2019-05-02 Thread Martin Schwidefsky
On Thu, 2 May 2019 14:21:28 +0200
Greg KH  wrote:

> On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote:
> > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  
> > wrote:  
> > >
> > > Can we please have the page refcount overflow fixes out on the list
> > > for review, even if it is after the fact?  
> > 
> > They were actually on a list for review long before the fact, but it
> > was the security mailing list. The issue actually got discussed back
> > in January along with early versions of the patches, but then we
> > dropped the ball because it just wasn't on anybody's radar and it got
> > resurrected late March. Willy wrote a rather bigger patch-series, and
> > review of that is what then resulted in those commits. So they may
> > look recent, but that's just because the original patches got
> > seriously edited down and rewritten.
> > 
> > That said, powerpc and s390 should at least look at maybe adding a
> > check for the page ref in their gup paths too. Powerpc has the special
> > gup_hugepte() case, and s390 has its own version of gup entirely. I
> > was actually hoping the s390 guys would look at using the generic gup
> > code.
> > 
> > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
> > largely irrelevant, partly since even theoretically this whole issue
> > needs a _lot_ of memory.
> > 
> > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
> > (page ref overflow)"). You may or may not really care.  
> 
> I've now queued these patches up for the next round of stable releases,
> as some people seem to care about these.
> 
> I didn't see any follow-on patches for s390 or ppc64 hit the tree for
> these changes, am I just missing them and should also queue up a few
> more to handle this issue on those platforms?

I fixed that with a different approach. The following two patches are
queued for the next merge window:

d1874a0c2805 "s390/mm: make the pxd_offset functions more robust"
1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code"

With these two s390 now uses the generic gup code in mm/gup.c

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-05-02 Thread Greg KH
On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote:
> On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  wrote:
> >
> > Can we please have the page refcount overflow fixes out on the list
> > for review, even if it is after the fact?
> 
> They were actually on a list for review long before the fact, but it
> was the security mailing list. The issue actually got discussed back
> in January along with early versions of the patches, but then we
> dropped the ball because it just wasn't on anybody's radar and it got
> resurrected late March. Willy wrote a rather bigger patch-series, and
> review of that is what then resulted in those commits. So they may
> look recent, but that's just because the original patches got
> seriously edited down and rewritten.
> 
> That said, powerpc and s390 should at least look at maybe adding a
> check for the page ref in their gup paths too. Powerpc has the special
> gup_hugepte() case, and s390 has its own version of gup entirely. I
> was actually hoping the s390 guys would look at using the generic gup
> code.
> 
> I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
> largely irrelevant, partly since even theoretically this whole issue
> needs a _lot_ of memory.
> 
> Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
> (page ref overflow)"). You may or may not really care.

I've now queued these patches up for the next round of stable releases,
as some people seem to care about these.

I didn't see any follow-on patches for s390 or ppc64 hit the tree for
these changes, am I just missing them and should also queue up a few
more to handle this issue on those platforms?

thanks,

greg k-h


Re: Linux 5.1-rc5

2019-04-23 Thread Linus Torvalds
On Tue, Apr 23, 2019 at 8:39 AM Martin Schwidefsky
 wrote:
>
> Ok, I added two patches for my s390/linux:features branch
>
> Martin Schwidefsky (2):
>   s390/mm: make the pxd_offset functions more robust
>   s390/mm: convert to the generic get_user_pages_fast code
>
> All code changes are inside arch/s390, I plan to include these patches with
> the next merge window. That gives us a little bit of time to run our tests.

Sounds good. Thanks for looking into this all.

Now I slightly wonder about all the other random architectures that
don't use the HAVE_GENERIC_GUP config option, but at least we'll have
all of arm, powerpc, x86 and s390 using the generic code..

Linus


Re: Linux 5.1-rc5

2019-04-23 Thread Martin Schwidefsky
On Fri, 19 Apr 2019 10:27:17 -0700
Linus Torvalds  wrote:

> On Fri, Apr 19, 2019 at 6:33 AM Martin Schwidefsky
>  wrote:
> >
> > That problem got stuck in my head and I thought more about it. Why not
> > emulate the static folding sequence in the s390 page table code?  
> 
> So this model seems much closer to what x86 does in its folding, where
> the pattern is basically
> 
> > static inline pX-1d_t *pXd_offset(pXd_t *pXd, unsigned long address)
> > {
> > if (pXd_folded(pXd)
> > return (pX-1d_t *) pXd;
> > return (pX-1d_t *) pXd_deref(*pXd) + pXd_index(address);
> > }  
> 
> which is really how the code is designed to work (ie the folded entry
> doesn't actually do anything to the page directory pointer, it just
> says "ok, we'll use this exact page directory pointer for the next
> lower level instead".
> 
> And that's very much what allows the generic gup code to load the
> entry once, and use a temporary, and as you walk down the chain, if it
> is folded it just then uses that (previous) temporary value for the
> next level instead. IOW, the lower level page table is hidden inside
> the upper level one, and folding just means "don't do any offsets,
> don't change any values, just use the entry as-is for the next lower
> level".
> 
> So I think that's the right thing to do.

Ok, I added two patches for my s390/linux:features branch

Martin Schwidefsky (2):
  s390/mm: make the pxd_offset functions more robust
  s390/mm: convert to the generic get_user_pages_fast code

All code changes are inside arch/s390, I plan to include these patches with
the next merge window. That gives us a little bit of time to run our tests.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-19 Thread Linus Torvalds
On Fri, Apr 19, 2019 at 6:33 AM Martin Schwidefsky
 wrote:
>
> That problem got stuck in my head and I thought more about it. Why not
> emulate the static folding sequence in the s390 page table code?

So this model seems much closer to what x86 does in its folding, where
the pattern is basically

> static inline pX-1d_t *pXd_offset(pXd_t *pXd, unsigned long address)
> {
> if (pXd_folded(pXd)
> return (pX-1d_t *) pXd;
> return (pX-1d_t *) pXd_deref(*pXd) + pXd_index(address);
> }

which is really how the code is designed to work (ie the folded entry
doesn't actually do anything to the page directory pointer, it just
says "ok, we'll use this exact page directory pointer for the next
lower level instead".

And that's very much what allows the generic gup code to load the
entry once, and use a temporary, and as you walk down the chain, if it
is folded it just then uses that (previous) temporary value for the
next level instead. IOW, the lower level page table is hidden inside
the upper level one, and folding just means "don't do any offsets,
don't change any values, just use the entry as-is for the next lower
level".

So I think that's the right thing to do.

Looking at the s390 code, it seems to fold things the other way,
conceptually hiding the upper level inside the lower one, and always
doing the offset thing (but just avoiding the dereference).

Maybe there's some reason why the s390 code does it that way, but I
think your new model is the right one, and hopefully means you can use
the generic page table walking more easily.

Of course, the s390 folding is very different from the x86 one (or the
generic fixed 3-level of 4-level cases). The x86 folding doesn't
depend on the contents of the page tables, it's just entirely static
(well, the 5th level is conditional, but it's conditional on a static
key, not on what is in the page tables). So maybe the old model of
s390 made more sense in that context, but I look at your new suggested
pXd_offset() functions and I go "yeah, that's the way it's supposed to
work".

 Linus


Re: Linux 5.1-rc5

2019-04-19 Thread Martin Schwidefsky
On Thu, 18 Apr 2019 20:41:44 +0200
Martin Schwidefsky  wrote:

> On Thu, 18 Apr 2019 08:49:32 -0700
> Linus Torvalds  wrote:
> 
> > On Thu, Apr 18, 2019 at 1:02 AM Martin Schwidefsky
> >  wrote:  
> > >
> > > The problematic lines in the generic gup code are these three:
> > >
> > > 1845:   pmdp = pmd_offset(, addr);
> > > 1888:   pudp = pud_offset(, addr);
> > > 1916:   p4dp = p4d_offset(, addr);
> > >
> > > Passing the pointer of a *copy* of a page table entry to pxd_offset() does
> > > not work with the page table folding on s390.
> > 
> > Hmm. I wonder why. x86 too does the folding thing for the p4d and pud case.
> > 
> > The folding works with the local copy just the same way it works with
> > the orignal value.  
> 
> The difference is that with the static page table folding pgd_offset()
> does the index calculation of the actual hardware top-level table. With
> dynamic page table folding as s390 is doing it, if the task does not use
> a 5-level page table pgd_offset() will see a pgd_index() of 0, the indexing
> of the actual top-level table is done later with p4d_offset(), pud_offset()
> or pmd_offset(). 
> 
> As an example, with a three level page table we have three indexes x/y/z.
> The common code "thinks" 5 indexing steps, with static folding the index
> sequence is x 0 0 y z. With dynamic folding the sequence is 0 0 x y z.
> By moving the first indexing operation to pgd_offset the static sequence
> does not add an index to a non-dereferenced pointer to a stack variable,
> the dynamic sequence does.

That problem got stuck in my head and I thought more about it. Why not
emulate the static folding sequence in the s390 page table code?

As the table type is encoded in every entry for the region and segment
tables, pgd_offset() can look at the first entry to find the table type
and then do the correct index calculation for the given top-level table.
Like this:

static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
{
unsigned long rste;
unsigned int shift;

/* Get the first entry of the top level table */
rste = pgd_val(*pgd);
/* Pick up the shift from the table type of the first entry */
shift = ((rste & _REGION_ENTRY_TYPE_MASK) >> 2) * 11 + 20;
return pgd + ((address >> shift) & (PTRS_PER_PGD - 1));
}

#define pgd_offset(mm, address) pgd_offset_raw((mm)->pgd, address)
#define pgd_offset_k(address) pgd_offset(_mm, address)

static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
{
if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R1)
return (p4d_t *) pgd;
return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
}

static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
{
if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R2)
return (pud_t *) p4d;
return (pud_t *) p4d_deref(*p4d) + pud_index(address);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R3)
return (pmd_t *) pud;
return (pmd_t *) pud_deref(*pud) + pmd_index(address);
}

This needs more thorough testing but in principle it does work. The kernel
boots and survives a kernel compile. The only things that is slightly off is
that pgd_offset() now has to look at the first table entry to do its job.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-18 Thread Martin Schwidefsky
On Thu, 18 Apr 2019 08:49:32 -0700
Linus Torvalds  wrote:

> On Thu, Apr 18, 2019 at 1:02 AM Martin Schwidefsky
>  wrote:
> >
> > The problematic lines in the generic gup code are these three:
> >
> > 1845:   pmdp = pmd_offset(, addr);
> > 1888:   pudp = pud_offset(, addr);
> > 1916:   p4dp = p4d_offset(, addr);
> >
> > Passing the pointer of a *copy* of a page table entry to pxd_offset() does
> > not work with the page table folding on s390.  
> 
> Hmm. I wonder why. x86 too does the folding thing for the p4d and pud case.
> 
> The folding works with the local copy just the same way it works with
> the orignal value.

The difference is that with the static page table folding pgd_offset()
does the index calculation of the actual hardware top-level table. With
dynamic page table folding as s390 is doing it, if the task does not use
a 5-level page table pgd_offset() will see a pgd_index() of 0, the indexing
of the actual top-level table is done later with p4d_offset(), pud_offset()
or pmd_offset(). 

As an example, with a three level page table we have three indexes x/y/z.
The common code "thinks" 5 indexing steps, with static folding the index
sequence is x 0 0 y z. With dynamic folding the sequence is 0 0 x y z.
By moving the first indexing operation to pgd_offset the static sequence
does not add an index to a non-dereferenced pointer to a stack variable,
the dynamic sequence does.

> But I see that s390 does some other kind of folding and does that
> addition of the p*d_index() unconditionally.
> 
> I guess that does mean that s390 will just have to have its own walker.
> 
> For the issue of the page refcount overflow it really isn't a huge
> deal. Adding the refcount checking is simple (see the example patch I
> gave for powerpc - you'll just have a couple of extra cases since you
> do it all, rather than just the special hugetlb cases).
> 
> Obviously in general it would have been nicer to share as much code as
> possible, but let's not make things unnecessarily complex if s390 is
> just fundamentally different..

It would have been nice to use the generic code (less bugs) but not at
the price of over-complicating things. And that page table folding thing
always makes my head hurt.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-18 Thread Linus Torvalds
On Thu, Apr 18, 2019 at 1:02 AM Martin Schwidefsky
 wrote:
>
> The problematic lines in the generic gup code are these three:
>
> 1845:   pmdp = pmd_offset(, addr);
> 1888:   pudp = pud_offset(, addr);
> 1916:   p4dp = p4d_offset(, addr);
>
> Passing the pointer of a *copy* of a page table entry to pxd_offset() does
> not work with the page table folding on s390.

Hmm. I wonder why. x86 too does the folding thing for the p4d and pud case.

The folding works with the local copy just the same way it works with
the orignal value.

But I see that s390 does some other kind of folding and does that
addition of the p*d_index() unconditionally.

I guess that does mean that s390 will just have to have its own walker.

For the issue of the page refcount overflow it really isn't a huge
deal. Adding the refcount checking is simple (see the example patch I
gave for powerpc - you'll just have a couple of extra cases since you
do it all, rather than just the special hugetlb cases).

Obviously in general it would have been nicer to share as much code as
possible, but let's not make things unnecessarily complex if s390 is
just fundamentally different..

   Linus


Re: Linux 5.1-rc5

2019-04-18 Thread Martin Schwidefsky
On Wed, 17 Apr 2019 09:57:01 -0700
Linus Torvalds  wrote:

> On Wed, Apr 17, 2019 at 1:02 AM Martin Schwidefsky
>  wrote:
> >
> > Grumpf, that does *not* work. For gup the table entries may be read only
> > once. Now I remember why I open-coded p4d_offset, pud_offset and pmd_offset
> > in arch/s390/mm/gup.c, to avoid to read the table entries twice.
> > It will be hard to use the common gup code after all.  
> 
> Hmm. The common gup code generally should do the "read only once"
> thing too (since by definition the gup-fast case is done without
> locking), although it's probably the case that most architectures
> simply don't care.
> 
> What would it require for the generic code to work for s390?

The problematic lines in the generic gup code are these three:

1845:   pmdp = pmd_offset(, addr);
1888:   pudp = pud_offset(, addr);
1916:   p4dp = p4d_offset(, addr);

Passing the pointer of a *copy* of a page table entry to pxd_offset() does
not work with the page table folding on s390. The pxd_offset() function
on s390 have to make a choice, either return the dereferenced value behind
the passed pointer (that works) or return the original page table pointer
if the table level is folded (that does not work).

To fix this we would need three new helpers pmd_offset_orig, pud_offset_orig
and p4d_offset_orig, their generic definition would look like this:

#define p4d_offset_orig(pgdp, pgd, address)p4d_offset(, address)
#define pud_offset_orig(p4dp, p4d, address)pud_offset(, address)
#define pmd_offset_orig(pudp, pud, address)pmd_offset(, address)

For the s390 definition see the following branch:

git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git generic-gup

A quick test with this branch shows everything working normally.
Keeping my fingers crossed that I did not miss anything.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-17 Thread Linus Torvalds
On Wed, Apr 17, 2019 at 1:02 AM Martin Schwidefsky
 wrote:
>
> Grumpf, that does *not* work. For gup the table entries may be read only
> once. Now I remember why I open-coded p4d_offset, pud_offset and pmd_offset
> in arch/s390/mm/gup.c, to avoid to read the table entries twice.
> It will be hard to use the common gup code after all.

Hmm. The common gup code generally should do the "read only once"
thing too (since by definition the gup-fast case is done without
locking), although it's probably the case that most architectures
simply don't care.

What would it require for the generic code to work for s390?

Linus


Re: Linux 5.1-rc5

2019-04-17 Thread Martin Schwidefsky
On Wed, 17 Apr 2019 09:46:37 +0200
Martin Schwidefsky  wrote:

> On Tue, 16 Apr 2019 09:49:46 -0700
> Linus Torvalds  wrote:
> 
> > On Tue, Apr 16, 2019 at 9:16 AM Linus Torvalds
> >  wrote:  
> > >
> > > We actually already *have* this function.
> > >
> > > It's called "gup_fast_permitted()" and it's used by x86-64 to verify
> > > the proper address range. Exactly like s390 needs..
> > >
> > > Could you please use that instead?
> > 
> > IOW, something like the attached.
> > 
> > Obviously untested. And maybe 'current' isn't declared in
> > , in which case you'd need to modify it to instead make
> > the inline function be "s390_gup_fast_permitted()" that takes a
> > pointer to the mm, and do something like
> > 
> >   #define gup_fast_permitted(start, pages) \
> >  s390_gup_fast_permitted(current->mm, start, pages)
> > 
> > instead.
> > 
> > But I think you get the idea..  
> 
> Nice, I did not realize that gup_fast_permitted is a platform
> override-able function. So that part is doable in arch/s390. But I
> spoke to soon, I got my first crash and realized that the common gup code
> is not usable as it is. The reason is this e.g. this sequence:
> 
>   pgdp = pgd_offset(current->mm, addr);
>   pgd_t pgd = READ_ONCE(*pgdp);
>   /* some checking on pgd */
>   gup_p4d_range(pgd, addr, next, write, pages, nr);
> 
>   p4dp = p4d_offset(, addr);
>   p4d_t p4d = READ_ONCE(*p4dp);
>   /* some checking on p4d */
>   gup_pud_range(p4d, addr, next, write, pages, nr);
> 
>   pudp = pud_offset(, addr);
>   pud_t pud = READ_ONCE(*pudp);
>   /* some checking on pud */
>   gup_pmd_range(pud, addr, next, write, pages, nr;
> 
> Each step along the way will read the page table entry and pass the
> table entry to the next function. This clashes with the page table
> folding on s390. The s390 gup code looks more like this:
> 
>   pgdp = pgd_offset(current->mm, addr);
>   /* some checking on pgd */
>   pgd_t pgd = READ_ONCE(*pgdp);
>   gup_p4d_range(pgdp, pgd, addr, next, write, pages, );
> 
>   p4dp = p4d_offset(pgdp, addr);
>   p4d_t p4d = READ_ONCE(*p4dp);
>   /* some checking on p4d */
>   gup_pud_range(p4dp, p4d, addr, next, write, pages, nr);
> 
>   pudp = pud_offset(p4dp, addr);
>   pud_t pud = READ_ONCE(*pudp);
>   /* some checking on pud */
>   gup_pmd_range(pudp, pud, addr, next, write, pages, nr;
> 
> There are magic dereferences in the s390 versions of p4d_offset,
> pud_offset and pmd_offset functions. To make this work the pointer
> passed to these functions may not be the local copy of the already
> dereferenced table entry. I'll cook up a patch for the common code.

Grumpf, that does *not* work. For gup the table entries may be read only
once. Now I remember why I open-coded p4d_offset, pud_offset and pmd_offset
in arch/s390/mm/gup.c, to avoid to read the table entries twice.
It will be hard to use the common gup code after all.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-17 Thread Martin Schwidefsky
On Tue, 16 Apr 2019 09:49:46 -0700
Linus Torvalds  wrote:

> On Tue, Apr 16, 2019 at 9:16 AM Linus Torvalds
>  wrote:
> >
> > We actually already *have* this function.
> >
> > It's called "gup_fast_permitted()" and it's used by x86-64 to verify
> > the proper address range. Exactly like s390 needs..
> >
> > Could you please use that instead?  
> 
> IOW, something like the attached.
> 
> Obviously untested. And maybe 'current' isn't declared in
> , in which case you'd need to modify it to instead make
> the inline function be "s390_gup_fast_permitted()" that takes a
> pointer to the mm, and do something like
> 
>   #define gup_fast_permitted(start, pages) \
>  s390_gup_fast_permitted(current->mm, start, pages)
> 
> instead.
> 
> But I think you get the idea..

Nice, I did not realize that gup_fast_permitted is a platform
override-able function. So that part is doable in arch/s390. But I
spoke to soon, I got my first crash and realized that the common gup code
is not usable as it is. The reason is this e.g. this sequence:

pgdp = pgd_offset(current->mm, addr);
pgd_t pgd = READ_ONCE(*pgdp);
/* some checking on pgd */
gup_p4d_range(pgd, addr, next, write, pages, nr);

p4dp = p4d_offset(, addr);
p4d_t p4d = READ_ONCE(*p4dp);
/* some checking on p4d */
gup_pud_range(p4d, addr, next, write, pages, nr);

pudp = pud_offset(, addr);
pud_t pud = READ_ONCE(*pudp);
/* some checking on pud */
gup_pmd_range(pud, addr, next, write, pages, nr;

Each step along the way will read the page table entry and pass the
table entry to the next function. This clashes with the page table
folding on s390. The s390 gup code looks more like this:

pgdp = pgd_offset(current->mm, addr);
/* some checking on pgd */
pgd_t pgd = READ_ONCE(*pgdp);
gup_p4d_range(pgdp, pgd, addr, next, write, pages, );

p4dp = p4d_offset(pgdp, addr);
p4d_t p4d = READ_ONCE(*p4dp);
/* some checking on p4d */
gup_pud_range(p4dp, p4d, addr, next, write, pages, nr);

pudp = pud_offset(p4dp, addr);
pud_t pud = READ_ONCE(*pudp);
/* some checking on pud */
gup_pmd_range(pudp, pud, addr, next, write, pages, nr;

There are magic dereferences in the s390 versions of p4d_offset,
pud_offset and pmd_offset functions. To make this work the pointer
passed to these functions may not be the local copy of the already
dereferenced table entry. I'll cook up a patch for the common code.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-16 Thread Linus Torvalds
On Tue, Apr 16, 2019 at 8:38 PM Michael Ellerman  wrote:
>
> > That said, powerpc and s390 should at least look at maybe adding a
> > check for the page ref in their gup paths too. Powerpc has the special
> > gup_hugepte() case
>
> Which uses page_cache_add_speculative(), which handles the case of the
> refcount being zero but not overflow. So that looks like it needs
> fixing.

Note that unlike the zero check, the "too many refs" check does _not_
need to be atomic.

Because it's not a correctness issue right at some magical exact
point, it's a much more ambiguous a "the refcount is now so large that
I'm not going to do GUP on this page any more". Being off by a number
of pages in case there's a race is just fine.

So you could do something like this (TOTALLY UNTESTED, and
whitespace-damaged on purpose - I don't want you to apply it blindly)
appended patch.

> And we have a few uses of bare get_page() in KVM code which might be
> subject to the same attack.

Note that you really have to have not just a get_page(), but some way
of lining up *billions* of them. Which really tends to be pretty hard.

Linus



diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9e732bb2c84a..52db7ff7c756 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -523,7 +523,8 @@ struct page *follow_huge_pd(struct vm_area_struct *vma,
page = pte_page(*ptep);
page += ((address & mask) >> PAGE_SHIFT);
if (flags & FOLL_GET)
-   get_page(page);
+   if (!try_get_page(page))
+   page = NULL;
} else {
if (is_hugetlb_entry_migration(*ptep)) {
spin_unlock(ptl);
@@ -883,6 +884,8 @@ int gup_hugepte(pte_t *ptep, unsigned long sz,
unsigned long addr,

refs = 0;
head = pte_page(pte);
+   if (page_ref_count(head) < 0)
+   return 0;

page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
do {


Re: Linux 5.1-rc5

2019-04-16 Thread Michael Ellerman
[ Cc += Nick & Aneesh & Paul ]

Linus Torvalds  writes:
> On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  wrote:
>>
>> Can we please have the page refcount overflow fixes out on the list
>> for review, even if it is after the fact?
>
> They were actually on a list for review long before the fact, but it
> was the security mailing list. The issue actually got discussed back
> in January along with early versions of the patches, but then we
> dropped the ball because it just wasn't on anybody's radar and it got
> resurrected late March. Willy wrote a rather bigger patch-series, and
> review of that is what then resulted in those commits. So they may
> look recent, but that's just because the original patches got
> seriously edited down and rewritten.
>
> That said, powerpc and s390 should at least look at maybe adding a
> check for the page ref in their gup paths too. Powerpc has the special
> gup_hugepte() case

Which uses page_cache_add_speculative(), which handles the case of the
refcount being zero but not overflow. So that looks like it needs
fixing.

We also have follow_huge_pd() that should use try_get_page().

And we have a few uses of bare get_page() in KVM code which might be
subject to the same attack.

cheers


Re: Linux 5.1-rc5

2019-04-16 Thread Linus Torvalds
On Tue, Apr 16, 2019 at 9:16 AM Linus Torvalds
 wrote:
>
> We actually already *have* this function.
>
> It's called "gup_fast_permitted()" and it's used by x86-64 to verify
> the proper address range. Exactly like s390 needs..
>
> Could you please use that instead?

IOW, something like the attached.

Obviously untested. And maybe 'current' isn't declared in
, in which case you'd need to modify it to instead make
the inline function be "s390_gup_fast_permitted()" that takes a
pointer to the mm, and do something like

  #define gup_fast_permitted(start, pages) \
 s390_gup_fast_permitted(current->mm, start, pages)

instead.

But I think you get the idea..

Linus
 arch/s390/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 76dc344edb8c..a08248995f50 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1659,4 +1659,16 @@ static inline void check_pgt_cache(void) { }
 
 #include 
 
+static inline bool gup_fast_permitted(unsigned long start, int nr_pages)
+{
+	unsigned long len, end;
+
+	len = (unsigned long)nr_pages << PAGE_SHIFT;
+	end = start + len;
+	if (end < start)
+		return false;
+	return end <= current->mm->context.asce_limit;
+}
+#define gup_fast_permitted gup_fast_permitted
+
 #endif /* _S390_PAGE_H */


Re: Linux 5.1-rc5

2019-04-16 Thread Linus Torvalds
On Tue, Apr 16, 2019 at 5:08 AM Martin Schwidefsky
 wrote:
>
> This is not nice, would a patch like the following be acceptable?

Umm.

We actually already *have* this function.

It's called "gup_fast_permitted()" and it's used by x86-64 to verify
the proper address range. Exactly like s390 needs..

Could you please use that instead?

Linus


Re: Linux 5.1-rc5

2019-04-16 Thread Martin Schwidefsky
On Tue, 16 Apr 2019 11:09:06 +0200
Martin Schwidefsky  wrote:

> On Mon, 15 Apr 2019 09:17:10 -0700
> Linus Torvalds  wrote:
> 
> > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  
> > wrote:  
> > >
> > > Can we please have the page refcount overflow fixes out on the list
> > > for review, even if it is after the fact?
> > 
> > They were actually on a list for review long before the fact, but it
> > was the security mailing list. The issue actually got discussed back
> > in January along with early versions of the patches, but then we
> > dropped the ball because it just wasn't on anybody's radar and it got
> > resurrected late March. Willy wrote a rather bigger patch-series, and
> > review of that is what then resulted in those commits. So they may
> > look recent, but that's just because the original patches got
> > seriously edited down and rewritten.  
> 
> First time I hear about this, thanks for the heads up.
>  
> > That said, powerpc and s390 should at least look at maybe adding a
> > check for the page ref in their gup paths too. Powerpc has the special
> > gup_hugepte() case, and s390 has its own version of gup entirely. I
> > was actually hoping the s390 guys would look at using the generic gup
> > code.  
> 
> We did look at converting the s390 gup code to CONFIG_HAVE_GENERIC_GUP,
> there are some details that need careful consideration. The top one
> is access_ok(), for s390 we always return true. The generic gup code
> relies on the fact that a page table walk with a specific address is
> doable if access_ok() returned true, the s390 specific check is slightly
> different:
> 
> if ((end <= start) || (end > mm->context.asce_limit))
> return 0;
> 
> The obvious approach would be to modify access_ok() to check against
> the asce_limit. I will try and see if anything breaks, e.g. the automatic
> page table upgrade.

I tested the waters in regard to access_ok() and the generic gup code.
The good news is that mm/gup.c with CONFIG_HAVE_GENERIC_GUP=y seems to
work just fine if the access_ok() issue is taken care of. But..

Bloat-o-meter with a non-empty uaccess_ok() that checks against
current->mm->context.asce_limit:

add/remove: 8/2 grow/shrink: 611/11 up/down: 61352/-1914 (59438)

with CONFIG_HAVE_GENERIC_GUP on top of that

add/remove: 10/2 grow/shrink: 612/12 up/down: 63568/-3280 (60288)

This is not nice, would a patch like the following be acceptable?
--
Subject: [PATCH] mm: introduce mm_pgd_walk_ok

Add the architecture overrideable function mm_pgd_walk_ok() to check
if a block of memory is inside the limits of the page table hierarchy
of a given mm struct.

Signed-off-by: Martin Schwidefsky 
---
 include/asm-generic/pgtable.h | 4 
 mm/gup.c  | 4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fa782fba51ee..7d2a8a58f1c1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1186,4 +1186,8 @@ static inline bool arch_has_pfn_modify_check(void)
 #define mm_pmd_folded(mm)  __is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef mm_pgd_walk_ok
+#define mm_pgd_walk_ok(mm, addr, size) access_ok(addr, size)
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff --git a/mm/gup.c b/mm/gup.c
index 91819b8ad9cc..b3eb3f45d237 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1990,7 +1990,7 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;
 
-   if (unlikely(!access_ok((void __user *)start, len)))
+   if (unlikely(!mm_pgd_walk_ok(current->mm, (void __user *)start, len)))
return 0;
 
/*
@@ -2044,7 +2044,7 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
if (nr_pages <= 0)
return 0;
 
-   if (unlikely(!access_ok((void __user *)start, len)))
+   if (unlikely(!mm_pgd_walk_ok(current->mm, (void __user *)start, len)))
return -EFAULT;
 
if (gup_fast_permitted(start, nr_pages)) {
-- 
2.16.4

With an empty access_ok() but a "real" mm_pgd_walk_ok() the results are
much more reasonable:

add/remove: 2/0 grow/shrink: 2/1 up/down: 2186/-1382 (804)

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-16 Thread Martin Schwidefsky
On Mon, 15 Apr 2019 09:17:10 -0700
Linus Torvalds  wrote:

> On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  wrote:
> >
> > Can we please have the page refcount overflow fixes out on the list
> > for review, even if it is after the fact?  
> 
> They were actually on a list for review long before the fact, but it
> was the security mailing list. The issue actually got discussed back
> in January along with early versions of the patches, but then we
> dropped the ball because it just wasn't on anybody's radar and it got
> resurrected late March. Willy wrote a rather bigger patch-series, and
> review of that is what then resulted in those commits. So they may
> look recent, but that's just because the original patches got
> seriously edited down and rewritten.

First time I hear about this, thanks for the heads up.
 
> That said, powerpc and s390 should at least look at maybe adding a
> check for the page ref in their gup paths too. Powerpc has the special
> gup_hugepte() case, and s390 has its own version of gup entirely. I
> was actually hoping the s390 guys would look at using the generic gup
> code.

We did look at converting the s390 gup code to CONFIG_HAVE_GENERIC_GUP,
there are some details that need careful consideration. The top one
is access_ok(), for s390 we always return true. The generic gup code
relies on the fact that a page table walk with a specific address is
doable if access_ok() returned true, the s390 specific check is slightly
different:

if ((end <= start) || (end > mm->context.asce_limit))
return 0;

The obvious approach would be to modify access_ok() to check against
the asce_limit. I will try and see if anything breaks, e.g. the automatic
page table upgrade.

> I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
> largely irrelevant, partly since even theoretically this whole issue
> needs a _lot_ of memory.
> 
> Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
> (page ref overflow)"). You may or may not really care.

On s390 we can have up to 16TB of memory in a single LPAR. So yes, I do
care about it.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: Linux 5.1-rc5

2019-04-15 Thread Linus Torvalds
On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig  wrote:
>
> Can we please have the page refcount overflow fixes out on the list
> for review, even if it is after the fact?

They were actually on a list for review long before the fact, but it
was the security mailing list. The issue actually got discussed back
in January along with early versions of the patches, but then we
dropped the ball because it just wasn't on anybody's radar and it got
resurrected late March. Willy wrote a rather bigger patch-series, and
review of that is what then resulted in those commits. So they may
look recent, but that's just because the original patches got
seriously edited down and rewritten.

That said, powerpc and s390 should at least look at maybe adding a
check for the page ref in their gup paths too. Powerpc has the special
gup_hugepte() case, and s390 has its own version of gup entirely. I
was actually hoping the s390 guys would look at using the generic gup
code.

I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem
largely irrelevant, partly since even theoretically this whole issue
needs a _lot_ of memory.

Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs'
(page ref overflow)"). You may or may not really care.

 Linus