Re: Linux 5.1-rc5
On Thu, May 02, 2019 at 05:10:55PM +0200, Martin Schwidefsky wrote: > On Thu, 2 May 2019 16:31:10 +0200 > Greg KH wrote: > > > On Thu, May 02, 2019 at 04:17:58PM +0200, Martin Schwidefsky wrote: > > > On Thu, 2 May 2019 14:21:28 +0200 > > > Greg KH wrote: > > > > > > > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote: > > > > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig > > > > > wrote: > > > > > > > > > > > > Can we please have the page refcount overflow fixes out on the list > > > > > > for review, even if it is after the fact? > > > > > > > > > > They were actually on a list for review long before the fact, but it > > > > > was the security mailing list. The issue actually got discussed back > > > > > in January along with early versions of the patches, but then we > > > > > dropped the ball because it just wasn't on anybody's radar and it got > > > > > resurrected late March. Willy wrote a rather bigger patch-series, and > > > > > review of that is what then resulted in those commits. So they may > > > > > look recent, but that's just because the original patches got > > > > > seriously edited down and rewritten. > > > > > > > > > > That said, powerpc and s390 should at least look at maybe adding a > > > > > check for the page ref in their gup paths too. Powerpc has the special > > > > > gup_hugepte() case, and s390 has its own version of gup entirely. I > > > > > was actually hoping the s390 guys would look at using the generic gup > > > > > code. > > > > > > > > > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem > > > > > largely irrelevant, partly since even theoretically this whole issue > > > > > needs a _lot_ of memory. > > > > > > > > > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' > > > > > (page ref overflow)"). You may or may not really care. > > > > > > > > I've now queued these patches up for the next round of stable releases, > > > > as some people seem to care about these. > > > > > > > > I didn't see any follow-on patches for s390 or ppc64 hit the tree for > > > > these changes, am I just missing them and should also queue up a few > > > > more to handle this issue on those platforms? > > > > > > I fixed that with a different approach. The following two patches are > > > queued for the next merge window: > > > > > > d1874a0c2805 "s390/mm: make the pxd_offset functions more robust" > > > 1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code" > > > > > > With these two s390 now uses the generic gup code in mm/gup.c > > > > Nice! Do you want me to queue those up for the stable backports once > > they hit a public -rc release? > > Yes please! Now queued up to 5.0 and 5.1, but did not apply to 4.19 or older :( thanks, greg k-h
Re: Linux 5.1-rc5
Greg KH writes: > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote: >> On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig >> wrote: >> > >> > Can we please have the page refcount overflow fixes out on the list >> > for review, even if it is after the fact? >> >> They were actually on a list for review long before the fact, but it >> was the security mailing list. The issue actually got discussed back >> in January along with early versions of the patches, but then we >> dropped the ball because it just wasn't on anybody's radar and it got >> resurrected late March. Willy wrote a rather bigger patch-series, and >> review of that is what then resulted in those commits. So they may >> look recent, but that's just because the original patches got >> seriously edited down and rewritten. >> >> That said, powerpc and s390 should at least look at maybe adding a >> check for the page ref in their gup paths too. Powerpc has the special >> gup_hugepte() case, and s390 has its own version of gup entirely. I >> was actually hoping the s390 guys would look at using the generic gup >> code. >> >> I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem >> largely irrelevant, partly since even theoretically this whole issue >> needs a _lot_ of memory. >> >> Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' >> (page ref overflow)"). You may or may not really care. > > I've now queued these patches up for the next round of stable releases, > as some people seem to care about these. > > I didn't see any follow-on patches for s390 or ppc64 hit the tree for > these changes, am I just missing them and should also queue up a few > more to handle this issue on those platforms? No you haven't missed them for powerpc. It's on my list. cheers
Re: Linux 5.1-rc5
On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote: > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem > largely irrelevant, partly since even theoretically this whole issue > needs a _lot_ of memory. Adding the relevant people - while the might be irrelevant, at least mips and sparc have some giant memory systems. And I'd really like to see the arch-specific GUP implementations to go away for other reasons, as we have a few issues to sort out with GUP usage now (we just had discussions at LSF/MM), and the less implementations we have to deal with the better.
Re: Linux 5.1-rc5
On Thu, 2 May 2019 16:31:10 +0200 Greg KH wrote: > On Thu, May 02, 2019 at 04:17:58PM +0200, Martin Schwidefsky wrote: > > On Thu, 2 May 2019 14:21:28 +0200 > > Greg KH wrote: > > > > > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote: > > > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig > > > > wrote: > > > > > > > > > > Can we please have the page refcount overflow fixes out on the list > > > > > for review, even if it is after the fact? > > > > > > > > They were actually on a list for review long before the fact, but it > > > > was the security mailing list. The issue actually got discussed back > > > > in January along with early versions of the patches, but then we > > > > dropped the ball because it just wasn't on anybody's radar and it got > > > > resurrected late March. Willy wrote a rather bigger patch-series, and > > > > review of that is what then resulted in those commits. So they may > > > > look recent, but that's just because the original patches got > > > > seriously edited down and rewritten. > > > > > > > > That said, powerpc and s390 should at least look at maybe adding a > > > > check for the page ref in their gup paths too. Powerpc has the special > > > > gup_hugepte() case, and s390 has its own version of gup entirely. I > > > > was actually hoping the s390 guys would look at using the generic gup > > > > code. > > > > > > > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem > > > > largely irrelevant, partly since even theoretically this whole issue > > > > needs a _lot_ of memory. > > > > > > > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' > > > > (page ref overflow)"). You may or may not really care. > > > > > > I've now queued these patches up for the next round of stable releases, > > > as some people seem to care about these. > > > > > > I didn't see any follow-on patches for s390 or ppc64 hit the tree for > > > these changes, am I just missing them and should also queue up a few > > > more to handle this issue on those platforms? > > > > I fixed that with a different approach. The following two patches are > > queued for the next merge window: > > > > d1874a0c2805 "s390/mm: make the pxd_offset functions more robust" > > 1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code" > > > > With these two s390 now uses the generic gup code in mm/gup.c > > Nice! Do you want me to queue those up for the stable backports once > they hit a public -rc release? Yes please! -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Thu, May 02, 2019 at 04:17:58PM +0200, Martin Schwidefsky wrote: > On Thu, 2 May 2019 14:21:28 +0200 > Greg KH wrote: > > > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote: > > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig > > > wrote: > > > > > > > > Can we please have the page refcount overflow fixes out on the list > > > > for review, even if it is after the fact? > > > > > > They were actually on a list for review long before the fact, but it > > > was the security mailing list. The issue actually got discussed back > > > in January along with early versions of the patches, but then we > > > dropped the ball because it just wasn't on anybody's radar and it got > > > resurrected late March. Willy wrote a rather bigger patch-series, and > > > review of that is what then resulted in those commits. So they may > > > look recent, but that's just because the original patches got > > > seriously edited down and rewritten. > > > > > > That said, powerpc and s390 should at least look at maybe adding a > > > check for the page ref in their gup paths too. Powerpc has the special > > > gup_hugepte() case, and s390 has its own version of gup entirely. I > > > was actually hoping the s390 guys would look at using the generic gup > > > code. > > > > > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem > > > largely irrelevant, partly since even theoretically this whole issue > > > needs a _lot_ of memory. > > > > > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' > > > (page ref overflow)"). You may or may not really care. > > > > I've now queued these patches up for the next round of stable releases, > > as some people seem to care about these. > > > > I didn't see any follow-on patches for s390 or ppc64 hit the tree for > > these changes, am I just missing them and should also queue up a few > > more to handle this issue on those platforms? > > I fixed that with a different approach. The following two patches are > queued for the next merge window: > > d1874a0c2805 "s390/mm: make the pxd_offset functions more robust" > 1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code" > > With these two s390 now uses the generic gup code in mm/gup.c Nice! Do you want me to queue those up for the stable backports once they hit a public -rc release? thanks, greg k-h
Re: Linux 5.1-rc5
On Thu, 2 May 2019 14:21:28 +0200 Greg KH wrote: > On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote: > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig > > wrote: > > > > > > Can we please have the page refcount overflow fixes out on the list > > > for review, even if it is after the fact? > > > > They were actually on a list for review long before the fact, but it > > was the security mailing list. The issue actually got discussed back > > in January along with early versions of the patches, but then we > > dropped the ball because it just wasn't on anybody's radar and it got > > resurrected late March. Willy wrote a rather bigger patch-series, and > > review of that is what then resulted in those commits. So they may > > look recent, but that's just because the original patches got > > seriously edited down and rewritten. > > > > That said, powerpc and s390 should at least look at maybe adding a > > check for the page ref in their gup paths too. Powerpc has the special > > gup_hugepte() case, and s390 has its own version of gup entirely. I > > was actually hoping the s390 guys would look at using the generic gup > > code. > > > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem > > largely irrelevant, partly since even theoretically this whole issue > > needs a _lot_ of memory. > > > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' > > (page ref overflow)"). You may or may not really care. > > I've now queued these patches up for the next round of stable releases, > as some people seem to care about these. > > I didn't see any follow-on patches for s390 or ppc64 hit the tree for > these changes, am I just missing them and should also queue up a few > more to handle this issue on those platforms? I fixed that with a different approach. The following two patches are queued for the next merge window: d1874a0c2805 "s390/mm: make the pxd_offset functions more robust" 1a42010cdc26 "s390/mm: convert to the generic get_user_pages_fast code" With these two s390 now uses the generic gup code in mm/gup.c -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Mon, Apr 15, 2019 at 09:17:10AM -0700, Linus Torvalds wrote: > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig wrote: > > > > Can we please have the page refcount overflow fixes out on the list > > for review, even if it is after the fact? > > They were actually on a list for review long before the fact, but it > was the security mailing list. The issue actually got discussed back > in January along with early versions of the patches, but then we > dropped the ball because it just wasn't on anybody's radar and it got > resurrected late March. Willy wrote a rather bigger patch-series, and > review of that is what then resulted in those commits. So they may > look recent, but that's just because the original patches got > seriously edited down and rewritten. > > That said, powerpc and s390 should at least look at maybe adding a > check for the page ref in their gup paths too. Powerpc has the special > gup_hugepte() case, and s390 has its own version of gup entirely. I > was actually hoping the s390 guys would look at using the generic gup > code. > > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem > largely irrelevant, partly since even theoretically this whole issue > needs a _lot_ of memory. > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' > (page ref overflow)"). You may or may not really care. I've now queued these patches up for the next round of stable releases, as some people seem to care about these. I didn't see any follow-on patches for s390 or ppc64 hit the tree for these changes, am I just missing them and should also queue up a few more to handle this issue on those platforms? thanks, greg k-h
Re: Linux 5.1-rc5
On Tue, Apr 23, 2019 at 8:39 AM Martin Schwidefsky wrote: > > Ok, I added two patches for my s390/linux:features branch > > Martin Schwidefsky (2): > s390/mm: make the pxd_offset functions more robust > s390/mm: convert to the generic get_user_pages_fast code > > All code changes are inside arch/s390, I plan to include these patches with > the next merge window. That gives us a little bit of time to run our tests. Sounds good. Thanks for looking into this all. Now I slightly wonder about all the other random architectures that don't use the HAVE_GENERIC_GUP config option, but at least we'll have all of arm, powerpc, x86 and s390 using the generic code.. Linus
Re: Linux 5.1-rc5
On Fri, 19 Apr 2019 10:27:17 -0700 Linus Torvalds wrote: > On Fri, Apr 19, 2019 at 6:33 AM Martin Schwidefsky > wrote: > > > > That problem got stuck in my head and I thought more about it. Why not > > emulate the static folding sequence in the s390 page table code? > > So this model seems much closer to what x86 does in its folding, where > the pattern is basically > > > static inline pX-1d_t *pXd_offset(pXd_t *pXd, unsigned long address) > > { > > if (pXd_folded(pXd) > > return (pX-1d_t *) pXd; > > return (pX-1d_t *) pXd_deref(*pXd) + pXd_index(address); > > } > > which is really how the code is designed to work (ie the folded entry > doesn't actually do anything to the page directory pointer, it just > says "ok, we'll use this exact page directory pointer for the next > lower level instead". > > And that's very much what allows the generic gup code to load the > entry once, and use a temporary, and as you walk down the chain, if it > is folded it just then uses that (previous) temporary value for the > next level instead. IOW, the lower level page table is hidden inside > the upper level one, and folding just means "don't do any offsets, > don't change any values, just use the entry as-is for the next lower > level". > > So I think that's the right thing to do. Ok, I added two patches for my s390/linux:features branch Martin Schwidefsky (2): s390/mm: make the pxd_offset functions more robust s390/mm: convert to the generic get_user_pages_fast code All code changes are inside arch/s390, I plan to include these patches with the next merge window. That gives us a little bit of time to run our tests. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Fri, Apr 19, 2019 at 6:33 AM Martin Schwidefsky wrote: > > That problem got stuck in my head and I thought more about it. Why not > emulate the static folding sequence in the s390 page table code? So this model seems much closer to what x86 does in its folding, where the pattern is basically > static inline pX-1d_t *pXd_offset(pXd_t *pXd, unsigned long address) > { > if (pXd_folded(pXd) > return (pX-1d_t *) pXd; > return (pX-1d_t *) pXd_deref(*pXd) + pXd_index(address); > } which is really how the code is designed to work (ie the folded entry doesn't actually do anything to the page directory pointer, it just says "ok, we'll use this exact page directory pointer for the next lower level instead". And that's very much what allows the generic gup code to load the entry once, and use a temporary, and as you walk down the chain, if it is folded it just then uses that (previous) temporary value for the next level instead. IOW, the lower level page table is hidden inside the upper level one, and folding just means "don't do any offsets, don't change any values, just use the entry as-is for the next lower level". So I think that's the right thing to do. Looking at the s390 code, it seems to fold things the other way, conceptually hiding the upper level inside the lower one, and always doing the offset thing (but just avoiding the dereference). Maybe there's some reason why the s390 code does it that way, but I think your new model is the right one, and hopefully means you can use the generic page table walking more easily. Of course, the s390 folding is very different from the x86 one (or the generic fixed 3-level of 4-level cases). The x86 folding doesn't depend on the contents of the page tables, it's just entirely static (well, the 5th level is conditional, but it's conditional on a static key, not on what is in the page tables). So maybe the old model of s390 made more sense in that context, but I look at your new suggested pXd_offset() functions and I go "yeah, that's the way it's supposed to work". Linus
Re: Linux 5.1-rc5
On Thu, 18 Apr 2019 20:41:44 +0200 Martin Schwidefsky wrote: > On Thu, 18 Apr 2019 08:49:32 -0700 > Linus Torvalds wrote: > > > On Thu, Apr 18, 2019 at 1:02 AM Martin Schwidefsky > > wrote: > > > > > > The problematic lines in the generic gup code are these three: > > > > > > 1845: pmdp = pmd_offset(, addr); > > > 1888: pudp = pud_offset(, addr); > > > 1916: p4dp = p4d_offset(, addr); > > > > > > Passing the pointer of a *copy* of a page table entry to pxd_offset() does > > > not work with the page table folding on s390. > > > > Hmm. I wonder why. x86 too does the folding thing for the p4d and pud case. > > > > The folding works with the local copy just the same way it works with > > the orignal value. > > The difference is that with the static page table folding pgd_offset() > does the index calculation of the actual hardware top-level table. With > dynamic page table folding as s390 is doing it, if the task does not use > a 5-level page table pgd_offset() will see a pgd_index() of 0, the indexing > of the actual top-level table is done later with p4d_offset(), pud_offset() > or pmd_offset(). > > As an example, with a three level page table we have three indexes x/y/z. > The common code "thinks" 5 indexing steps, with static folding the index > sequence is x 0 0 y z. With dynamic folding the sequence is 0 0 x y z. > By moving the first indexing operation to pgd_offset the static sequence > does not add an index to a non-dereferenced pointer to a stack variable, > the dynamic sequence does. That problem got stuck in my head and I thought more about it. Why not emulate the static folding sequence in the s390 page table code? As the table type is encoded in every entry for the region and segment tables, pgd_offset() can look at the first entry to find the table type and then do the correct index calculation for the given top-level table. Like this: static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) { unsigned long rste; unsigned int shift; /* Get the first entry of the top level table */ rste = pgd_val(*pgd); /* Pick up the shift from the table type of the first entry */ shift = ((rste & _REGION_ENTRY_TYPE_MASK) >> 2) * 11 + 20; return pgd + ((address >> shift) & (PTRS_PER_PGD - 1)); } #define pgd_offset(mm, address) pgd_offset_raw((mm)->pgd, address) #define pgd_offset_k(address) pgd_offset(_mm, address) static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) { if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R1) return (p4d_t *) pgd; return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); } static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) { if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R2) return (pud_t *) p4d; return (pud_t *) p4d_deref(*p4d) + pud_index(address); } static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) { if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R3) return (pmd_t *) pud; return (pmd_t *) pud_deref(*pud) + pmd_index(address); } This needs more thorough testing but in principle it does work. The kernel boots and survives a kernel compile. The only things that is slightly off is that pgd_offset() now has to look at the first table entry to do its job. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Thu, 18 Apr 2019 08:49:32 -0700 Linus Torvalds wrote: > On Thu, Apr 18, 2019 at 1:02 AM Martin Schwidefsky > wrote: > > > > The problematic lines in the generic gup code are these three: > > > > 1845: pmdp = pmd_offset(, addr); > > 1888: pudp = pud_offset(, addr); > > 1916: p4dp = p4d_offset(, addr); > > > > Passing the pointer of a *copy* of a page table entry to pxd_offset() does > > not work with the page table folding on s390. > > Hmm. I wonder why. x86 too does the folding thing for the p4d and pud case. > > The folding works with the local copy just the same way it works with > the orignal value. The difference is that with the static page table folding pgd_offset() does the index calculation of the actual hardware top-level table. With dynamic page table folding as s390 is doing it, if the task does not use a 5-level page table pgd_offset() will see a pgd_index() of 0, the indexing of the actual top-level table is done later with p4d_offset(), pud_offset() or pmd_offset(). As an example, with a three level page table we have three indexes x/y/z. The common code "thinks" 5 indexing steps, with static folding the index sequence is x 0 0 y z. With dynamic folding the sequence is 0 0 x y z. By moving the first indexing operation to pgd_offset the static sequence does not add an index to a non-dereferenced pointer to a stack variable, the dynamic sequence does. > But I see that s390 does some other kind of folding and does that > addition of the p*d_index() unconditionally. > > I guess that does mean that s390 will just have to have its own walker. > > For the issue of the page refcount overflow it really isn't a huge > deal. Adding the refcount checking is simple (see the example patch I > gave for powerpc - you'll just have a couple of extra cases since you > do it all, rather than just the special hugetlb cases). > > Obviously in general it would have been nicer to share as much code as > possible, but let's not make things unnecessarily complex if s390 is > just fundamentally different.. It would have been nice to use the generic code (less bugs) but not at the price of over-complicating things. And that page table folding thing always makes my head hurt. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Thu, Apr 18, 2019 at 1:02 AM Martin Schwidefsky wrote: > > The problematic lines in the generic gup code are these three: > > 1845: pmdp = pmd_offset(, addr); > 1888: pudp = pud_offset(, addr); > 1916: p4dp = p4d_offset(, addr); > > Passing the pointer of a *copy* of a page table entry to pxd_offset() does > not work with the page table folding on s390. Hmm. I wonder why. x86 too does the folding thing for the p4d and pud case. The folding works with the local copy just the same way it works with the orignal value. But I see that s390 does some other kind of folding and does that addition of the p*d_index() unconditionally. I guess that does mean that s390 will just have to have its own walker. For the issue of the page refcount overflow it really isn't a huge deal. Adding the refcount checking is simple (see the example patch I gave for powerpc - you'll just have a couple of extra cases since you do it all, rather than just the special hugetlb cases). Obviously in general it would have been nicer to share as much code as possible, but let's not make things unnecessarily complex if s390 is just fundamentally different.. Linus
Re: Linux 5.1-rc5
On Wed, 17 Apr 2019 09:57:01 -0700 Linus Torvalds wrote: > On Wed, Apr 17, 2019 at 1:02 AM Martin Schwidefsky > wrote: > > > > Grumpf, that does *not* work. For gup the table entries may be read only > > once. Now I remember why I open-coded p4d_offset, pud_offset and pmd_offset > > in arch/s390/mm/gup.c, to avoid to read the table entries twice. > > It will be hard to use the common gup code after all. > > Hmm. The common gup code generally should do the "read only once" > thing too (since by definition the gup-fast case is done without > locking), although it's probably the case that most architectures > simply don't care. > > What would it require for the generic code to work for s390? The problematic lines in the generic gup code are these three: 1845: pmdp = pmd_offset(, addr); 1888: pudp = pud_offset(, addr); 1916: p4dp = p4d_offset(, addr); Passing the pointer of a *copy* of a page table entry to pxd_offset() does not work with the page table folding on s390. The pxd_offset() function on s390 have to make a choice, either return the dereferenced value behind the passed pointer (that works) or return the original page table pointer if the table level is folded (that does not work). To fix this we would need three new helpers pmd_offset_orig, pud_offset_orig and p4d_offset_orig, their generic definition would look like this: #define p4d_offset_orig(pgdp, pgd, address)p4d_offset(, address) #define pud_offset_orig(p4dp, p4d, address)pud_offset(, address) #define pmd_offset_orig(pudp, pud, address)pmd_offset(, address) For the s390 definition see the following branch: git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git generic-gup A quick test with this branch shows everything working normally. Keeping my fingers crossed that I did not miss anything. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Wed, Apr 17, 2019 at 1:02 AM Martin Schwidefsky wrote: > > Grumpf, that does *not* work. For gup the table entries may be read only > once. Now I remember why I open-coded p4d_offset, pud_offset and pmd_offset > in arch/s390/mm/gup.c, to avoid to read the table entries twice. > It will be hard to use the common gup code after all. Hmm. The common gup code generally should do the "read only once" thing too (since by definition the gup-fast case is done without locking), although it's probably the case that most architectures simply don't care. What would it require for the generic code to work for s390? Linus
Re: Linux 5.1-rc5
On Wed, 17 Apr 2019 09:46:37 +0200 Martin Schwidefsky wrote: > On Tue, 16 Apr 2019 09:49:46 -0700 > Linus Torvalds wrote: > > > On Tue, Apr 16, 2019 at 9:16 AM Linus Torvalds > > wrote: > > > > > > We actually already *have* this function. > > > > > > It's called "gup_fast_permitted()" and it's used by x86-64 to verify > > > the proper address range. Exactly like s390 needs.. > > > > > > Could you please use that instead? > > > > IOW, something like the attached. > > > > Obviously untested. And maybe 'current' isn't declared in > > , in which case you'd need to modify it to instead make > > the inline function be "s390_gup_fast_permitted()" that takes a > > pointer to the mm, and do something like > > > > #define gup_fast_permitted(start, pages) \ > > s390_gup_fast_permitted(current->mm, start, pages) > > > > instead. > > > > But I think you get the idea.. > > Nice, I did not realize that gup_fast_permitted is a platform > override-able function. So that part is doable in arch/s390. But I > spoke to soon, I got my first crash and realized that the common gup code > is not usable as it is. The reason is this e.g. this sequence: > > pgdp = pgd_offset(current->mm, addr); > pgd_t pgd = READ_ONCE(*pgdp); > /* some checking on pgd */ > gup_p4d_range(pgd, addr, next, write, pages, nr); > > p4dp = p4d_offset(, addr); > p4d_t p4d = READ_ONCE(*p4dp); > /* some checking on p4d */ > gup_pud_range(p4d, addr, next, write, pages, nr); > > pudp = pud_offset(, addr); > pud_t pud = READ_ONCE(*pudp); > /* some checking on pud */ > gup_pmd_range(pud, addr, next, write, pages, nr; > > Each step along the way will read the page table entry and pass the > table entry to the next function. This clashes with the page table > folding on s390. The s390 gup code looks more like this: > > pgdp = pgd_offset(current->mm, addr); > /* some checking on pgd */ > pgd_t pgd = READ_ONCE(*pgdp); > gup_p4d_range(pgdp, pgd, addr, next, write, pages, ); > > p4dp = p4d_offset(pgdp, addr); > p4d_t p4d = READ_ONCE(*p4dp); > /* some checking on p4d */ > gup_pud_range(p4dp, p4d, addr, next, write, pages, nr); > > pudp = pud_offset(p4dp, addr); > pud_t pud = READ_ONCE(*pudp); > /* some checking on pud */ > gup_pmd_range(pudp, pud, addr, next, write, pages, nr; > > There are magic dereferences in the s390 versions of p4d_offset, > pud_offset and pmd_offset functions. To make this work the pointer > passed to these functions may not be the local copy of the already > dereferenced table entry. I'll cook up a patch for the common code. Grumpf, that does *not* work. For gup the table entries may be read only once. Now I remember why I open-coded p4d_offset, pud_offset and pmd_offset in arch/s390/mm/gup.c, to avoid to read the table entries twice. It will be hard to use the common gup code after all. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Tue, 16 Apr 2019 09:49:46 -0700 Linus Torvalds wrote: > On Tue, Apr 16, 2019 at 9:16 AM Linus Torvalds > wrote: > > > > We actually already *have* this function. > > > > It's called "gup_fast_permitted()" and it's used by x86-64 to verify > > the proper address range. Exactly like s390 needs.. > > > > Could you please use that instead? > > IOW, something like the attached. > > Obviously untested. And maybe 'current' isn't declared in > , in which case you'd need to modify it to instead make > the inline function be "s390_gup_fast_permitted()" that takes a > pointer to the mm, and do something like > > #define gup_fast_permitted(start, pages) \ > s390_gup_fast_permitted(current->mm, start, pages) > > instead. > > But I think you get the idea.. Nice, I did not realize that gup_fast_permitted is a platform override-able function. So that part is doable in arch/s390. But I spoke to soon, I got my first crash and realized that the common gup code is not usable as it is. The reason is this e.g. this sequence: pgdp = pgd_offset(current->mm, addr); pgd_t pgd = READ_ONCE(*pgdp); /* some checking on pgd */ gup_p4d_range(pgd, addr, next, write, pages, nr); p4dp = p4d_offset(, addr); p4d_t p4d = READ_ONCE(*p4dp); /* some checking on p4d */ gup_pud_range(p4d, addr, next, write, pages, nr); pudp = pud_offset(, addr); pud_t pud = READ_ONCE(*pudp); /* some checking on pud */ gup_pmd_range(pud, addr, next, write, pages, nr; Each step along the way will read the page table entry and pass the table entry to the next function. This clashes with the page table folding on s390. The s390 gup code looks more like this: pgdp = pgd_offset(current->mm, addr); /* some checking on pgd */ pgd_t pgd = READ_ONCE(*pgdp); gup_p4d_range(pgdp, pgd, addr, next, write, pages, ); p4dp = p4d_offset(pgdp, addr); p4d_t p4d = READ_ONCE(*p4dp); /* some checking on p4d */ gup_pud_range(p4dp, p4d, addr, next, write, pages, nr); pudp = pud_offset(p4dp, addr); pud_t pud = READ_ONCE(*pudp); /* some checking on pud */ gup_pmd_range(pudp, pud, addr, next, write, pages, nr; There are magic dereferences in the s390 versions of p4d_offset, pud_offset and pmd_offset functions. To make this work the pointer passed to these functions may not be the local copy of the already dereferenced table entry. I'll cook up a patch for the common code. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Tue, Apr 16, 2019 at 8:38 PM Michael Ellerman wrote: > > > That said, powerpc and s390 should at least look at maybe adding a > > check for the page ref in their gup paths too. Powerpc has the special > > gup_hugepte() case > > Which uses page_cache_add_speculative(), which handles the case of the > refcount being zero but not overflow. So that looks like it needs > fixing. Note that unlike the zero check, the "too many refs" check does _not_ need to be atomic. Because it's not a correctness issue right at some magical exact point, it's a much more ambiguous a "the refcount is now so large that I'm not going to do GUP on this page any more". Being off by a number of pages in case there's a race is just fine. So you could do something like this (TOTALLY UNTESTED, and whitespace-damaged on purpose - I don't want you to apply it blindly) appended patch. > And we have a few uses of bare get_page() in KVM code which might be > subject to the same attack. Note that you really have to have not just a get_page(), but some way of lining up *billions* of them. Which really tends to be pretty hard. Linus diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 9e732bb2c84a..52db7ff7c756 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -523,7 +523,8 @@ struct page *follow_huge_pd(struct vm_area_struct *vma, page = pte_page(*ptep); page += ((address & mask) >> PAGE_SHIFT); if (flags & FOLL_GET) - get_page(page); + if (!try_get_page(page)) + page = NULL; } else { if (is_hugetlb_entry_migration(*ptep)) { spin_unlock(ptl); @@ -883,6 +884,8 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, refs = 0; head = pte_page(pte); + if (page_ref_count(head) < 0) + return 0; page = head + ((addr & (sz-1)) >> PAGE_SHIFT); do {
Re: Linux 5.1-rc5
[ Cc += Nick & Aneesh & Paul ] Linus Torvalds writes: > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig wrote: >> >> Can we please have the page refcount overflow fixes out on the list >> for review, even if it is after the fact? > > They were actually on a list for review long before the fact, but it > was the security mailing list. The issue actually got discussed back > in January along with early versions of the patches, but then we > dropped the ball because it just wasn't on anybody's radar and it got > resurrected late March. Willy wrote a rather bigger patch-series, and > review of that is what then resulted in those commits. So they may > look recent, but that's just because the original patches got > seriously edited down and rewritten. > > That said, powerpc and s390 should at least look at maybe adding a > check for the page ref in their gup paths too. Powerpc has the special > gup_hugepte() case Which uses page_cache_add_speculative(), which handles the case of the refcount being zero but not overflow. So that looks like it needs fixing. We also have follow_huge_pd() that should use try_get_page(). And we have a few uses of bare get_page() in KVM code which might be subject to the same attack. cheers
Re: Linux 5.1-rc5
On Tue, Apr 16, 2019 at 9:16 AM Linus Torvalds wrote: > > We actually already *have* this function. > > It's called "gup_fast_permitted()" and it's used by x86-64 to verify > the proper address range. Exactly like s390 needs.. > > Could you please use that instead? IOW, something like the attached. Obviously untested. And maybe 'current' isn't declared in , in which case you'd need to modify it to instead make the inline function be "s390_gup_fast_permitted()" that takes a pointer to the mm, and do something like #define gup_fast_permitted(start, pages) \ s390_gup_fast_permitted(current->mm, start, pages) instead. But I think you get the idea.. Linus arch/s390/include/asm/pgtable.h | 12 1 file changed, 12 insertions(+) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 76dc344edb8c..a08248995f50 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1659,4 +1659,16 @@ static inline void check_pgt_cache(void) { } #include +static inline bool gup_fast_permitted(unsigned long start, int nr_pages) +{ + unsigned long len, end; + + len = (unsigned long)nr_pages << PAGE_SHIFT; + end = start + len; + if (end < start) + return false; + return end <= current->mm->context.asce_limit; +} +#define gup_fast_permitted gup_fast_permitted + #endif /* _S390_PAGE_H */
Re: Linux 5.1-rc5
On Tue, Apr 16, 2019 at 5:08 AM Martin Schwidefsky wrote: > > This is not nice, would a patch like the following be acceptable? Umm. We actually already *have* this function. It's called "gup_fast_permitted()" and it's used by x86-64 to verify the proper address range. Exactly like s390 needs.. Could you please use that instead? Linus
Re: Linux 5.1-rc5
On Tue, 16 Apr 2019 11:09:06 +0200 Martin Schwidefsky wrote: > On Mon, 15 Apr 2019 09:17:10 -0700 > Linus Torvalds wrote: > > > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig > > wrote: > > > > > > Can we please have the page refcount overflow fixes out on the list > > > for review, even if it is after the fact? > > > > They were actually on a list for review long before the fact, but it > > was the security mailing list. The issue actually got discussed back > > in January along with early versions of the patches, but then we > > dropped the ball because it just wasn't on anybody's radar and it got > > resurrected late March. Willy wrote a rather bigger patch-series, and > > review of that is what then resulted in those commits. So they may > > look recent, but that's just because the original patches got > > seriously edited down and rewritten. > > First time I hear about this, thanks for the heads up. > > > That said, powerpc and s390 should at least look at maybe adding a > > check for the page ref in their gup paths too. Powerpc has the special > > gup_hugepte() case, and s390 has its own version of gup entirely. I > > was actually hoping the s390 guys would look at using the generic gup > > code. > > We did look at converting the s390 gup code to CONFIG_HAVE_GENERIC_GUP, > there are some details that need careful consideration. The top one > is access_ok(), for s390 we always return true. The generic gup code > relies on the fact that a page table walk with a specific address is > doable if access_ok() returned true, the s390 specific check is slightly > different: > > if ((end <= start) || (end > mm->context.asce_limit)) > return 0; > > The obvious approach would be to modify access_ok() to check against > the asce_limit. I will try and see if anything breaks, e.g. the automatic > page table upgrade. I tested the waters in regard to access_ok() and the generic gup code. The good news is that mm/gup.c with CONFIG_HAVE_GENERIC_GUP=y seems to work just fine if the access_ok() issue is taken care of. But.. Bloat-o-meter with a non-empty uaccess_ok() that checks against current->mm->context.asce_limit: add/remove: 8/2 grow/shrink: 611/11 up/down: 61352/-1914 (59438) with CONFIG_HAVE_GENERIC_GUP on top of that add/remove: 10/2 grow/shrink: 612/12 up/down: 63568/-3280 (60288) This is not nice, would a patch like the following be acceptable? -- Subject: [PATCH] mm: introduce mm_pgd_walk_ok Add the architecture overrideable function mm_pgd_walk_ok() to check if a block of memory is inside the limits of the page table hierarchy of a given mm struct. Signed-off-by: Martin Schwidefsky --- include/asm-generic/pgtable.h | 4 mm/gup.c | 4 ++-- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index fa782fba51ee..7d2a8a58f1c1 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -1186,4 +1186,8 @@ static inline bool arch_has_pfn_modify_check(void) #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) #endif +#ifndef mm_pgd_walk_ok +#define mm_pgd_walk_ok(mm, addr, size) access_ok(addr, size) +#endif + #endif /* _ASM_GENERIC_PGTABLE_H */ diff --git a/mm/gup.c b/mm/gup.c index 91819b8ad9cc..b3eb3f45d237 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1990,7 +1990,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; - if (unlikely(!access_ok((void __user *)start, len))) + if (unlikely(!mm_pgd_walk_ok(current->mm, (void __user *)start, len))) return 0; /* @@ -2044,7 +2044,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, if (nr_pages <= 0) return 0; - if (unlikely(!access_ok((void __user *)start, len))) + if (unlikely(!mm_pgd_walk_ok(current->mm, (void __user *)start, len))) return -EFAULT; if (gup_fast_permitted(start, nr_pages)) { -- 2.16.4 With an empty access_ok() but a "real" mm_pgd_walk_ok() the results are much more reasonable: add/remove: 2/0 grow/shrink: 2/1 up/down: 2186/-1382 (804) -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Mon, 15 Apr 2019 09:17:10 -0700 Linus Torvalds wrote: > On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig wrote: > > > > Can we please have the page refcount overflow fixes out on the list > > for review, even if it is after the fact? > > They were actually on a list for review long before the fact, but it > was the security mailing list. The issue actually got discussed back > in January along with early versions of the patches, but then we > dropped the ball because it just wasn't on anybody's radar and it got > resurrected late March. Willy wrote a rather bigger patch-series, and > review of that is what then resulted in those commits. So they may > look recent, but that's just because the original patches got > seriously edited down and rewritten. First time I hear about this, thanks for the heads up. > That said, powerpc and s390 should at least look at maybe adding a > check for the page ref in their gup paths too. Powerpc has the special > gup_hugepte() case, and s390 has its own version of gup entirely. I > was actually hoping the s390 guys would look at using the generic gup > code. We did look at converting the s390 gup code to CONFIG_HAVE_GENERIC_GUP, there are some details that need careful consideration. The top one is access_ok(), for s390 we always return true. The generic gup code relies on the fact that a page table walk with a specific address is doable if access_ok() returned true, the s390 specific check is slightly different: if ((end <= start) || (end > mm->context.asce_limit)) return 0; The obvious approach would be to modify access_ok() to check against the asce_limit. I will try and see if anything breaks, e.g. the automatic page table upgrade. > I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem > largely irrelevant, partly since even theoretically this whole issue > needs a _lot_ of memory. > > Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' > (page ref overflow)"). You may or may not really care. On s390 we can have up to 16TB of memory in a single LPAR. So yes, I do care about it. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: Linux 5.1-rc5
On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig wrote: > > Can we please have the page refcount overflow fixes out on the list > for review, even if it is after the fact? They were actually on a list for review long before the fact, but it was the security mailing list. The issue actually got discussed back in January along with early versions of the patches, but then we dropped the ball because it just wasn't on anybody's radar and it got resurrected late March. Willy wrote a rather bigger patch-series, and review of that is what then resulted in those commits. So they may look recent, but that's just because the original patches got seriously edited down and rewritten. That said, powerpc and s390 should at least look at maybe adding a check for the page ref in their gup paths too. Powerpc has the special gup_hugepte() case, and s390 has its own version of gup entirely. I was actually hoping the s390 guys would look at using the generic gup code. I ruthlessly also entirely ignored MIPS, SH and sparc, since they seem largely irrelevant, partly since even theoretically this whole issue needs a _lot_ of memory. Michael, Martin, see commit 6b3a70773630 ("Merge branch 'page-refs' (page ref overflow)"). You may or may not really care. Linus