Re: Access to non-RAM pages
On Sun, Sep 02, 2018 at 07:10:46PM -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt > wrote: > > > > Still, I can potentially see an issue with DEBUG_PAGEALLOC > > An unmapped page isn't a problem. That's what the whole > load_unaligned_zeropad() is about: it's ok to take a fault on the part > that crosses a page, and we'll just fill the value with zeroes (that's > the "zeropad" part). > > So as long as it's rare (and it is), it's all fine. > > That said, I think we turn off for DEBUG_PAGEALLOC simply because it's > not rare _enough_. > > And vmalloc() should actually be safe too, simply because I think we > strive for a guard page between vmalloc areas. > > So only a *mapped* page after the page that matters, and only if it's > something you can't read without side effects. > > Which basically doesn't happen on x86 in reality. BIOSes just don't > put MMIO right after the last page of RAM. I think this is why it only > triggered on Xen, due to some crazy "Xen reacts badly" case where we > do the speculation into a balloon address. > > So _practically_ this is just a Xen bug, nothing more. > > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory. On arm64, ioremap() gives you a guard page because it allocates out of the vmalloc area. The only way I think we could get MMIO in the middle of the linear map would be if firmware has reserved something there. In this case, the region should be treated as NOMAP, meaning we won't map the area at all in the kernel and our pfn_valid() implementation will return false for the corresponding memmap entries. For userspace, we did consider putting out a guard page for non-fixed mmap() calls, but it's not something we've really looked into. Will
Re: Access to non-RAM pages
On Sun, Sep 02, 2018 at 07:10:46PM -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt > wrote: > > > > Still, I can potentially see an issue with DEBUG_PAGEALLOC > > An unmapped page isn't a problem. That's what the whole > load_unaligned_zeropad() is about: it's ok to take a fault on the part > that crosses a page, and we'll just fill the value with zeroes (that's > the "zeropad" part). > > So as long as it's rare (and it is), it's all fine. > > That said, I think we turn off for DEBUG_PAGEALLOC simply because it's > not rare _enough_. > > And vmalloc() should actually be safe too, simply because I think we > strive for a guard page between vmalloc areas. > > So only a *mapped* page after the page that matters, and only if it's > something you can't read without side effects. > > Which basically doesn't happen on x86 in reality. BIOSes just don't > put MMIO right after the last page of RAM. I think this is why it only > triggered on Xen, due to some crazy "Xen reacts badly" case where we > do the speculation into a balloon address. > > So _practically_ this is just a Xen bug, nothing more. > > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory. On arm64, ioremap() gives you a guard page because it allocates out of the vmalloc area. The only way I think we could get MMIO in the middle of the linear map would be if firmware has reserved something there. In this case, the region should be treated as NOMAP, meaning we won't map the area at all in the kernel and our pfn_valid() implementation will return false for the corresponding memmap entries. For userspace, we did consider putting out a guard page for non-fixed mmap() calls, but it's not something we've really looked into. Will
Re: Access to non-RAM pages
On Sun, 2 Sep 2018, Linus Torvalds wrote: > Which basically doesn't happen on x86 in reality. BIOSes just don't put > MMIO right after the last page of RAM. I think this is why it only > triggered on Xen, due to some crazy "Xen reacts badly" case where we do > the speculation into a balloon address. > > So _practically_ this is just a Xen bug, nothing more. Yeah, and Xen guys are already working on fixing that AFAIK. > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory. Well, we've been hit by similar/related issue in practice, on x86 machines where GART aperture is being mapped over physical RAM. For the curious -- see commit 2a3e83c6f ("x86/gart: Exclude GART aperture from vmcore"). Thanks, -- Jiri Kosina SUSE Labs
Re: Access to non-RAM pages
On Sun, 2 Sep 2018, Linus Torvalds wrote: > Which basically doesn't happen on x86 in reality. BIOSes just don't put > MMIO right after the last page of RAM. I think this is why it only > triggered on Xen, due to some crazy "Xen reacts badly" case where we do > the speculation into a balloon address. > > So _practically_ this is just a Xen bug, nothing more. Yeah, and Xen guys are already working on fixing that AFAIK. > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory. Well, we've been hit by similar/related issue in practice, on x86 machines where GART aperture is being mapped over physical RAM. For the curious -- see commit 2a3e83c6f ("x86/gart: Exclude GART aperture from vmcore"). Thanks, -- Jiri Kosina SUSE Labs
Re: Access to non-RAM pages
On 03/09/18 04:10, Linus Torvalds wrote: > So _practically_ this is just a Xen bug, nothing more. > > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory. And that was exactly the reason I brought it up. :-) Juergen
Re: Access to non-RAM pages
On 03/09/18 04:10, Linus Torvalds wrote: > So _practically_ this is just a Xen bug, nothing more. > > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory. And that was exactly the reason I brought it up. :-) Juergen
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 19:52 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds > wrote: > > > > The comment actually does talk about it, although the comment also > > claims that the cs read would use load_unaligned_zeropad(), which it > > no longer does (now it only does the read_word_at_a_time). > > IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and > dentry_cmp() interfaces") for why the zeropad went away for the cs > access (but the comment wasn't updated). > > And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in > dentry_string_cmp()") did the "let's make KASAN happy thing. > > And yes, the word-at-a-time code actually matters a lot for certain > loads. The "copy-and-hash" thing for path components ends up being > pretty critical in all the pathname handling. Yup, makes sense. Thanks ! Cheers, Ben.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 19:52 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds > wrote: > > > > The comment actually does talk about it, although the comment also > > claims that the cs read would use load_unaligned_zeropad(), which it > > no longer does (now it only does the read_word_at_a_time). > > IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and > dentry_cmp() interfaces") for why the zeropad went away for the cs > access (but the comment wasn't updated). > > And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in > dentry_string_cmp()") did the "let's make KASAN happy thing. > > And yes, the word-at-a-time code actually matters a lot for certain > loads. The "copy-and-hash" thing for path components ends up being > pretty critical in all the pathname handling. Yup, makes sense. Thanks ! Cheers, Ben.
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds wrote: > > The comment actually does talk about it, although the comment also > claims that the cs read would use load_unaligned_zeropad(), which it > no longer does (now it only does the read_word_at_a_time). IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces") for why the zeropad went away for the cs access (but the comment wasn't updated). And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()") did the "let's make KASAN happy thing. And yes, the word-at-a-time code actually matters a lot for certain loads. The "copy-and-hash" thing for path components ends up being pretty critical in all the pathname handling. Linus
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds wrote: > > The comment actually does talk about it, although the comment also > claims that the cs read would use load_unaligned_zeropad(), which it > no longer does (now it only does the read_word_at_a_time). IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces") for why the zeropad went away for the cs access (but the comment wasn't updated). And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()") did the "let's make KASAN happy thing. And yes, the word-at-a-time code actually matters a lot for certain loads. The "copy-and-hash" thing for path components ends up being pretty critical in all the pathname handling. Linus
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 19:10 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt > wrote: > > > > Still, I can potentially see an issue with DEBUG_PAGEALLOC > > An unmapped page isn't a problem. That's what the whole > load_unaligned_zeropad() is about: it's ok to take a fault on the part > that crosses a page, and we'll just fill the value with zeroes (that's > the "zeropad" part). Ah, my bad reading, I was looking at read_word_at_a_time() instead of load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr stuff, I assume this is safe ? > So as long as it's rare (and it is), it's all fine. > > That said, I think we turn off for DEBUG_PAGEALLOC simply because it's > not rare _enough_. > > And vmalloc() should actually be safe too, simply because I think we > strive for a guard page between vmalloc areas. > > So only a *mapped* page after the page that matters, and only if it's > something you can't read without side effects. > > Which basically doesn't happen on x86 in reality. BIOSes just don't > put MMIO right after the last page of RAM. I think this is why it only > triggered on Xen, due to some crazy "Xen reacts badly" case where we > do the speculation into a balloon address. > > So _practically_ this is just a Xen bug, nothing more. > > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 19:10 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt > wrote: > > > > Still, I can potentially see an issue with DEBUG_PAGEALLOC > > An unmapped page isn't a problem. That's what the whole > load_unaligned_zeropad() is about: it's ok to take a fault on the part > that crosses a page, and we'll just fill the value with zeroes (that's > the "zeropad" part). Ah, my bad reading, I was looking at read_word_at_a_time() instead of load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr stuff, I assume this is safe ? > So as long as it's rare (and it is), it's all fine. > > That said, I think we turn off for DEBUG_PAGEALLOC simply because it's > not rare _enough_. > > And vmalloc() should actually be safe too, simply because I think we > strive for a guard page between vmalloc areas. > > So only a *mapped* page after the page that matters, and only if it's > something you can't read without side effects. > > Which basically doesn't happen on x86 in reality. BIOSes just don't > put MMIO right after the last page of RAM. I think this is why it only > triggered on Xen, due to some crazy "Xen reacts badly" case where we > do the speculation into a balloon address. > > So _practically_ this is just a Xen bug, nothing more. > > But since in _theory_ you could have MMIO abut regular RAM directly, > it's worth maybe making sure it's purely theory.
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 7:25 PM Benjamin Herrenschmidt wrote: > Ah, my bad reading, I was looking at read_word_at_a_time() instead of > load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr > stuff, I assume this is safe ? The dentry qstr should always be 8-byte aligned because it's a kernel name allocation. So it's the path component in the actual pathname string that can be unaligned (ct/tcount in dentry_string_cmp). The comment actually does talk about it, although the comment also claims that the cs read would use load_unaligned_zeropad(), which it no longer does (now it only does the read_word_at_a_time). And read_word_at_a_time() is purely a KASAN thing. The thing can't fault, but it *can* read uninitialized data past the end of the string, making KASAN unhappy. So that's actually a different issue, where KASAN does byte-level validity testing, and doing word-at-a-time accesses obviously violates that for strings. Linus
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 7:25 PM Benjamin Herrenschmidt wrote: > Ah, my bad reading, I was looking at read_word_at_a_time() instead of > load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr > stuff, I assume this is safe ? The dentry qstr should always be 8-byte aligned because it's a kernel name allocation. So it's the path component in the actual pathname string that can be unaligned (ct/tcount in dentry_string_cmp). The comment actually does talk about it, although the comment also claims that the cs read would use load_unaligned_zeropad(), which it no longer does (now it only does the read_word_at_a_time). And read_word_at_a_time() is purely a KASAN thing. The thing can't fault, but it *can* read uninitialized data past the end of the string, making KASAN unhappy. So that's actually a different issue, where KASAN does byte-level validity testing, and doing word-at-a-time accesses obviously violates that for strings. Linus
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt wrote: > > Still, I can potentially see an issue with DEBUG_PAGEALLOC An unmapped page isn't a problem. That's what the whole load_unaligned_zeropad() is about: it's ok to take a fault on the part that crosses a page, and we'll just fill the value with zeroes (that's the "zeropad" part). So as long as it's rare (and it is), it's all fine. That said, I think we turn off for DEBUG_PAGEALLOC simply because it's not rare _enough_. And vmalloc() should actually be safe too, simply because I think we strive for a guard page between vmalloc areas. So only a *mapped* page after the page that matters, and only if it's something you can't read without side effects. Which basically doesn't happen on x86 in reality. BIOSes just don't put MMIO right after the last page of RAM. I think this is why it only triggered on Xen, due to some crazy "Xen reacts badly" case where we do the speculation into a balloon address. So _practically_ this is just a Xen bug, nothing more. But since in _theory_ you could have MMIO abut regular RAM directly, it's worth maybe making sure it's purely theory. Linus
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt wrote: > > Still, I can potentially see an issue with DEBUG_PAGEALLOC An unmapped page isn't a problem. That's what the whole load_unaligned_zeropad() is about: it's ok to take a fault on the part that crosses a page, and we'll just fill the value with zeroes (that's the "zeropad" part). So as long as it's rare (and it is), it's all fine. That said, I think we turn off for DEBUG_PAGEALLOC simply because it's not rare _enough_. And vmalloc() should actually be safe too, simply because I think we strive for a guard page between vmalloc areas. So only a *mapped* page after the page that matters, and only if it's something you can't read without side effects. Which basically doesn't happen on x86 in reality. BIOSes just don't put MMIO right after the last page of RAM. I think this is why it only triggered on Xen, due to some crazy "Xen reacts badly" case where we do the speculation into a balloon address. So _practically_ this is just a Xen bug, nothing more. But since in _theory_ you could have MMIO abut regular RAM directly, it's worth maybe making sure it's purely theory. Linus
Re: Access to non-RAM pages
On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote: > [ Adding a few new people the the cc. > > The issue is the worry about software-speculative accesses (ie > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw > speculation now) accessing past RAM into possibly contiguous IO ] > > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds > wrote: > > > > If you have a machine with RAM that touches IO, you need to disable > > the last page, exactly the same way we disable and marked reserved the > > first page at zero. So I missed the departure of that train ... stupid question, with CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can), what prevents it from crossing into a non-mapped page (not even IO) and causing an oops ? Looking at a random user in fs/dcache.c its not a uaccess-style read with recovery Or am I missing somethign obvious here ? IE, should we "reserve" the last page of any memory region (maybe mark it read-only) to avoid this along with avoiding leakage into IO space ? > > I thought we already did that. > > We don't seem to do that. > > And it's not just the last page, it's _any_ last page in a region that > bumps up to IO. That's actually much more common in the low 4G area on > PC's, I suspect, although the reserved BIOS ranges always tend to be > there. What makes IO more "wrong" than oopsing due to the page not being mapped ? > I suspect it should be trivial to do - maybe in > e820__memblock_setup()? That's where we already trim partial pages > etc. > > In fact, I think this might be done as an extension of commit > 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into > memblock.reserved"), except making sure that non-RAM regions mark one > page _previous_ as reserved too. > > I assume memory hotplug might have the same issue, and checking > whether ARM64 and powerpc perhaps might have already done something > like this (or might need to add it). > > We discussed long ago the case of user space mapping IO in user space, > and decided we didn't care. But the kernel should probably explicitly > make sure we don't either, even if I can't recall having ever seen a > machine that actually maps IO contiguously to RAM. The layout always > tends to end up having holes anyway. Can't we put the safety in generic memblock ? IE, don't hand out an allocation that contain the last page of a "block" and handle that last page in the memblock->buddy transition rather than in arch specific code ? Cheers, Ben.
Re: Access to non-RAM pages
On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote: > [ Adding a few new people the the cc. > > The issue is the worry about software-speculative accesses (ie > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw > speculation now) accessing past RAM into possibly contiguous IO ] > > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds > wrote: > > > > If you have a machine with RAM that touches IO, you need to disable > > the last page, exactly the same way we disable and marked reserved the > > first page at zero. So I missed the departure of that train ... stupid question, with CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can), what prevents it from crossing into a non-mapped page (not even IO) and causing an oops ? Looking at a random user in fs/dcache.c its not a uaccess-style read with recovery Or am I missing somethign obvious here ? IE, should we "reserve" the last page of any memory region (maybe mark it read-only) to avoid this along with avoiding leakage into IO space ? > > I thought we already did that. > > We don't seem to do that. > > And it's not just the last page, it's _any_ last page in a region that > bumps up to IO. That's actually much more common in the low 4G area on > PC's, I suspect, although the reserved BIOS ranges always tend to be > there. What makes IO more "wrong" than oopsing due to the page not being mapped ? > I suspect it should be trivial to do - maybe in > e820__memblock_setup()? That's where we already trim partial pages > etc. > > In fact, I think this might be done as an extension of commit > 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into > memblock.reserved"), except making sure that non-RAM regions mark one > page _previous_ as reserved too. > > I assume memory hotplug might have the same issue, and checking > whether ARM64 and powerpc perhaps might have already done something > like this (or might need to add it). > > We discussed long ago the case of user space mapping IO in user space, > and decided we didn't care. But the kernel should probably explicitly > make sure we don't either, even if I can't recall having ever seen a > machine that actually maps IO contiguously to RAM. The layout always > tends to end up having holes anyway. Can't we put the safety in generic memblock ? IE, don't hand out an allocation that contain the last page of a "block" and handle that last page in the memblock->buddy transition rather than in arch specific code ? Cheers, Ben.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 18:42 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds > wrote: > > It's not used for vmalloc stuff. It's just regular kmalloc(). > > Just to clarify .. that's true of the dcache stuff. > > The strscpy case actually explicitly limits things to page boundaries > and falls back to the byte-by-byte case after that. Ah ok, that makes sense. Still, I can potentially see an issue with DEBUG_PAGEALLOC Cheers, Ben.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 18:42 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds > wrote: > > It's not used for vmalloc stuff. It's just regular kmalloc(). > > Just to clarify .. that's true of the dcache stuff. > > The strscpy case actually explicitly limits things to page boundaries > and falls back to the byte-by-byte case after that. Ah ok, that makes sense. Still, I can potentially see an issue with DEBUG_PAGEALLOC Cheers, Ben.
Re: Access to non-RAM pages
On Mon, 2018-09-03 at 10:48 +1000, Benjamin Herrenschmidt wrote: > On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote: > > [ Adding a few new people the the cc. > > > > The issue is the worry about software-speculative accesses (ie > > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw > > speculation now) accessing past RAM into possibly contiguous IO ] > > > > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds > > wrote: > > > > > > If you have a machine with RAM that touches IO, you need to disable > > > the last page, exactly the same way we disable and marked reserved the > > > first page at zero. > > So I missed the departure of that train ... stupid question, with > CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can), > what prevents it from crossing into a non-mapped page (not even IO) and > causing an oops ? Looking at a random user in fs/dcache.c its not a > uaccess-style read with recovery Or am I missing somethign obvious > here ? Also, if we cross page boundaries with those guys then we have a bigger problem no ? we could fall off a vmalloc page into the nether or into an ioremap mapping no ? Cheers, Ben.
Re: Access to non-RAM pages
On Mon, 2018-09-03 at 10:48 +1000, Benjamin Herrenschmidt wrote: > On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote: > > [ Adding a few new people the the cc. > > > > The issue is the worry about software-speculative accesses (ie > > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw > > speculation now) accessing past RAM into possibly contiguous IO ] > > > > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds > > wrote: > > > > > > If you have a machine with RAM that touches IO, you need to disable > > > the last page, exactly the same way we disable and marked reserved the > > > first page at zero. > > So I missed the departure of that train ... stupid question, with > CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can), > what prevents it from crossing into a non-mapped page (not even IO) and > causing an oops ? Looking at a random user in fs/dcache.c its not a > uaccess-style read with recovery Or am I missing somethign obvious > here ? Also, if we cross page boundaries with those guys then we have a bigger problem no ? we could fall off a vmalloc page into the nether or into an ioremap mapping no ? Cheers, Ben.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt > wrote: > > > > Also, if we cross page boundaries with those guys then we have a bigger > > problem no ? we could fall off a vmalloc page into the nether or into > > an ioremap mapping no ? > > It's not used for vmalloc stuff. It's just regular kmalloc(). > > So it can cross pages, and it can fall off the end of memory, but it > can't do random stuff. Actually what about DEBUG_PAGEALLOC ? Can't we fall off a page and fault that way too ? Cheers, Ben.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt > wrote: > > > > Also, if we cross page boundaries with those guys then we have a bigger > > problem no ? we could fall off a vmalloc page into the nether or into > > an ioremap mapping no ? > > It's not used for vmalloc stuff. It's just regular kmalloc(). > > So it can cross pages, and it can fall off the end of memory, but it > can't do random stuff. Actually what about DEBUG_PAGEALLOC ? Can't we fall off a page and fault that way too ? Cheers, Ben.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt > wrote: > > > > Also, if we cross page boundaries with those guys then we have a bigger > > problem no ? we could fall off a vmalloc page into the nether or into > > an ioremap mapping no ? > > It's not used for vmalloc stuff. It's just regular kmalloc(). > > So it can cross pages, and it can fall off the end of memory, but it > can't do random stuff. Ok, it might be worth adding a DEBUG_VM based (or similar) warning in case somebody ever thinks of passing a vmalloc pointer to it... As for falling out of the end of memory, yes it could be a real problem though I don't see why IO is any different than just hitting a non- mapped area in that regard. So we should probably keep an unused (readonly if possible) zero page at the end. Cheers, Ben.
Re: Access to non-RAM pages
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote: > On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt > wrote: > > > > Also, if we cross page boundaries with those guys then we have a bigger > > problem no ? we could fall off a vmalloc page into the nether or into > > an ioremap mapping no ? > > It's not used for vmalloc stuff. It's just regular kmalloc(). > > So it can cross pages, and it can fall off the end of memory, but it > can't do random stuff. Ok, it might be worth adding a DEBUG_VM based (or similar) warning in case somebody ever thinks of passing a vmalloc pointer to it... As for falling out of the end of memory, yes it could be a real problem though I don't see why IO is any different than just hitting a non- mapped area in that regard. So we should probably keep an unused (readonly if possible) zero page at the end. Cheers, Ben.
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds wrote: > It's not used for vmalloc stuff. It's just regular kmalloc(). Just to clarify .. that's true of the dcache stuff. The strscpy case actually explicitly limits things to page boundaries and falls back to the byte-by-byte case after that. Linus
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds wrote: > It's not used for vmalloc stuff. It's just regular kmalloc(). Just to clarify .. that's true of the dcache stuff. The strscpy case actually explicitly limits things to page boundaries and falls back to the byte-by-byte case after that. Linus
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt wrote: > > Also, if we cross page boundaries with those guys then we have a bigger > problem no ? we could fall off a vmalloc page into the nether or into > an ioremap mapping no ? It's not used for vmalloc stuff. It's just regular kmalloc(). So it can cross pages, and it can fall off the end of memory, but it can't do random stuff. Linus
Re: Access to non-RAM pages
On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt wrote: > > Also, if we cross page boundaries with those guys then we have a bigger > problem no ? we could fall off a vmalloc page into the nether or into > an ioremap mapping no ? It's not used for vmalloc stuff. It's just regular kmalloc(). So it can cross pages, and it can fall off the end of memory, but it can't do random stuff. Linus
Re: Access to non-RAM pages
On Sat, 1 Sep 2018, Al Viro wrote: > IMO that's crap. In absolute majority of cases there is a guaranteed gap > between the end of accessed object and the next page boundary. So if that's the case, you're absolutely right. But I am unable to find any such guarantee in our current code though. Thanks, -- Jiri Kosina SUSE Labs
Re: Access to non-RAM pages
On Sat, 1 Sep 2018, Al Viro wrote: > IMO that's crap. In absolute majority of cases there is a guaranteed gap > between the end of accessed object and the next page boundary. So if that's the case, you're absolutely right. But I am unable to find any such guarantee in our current code though. Thanks, -- Jiri Kosina SUSE Labs
Re: Access to non-RAM pages
[ Adding a few new people the the cc. The issue is the worry about software-speculative accesses (ie things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw speculation now) accessing past RAM into possibly contiguous IO ] On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds wrote: > > If you have a machine with RAM that touches IO, you need to disable > the last page, exactly the same way we disable and marked reserved the > first page at zero. > > I thought we already did that. We don't seem to do that. And it's not just the last page, it's _any_ last page in a region that bumps up to IO. That's actually much more common in the low 4G area on PC's, I suspect, although the reserved BIOS ranges always tend to be there. I suspect it should be trivial to do - maybe in e820__memblock_setup()? That's where we already trim partial pages etc. In fact, I think this might be done as an extension of commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"), except making sure that non-RAM regions mark one page _previous_ as reserved too. I assume memory hotplug might have the same issue, and checking whether ARM64 and powerpc perhaps might have already done something like this (or might need to add it). We discussed long ago the case of user space mapping IO in user space, and decided we didn't care. But the kernel should probably explicitly make sure we don't either, even if I can't recall having ever seen a machine that actually maps IO contiguously to RAM. The layout always tends to end up having holes anyway. Linus
Re: Access to non-RAM pages
[ Adding a few new people the the cc. The issue is the worry about software-speculative accesses (ie things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw speculation now) accessing past RAM into possibly contiguous IO ] On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds wrote: > > If you have a machine with RAM that touches IO, you need to disable > the last page, exactly the same way we disable and marked reserved the > first page at zero. > > I thought we already did that. We don't seem to do that. And it's not just the last page, it's _any_ last page in a region that bumps up to IO. That's actually much more common in the low 4G area on PC's, I suspect, although the reserved BIOS ranges always tend to be there. I suspect it should be trivial to do - maybe in e820__memblock_setup()? That's where we already trim partial pages etc. In fact, I think this might be done as an extension of commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"), except making sure that non-RAM regions mark one page _previous_ as reserved too. I assume memory hotplug might have the same issue, and checking whether ARM64 and powerpc perhaps might have already done something like this (or might need to add it). We discussed long ago the case of user space mapping IO in user space, and decided we didn't care. But the kernel should probably explicitly make sure we don't either, even if I can't recall having ever seen a machine that actually maps IO contiguously to RAM. The layout always tends to end up having holes anyway. Linus
Re: Access to non-RAM pages
On Fri, Aug 31, 2018 at 2:18 PM Jiri Kosina wrote: > > If noone has any clever idea how to work this around (I don't), I am > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as > it's silently dangerous. No way in hell will I apply such a stupid patch. It is NOT dangerous. If you have a machine with RAM that touches IO, you need to disable the last page, exactly the same way we disable and marked reserved the first page at zero. I thought we already did that. I suspect this is a Xen bug, where the fake BIOS sets up a garbage description of the hardware that is simply not realistic. I don't think I've ever seen a machine that didn't have some reserved memory at the top, but hey, if we don't expressly mark the last page reserved already, doing so should be trivial. No way do we disable the word accesses just because of some crazy corner case that doesn't matter and doesn't happen in reality. Linus
Re: Access to non-RAM pages
On Fri, Aug 31, 2018 at 2:18 PM Jiri Kosina wrote: > > If noone has any clever idea how to work this around (I don't), I am > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as > it's silently dangerous. No way in hell will I apply such a stupid patch. It is NOT dangerous. If you have a machine with RAM that touches IO, you need to disable the last page, exactly the same way we disable and marked reserved the first page at zero. I thought we already did that. I suspect this is a Xen bug, where the fake BIOS sets up a garbage description of the hardware that is simply not realistic. I don't think I've ever seen a machine that didn't have some reserved memory at the top, but hey, if we don't expressly mark the last page reserved already, doing so should be trivial. No way do we disable the word accesses just because of some crazy corner case that doesn't matter and doesn't happen in reality. Linus
Re: Access to non-RAM pages
On Sat, Sep 01, 2018 at 12:47:48PM +0200, Juergen Gross wrote: > On 31/08/18 23:18, Jiri Kosina wrote: > > On Wed, 29 Aug 2018, Juergen Gross wrote: > > > >> While being very unlikely I still believe this is possible. Any > >> thoughts? > > > > So in theory we should somehow test whether the next page is some form of > > mmio/gart/... mapping, but I guess that by itself would kill the > > performance advantage of the whole load_unaligned_zeropad() trick. > > > > And yes, the sideffects of reading from mmio mapping is a real thing -- > > see for example the issue fixed by 2a3e83c6f9. > > > > If noone has any clever idea how to work this around (I don't), I am > > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as > > it's silently dangerous. > > Are there any architectures which can still use DCACHE_WORD_ACCESS? > > x86 shouldn't, what about powerpc, arm and arm64? IMO that's crap. In absolute majority of cases there is a guaranteed gap between the end of accessed object and the next page boundary. Penalizing each syscall that does pathname resolution to deal with something that might only happen when the pathname length is just under 4Kb looks like a bloody bad idea.
Re: Access to non-RAM pages
On Sat, Sep 01, 2018 at 12:47:48PM +0200, Juergen Gross wrote: > On 31/08/18 23:18, Jiri Kosina wrote: > > On Wed, 29 Aug 2018, Juergen Gross wrote: > > > >> While being very unlikely I still believe this is possible. Any > >> thoughts? > > > > So in theory we should somehow test whether the next page is some form of > > mmio/gart/... mapping, but I guess that by itself would kill the > > performance advantage of the whole load_unaligned_zeropad() trick. > > > > And yes, the sideffects of reading from mmio mapping is a real thing -- > > see for example the issue fixed by 2a3e83c6f9. > > > > If noone has any clever idea how to work this around (I don't), I am > > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as > > it's silently dangerous. > > Are there any architectures which can still use DCACHE_WORD_ACCESS? > > x86 shouldn't, what about powerpc, arm and arm64? IMO that's crap. In absolute majority of cases there is a guaranteed gap between the end of accessed object and the next page boundary. Penalizing each syscall that does pathname resolution to deal with something that might only happen when the pathname length is just under 4Kb looks like a bloody bad idea.
Re: Access to non-RAM pages
On 31/08/18 23:18, Jiri Kosina wrote: > On Wed, 29 Aug 2018, Juergen Gross wrote: > >> While being very unlikely I still believe this is possible. Any >> thoughts? > > So in theory we should somehow test whether the next page is some form of > mmio/gart/... mapping, but I guess that by itself would kill the > performance advantage of the whole load_unaligned_zeropad() trick. > > And yes, the sideffects of reading from mmio mapping is a real thing -- > see for example the issue fixed by 2a3e83c6f9. > > If noone has any clever idea how to work this around (I don't), I am > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as > it's silently dangerous. Are there any architectures which can still use DCACHE_WORD_ACCESS? x86 shouldn't, what about powerpc, arm and arm64? I'm happy to send patches to remove selection of DCACHE_WORD_ACCESS and eventually to remove all of the load_unaligned_zeropad() stuff. Juergen
Re: Access to non-RAM pages
On 31/08/18 23:18, Jiri Kosina wrote: > On Wed, 29 Aug 2018, Juergen Gross wrote: > >> While being very unlikely I still believe this is possible. Any >> thoughts? > > So in theory we should somehow test whether the next page is some form of > mmio/gart/... mapping, but I guess that by itself would kill the > performance advantage of the whole load_unaligned_zeropad() trick. > > And yes, the sideffects of reading from mmio mapping is a real thing -- > see for example the issue fixed by 2a3e83c6f9. > > If noone has any clever idea how to work this around (I don't), I am > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as > it's silently dangerous. Are there any architectures which can still use DCACHE_WORD_ACCESS? x86 shouldn't, what about powerpc, arm and arm64? I'm happy to send patches to remove selection of DCACHE_WORD_ACCESS and eventually to remove all of the load_unaligned_zeropad() stuff. Juergen
Re: Access to non-RAM pages
On Wed, 29 Aug 2018, Juergen Gross wrote: > While being very unlikely I still believe this is possible. Any > thoughts? So in theory we should somehow test whether the next page is some form of mmio/gart/... mapping, but I guess that by itself would kill the performance advantage of the whole load_unaligned_zeropad() trick. And yes, the sideffects of reading from mmio mapping is a real thing -- see for example the issue fixed by 2a3e83c6f9. If noone has any clever idea how to work this around (I don't), I am afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as it's silently dangerous. Thanks, -- Jiri Kosina SUSE Labs
Re: Access to non-RAM pages
On Wed, 29 Aug 2018, Juergen Gross wrote: > While being very unlikely I still believe this is possible. Any > thoughts? So in theory we should somehow test whether the next page is some form of mmio/gart/... mapping, but I guess that by itself would kill the performance advantage of the whole load_unaligned_zeropad() trick. And yes, the sideffects of reading from mmio mapping is a real thing -- see for example the issue fixed by 2a3e83c6f9. If noone has any clever idea how to work this around (I don't), I am afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as it's silently dangerous. Thanks, -- Jiri Kosina SUSE Labs
Access to non-RAM pages
Running as a Xen guest a customer's system experienced problems due to load_unaligned_zeropad() hitting a ballooned memory page. The 8 byte read was meant to read the last 4 bytes from a kernel page and should have returned 4 zero bytes for the not allocated page after that. In order not to split huge direct map pages The Xen balloon driver doesn't remove ballooned pages from the direct map. The memory page is removed from the EPT, of course, but a bug in the hypervisor would emulate the 8 byte read wrong by returning all bytes as 0xff, leading to above mentioned problems. While this bug can be repaired in Xen I'm rather sure it is dangerous to rely on load_unaligned_zeropad() doing the right thing when crossing page boundaries. Trying to access an invalid page is handled fine, but what happens if the next page is a MMIO area? Reading from that might have strange side effects. While being very unlikely I still believe this is possible. Any thoughts? Michal asked me to bring this up for a wider audience. Juergen
Access to non-RAM pages
Running as a Xen guest a customer's system experienced problems due to load_unaligned_zeropad() hitting a ballooned memory page. The 8 byte read was meant to read the last 4 bytes from a kernel page and should have returned 4 zero bytes for the not allocated page after that. In order not to split huge direct map pages The Xen balloon driver doesn't remove ballooned pages from the direct map. The memory page is removed from the EPT, of course, but a bug in the hypervisor would emulate the 8 byte read wrong by returning all bytes as 0xff, leading to above mentioned problems. While this bug can be repaired in Xen I'm rather sure it is dangerous to rely on load_unaligned_zeropad() doing the right thing when crossing page boundaries. Trying to access an invalid page is handled fine, but what happens if the next page is a MMIO area? Reading from that might have strange side effects. While being very unlikely I still believe this is possible. Any thoughts? Michal asked me to bring this up for a wider audience. Juergen