Re: Access to non-RAM pages

2018-09-03 Thread Will Deacon
On Sun, Sep 02, 2018 at 07:10:46PM -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
>  wrote:
> >
> > Still, I can potentially see an issue with DEBUG_PAGEALLOC
> 
> An unmapped page isn't a problem. That's what the whole
> load_unaligned_zeropad() is about: it's ok to take a fault on the part
> that crosses a page, and we'll just fill the value with zeroes (that's
> the "zeropad" part).
> 
> So as long as it's rare (and it is), it's all fine.
> 
> That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
> not rare _enough_.
> 
> And vmalloc() should actually be safe too, simply because I think we
> strive for a guard page between vmalloc areas.
> 
> So only a *mapped* page after the page that matters, and only if it's
> something you can't read without side effects.
> 
> Which basically doesn't happen on x86 in reality. BIOSes just don't
> put MMIO right after the last page of RAM. I think this is why it only
> triggered on Xen, due to some crazy "Xen reacts badly" case where we
> do the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.

On arm64, ioremap() gives you a guard page because it allocates out of
the vmalloc area. The only way I think we could get MMIO in the middle
of the linear map would be if firmware has reserved something there. In
this case, the region should be treated as NOMAP, meaning we won't map
the area at all in the kernel and our pfn_valid() implementation will
return false for the corresponding memmap entries.

For userspace, we did consider putting out a guard page for non-fixed
mmap() calls, but it's not something we've really looked into.

Will


Re: Access to non-RAM pages

2018-09-03 Thread Will Deacon
On Sun, Sep 02, 2018 at 07:10:46PM -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
>  wrote:
> >
> > Still, I can potentially see an issue with DEBUG_PAGEALLOC
> 
> An unmapped page isn't a problem. That's what the whole
> load_unaligned_zeropad() is about: it's ok to take a fault on the part
> that crosses a page, and we'll just fill the value with zeroes (that's
> the "zeropad" part).
> 
> So as long as it's rare (and it is), it's all fine.
> 
> That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
> not rare _enough_.
> 
> And vmalloc() should actually be safe too, simply because I think we
> strive for a guard page between vmalloc areas.
> 
> So only a *mapped* page after the page that matters, and only if it's
> something you can't read without side effects.
> 
> Which basically doesn't happen on x86 in reality. BIOSes just don't
> put MMIO right after the last page of RAM. I think this is why it only
> triggered on Xen, due to some crazy "Xen reacts badly" case where we
> do the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.

On arm64, ioremap() gives you a guard page because it allocates out of
the vmalloc area. The only way I think we could get MMIO in the middle
of the linear map would be if firmware has reserved something there. In
this case, the region should be treated as NOMAP, meaning we won't map
the area at all in the kernel and our pfn_valid() implementation will
return false for the corresponding memmap entries.

For userspace, we did consider putting out a guard page for non-fixed
mmap() calls, but it's not something we've really looked into.

Will


Re: Access to non-RAM pages

2018-09-03 Thread Jiri Kosina
On Sun, 2 Sep 2018, Linus Torvalds wrote:

> Which basically doesn't happen on x86 in reality. BIOSes just don't put 
> MMIO right after the last page of RAM. I think this is why it only 
> triggered on Xen, due to some crazy "Xen reacts badly" case where we do 
> the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.

Yeah, and Xen guys are already working on fixing that AFAIK.

> But since in _theory_ you could have MMIO abut regular RAM directly, 
> it's worth maybe making sure it's purely theory.

Well, we've been hit by similar/related issue in practice, on x86 machines 
where GART aperture is being mapped over physical RAM. For the curious -- 
see commit 2a3e83c6f ("x86/gart: Exclude GART aperture from vmcore").

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: Access to non-RAM pages

2018-09-03 Thread Jiri Kosina
On Sun, 2 Sep 2018, Linus Torvalds wrote:

> Which basically doesn't happen on x86 in reality. BIOSes just don't put 
> MMIO right after the last page of RAM. I think this is why it only 
> triggered on Xen, due to some crazy "Xen reacts badly" case where we do 
> the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.

Yeah, and Xen guys are already working on fixing that AFAIK.

> But since in _theory_ you could have MMIO abut regular RAM directly, 
> it's worth maybe making sure it's purely theory.

Well, we've been hit by similar/related issue in practice, on x86 machines 
where GART aperture is being mapped over physical RAM. For the curious -- 
see commit 2a3e83c6f ("x86/gart: Exclude GART aperture from vmcore").

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: Access to non-RAM pages

2018-09-02 Thread Juergen Gross
On 03/09/18 04:10, Linus Torvalds wrote:
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.

And that was exactly the reason I brought it up. :-)


Juergen


Re: Access to non-RAM pages

2018-09-02 Thread Juergen Gross
On 03/09/18 04:10, Linus Torvalds wrote:
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.

And that was exactly the reason I brought it up. :-)


Juergen


Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 19:52 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds
>  wrote:
> > 
> > The comment actually does talk about it, although the comment also
> > claims that the cs read would use load_unaligned_zeropad(), which it
> > no longer does (now it only does the read_word_at_a_time).
> 
> IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and
> dentry_cmp() interfaces") for why the zeropad went away for the cs
> access (but the comment wasn't updated).
> 
> And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in
> dentry_string_cmp()") did the "let's make KASAN happy thing.
> 
> And yes, the word-at-a-time code actually matters a lot for certain
> loads. The "copy-and-hash" thing for path components ends up being
> pretty critical in all the pathname handling.

Yup, makes sense.

Thanks !

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 19:52 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds
>  wrote:
> > 
> > The comment actually does talk about it, although the comment also
> > claims that the cs read would use load_unaligned_zeropad(), which it
> > no longer does (now it only does the read_word_at_a_time).
> 
> IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and
> dentry_cmp() interfaces") for why the zeropad went away for the cs
> access (but the comment wasn't updated).
> 
> And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in
> dentry_string_cmp()") did the "let's make KASAN happy thing.
> 
> And yes, the word-at-a-time code actually matters a lot for certain
> loads. The "copy-and-hash" thing for path components ends up being
> pretty critical in all the pathname handling.

Yup, makes sense.

Thanks !

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds
 wrote:
>
> The comment actually does talk about it, although the comment also
> claims that the cs read would use load_unaligned_zeropad(), which it
> no longer does (now it only does the read_word_at_a_time).

IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and
dentry_cmp() interfaces") for why the zeropad went away for the cs
access (but the comment wasn't updated).

And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in
dentry_string_cmp()") did the "let's make KASAN happy thing.

And yes, the word-at-a-time code actually matters a lot for certain
loads. The "copy-and-hash" thing for path components ends up being
pretty critical in all the pathname handling.

Linus


Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds
 wrote:
>
> The comment actually does talk about it, although the comment also
> claims that the cs read would use load_unaligned_zeropad(), which it
> no longer does (now it only does the read_word_at_a_time).

IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and
dentry_cmp() interfaces") for why the zeropad went away for the cs
access (but the comment wasn't updated).

And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in
dentry_string_cmp()") did the "let's make KASAN happy thing.

And yes, the word-at-a-time code actually matters a lot for certain
loads. The "copy-and-hash" thing for path components ends up being
pretty critical in all the pathname handling.

Linus


Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 19:10 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Still, I can potentially see an issue with DEBUG_PAGEALLOC
> 
> An unmapped page isn't a problem. That's what the whole
> load_unaligned_zeropad() is about: it's ok to take a fault on the part
> that crosses a page, and we'll just fill the value with zeroes (that's
> the "zeropad" part).

Ah, my bad reading, I was looking at read_word_at_a_time() instead of
load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr
stuff, I assume this is safe ?

> So as long as it's rare (and it is), it's all fine.
> 
> That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
> not rare _enough_.
> 
> And vmalloc() should actually be safe too, simply because I think we
> strive for a guard page between vmalloc areas.
> 
> So only a *mapped* page after the page that matters, and only if it's
> something you can't read without side effects.
> 
> Which basically doesn't happen on x86 in reality. BIOSes just don't
> put MMIO right after the last page of RAM. I think this is why it only
> triggered on Xen, due to some crazy "Xen reacts badly" case where we
> do the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.




Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 19:10 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Still, I can potentially see an issue with DEBUG_PAGEALLOC
> 
> An unmapped page isn't a problem. That's what the whole
> load_unaligned_zeropad() is about: it's ok to take a fault on the part
> that crosses a page, and we'll just fill the value with zeroes (that's
> the "zeropad" part).

Ah, my bad reading, I was looking at read_word_at_a_time() instead of
load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr
stuff, I assume this is safe ?

> So as long as it's rare (and it is), it's all fine.
> 
> That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
> not rare _enough_.
> 
> And vmalloc() should actually be safe too, simply because I think we
> strive for a guard page between vmalloc areas.
> 
> So only a *mapped* page after the page that matters, and only if it's
> something you can't read without side effects.
> 
> Which basically doesn't happen on x86 in reality. BIOSes just don't
> put MMIO right after the last page of RAM. I think this is why it only
> triggered on Xen, due to some crazy "Xen reacts badly" case where we
> do the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.




Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 7:25 PM Benjamin Herrenschmidt
 wrote:
> Ah, my bad reading, I was looking at read_word_at_a_time() instead of
> load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr
> stuff, I assume this is safe ?

The dentry qstr should always be 8-byte aligned because it's a kernel
name allocation.

So it's the path component in the actual pathname string that can be
unaligned (ct/tcount in dentry_string_cmp).

The comment actually does talk about it, although the comment also
claims that the cs read would use load_unaligned_zeropad(), which it
no longer does (now it only does the read_word_at_a_time).

And read_word_at_a_time() is purely a KASAN thing. The thing can't
fault, but it *can* read uninitialized data past the end of the
string, making KASAN unhappy.

So that's actually a different issue, where KASAN does byte-level
validity testing, and doing word-at-a-time accesses obviously violates
that for strings.

   Linus


Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 7:25 PM Benjamin Herrenschmidt
 wrote:
> Ah, my bad reading, I was looking at read_word_at_a_time() instead of
> load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr
> stuff, I assume this is safe ?

The dentry qstr should always be 8-byte aligned because it's a kernel
name allocation.

So it's the path component in the actual pathname string that can be
unaligned (ct/tcount in dentry_string_cmp).

The comment actually does talk about it, although the comment also
claims that the cs read would use load_unaligned_zeropad(), which it
no longer does (now it only does the read_word_at_a_time).

And read_word_at_a_time() is purely a KASAN thing. The thing can't
fault, but it *can* read uninitialized data past the end of the
string, making KASAN unhappy.

So that's actually a different issue, where KASAN does byte-level
validity testing, and doing word-at-a-time accesses obviously violates
that for strings.

   Linus


Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
 wrote:
>
> Still, I can potentially see an issue with DEBUG_PAGEALLOC

An unmapped page isn't a problem. That's what the whole
load_unaligned_zeropad() is about: it's ok to take a fault on the part
that crosses a page, and we'll just fill the value with zeroes (that's
the "zeropad" part).

So as long as it's rare (and it is), it's all fine.

That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
not rare _enough_.

And vmalloc() should actually be safe too, simply because I think we
strive for a guard page between vmalloc areas.

So only a *mapped* page after the page that matters, and only if it's
something you can't read without side effects.

Which basically doesn't happen on x86 in reality. BIOSes just don't
put MMIO right after the last page of RAM. I think this is why it only
triggered on Xen, due to some crazy "Xen reacts badly" case where we
do the speculation into a balloon address.

So _practically_ this is just a Xen bug, nothing more.

But since in _theory_ you could have MMIO abut regular RAM directly,
it's worth maybe making sure it's purely theory.

Linus


Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
 wrote:
>
> Still, I can potentially see an issue with DEBUG_PAGEALLOC

An unmapped page isn't a problem. That's what the whole
load_unaligned_zeropad() is about: it's ok to take a fault on the part
that crosses a page, and we'll just fill the value with zeroes (that's
the "zeropad" part).

So as long as it's rare (and it is), it's all fine.

That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
not rare _enough_.

And vmalloc() should actually be safe too, simply because I think we
strive for a guard page between vmalloc areas.

So only a *mapped* page after the page that matters, and only if it's
something you can't read without side effects.

Which basically doesn't happen on x86 in reality. BIOSes just don't
put MMIO right after the last page of RAM. I think this is why it only
triggered on Xen, due to some crazy "Xen reacts badly" case where we
do the speculation into a balloon address.

So _practically_ this is just a Xen bug, nothing more.

But since in _theory_ you could have MMIO abut regular RAM directly,
it's worth maybe making sure it's purely theory.

Linus


Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> [ Adding a few new people the the cc.
> 
>   The issue is the worry about software-speculative accesses (ie
> things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> speculation now) accessing past RAM into possibly contiguous IO ]
> 
> On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
>  wrote:
> > 
> > If you have a machine with RAM that touches IO, you need to disable
> > the last page, exactly the same way we disable and marked reserved the
> > first page at zero.

So I missed the departure of that train ... stupid question, with
CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
what prevents it from crossing into a non-mapped page (not even IO) and
causing an oops ? Looking at a random user in fs/dcache.c its not a
uaccess-style read with recovery Or am I missing somethign obvious
here ?

IE, should we "reserve" the last page of any memory region (maybe mark
it read-only) to avoid this along with avoiding leakage into IO space ?

> > I thought we already did that.
> 
> We don't seem to do that.
> 
> And it's not just the last page, it's _any_ last page in a region that
> bumps up to IO. That's actually much more common in the low 4G area on
> PC's, I suspect, although the reserved BIOS ranges always tend to be
> there.

What makes IO more "wrong" than oopsing due to the page not being
mapped ?

> I suspect it should be trivial to do - maybe in
> e820__memblock_setup()? That's where we already trim partial pages
> etc.
>
> In fact, I think this might be done as an extension of commit
> 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved"), except making sure that non-RAM regions mark one
> page _previous_ as reserved too.
> 
> I assume memory hotplug might have the same issue, and checking
> whether ARM64 and powerpc perhaps might have already done something
> like this (or might need to add it).
> 
> We discussed long ago the case of user space mapping IO in user space,
> and decided we didn't care. But the kernel should probably explicitly
> make sure we don't either, even if I can't recall having ever seen a
> machine that actually maps IO contiguously to RAM. The layout always
> tends to end up having holes anyway.

Can't we put the safety in generic memblock ? IE, don't hand out an
allocation that contain the last page of a "block" and handle that last
page in the memblock->buddy transition rather than in arch specific
code ?

Cheers,
Ben.




Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> [ Adding a few new people the the cc.
> 
>   The issue is the worry about software-speculative accesses (ie
> things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> speculation now) accessing past RAM into possibly contiguous IO ]
> 
> On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
>  wrote:
> > 
> > If you have a machine with RAM that touches IO, you need to disable
> > the last page, exactly the same way we disable and marked reserved the
> > first page at zero.

So I missed the departure of that train ... stupid question, with
CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
what prevents it from crossing into a non-mapped page (not even IO) and
causing an oops ? Looking at a random user in fs/dcache.c its not a
uaccess-style read with recovery Or am I missing somethign obvious
here ?

IE, should we "reserve" the last page of any memory region (maybe mark
it read-only) to avoid this along with avoiding leakage into IO space ?

> > I thought we already did that.
> 
> We don't seem to do that.
> 
> And it's not just the last page, it's _any_ last page in a region that
> bumps up to IO. That's actually much more common in the low 4G area on
> PC's, I suspect, although the reserved BIOS ranges always tend to be
> there.

What makes IO more "wrong" than oopsing due to the page not being
mapped ?

> I suspect it should be trivial to do - maybe in
> e820__memblock_setup()? That's where we already trim partial pages
> etc.
>
> In fact, I think this might be done as an extension of commit
> 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved"), except making sure that non-RAM regions mark one
> page _previous_ as reserved too.
> 
> I assume memory hotplug might have the same issue, and checking
> whether ARM64 and powerpc perhaps might have already done something
> like this (or might need to add it).
> 
> We discussed long ago the case of user space mapping IO in user space,
> and decided we didn't care. But the kernel should probably explicitly
> make sure we don't either, even if I can't recall having ever seen a
> machine that actually maps IO contiguously to RAM. The layout always
> tends to end up having holes anyway.

Can't we put the safety in generic memblock ? IE, don't hand out an
allocation that contain the last page of a "block" and handle that last
page in the memblock->buddy transition rather than in arch specific
code ?

Cheers,
Ben.




Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:42 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds
>  wrote:
> > It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> Just to clarify .. that's true of the dcache stuff.
> 
> The strscpy case actually explicitly limits things to page boundaries
> and falls back to the byte-by-byte case after that.

Ah ok, that makes sense.

Still, I can potentially see an issue with DEBUG_PAGEALLOC

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:42 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds
>  wrote:
> > It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> Just to clarify .. that's true of the dcache stuff.
> 
> The strscpy case actually explicitly limits things to page boundaries
> and falls back to the byte-by-byte case after that.

Ah ok, that makes sense.

Still, I can potentially see an issue with DEBUG_PAGEALLOC

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Mon, 2018-09-03 at 10:48 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> > [ Adding a few new people the the cc.
> > 
> >   The issue is the worry about software-speculative accesses (ie
> > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> > speculation now) accessing past RAM into possibly contiguous IO ]
> > 
> > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
> >  wrote:
> > > 
> > > If you have a machine with RAM that touches IO, you need to disable
> > > the last page, exactly the same way we disable and marked reserved the
> > > first page at zero.
> 
> So I missed the departure of that train ... stupid question, with
> CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
> what prevents it from crossing into a non-mapped page (not even IO) and
> causing an oops ? Looking at a random user in fs/dcache.c its not a
> uaccess-style read with recovery Or am I missing somethign obvious
> here ?

Also, if we cross page boundaries with those guys then we have a bigger
problem no ? we could fall off a vmalloc page into the nether or into
an ioremap mapping no ?

Cheers,
Ben. 



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Mon, 2018-09-03 at 10:48 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> > [ Adding a few new people the the cc.
> > 
> >   The issue is the worry about software-speculative accesses (ie
> > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> > speculation now) accessing past RAM into possibly contiguous IO ]
> > 
> > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
> >  wrote:
> > > 
> > > If you have a machine with RAM that touches IO, you need to disable
> > > the last page, exactly the same way we disable and marked reserved the
> > > first page at zero.
> 
> So I missed the departure of that train ... stupid question, with
> CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
> what prevents it from crossing into a non-mapped page (not even IO) and
> causing an oops ? Looking at a random user in fs/dcache.c its not a
> uaccess-style read with recovery Or am I missing somethign obvious
> here ?

Also, if we cross page boundaries with those guys then we have a bigger
problem no ? we could fall off a vmalloc page into the nether or into
an ioremap mapping no ?

Cheers,
Ben. 



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Actually what about DEBUG_PAGEALLOC ? Can't we fall off a page and
fault that way too ?

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Actually what about DEBUG_PAGEALLOC ? Can't we fall off a page and
fault that way too ?

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Ok, it might be worth adding a DEBUG_VM based (or similar) warning in
case somebody ever thinks of passing a vmalloc pointer to it...

As for falling out of the end of memory, yes it could be a real problem
though I don't see why IO is any different than just hitting a non-
mapped area in that regard. So we should probably keep an unused
(readonly if possible) zero page at the end.
 
Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Ok, it might be worth adding a DEBUG_VM based (or similar) warning in
case somebody ever thinks of passing a vmalloc pointer to it...

As for falling out of the end of memory, yes it could be a real problem
though I don't see why IO is any different than just hitting a non-
mapped area in that regard. So we should probably keep an unused
(readonly if possible) zero page at the end.
 
Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds
 wrote:
> It's not used for vmalloc stuff. It's just regular kmalloc().

Just to clarify .. that's true of the dcache stuff.

The strscpy case actually explicitly limits things to page boundaries
and falls back to the byte-by-byte case after that.

 Linus


Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds
 wrote:
> It's not used for vmalloc stuff. It's just regular kmalloc().

Just to clarify .. that's true of the dcache stuff.

The strscpy case actually explicitly limits things to page boundaries
and falls back to the byte-by-byte case after that.

 Linus


Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
 wrote:
>
> Also, if we cross page boundaries with those guys then we have a bigger
> problem no ? we could fall off a vmalloc page into the nether or into
> an ioremap mapping no ?

It's not used for vmalloc stuff. It's just regular kmalloc().

So it can cross pages, and it can fall off the end of memory, but it
can't do random stuff.

   Linus


Re: Access to non-RAM pages

2018-09-02 Thread Linus Torvalds
On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
 wrote:
>
> Also, if we cross page boundaries with those guys then we have a bigger
> problem no ? we could fall off a vmalloc page into the nether or into
> an ioremap mapping no ?

It's not used for vmalloc stuff. It's just regular kmalloc().

So it can cross pages, and it can fall off the end of memory, but it
can't do random stuff.

   Linus


Re: Access to non-RAM pages

2018-09-01 Thread Jiri Kosina
On Sat, 1 Sep 2018, Al Viro wrote:

> IMO that's crap.  In absolute majority of cases there is a guaranteed gap
> between the end of accessed object and the next page boundary.  

So if that's the case, you're absolutely right. But I am unable to find 
any such guarantee in our current code though.

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: Access to non-RAM pages

2018-09-01 Thread Jiri Kosina
On Sat, 1 Sep 2018, Al Viro wrote:

> IMO that's crap.  In absolute majority of cases there is a guaranteed gap
> between the end of accessed object and the next page boundary.  

So if that's the case, you're absolutely right. But I am unable to find 
any such guarantee in our current code though.

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: Access to non-RAM pages

2018-09-01 Thread Linus Torvalds
[ Adding a few new people the the cc.

  The issue is the worry about software-speculative accesses (ie
things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
speculation now) accessing past RAM into possibly contiguous IO ]

On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
 wrote:
>
> If you have a machine with RAM that touches IO, you need to disable
> the last page, exactly the same way we disable and marked reserved the
> first page at zero.
>
> I thought we already did that.

We don't seem to do that.

And it's not just the last page, it's _any_ last page in a region that
bumps up to IO. That's actually much more common in the low 4G area on
PC's, I suspect, although the reserved BIOS ranges always tend to be
there.

I suspect it should be trivial to do - maybe in
e820__memblock_setup()? That's where we already trim partial pages
etc.

In fact, I think this might be done as an extension of commit
124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved"), except making sure that non-RAM regions mark one
page _previous_ as reserved too.

I assume memory hotplug might have the same issue, and checking
whether ARM64 and powerpc perhaps might have already done something
like this (or might need to add it).

We discussed long ago the case of user space mapping IO in user space,
and decided we didn't care. But the kernel should probably explicitly
make sure we don't either, even if I can't recall having ever seen a
machine that actually maps IO contiguously to RAM. The layout always
tends to end up having holes anyway.

  Linus


Re: Access to non-RAM pages

2018-09-01 Thread Linus Torvalds
[ Adding a few new people the the cc.

  The issue is the worry about software-speculative accesses (ie
things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
speculation now) accessing past RAM into possibly contiguous IO ]

On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
 wrote:
>
> If you have a machine with RAM that touches IO, you need to disable
> the last page, exactly the same way we disable and marked reserved the
> first page at zero.
>
> I thought we already did that.

We don't seem to do that.

And it's not just the last page, it's _any_ last page in a region that
bumps up to IO. That's actually much more common in the low 4G area on
PC's, I suspect, although the reserved BIOS ranges always tend to be
there.

I suspect it should be trivial to do - maybe in
e820__memblock_setup()? That's where we already trim partial pages
etc.

In fact, I think this might be done as an extension of commit
124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved"), except making sure that non-RAM regions mark one
page _previous_ as reserved too.

I assume memory hotplug might have the same issue, and checking
whether ARM64 and powerpc perhaps might have already done something
like this (or might need to add it).

We discussed long ago the case of user space mapping IO in user space,
and decided we didn't care. But the kernel should probably explicitly
make sure we don't either, even if I can't recall having ever seen a
machine that actually maps IO contiguously to RAM. The layout always
tends to end up having holes anyway.

  Linus


Re: Access to non-RAM pages

2018-09-01 Thread Linus Torvalds
On Fri, Aug 31, 2018 at 2:18 PM Jiri Kosina  wrote:
>
> If noone has any clever idea how to work this around (I don't), I am
> afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as
> it's silently dangerous.

No way in hell will I apply such a stupid patch.

It is NOT dangerous.

If you have a machine with RAM that touches IO, you need to disable
the last page, exactly the same way we disable and marked reserved the
first page at zero.

I thought we already did that.

I suspect this is a Xen bug, where the fake BIOS sets up a garbage
description of the hardware that is simply not realistic. I don't
think I've ever seen a machine that didn't have some reserved memory
at the top, but hey, if we don't expressly mark the last page reserved
already, doing so should be trivial.

No way do we disable the word accesses just because of some crazy
corner case that doesn't matter and doesn't happen in reality.

   Linus


Re: Access to non-RAM pages

2018-09-01 Thread Linus Torvalds
On Fri, Aug 31, 2018 at 2:18 PM Jiri Kosina  wrote:
>
> If noone has any clever idea how to work this around (I don't), I am
> afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as
> it's silently dangerous.

No way in hell will I apply such a stupid patch.

It is NOT dangerous.

If you have a machine with RAM that touches IO, you need to disable
the last page, exactly the same way we disable and marked reserved the
first page at zero.

I thought we already did that.

I suspect this is a Xen bug, where the fake BIOS sets up a garbage
description of the hardware that is simply not realistic. I don't
think I've ever seen a machine that didn't have some reserved memory
at the top, but hey, if we don't expressly mark the last page reserved
already, doing so should be trivial.

No way do we disable the word accesses just because of some crazy
corner case that doesn't matter and doesn't happen in reality.

   Linus


Re: Access to non-RAM pages

2018-09-01 Thread Al Viro
On Sat, Sep 01, 2018 at 12:47:48PM +0200, Juergen Gross wrote:
> On 31/08/18 23:18, Jiri Kosina wrote:
> > On Wed, 29 Aug 2018, Juergen Gross wrote:
> > 
> >> While being very unlikely I still believe this is possible. Any
> >> thoughts?
> > 
> > So in theory we should somehow test whether the next page is some form of 
> > mmio/gart/... mapping, but I guess that by itself would kill the 
> > performance advantage of the whole load_unaligned_zeropad() trick.
> > 
> > And yes, the sideffects of reading from mmio mapping is a real thing -- 
> > see for example the issue fixed by 2a3e83c6f9.
> > 
> > If noone has any clever idea how to work this around (I don't), I am 
> > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as 
> > it's silently dangerous.
> 
> Are there any architectures which can still use DCACHE_WORD_ACCESS?
> 
> x86 shouldn't, what about powerpc, arm and arm64?

IMO that's crap.  In absolute majority of cases there is a guaranteed gap
between the end of accessed object and the next page boundary.  Penalizing
each syscall that does pathname resolution to deal with something that
might only happen when the pathname length is just under 4Kb looks like
a bloody bad idea.


Re: Access to non-RAM pages

2018-09-01 Thread Al Viro
On Sat, Sep 01, 2018 at 12:47:48PM +0200, Juergen Gross wrote:
> On 31/08/18 23:18, Jiri Kosina wrote:
> > On Wed, 29 Aug 2018, Juergen Gross wrote:
> > 
> >> While being very unlikely I still believe this is possible. Any
> >> thoughts?
> > 
> > So in theory we should somehow test whether the next page is some form of 
> > mmio/gart/... mapping, but I guess that by itself would kill the 
> > performance advantage of the whole load_unaligned_zeropad() trick.
> > 
> > And yes, the sideffects of reading from mmio mapping is a real thing -- 
> > see for example the issue fixed by 2a3e83c6f9.
> > 
> > If noone has any clever idea how to work this around (I don't), I am 
> > afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as 
> > it's silently dangerous.
> 
> Are there any architectures which can still use DCACHE_WORD_ACCESS?
> 
> x86 shouldn't, what about powerpc, arm and arm64?

IMO that's crap.  In absolute majority of cases there is a guaranteed gap
between the end of accessed object and the next page boundary.  Penalizing
each syscall that does pathname resolution to deal with something that
might only happen when the pathname length is just under 4Kb looks like
a bloody bad idea.


Re: Access to non-RAM pages

2018-09-01 Thread Juergen Gross
On 31/08/18 23:18, Jiri Kosina wrote:
> On Wed, 29 Aug 2018, Juergen Gross wrote:
> 
>> While being very unlikely I still believe this is possible. Any
>> thoughts?
> 
> So in theory we should somehow test whether the next page is some form of 
> mmio/gart/... mapping, but I guess that by itself would kill the 
> performance advantage of the whole load_unaligned_zeropad() trick.
> 
> And yes, the sideffects of reading from mmio mapping is a real thing -- 
> see for example the issue fixed by 2a3e83c6f9.
> 
> If noone has any clever idea how to work this around (I don't), I am 
> afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as 
> it's silently dangerous.

Are there any architectures which can still use DCACHE_WORD_ACCESS?

x86 shouldn't, what about powerpc, arm and arm64?

I'm happy to send patches to remove selection of DCACHE_WORD_ACCESS
and eventually to remove all of the load_unaligned_zeropad() stuff.


Juergen


Re: Access to non-RAM pages

2018-09-01 Thread Juergen Gross
On 31/08/18 23:18, Jiri Kosina wrote:
> On Wed, 29 Aug 2018, Juergen Gross wrote:
> 
>> While being very unlikely I still believe this is possible. Any
>> thoughts?
> 
> So in theory we should somehow test whether the next page is some form of 
> mmio/gart/... mapping, but I guess that by itself would kill the 
> performance advantage of the whole load_unaligned_zeropad() trick.
> 
> And yes, the sideffects of reading from mmio mapping is a real thing -- 
> see for example the issue fixed by 2a3e83c6f9.
> 
> If noone has any clever idea how to work this around (I don't), I am 
> afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as 
> it's silently dangerous.

Are there any architectures which can still use DCACHE_WORD_ACCESS?

x86 shouldn't, what about powerpc, arm and arm64?

I'm happy to send patches to remove selection of DCACHE_WORD_ACCESS
and eventually to remove all of the load_unaligned_zeropad() stuff.


Juergen


Re: Access to non-RAM pages

2018-08-31 Thread Jiri Kosina
On Wed, 29 Aug 2018, Juergen Gross wrote:

> While being very unlikely I still believe this is possible. Any
> thoughts?

So in theory we should somehow test whether the next page is some form of 
mmio/gart/... mapping, but I guess that by itself would kill the 
performance advantage of the whole load_unaligned_zeropad() trick.

And yes, the sideffects of reading from mmio mapping is a real thing -- 
see for example the issue fixed by 2a3e83c6f9.

If noone has any clever idea how to work this around (I don't), I am 
afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as 
it's silently dangerous.

Thanks,

-- 
Jiri Kosina
SUSE Labs


Re: Access to non-RAM pages

2018-08-31 Thread Jiri Kosina
On Wed, 29 Aug 2018, Juergen Gross wrote:

> While being very unlikely I still believe this is possible. Any
> thoughts?

So in theory we should somehow test whether the next page is some form of 
mmio/gart/... mapping, but I guess that by itself would kill the 
performance advantage of the whole load_unaligned_zeropad() trick.

And yes, the sideffects of reading from mmio mapping is a real thing -- 
see for example the issue fixed by 2a3e83c6f9.

If noone has any clever idea how to work this around (I don't), I am 
afraid we'd have to ditch the whole DCACHE_WORD_ACCESS optimization, as 
it's silently dangerous.

Thanks,

-- 
Jiri Kosina
SUSE Labs


Access to non-RAM pages

2018-08-29 Thread Juergen Gross
Running as a Xen guest a customer's system experienced problems due
to load_unaligned_zeropad() hitting a ballooned memory page. The
8 byte read was meant to read the last 4 bytes from a kernel page
and should have returned 4 zero bytes for the not allocated page
after that.

In order not to split huge direct map pages The Xen balloon driver
doesn't remove ballooned pages from the direct map. The memory page
is removed from the EPT, of course, but a bug in the hypervisor would
emulate the 8 byte read wrong by returning all bytes as 0xff, leading
to above mentioned problems.

While this bug can be repaired in Xen I'm rather sure it is dangerous
to rely on load_unaligned_zeropad() doing the right thing when crossing
page boundaries. Trying to access an invalid page is handled fine, but
what happens if the next page is a MMIO area? Reading from that might
have strange side effects.

While being very unlikely I still believe this is possible. Any
thoughts?

Michal asked me to bring this up for a wider audience.


Juergen


Access to non-RAM pages

2018-08-29 Thread Juergen Gross
Running as a Xen guest a customer's system experienced problems due
to load_unaligned_zeropad() hitting a ballooned memory page. The
8 byte read was meant to read the last 4 bytes from a kernel page
and should have returned 4 zero bytes for the not allocated page
after that.

In order not to split huge direct map pages The Xen balloon driver
doesn't remove ballooned pages from the direct map. The memory page
is removed from the EPT, of course, but a bug in the hypervisor would
emulate the 8 byte read wrong by returning all bytes as 0xff, leading
to above mentioned problems.

While this bug can be repaired in Xen I'm rather sure it is dangerous
to rely on load_unaligned_zeropad() doing the right thing when crossing
page boundaries. Trying to access an invalid page is handled fine, but
what happens if the next page is a MMIO area? Reading from that might
have strange side effects.

While being very unlikely I still believe this is possible. Any
thoughts?

Michal asked me to bring this up for a wider audience.


Juergen