Re: pool guard

2019-09-17 Thread Maxime Villard

Le 12/09/2019 à 08:21, Maxime Villard a écrit :

Le 06/09/2019 à 15:09, Maxime Villard a écrit :

An idea for a feature similar to KMEM_GUARD - which I recently removed because
it was too weak and useless -, but this time at the pool layer, covering certain
specific pools, without memory consumption or performance cost, and enabled by
default at least on amd64. Note that this is hardening and exploit mitigation,
but not bug detection, so it will be of little interest in the context of
fuzzing. Note also that it targets 64bit arches, because they have nearly
unlimited VA.

The idea is that we can use special guard allocators on certain pools to prevent
important kernel data from being close to untrusted data in the VA space.

Suppose the kernel is parsing a received packet from the network, and there
is a buffer overflow which causes it to write beyond the mbuf. The data is in
an mbuf cluster of size 2K (on amd64). This mbuf cluster sits on a 4K page
allocated using the default pool allocator. After the 4K page in memory, there
could be critical kernel data sitting, which an attacker could overwrite.

    overflow
    --->
 +++--+
 | 2K Cluster | 2K Cluster | Critical Kernel Data |
 +++--+
  <- usual 4K pool page --> <- another 4K page -->

This is a scenario that I already encountered when working on NetBSD's network
stack.

Now, we switch the mcl pool to use the new uvm_km_guard API (simple wrappers
to allocate buffers with unmapped pages at the beginning and the end). The pool
layer sees pages of size 128K, and packs 64 2K clusters in them.

    overflow
    ~>
 ++---+---+---+---+---++
 |  Unmapped  | 2K C. | 2K C. | [...] | 2K C. | 2K C. |  Unmapped  |
 ++---+---+---+---+---++
  <-- 64K ---> <-- 128K pool page with 64 clusters --> <-- 64K --->

The pool page header is off-page, and bitmapped. Therefore, there is strictly
no kernel data in the 128K pool page.

The overflow still occurs, but this time the critical kernel data is far from
here, after the unmapped pages at the end. At worst only other clusters get
overwritten; at best we are close to the end and hit a page fault which stops
the overflow. 64K is chosen as the maximum of uint16_t.

No performance cost, because these guarded buffers are allocated only when the
pools grow, which is a rare operation that occurs almost only at boot time. No
actual memory consumption either, because unmapped areas don't consume physical
memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on
amd64, far beyond what we will ever need -, so no problem with consuming VA.

The code is here [1] for mcl, it is simple and works fine. It is not perfect
but can already prevent a lot of trouble. The principle could be applied to
other pools.

[1] https://m00nbsd.net/garbage/pool/guard.diff


If there are no further comments, I will commit it within a week.


Actually I realized there is a problem. uvm_km_guard must use uvm_map and not
the kmem arena, because the kmem arena is proportionate to the PA. The thing
is, using uvm_map is illegal, because we're sometimes in interrupt context.
Needs to be revisited.


Re: pool guard

2019-09-12 Thread Maxime Villard

Le 06/09/2019 à 15:09, Maxime Villard a écrit :

An idea for a feature similar to KMEM_GUARD - which I recently removed because
it was too weak and useless -, but this time at the pool layer, covering certain
specific pools, without memory consumption or performance cost, and enabled by
default at least on amd64. Note that this is hardening and exploit mitigation,
but not bug detection, so it will be of little interest in the context of
fuzzing. Note also that it targets 64bit arches, because they have nearly
unlimited VA.

The idea is that we can use special guard allocators on certain pools to prevent
important kernel data from being close to untrusted data in the VA space.

Suppose the kernel is parsing a received packet from the network, and there
is a buffer overflow which causes it to write beyond the mbuf. The data is in
an mbuf cluster of size 2K (on amd64). This mbuf cluster sits on a 4K page
allocated using the default pool allocator. After the 4K page in memory, there
could be critical kernel data sitting, which an attacker could overwrite.

    overflow
    --->
 +++--+
 | 2K Cluster | 2K Cluster | Critical Kernel Data |
 +++--+
  <- usual 4K pool page --> <- another 4K page -->

This is a scenario that I already encountered when working on NetBSD's network
stack.

Now, we switch the mcl pool to use the new uvm_km_guard API (simple wrappers
to allocate buffers with unmapped pages at the beginning and the end). The pool
layer sees pages of size 128K, and packs 64 2K clusters in them.

    overflow
    ~>
 ++---+---+---+---+---++
 |  Unmapped  | 2K C. | 2K C. | [...] | 2K C. | 2K C. |  Unmapped  |
 ++---+---+---+---+---++
  <-- 64K ---> <-- 128K pool page with 64 clusters --> <-- 64K --->

The pool page header is off-page, and bitmapped. Therefore, there is strictly
no kernel data in the 128K pool page.

The overflow still occurs, but this time the critical kernel data is far from
here, after the unmapped pages at the end. At worst only other clusters get
overwritten; at best we are close to the end and hit a page fault which stops
the overflow. 64K is chosen as the maximum of uint16_t.

No performance cost, because these guarded buffers are allocated only when the
pools grow, which is a rare operation that occurs almost only at boot time. No
actual memory consumption either, because unmapped areas don't consume physical
memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on
amd64, far beyond what we will ever need -, so no problem with consuming VA.

The code is here [1] for mcl, it is simple and works fine. It is not perfect
but can already prevent a lot of trouble. The principle could be applied to
other pools.

[1] https://m00nbsd.net/garbage/pool/guard.diff


If there are no further comments, I will commit it within a week.


Re: pool guard

2019-09-08 Thread Jason Thorpe



> On Sep 8, 2019, at 4:48 PM, Maxime Villard  wrote:
> 
> "Fixing" this entails first having a UVM that can scale up to and work
> reasonably well with such gigantic amounts of RAM, which is far from being the
> case currently. We will need to have SMP, NUMA, large pages, and in short, we
> will likely have to rewrite UVM entirely.

Hence "at some point".  I am under no illusions that it won't be a large amount 
of work.  At the very least, document the limitation so it's easy to find later.

-- thorpej



Re: pool guard

2019-09-08 Thread Maxime Villard

Le 08/09/2019 à 08:54, Jason Thorpe a écrit :

On Sep 8, 2019, at 9:29 AM, Maxime Villard  wrote:
Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if
you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB
of VA, twice more than the maximum PA. So again, we are fine.


Yes, but obviously we should fix that at some point.


"Fixing" this entails first having a UVM that can scale up to and work
reasonably well with such gigantic amounts of RAM, which is far from being the
case currently. We will need to have SMP, NUMA, large pages, and in short, we
will likely have to rewrite UVM entirely.

By the time we have done all of that, the 64bit CPUs will already ship with
5-level page trees which multiply by 512 the size of the VA space, so again,
we will be largely fine.

Even otherwise, revisiting pool guards to increase the mapped/unmapped ratio
is a very insignificant and easy task.

I see no problem related to VA/PA amounts.


Re: pool guard

2019-09-08 Thread Jason Thorpe


> On Sep 8, 2019, at 9:29 AM, Maxime Villard  wrote:
> 
> Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if
> you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB
> of VA, twice more than the maximum PA. So again, we are fine.

Yes, but obviously we should fix that at some point.

-- thorpej



Re: pool guard

2019-09-08 Thread Maxime Villard

Le 07/09/2019 à 23:47, matthew green a écrit :

No performance cost, because these guarded buffers are allocated only when the
pools grow, which is a rare operation that occurs almost only at boot time. No
actual memory consumption either, because unmapped areas don't consume physical
memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on
amd64, far beyond what we will ever need -, so no problem with consuming VA.


i like this idea, but i would like to point out that HPE already
sell a machine with 48TiB ram and nvdimms are going to explode
the apparently memory in the coming years, so "32TiB" is very far
from "plenty".  we have many challenges to get beyond 8TiB tho,
since we count pages in 'int' all over uvm.


Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if
you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB
of VA, twice more than the maximum PA. So again, we are fine.


re: pool guard

2019-09-07 Thread matthew green
> No performance cost, because these guarded buffers are allocated only when the
> pools grow, which is a rare operation that occurs almost only at boot time. No
> actual memory consumption either, because unmapped areas don't consume 
> physical
> memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on
> amd64, far beyond what we will ever need -, so no problem with consuming VA.

i like this idea, but i would like to point out that HPE already
sell a machine with 48TiB ram and nvdimms are going to explode
the apparently memory in the coming years, so "32TiB" is very far
from "plenty".  we have many challenges to get beyond 8TiB tho,
since we count pages in 'int' all over uvm.


.mrg.