Re: pool guard
Le 12/09/2019 à 08:21, Maxime Villard a écrit : Le 06/09/2019 à 15:09, Maxime Villard a écrit : An idea for a feature similar to KMEM_GUARD - which I recently removed because it was too weak and useless -, but this time at the pool layer, covering certain specific pools, without memory consumption or performance cost, and enabled by default at least on amd64. Note that this is hardening and exploit mitigation, but not bug detection, so it will be of little interest in the context of fuzzing. Note also that it targets 64bit arches, because they have nearly unlimited VA. The idea is that we can use special guard allocators on certain pools to prevent important kernel data from being close to untrusted data in the VA space. Suppose the kernel is parsing a received packet from the network, and there is a buffer overflow which causes it to write beyond the mbuf. The data is in an mbuf cluster of size 2K (on amd64). This mbuf cluster sits on a 4K page allocated using the default pool allocator. After the 4K page in memory, there could be critical kernel data sitting, which an attacker could overwrite. overflow ---> +++--+ | 2K Cluster | 2K Cluster | Critical Kernel Data | +++--+ <- usual 4K pool page --> <- another 4K page --> This is a scenario that I already encountered when working on NetBSD's network stack. Now, we switch the mcl pool to use the new uvm_km_guard API (simple wrappers to allocate buffers with unmapped pages at the beginning and the end). The pool layer sees pages of size 128K, and packs 64 2K clusters in them. overflow ~> ++---+---+---+---+---++ | Unmapped | 2K C. | 2K C. | [...] | 2K C. | 2K C. | Unmapped | ++---+---+---+---+---++ <-- 64K ---> <-- 128K pool page with 64 clusters --> <-- 64K ---> The pool page header is off-page, and bitmapped. Therefore, there is strictly no kernel data in the 128K pool page. The overflow still occurs, but this time the critical kernel data is far from here, after the unmapped pages at the end. At worst only other clusters get overwritten; at best we are close to the end and hit a page fault which stops the overflow. 64K is chosen as the maximum of uint16_t. No performance cost, because these guarded buffers are allocated only when the pools grow, which is a rare operation that occurs almost only at boot time. No actual memory consumption either, because unmapped areas don't consume physical memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on amd64, far beyond what we will ever need -, so no problem with consuming VA. The code is here [1] for mcl, it is simple and works fine. It is not perfect but can already prevent a lot of trouble. The principle could be applied to other pools. [1] https://m00nbsd.net/garbage/pool/guard.diff If there are no further comments, I will commit it within a week. Actually I realized there is a problem. uvm_km_guard must use uvm_map and not the kmem arena, because the kmem arena is proportionate to the PA. The thing is, using uvm_map is illegal, because we're sometimes in interrupt context. Needs to be revisited.
Re: pool guard
Le 06/09/2019 à 15:09, Maxime Villard a écrit : An idea for a feature similar to KMEM_GUARD - which I recently removed because it was too weak and useless -, but this time at the pool layer, covering certain specific pools, without memory consumption or performance cost, and enabled by default at least on amd64. Note that this is hardening and exploit mitigation, but not bug detection, so it will be of little interest in the context of fuzzing. Note also that it targets 64bit arches, because they have nearly unlimited VA. The idea is that we can use special guard allocators on certain pools to prevent important kernel data from being close to untrusted data in the VA space. Suppose the kernel is parsing a received packet from the network, and there is a buffer overflow which causes it to write beyond the mbuf. The data is in an mbuf cluster of size 2K (on amd64). This mbuf cluster sits on a 4K page allocated using the default pool allocator. After the 4K page in memory, there could be critical kernel data sitting, which an attacker could overwrite. overflow ---> +++--+ | 2K Cluster | 2K Cluster | Critical Kernel Data | +++--+ <- usual 4K pool page --> <- another 4K page --> This is a scenario that I already encountered when working on NetBSD's network stack. Now, we switch the mcl pool to use the new uvm_km_guard API (simple wrappers to allocate buffers with unmapped pages at the beginning and the end). The pool layer sees pages of size 128K, and packs 64 2K clusters in them. overflow ~> ++---+---+---+---+---++ | Unmapped | 2K C. | 2K C. | [...] | 2K C. | 2K C. | Unmapped | ++---+---+---+---+---++ <-- 64K ---> <-- 128K pool page with 64 clusters --> <-- 64K ---> The pool page header is off-page, and bitmapped. Therefore, there is strictly no kernel data in the 128K pool page. The overflow still occurs, but this time the critical kernel data is far from here, after the unmapped pages at the end. At worst only other clusters get overwritten; at best we are close to the end and hit a page fault which stops the overflow. 64K is chosen as the maximum of uint16_t. No performance cost, because these guarded buffers are allocated only when the pools grow, which is a rare operation that occurs almost only at boot time. No actual memory consumption either, because unmapped areas don't consume physical memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on amd64, far beyond what we will ever need -, so no problem with consuming VA. The code is here [1] for mcl, it is simple and works fine. It is not perfect but can already prevent a lot of trouble. The principle could be applied to other pools. [1] https://m00nbsd.net/garbage/pool/guard.diff If there are no further comments, I will commit it within a week.
Re: pool guard
> On Sep 8, 2019, at 4:48 PM, Maxime Villard wrote: > > "Fixing" this entails first having a UVM that can scale up to and work > reasonably well with such gigantic amounts of RAM, which is far from being the > case currently. We will need to have SMP, NUMA, large pages, and in short, we > will likely have to rewrite UVM entirely. Hence "at some point". I am under no illusions that it won't be a large amount of work. At the very least, document the limitation so it's easy to find later. -- thorpej
Re: pool guard
Le 08/09/2019 à 08:54, Jason Thorpe a écrit : On Sep 8, 2019, at 9:29 AM, Maxime Villard wrote: Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB of VA, twice more than the maximum PA. So again, we are fine. Yes, but obviously we should fix that at some point. "Fixing" this entails first having a UVM that can scale up to and work reasonably well with such gigantic amounts of RAM, which is far from being the case currently. We will need to have SMP, NUMA, large pages, and in short, we will likely have to rewrite UVM entirely. By the time we have done all of that, the 64bit CPUs will already ship with 5-level page trees which multiply by 512 the size of the VA space, so again, we will be largely fine. Even otherwise, revisiting pool guards to increase the mapped/unmapped ratio is a very insignificant and easy task. I see no problem related to VA/PA amounts.
Re: pool guard
> On Sep 8, 2019, at 9:29 AM, Maxime Villard wrote: > > Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if > you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB > of VA, twice more than the maximum PA. So again, we are fine. Yes, but obviously we should fix that at some point. -- thorpej
Re: pool guard
Le 07/09/2019 à 23:47, matthew green a écrit : No performance cost, because these guarded buffers are allocated only when the pools grow, which is a rare operation that occurs almost only at boot time. No actual memory consumption either, because unmapped areas don't consume physical memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on amd64, far beyond what we will ever need -, so no problem with consuming VA. i like this idea, but i would like to point out that HPE already sell a machine with 48TiB ram and nvdimms are going to explode the apparently memory in the coming years, so "32TiB" is very far from "plenty". we have many challenges to get beyond 8TiB tho, since we count pages in 'int' all over uvm. Don't confuse VA and PA. NetBSD-amd64 supports 16TB of PA, so even if you have 48TB nvdimms it gets truncated to 16TB. Then, we have 32TB of VA, twice more than the maximum PA. So again, we are fine.
re: pool guard
> No performance cost, because these guarded buffers are allocated only when the > pools grow, which is a rare operation that occurs almost only at boot time. No > actual memory consumption either, because unmapped areas don't consume > physical > memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on > amd64, far beyond what we will ever need -, so no problem with consuming VA. i like this idea, but i would like to point out that HPE already sell a machine with 48TiB ram and nvdimms are going to explode the apparently memory in the coming years, so "32TiB" is very far from "plenty". we have many challenges to get beyond 8TiB tho, since we count pages in 'int' all over uvm. .mrg.