Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random
On Mon, 2016-10-31 at 22:10 +0100, Florian Weimer wrote: > * Daniel Micay: > > > > It makes a lot of sense on x86_64 where it means the canary is > > > still 56 bits. Also, you want -fstack-check for protecting again > > > stack overflows rather than stack *buffer* overflow. SSP won't > > > really help you in that regard. Sadly, while -fstack-check now > > > works well in GCC 6 with little performance cost, it's not really > > > a > > I think GCC still does not treat the return address push on > architectures which have such a CALL instruction as an implicit stack > probe. > > > > complete feature (and Clang impls it as a no-op!). > > How many guard pages at the end of the stack does the kernel > guarantee? I saw some -fstack-check-generated code which seemed to > jump over a single guard page. > > The other thing I've seen which could impact the effectiveness of > -fstack-check: mmap *without* MAP_FIXED and a hint within stack > allocation can create a mapping inside the stack. That's rather > surprising, and I'm not sure if the net result is that there actually > is a guard page in all cases. It's not ideal but userspace can work around it. OpenJDK and ART both do something like walking to the end of the main thread stack during init to install their own guard region. ART then uses all but the last guard page it installed as a red zone to handle out-of-stack (not sure about OpenJDK). -fstack-stack is supposed to handle a single guard by default, and that's all there is for thread stacks by default. > > Note: talking about userspace after the entropy bit. The kernel > > doesn't > > really -fstack-check, at least in even slightly sane code... > > There used to be lots of discussions about kernel stack sizes ... It should just be banning VLAs, alloca and large stack frames though, if it's not already. There wasn't even support for guard pages with kernel stacks until recently outside grsecurity, and -fstack-check relies on them so it doesn't seem like a great solution for the kernel. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random
> It makes a lot of sense on x86_64 where it means the canary is still > 56 > bits. Also, you want -fstack-check for protecting again stack > overflows > rather than stack *buffer* overflow. SSP won't really help you in that > regard. Sadly, while -fstack-check now works well in GCC 6 with little > performance cost, it's not really a complete feature (and Clang impls > it > as a no-op!). Note: talking about userspace after the entropy bit. The kernel doesn't really -fstack-check, at least in even slightly sane code... signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random
> It makes a lot of sense on x86_64 where it means the canary is still > 56 > bits. Also, you want -fstack-check for protecting again stack > overflows > rather than stack *buffer* overflow. SSP won't really help you in that > regard. Sadly, while -fstack-check now works well in GCC 6 with little > performance cost, it's not really a complete feature (and Clang impls > it > as a no-op!). Note: talking about userspace after the entropy bit. The kernel doesn't really -fstack-check, at least in even slightly sane code... signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random
On Mon, 2016-10-31 at 21:45 +0100, Florian Weimer wrote: > * Jann Horn: > > > On Mon, Oct 31, 2016 at 09:04:02AM -0700, Kees Cook wrote: > > > On Mon, Oct 31, 2016 at 7:04 AM, Jann Hornwrote: > > > > On machines with sizeof(unsigned long)==8, this ensures that the > > > > more > > > > significant 32 bits of stack_canary are random, too. > > > > stack_canary is defined as unsigned long, all the architectures > > > > with stack > > > > protector support already pick the stack_canary of init as a > > > > random > > > > unsigned long, and get_random_long() should be as fast as > > > > get_random_int(), > > > > so there seems to be no good reason against this. > > > > > > > > This should help if someone tries to guess a stack canary with > > > > brute force. > > > > > > > > (This change has been made in PaX already, with a different > > > > RNG.) > > > > > > > > Signed-off-by: Jann Horn > > > > > > Acked-by: Kees Cook > > > > > > (A separate change might be to make sure that the leading byte is > > > zeroed. Entropy of the value, I think, is less important than > > > blocking > > > canary exposures from unbounded str* functions. Brute forcing > > > kernel > > > stack canaries isn't like it bruting them in userspace...) > > > > Yeah, makes sense. Especially on 64bit, 56 bits of entropy ought to > > be > > enough anyway. > > So you two approve of the way glibc does this currently? (See the > other thread.) > > I was under the impression that the kernel performs far less > null-terminated string processing the average user space application, > especially on the stack. (A lot of userspace code assumes large > stacks and puts essentially arbitrarily long strings into VLAs.) It makes a lot of sense on x86_64 where it means the canary is still 56 bits. Also, you want -fstack-check for protecting again stack overflows rather than stack *buffer* overflow. SSP won't really help you in that regard. Sadly, while -fstack-check now works well in GCC 6 with little performance cost, it's not really a complete feature (and Clang impls it as a no-op!). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random
On Mon, 2016-10-31 at 21:45 +0100, Florian Weimer wrote: > * Jann Horn: > > > On Mon, Oct 31, 2016 at 09:04:02AM -0700, Kees Cook wrote: > > > On Mon, Oct 31, 2016 at 7:04 AM, Jann Horn wrote: > > > > On machines with sizeof(unsigned long)==8, this ensures that the > > > > more > > > > significant 32 bits of stack_canary are random, too. > > > > stack_canary is defined as unsigned long, all the architectures > > > > with stack > > > > protector support already pick the stack_canary of init as a > > > > random > > > > unsigned long, and get_random_long() should be as fast as > > > > get_random_int(), > > > > so there seems to be no good reason against this. > > > > > > > > This should help if someone tries to guess a stack canary with > > > > brute force. > > > > > > > > (This change has been made in PaX already, with a different > > > > RNG.) > > > > > > > > Signed-off-by: Jann Horn > > > > > > Acked-by: Kees Cook > > > > > > (A separate change might be to make sure that the leading byte is > > > zeroed. Entropy of the value, I think, is less important than > > > blocking > > > canary exposures from unbounded str* functions. Brute forcing > > > kernel > > > stack canaries isn't like it bruting them in userspace...) > > > > Yeah, makes sense. Especially on 64bit, 56 bits of entropy ought to > > be > > enough anyway. > > So you two approve of the way glibc does this currently? (See the > other thread.) > > I was under the impression that the kernel performs far less > null-terminated string processing the average user space application, > especially on the stack. (A lot of userspace code assumes large > stacks and puts essentially arbitrarily long strings into VLAs.) It makes a lot of sense on x86_64 where it means the canary is still 56 bits. Also, you want -fstack-check for protecting again stack overflows rather than stack *buffer* overflow. SSP won't really help you in that regard. Sadly, while -fstack-check now works well in GCC 6 with little performance cost, it's not really a complete feature (and Clang impls it as a no-op!). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Wed, 2016-10-19 at 07:26 -0300, Arnaldo Carvalho de Melo wrote: > > But self profiling JITs would be useful for non-developers, on Android > (anywhere, really), and for that it would require being able to at > least, well, self profile, using sys_perf_event_open() by a normal > process, limited to profiling itself, no? > > This not being possible, self profiling will use some other means, its > like sys_perf_event_open() never existed for them. > > - Arnaldo It would defeat the purpose of the security feature if it was exposed to apps on Android. There are other ways for Chromium (including WebView) and the ART JIT to profile. AFAIK, they've never actually considered using perf for their JIT profiling. It wasn't something that anyone cared about when this was implemented. Chromium would *certainly* never use perf for this. They use a much tighter sandbox than the Android app sandbox. It doesn't have system calls like open(), let alone perf events. Their seccomp-bpf usage is to reduce kernel attack surface, since it has proven to be the easiest way out of a sandbox. They don't even allow futex() without filtering based on the parameters anymore. That proved to be a major issue. The only case where untrusted apps declaring the privileges they need actually works out is when the privileges are high-level controls over access to a user's personal information. That way, they can be exposed to the user with the ability to reject access when it's first needed or toggle it off later. The strength of the app sandbox isn't something that end users can be responsible for. Permissions also don't work if apps bypass them with local privilege escalation vulnerabilities. Even for the base system, no perf access is better than having it used anywhere. The difference is only that the base system can actually be trusted to declare what it needs, but those components can be exploited so it's still best if they are trusted as little as possible at runtime. I could see Android completely removing unprivileged access down the road too, not as a form of access control (apps already run as unique uid/gid pairs, etc.) but to remove a non-trivial form of kernel attack surface. The perf API isn't being singled out here. It just happened to be the first case where a kernel API was restricted to developer usage on Android. Should expect more of this. If you want perf exposed, make it secure. Stop writing kernel code in a memory unsafe language and start using isolation. Not going to happen in all likelihood, so kernel code will be increasingly walled off instead since it's the biggest liability on the system. Fixing bugs on a case by case basis doesn't work for systems that need even basic security. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Wed, 2016-10-19 at 07:26 -0300, Arnaldo Carvalho de Melo wrote: > > But self profiling JITs would be useful for non-developers, on Android > (anywhere, really), and for that it would require being able to at > least, well, self profile, using sys_perf_event_open() by a normal > process, limited to profiling itself, no? > > This not being possible, self profiling will use some other means, its > like sys_perf_event_open() never existed for them. > > - Arnaldo It would defeat the purpose of the security feature if it was exposed to apps on Android. There are other ways for Chromium (including WebView) and the ART JIT to profile. AFAIK, they've never actually considered using perf for their JIT profiling. It wasn't something that anyone cared about when this was implemented. Chromium would *certainly* never use perf for this. They use a much tighter sandbox than the Android app sandbox. It doesn't have system calls like open(), let alone perf events. Their seccomp-bpf usage is to reduce kernel attack surface, since it has proven to be the easiest way out of a sandbox. They don't even allow futex() without filtering based on the parameters anymore. That proved to be a major issue. The only case where untrusted apps declaring the privileges they need actually works out is when the privileges are high-level controls over access to a user's personal information. That way, they can be exposed to the user with the ability to reject access when it's first needed or toggle it off later. The strength of the app sandbox isn't something that end users can be responsible for. Permissions also don't work if apps bypass them with local privilege escalation vulnerabilities. Even for the base system, no perf access is better than having it used anywhere. The difference is only that the base system can actually be trusted to declare what it needs, but those components can be exploited so it's still best if they are trusted as little as possible at runtime. I could see Android completely removing unprivileged access down the road too, not as a form of access control (apps already run as unique uid/gid pairs, etc.) but to remove a non-trivial form of kernel attack surface. The perf API isn't being singled out here. It just happened to be the first case where a kernel API was restricted to developer usage on Android. Should expect more of this. If you want perf exposed, make it secure. Stop writing kernel code in a memory unsafe language and start using isolation. Not going to happen in all likelihood, so kernel code will be increasingly walled off instead since it's the biggest liability on the system. Fixing bugs on a case by case basis doesn't work for systems that need even basic security. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Wed, 2016-10-19 at 10:41 +0100, Mark Rutland wrote: > On Mon, Oct 17, 2016 at 10:54:33AM -0400, Daniel Micay wrote: > > On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote: > > > It's also my understanding that for Android, perf_event_paranoid > > > is > > > lowered when the user enables developer mode (rather than only > > > when an > > > external debugger is attached); is that correct? > > > > It's exposed as a "system property" marked as writable by the shell > > user, so the Android Debug Bridge shell can lower it. The debugging > > tools learned how to toggle it off automatically when they're used. > > It > > intentionally isn't a persist. prefixed property so the setting also > > goes away on reboot. > > > > ADB (incl. the shell user) isn't available until developer mode is > > enabled + ADB is toggled on in the developer settings, and then it > > still > > requires whitelisting keys. > > Ah; so I'd misunderstood somewhat. > > I was under the (clearly mistaken) impression that this was lowered > when > developer mode was enabled, rather than only when it was both enabled > and ADB was connected, for example. > > Thanks for clearing that up! ADB provides a shell as the 'shell' user, and that user has the ability to toggle the sysctl. So profiling tools were able to be taught to do that automatically. It's the only way that the 'shell' user is actually exposed. For example, a terminal app just runs in the untrusted_app SELinux domain as a unique unprivileged uid/gid pair, not as the much more privileged ADB 'shell' domain. So it doesn't actually get toggled off you use ADB to do something else. ADB itself is pretty much comparable to SSH, but over USB (i.e. key- based way of getting a shell). The 'shell' user has tools like 'run-as' to be able to run things as various apps (if they are marked debuggable), so in theory it could be finer-grained and act only there, for the app being debugged. It would be really hard to cover all use cases and maybe things other than apps though (although in an Android 'user' build, the base system itself isn't very debuggable, you really need 'userdebug' or 'eng' which isn't what ships on end user devices). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Wed, 2016-10-19 at 10:41 +0100, Mark Rutland wrote: > On Mon, Oct 17, 2016 at 10:54:33AM -0400, Daniel Micay wrote: > > On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote: > > > It's also my understanding that for Android, perf_event_paranoid > > > is > > > lowered when the user enables developer mode (rather than only > > > when an > > > external debugger is attached); is that correct? > > > > It's exposed as a "system property" marked as writable by the shell > > user, so the Android Debug Bridge shell can lower it. The debugging > > tools learned how to toggle it off automatically when they're used. > > It > > intentionally isn't a persist. prefixed property so the setting also > > goes away on reboot. > > > > ADB (incl. the shell user) isn't available until developer mode is > > enabled + ADB is toggled on in the developer settings, and then it > > still > > requires whitelisting keys. > > Ah; so I'd misunderstood somewhat. > > I was under the (clearly mistaken) impression that this was lowered > when > developer mode was enabled, rather than only when it was both enabled > and ADB was connected, for example. > > Thanks for clearing that up! ADB provides a shell as the 'shell' user, and that user has the ability to toggle the sysctl. So profiling tools were able to be taught to do that automatically. It's the only way that the 'shell' user is actually exposed. For example, a terminal app just runs in the untrusted_app SELinux domain as a unique unprivileged uid/gid pair, not as the much more privileged ADB 'shell' domain. So it doesn't actually get toggled off you use ADB to do something else. ADB itself is pretty much comparable to SSH, but over USB (i.e. key- based way of getting a shell). The 'shell' user has tools like 'run-as' to be able to run things as various apps (if they are marked debuggable), so in theory it could be finer-grained and act only there, for the app being debugged. It would be really hard to cover all use cases and maybe things other than apps though (although in an Android 'user' build, the base system itself isn't very debuggable, you really need 'userdebug' or 'eng' which isn't what ships on end user devices). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Tue, 2016-10-18 at 13:48 -0700, Kees Cook wrote: > On Mon, Oct 17, 2016 at 6:44 AM, Mark Rutland> wrote: > > Hi, > > > > Attempt to revive discussions below... > > > > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote: > > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow > > > all access to performance events by users without CAP_SYS_ADMIN. > > > > > > This new level of restriction is intended to reduce the attack > > > surface of the kernel. Perf is a valuable tool for developers but > > > is generally unnecessary and unused on production systems. Perf > > > may > > > open up an attack vector to vulnerable device-specific drivers as > > > recently demonstrated in CVE-2016-0805, CVE-2016-0819, > > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of > > > restriction allows for a safe default to be set on production > > > systems > > > while leaving a simple means for developers to grant access [1]. > > > > > > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad > > > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's > > > patches > > > have been modified and split up to address on-list feedback. > > > > > > kernel.perf_event_paranoid=3 is the default on both Debian [2] and > > > Android [3]. > > > > While people weren't particularly happy with this global toggle > > approach, my understanding from face-to-face discussions at LSS2016 > > was > > that people were happy with a more scoped restriction (e.g. using > > capabilities or some other access control mechanism), but no-one had > > the > > time to work on that. > > > > Does that match everyone's understanding, or am I mistaken? > > That's correct: some kind of finer-grain control would be preferred to > the maintainer, but no one has had time to work on it. (The =3 sysctl > setting present in Android, Debian, and Ubuntu satisfies most people.) > > -Kees It's also worth noting that fine-grained control via a scoped mechanism would likely only be used to implement *more restrictions* on Android, not to make the feature less aggressive. It's desirable for perf events to be disabled by default for non-root across the board on Android. The part that's imperfect is that when a developer uses a profiling tool, unprivileged usage is automatically enabled across the board until reboot. Ideally, it would be enabled only for the scope where it's needed. It would be very tricky to implement though, especially without adding friction, and it would only have value for protecting devices being used for development. It really doesn't seem to be worth the trouble, especially since it doesn't persist on reboot. It's only a temporary security hole and only for developer devices. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Tue, 2016-10-18 at 13:48 -0700, Kees Cook wrote: > On Mon, Oct 17, 2016 at 6:44 AM, Mark Rutland > wrote: > > Hi, > > > > Attempt to revive discussions below... > > > > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote: > > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow > > > all access to performance events by users without CAP_SYS_ADMIN. > > > > > > This new level of restriction is intended to reduce the attack > > > surface of the kernel. Perf is a valuable tool for developers but > > > is generally unnecessary and unused on production systems. Perf > > > may > > > open up an attack vector to vulnerable device-specific drivers as > > > recently demonstrated in CVE-2016-0805, CVE-2016-0819, > > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of > > > restriction allows for a safe default to be set on production > > > systems > > > while leaving a simple means for developers to grant access [1]. > > > > > > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad > > > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's > > > patches > > > have been modified and split up to address on-list feedback. > > > > > > kernel.perf_event_paranoid=3 is the default on both Debian [2] and > > > Android [3]. > > > > While people weren't particularly happy with this global toggle > > approach, my understanding from face-to-face discussions at LSS2016 > > was > > that people were happy with a more scoped restriction (e.g. using > > capabilities or some other access control mechanism), but no-one had > > the > > time to work on that. > > > > Does that match everyone's understanding, or am I mistaken? > > That's correct: some kind of finer-grain control would be preferred to > the maintainer, but no one has had time to work on it. (The =3 sysctl > setting present in Android, Debian, and Ubuntu satisfies most people.) > > -Kees It's also worth noting that fine-grained control via a scoped mechanism would likely only be used to implement *more restrictions* on Android, not to make the feature less aggressive. It's desirable for perf events to be disabled by default for non-root across the board on Android. The part that's imperfect is that when a developer uses a profiling tool, unprivileged usage is automatically enabled across the board until reboot. Ideally, it would be enabled only for the scope where it's needed. It would be very tricky to implement though, especially without adding friction, and it would only have value for protecting devices being used for development. It really doesn't seem to be worth the trouble, especially since it doesn't persist on reboot. It's only a temporary security hole and only for developer devices. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote: > Hi, > > Attempt to revive discussions below... > > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote: > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow > > all access to performance events by users without CAP_SYS_ADMIN. > > > > This new level of restriction is intended to reduce the attack > > surface of the kernel. Perf is a valuable tool for developers but > > is generally unnecessary and unused on production systems. Perf may > > open up an attack vector to vulnerable device-specific drivers as > > recently demonstrated in CVE-2016-0805, CVE-2016-0819, > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of > > restriction allows for a safe default to be set on production > > systems > > while leaving a simple means for developers to grant access [1]. > > > > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad > > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's patches > > have been modified and split up to address on-list feedback. > > > > kernel.perf_event_paranoid=3 is the default on both Debian [2] and > > Android [3]. > > While people weren't particularly happy with this global toggle > approach, my understanding from face-to-face discussions at LSS2016 > was > that people were happy with a more scoped restriction (e.g. using > capabilities or some other access control mechanism), but no-one had > the > time to work on that. > > Does that match everyone's understanding, or am I mistaken? > > It's also my understanding that for Android, perf_event_paranoid is > lowered when the user enables developer mode (rather than only when an > external debugger is attached); is that correct? It's exposed as a "system property" marked as writable by the shell user, so the Android Debug Bridge shell can lower it. The debugging tools learned how to toggle it off automatically when they're used. It intentionally isn't a persist. prefixed property so the setting also goes away on reboot. ADB (incl. the shell user) isn't available until developer mode is enabled + ADB is toggled on in the developer settings, and then it still requires whitelisting keys. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote: > Hi, > > Attempt to revive discussions below... > > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote: > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow > > all access to performance events by users without CAP_SYS_ADMIN. > > > > This new level of restriction is intended to reduce the attack > > surface of the kernel. Perf is a valuable tool for developers but > > is generally unnecessary and unused on production systems. Perf may > > open up an attack vector to vulnerable device-specific drivers as > > recently demonstrated in CVE-2016-0805, CVE-2016-0819, > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of > > restriction allows for a safe default to be set on production > > systems > > while leaving a simple means for developers to grant access [1]. > > > > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad > > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's patches > > have been modified and split up to address on-list feedback. > > > > kernel.perf_event_paranoid=3 is the default on both Debian [2] and > > Android [3]. > > While people weren't particularly happy with this global toggle > approach, my understanding from face-to-face discussions at LSS2016 > was > that people were happy with a more scoped restriction (e.g. using > capabilities or some other access control mechanism), but no-one had > the > time to work on that. > > Does that match everyone's understanding, or am I mistaken? > > It's also my understanding that for Android, perf_event_paranoid is > lowered when the user enables developer mode (rather than only when an > external debugger is attached); is that correct? It's exposed as a "system property" marked as writable by the shell user, so the Android Debug Bridge shell can lower it. The debugging tools learned how to toggle it off automatically when they're used. It intentionally isn't a persist. prefixed property so the setting also goes away on reboot. ADB (incl. the shell user) isn't available until developer mode is enabled + ADB is toggled on in the developer settings, and then it still requires whitelisting keys. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 2/2] arm: apply more __ro_after_init
On Wed, 2016-08-10 at 10:43 +0100, Russell King - ARM Linux wrote: > On Fri, Jun 03, 2016 at 11:40:24AM -0700, Kees Cook wrote: > > > > @@ -1309,16 +1309,11 @@ void __init arm_mm_memblock_reserve(void) > > * Any other function or debugging method which may touch any > > device _will_ > > * crash the kernel. > > */ > > +static char vectors[PAGE_SIZE * 2] __ro_after_init > > __aligned(PAGE_SIZE); > > static void __init devicemaps_init(const struct machine_desc > > *mdesc) > > { > > struct map_desc map; > > unsigned long addr; > > - void *vectors; > > - > > - /* > > - * Allocate the vector page early. > > - */ > > - vectors = early_alloc(PAGE_SIZE * 2); > > This one is not appropriate. We _do_ write to these pages after init > for FIQ handler updates. See set_fiq_handler(). This is one of the many cases where pax_open_kernel/pax_close_kernel are needed to temporarily toggle it read-only. From grsecurity: @@ -95,7 +95,10 @@ void set_fiq_handler(void *start, unsigned int length) void *base = vectors_page; unsigned offset = FIQ_OFFSET; + pax_open_kernel(); memcpy(base + offset, start, length); + pax_close_kernel(); signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 2/2] arm: apply more __ro_after_init
On Wed, 2016-08-10 at 10:43 +0100, Russell King - ARM Linux wrote: > On Fri, Jun 03, 2016 at 11:40:24AM -0700, Kees Cook wrote: > > > > @@ -1309,16 +1309,11 @@ void __init arm_mm_memblock_reserve(void) > > * Any other function or debugging method which may touch any > > device _will_ > > * crash the kernel. > > */ > > +static char vectors[PAGE_SIZE * 2] __ro_after_init > > __aligned(PAGE_SIZE); > > static void __init devicemaps_init(const struct machine_desc > > *mdesc) > > { > > struct map_desc map; > > unsigned long addr; > > - void *vectors; > > - > > - /* > > - * Allocate the vector page early. > > - */ > > - vectors = early_alloc(PAGE_SIZE * 2); > > This one is not appropriate. We _do_ write to these pages after init > for FIQ handler updates. See set_fiq_handler(). This is one of the many cases where pax_open_kernel/pax_close_kernel are needed to temporarily toggle it read-only. From grsecurity: @@ -95,7 +95,10 @@ void set_fiq_handler(void *start, unsigned int length) void *base = vectors_page; unsigned offset = FIQ_OFFSET; + pax_open_kernel(); memcpy(base + offset, start, length); + pax_close_kernel(); signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> My claim was not that the mainline code was impressively perfect, but > rather that the vendor code was worse, countering a prior claim > otherwise. Hence, reality. You're arguing with a straw man. I was responding to a comment about out-of-tree code, not generic architecture perf drivers vs. alternative versions by SoC vendors. Qualcomm and other vendors landing their drivers in mainline would be nice, but it wouldn't make it inherently higher quality. I don't really see what it has to do with this, which I why I responded... signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> My claim was not that the mainline code was impressively perfect, but > rather that the vendor code was worse, countering a prior claim > otherwise. Hence, reality. You're arguing with a straw man. I was responding to a comment about out-of-tree code, not generic architecture perf drivers vs. alternative versions by SoC vendors. Qualcomm and other vendors landing their drivers in mainline would be nice, but it wouldn't make it inherently higher quality. I don't really see what it has to do with this, which I why I responded... signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization
On Thu, 2016-08-04 at 16:55 +, Roberts, William C wrote: > > > > -Original Message- > > From: Daniel Micay [mailto:danielmi...@gmail.com] > > Sent: Thursday, August 4, 2016 9:53 AM > > To: kernel-harden...@lists.openwall.com; ja...@lakedaemon.net; > > linux- > > m...@vger.kernel.org; linux-kernel@vger.kernel.org; akpm@linux- > > foundation.org > > Cc: keesc...@chromium.org; gre...@linuxfoundation.org; n...@google.co > > m; > > je...@google.com; saly...@android.com; dcash...@android.com > > Subject: Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap > > randomization > > > > On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com > > wrote: > > > > > > The recent get_random_long() change in get_random_range() and then > > > the > > > subsequent patches Jason put out, all stemmed from my tinkering > > > with > > > the concept of randomizing mmap. > > > > > > Any feedback would be greatly appreciated, including any feedback > > > indicating that I am idiot. > > > > The RAND_THREADSTACK feature in grsecurity makes the gaps the way I > > think > > would be ideal, i.e. tracked as part of the appropriate VMA. It > > would be > > straightforward to make it more general purpose. > > I am not familiar with that, thanks for pointing it out. I'll take a > look when my time > frees up for this again. I'm actually wrong about that now that I look more closely... signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization
On Thu, 2016-08-04 at 16:55 +, Roberts, William C wrote: > > > > -Original Message- > > From: Daniel Micay [mailto:danielmi...@gmail.com] > > Sent: Thursday, August 4, 2016 9:53 AM > > To: kernel-harden...@lists.openwall.com; ja...@lakedaemon.net; > > linux- > > m...@vger.kernel.org; linux-kernel@vger.kernel.org; akpm@linux- > > foundation.org > > Cc: keesc...@chromium.org; gre...@linuxfoundation.org; n...@google.co > > m; > > je...@google.com; saly...@android.com; dcash...@android.com > > Subject: Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap > > randomization > > > > On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com > > wrote: > > > > > > The recent get_random_long() change in get_random_range() and then > > > the > > > subsequent patches Jason put out, all stemmed from my tinkering > > > with > > > the concept of randomizing mmap. > > > > > > Any feedback would be greatly appreciated, including any feedback > > > indicating that I am idiot. > > > > The RAND_THREADSTACK feature in grsecurity makes the gaps the way I > > think > > would be ideal, i.e. tracked as part of the appropriate VMA. It > > would be > > straightforward to make it more general purpose. > > I am not familiar with that, thanks for pointing it out. I'll take a > look when my time > frees up for this again. I'm actually wrong about that now that I look more closely... signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization
On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com wrote: > The recent get_random_long() change in get_random_range() and then the > subsequent patches Jason put out, all stemmed from my tinkering > with the concept of randomizing mmap. > > Any feedback would be greatly appreciated, including any feedback > indicating that I am idiot. The RAND_THREADSTACK feature in grsecurity makes the gaps the way I think would be ideal, i.e. tracked as part of the appropriate VMA. It would be straightforward to make it more general purpose. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization
On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com wrote: > The recent get_random_long() change in get_random_range() and then the > subsequent patches Jason put out, all stemmed from my tinkering > with the concept of randomizing mmap. > > Any feedback would be greatly appreciated, including any feedback > indicating that I am idiot. The RAND_THREADSTACK feature in grsecurity makes the gaps the way I think would be ideal, i.e. tracked as part of the appropriate VMA. It would be straightforward to make it more general purpose. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Thu, 2016-08-04 at 17:10 +0100, Mark Rutland wrote: > On Thu, Aug 04, 2016 at 11:44:28AM -0400, Daniel Micay wrote: > > > > Qualcomm's drivers might be lower quality than core kernel code, but > > they're way above the baseline set by mainline kernel drivers... > > I don't think that's true for the arm/arm64 perf code. The baseline architecture support is essentially core kernel code. I agree it's much better than the SoC vendor code. You're spending a lot of time auditing, fuzzing and improving the code in general, which is not true for most drivers. They don't get that attention. > I think we've done a reasonable job of testing and fixing those, along > with core infrastructure issues. The perf fuzzer runs for a very long > time on a mainline kernel without issues, while on my Nexus 5x I get a > hard lockup after ~85 seconds (and prior to the last android update > the > lockup was instantaneous). > > From my personal experience (and as above), and talking specifically > about PMU drivers, I think that the opposite is true. This is not to > say > there aren't issues; I would not be surprised if there are. But it's > disingenuous to say that mainline code is worse than that which exists > in a vendor kernel when the latter is demonstrably much easier to > break > than the former. I wasn't talking specifically about perf. > If there are issues you are aware of, please report them. If those > issues only exist in non-upstream code, then the applicable concerns > are > somewhat different (though certainly still exist). I'm not going to do volunteer work for a corporation. I've learned that lesson after spending years doing it. > But please, let's frame the argument to match reality. The argument is framed in reality. Stating that it now often takes a few hours to find a vulnerability with the unaltered, widely known public perf fuzzer is not impressive. It's really an argument for claiming that it's a significant security issue. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Thu, 2016-08-04 at 17:10 +0100, Mark Rutland wrote: > On Thu, Aug 04, 2016 at 11:44:28AM -0400, Daniel Micay wrote: > > > > Qualcomm's drivers might be lower quality than core kernel code, but > > they're way above the baseline set by mainline kernel drivers... > > I don't think that's true for the arm/arm64 perf code. The baseline architecture support is essentially core kernel code. I agree it's much better than the SoC vendor code. You're spending a lot of time auditing, fuzzing and improving the code in general, which is not true for most drivers. They don't get that attention. > I think we've done a reasonable job of testing and fixing those, along > with core infrastructure issues. The perf fuzzer runs for a very long > time on a mainline kernel without issues, while on my Nexus 5x I get a > hard lockup after ~85 seconds (and prior to the last android update > the > lockup was instantaneous). > > From my personal experience (and as above), and talking specifically > about PMU drivers, I think that the opposite is true. This is not to > say > there aren't issues; I would not be surprised if there are. But it's > disingenuous to say that mainline code is worse than that which exists > in a vendor kernel when the latter is demonstrably much easier to > break > than the former. I wasn't talking specifically about perf. > If there are issues you are aware of, please report them. If those > issues only exist in non-upstream code, then the applicable concerns > are > somewhat different (though certainly still exist). I'm not going to do volunteer work for a corporation. I've learned that lesson after spending years doing it. > But please, let's frame the argument to match reality. The argument is framed in reality. Stating that it now often takes a few hours to find a vulnerability with the unaltered, widely known public perf fuzzer is not impressive. It's really an argument for claiming that it's a significant security issue. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Thu, 2016-08-04 at 16:11 +0200, Peter Zijlstra wrote: > On Thu, Aug 04, 2016 at 09:45:23AM -0400, Daniel Micay wrote: > > > > Qualcomm's perf driver is out-of-tree along with most of their other > > drivers. > > > So you're asking us to maim upstream perf for some out of tree junk? > Srously? *plonk* This feature doesn't come from Android. The perf events subsystem in the mainline kernel is packed full of vulnerabilities too. The problem is so bad that pointing one of the public fuzzers at it for a short period of time is all that's required to start finding them. Qualcomm's drivers might be lower quality than core kernel code, but they're way above the baseline set by mainline kernel drivers... Shining the same light on mainline drivers wouldn't be pretty. The work going into hardening the Qualcomm drivers isn't happening upstream to any comparable extent. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Thu, 2016-08-04 at 16:11 +0200, Peter Zijlstra wrote: > On Thu, Aug 04, 2016 at 09:45:23AM -0400, Daniel Micay wrote: > > > > Qualcomm's perf driver is out-of-tree along with most of their other > > drivers. > > > So you're asking us to maim upstream perf for some out of tree junk? > Srously? *plonk* This feature doesn't come from Android. The perf events subsystem in the mainline kernel is packed full of vulnerabilities too. The problem is so bad that pointing one of the public fuzzers at it for a short period of time is all that's required to start finding them. Qualcomm's drivers might be lower quality than core kernel code, but they're way above the baseline set by mainline kernel drivers... Shining the same light on mainline drivers wouldn't be pretty. The work going into hardening the Qualcomm drivers isn't happening upstream to any comparable extent. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Thu, 2016-08-04 at 11:28 +0100, Mark Rutland wrote: > On Wed, Aug 03, 2016 at 03:36:16PM -0400, Daniel Micay wrote: > > > > There's a lot of architecture and vendor specific perf events code > > and > > lots of bleeding edge features. On Android, a lot of the perf events > > vulnerabilities have been specific to the Qualcomm SoC platform. > > Other > > platforms are likely just receiving a lot less attention. > > Are the relevant perf drivers for those platforms upstream? I've seen > no > patches addressing security issues in the ARMv7 krait+Scorpion PMU > driver since it was added, and there's no ARMv8 QCOM PMU driver. > > If there are outstanding issues, please report them upstream. > > FWIW, I've used Vince Weaver's perf fuzzer to test the ARM PMU code > (both the framework and drivers), so other platforms are seeing some > attention. That said, I haven't done that recently. Qualcomm's perf driver is out-of-tree along with most of their other drivers. Their drivers add up to a LOT of code shared across over a billion mobile devices, leading to the focus on them. It also helps that there are bounties for Nexus devices, so there are multi thousand dollar rewards for bugs in the Qualcomm drivers compared to nothing for other platforms / drivers. Now that perf is only available via ADB debugging, further perf bugs no longer technically qualify for their bounties (but they might still pay, I don't know). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Thu, 2016-08-04 at 11:28 +0100, Mark Rutland wrote: > On Wed, Aug 03, 2016 at 03:36:16PM -0400, Daniel Micay wrote: > > > > There's a lot of architecture and vendor specific perf events code > > and > > lots of bleeding edge features. On Android, a lot of the perf events > > vulnerabilities have been specific to the Qualcomm SoC platform. > > Other > > platforms are likely just receiving a lot less attention. > > Are the relevant perf drivers for those platforms upstream? I've seen > no > patches addressing security issues in the ARMv7 krait+Scorpion PMU > driver since it was added, and there's no ARMv8 QCOM PMU driver. > > If there are outstanding issues, please report them upstream. > > FWIW, I've used Vince Weaver's perf fuzzer to test the ARM PMU code > (both the framework and drivers), so other platforms are seeing some > attention. That said, I haven't done that recently. Qualcomm's perf driver is out-of-tree along with most of their other drivers. Their drivers add up to a LOT of code shared across over a billion mobile devices, leading to the focus on them. It also helps that there are bounties for Nexus devices, so there are multi thousand dollar rewards for bugs in the Qualcomm drivers compared to nothing for other platforms / drivers. Now that perf is only available via ADB debugging, further perf bugs no longer technically qualify for their bounties (but they might still pay, I don't know). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> One of the strengths of linux is applications of features the authors > of > the software had not imagined. Your proposals seem to be trying to > put > the world a tiny little box where if someone had not imagined and > preapproved a use of a feature it should not happen. Let's please > avoid implementing totalitarianism to avoid malicious code exploiting > bugs in the kernel. I am not interested in that future. You're describing operating systems like Android, ChromeOS and iOS. That future is already here and the Linux kernel is the major weak point in the attempts to build those systems based on Linux. Even for the very restricted Chrome sandbox, it's the easiest way out. Android similarly allows near zero access to /sys for apps and little access to /proc beyond the /proc/PID directories belonging to an app. > Especially when dealing with disabling code to reduce attack surface, > when then are no known attacks what we are actually dealing with > is a small percentage probability reduction that a malicious attacker > will be able to exploit the attack. There are perf events vulnerabilities being exploited in the wild to gain root on Android. It's not a theoretical attack vector. They're used in both malware and rooting tools. Local privilege escalation bugs in the kernel are common so there are a lot of alternatives but it's one of the major sources for vulnerabilities. There's a lot of architecture and vendor specific perf events code and lots of bleeding edge features. On Android, a lot of the perf events vulnerabilities have been specific to the Qualcomm SoC platform. Other platforms are likely just receiving a lot less attention. > Remember security is as much about availability as it is about > integrity. You keep imagining features that are great big denial of > service attacks on legitimate users. Only developers care about perf events and they still have access to it. JIT compilers have other ways to do tracing and even if they consider this to be the ideal API, it's not particularly important if they have to settle for something else. In reality, it's a small compromise. > I vote for sandboxes. Perhaps seccomp. Perhaps a per userns sysctl. > Perhaps something else. It's not possible to use the current incarnation of seccomp for this since it can't be dynamically granted/revoked. Perhaps it would be possible to support adding/removing or at least toggling seccomp filters for groups of processes. That would be good enough to take care of user ns, ptrace, perf events, etc. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> One of the strengths of linux is applications of features the authors > of > the software had not imagined. Your proposals seem to be trying to > put > the world a tiny little box where if someone had not imagined and > preapproved a use of a feature it should not happen. Let's please > avoid implementing totalitarianism to avoid malicious code exploiting > bugs in the kernel. I am not interested in that future. You're describing operating systems like Android, ChromeOS and iOS. That future is already here and the Linux kernel is the major weak point in the attempts to build those systems based on Linux. Even for the very restricted Chrome sandbox, it's the easiest way out. Android similarly allows near zero access to /sys for apps and little access to /proc beyond the /proc/PID directories belonging to an app. > Especially when dealing with disabling code to reduce attack surface, > when then are no known attacks what we are actually dealing with > is a small percentage probability reduction that a malicious attacker > will be able to exploit the attack. There are perf events vulnerabilities being exploited in the wild to gain root on Android. It's not a theoretical attack vector. They're used in both malware and rooting tools. Local privilege escalation bugs in the kernel are common so there are a lot of alternatives but it's one of the major sources for vulnerabilities. There's a lot of architecture and vendor specific perf events code and lots of bleeding edge features. On Android, a lot of the perf events vulnerabilities have been specific to the Qualcomm SoC platform. Other platforms are likely just receiving a lot less attention. > Remember security is as much about availability as it is about > integrity. You keep imagining features that are great big denial of > service attacks on legitimate users. Only developers care about perf events and they still have access to it. JIT compilers have other ways to do tracing and even if they consider this to be the ideal API, it's not particularly important if they have to settle for something else. In reality, it's a small compromise. > I vote for sandboxes. Perhaps seccomp. Perhaps a per userns sysctl. > Perhaps something else. It's not possible to use the current incarnation of seccomp for this since it can't be dynamically granted/revoked. Perhaps it would be possible to support adding/removing or at least toggling seccomp filters for groups of processes. That would be good enough to take care of user ns, ptrace, perf events, etc. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
Having this in Yama would also make it probable that there would be a security-centric default. It would end up wiping out unprivileged perf events access on distributions using Yama for ptrace_scope unless they make the explicit decision to disable it. Having the perf subsystem extend the existing perf_event_paranoid sysctl leaves the control over the upstream default in the hands of the perf subsystem, not LSMs. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
Having this in Yama would also make it probable that there would be a security-centric default. It would end up wiping out unprivileged perf events access on distributions using Yama for ptrace_scope unless they make the explicit decision to disable it. Having the perf subsystem extend the existing perf_event_paranoid sysctl leaves the control over the upstream default in the hands of the perf subsystem, not LSMs. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> The default has no impact on the "it's too coarse and limiting" > negative property > of this patch, which is the show-stopper aspect. Please fix that > aspect instead of > trying to argue around it. Disabling perf events in the kernel configuration is even more limiting, and is currently the alternative to doing it this way. It makes sense for debugging features to be disabled in production releases of products unless there's a way to control them where risk isn't increased. Having a way to toggle it dynamically will allow it to be remain available. > This isn't some narrow debugging mechanism we can turn on/off globally > and forget > about, this is a wide scope performance measurement and event logging > infrastructure that is being utilized not just by developers but by > apps and > runtimes as well. The incredibly wide scope is why it's such a big security problem. If it wasn't such a frequent source of vulnerabilities, it wouldn't have been disabled for unprivileged users in grsecurity, Debian and then Android. It's extended by lots of vendor code to specific to platforms too, so it isn't just some core kernel code that's properly reviewed. I don't think there are runtimes using this for JIT tracing. Perhaps it doesn't actually suit their needs. It's a theoretical use case. > > Let me take this another way instead. What would be a better way to > > provide a > > mechanism for system owners to disable perf without an LSM? (Since > > far fewer > > folks run with an enforcing "big" LSM: I'm seeking as wide a > > coverage as > > possible.) > > Because in practice what will happen is that if the only option is to > do something > drastic for sekjurity, IT departments will do it - while if there's a > more > flexible mechanism that does not throw out the baby with the bath > water that is > going to be used. It's already done: Android and Debian disable it for unprivileged users by default. It's already done for most of desktop and mobile Linux and perhaps even most servers. Not providing functionality desired by downstream doesn't mean it won't be provided there. They'll keep doing it whether or not this lands. If it doesn't land, it will only mean that mainline kernels aren't usable for making Android devices. They need to pass the compatibility test suite, which means having this. The mechanism could change but I don't see why the actual requirement would. > This is as if 20 years ago you had submitted a patch to the early > Linux TCP/IP > networking code to be on/off via a global sysctl switch and told > people that > "in developer mode you can have networking, talk to your admin". > > We'd have told you: "this switch is too coarse and limiting, please > implement > something better, like a list of routes which defines which IP ranges > are > accessible, and a privileged range of listen sockets ports and some > flexible > kernel side filtering mechanism to inhibit outgoing/incoming > connections". > > Global sysctls are way too coarse. The end result of submitting an LSM hook instead of using this would probably come down to having a global sysctl toggle in Yama. There would be two sysctl controls for perf restrictions rather than one which is needless complexity for the interface. Android and Debian would be using a fine-grained perf LSM to accomplish the same thing: globally disabling it for unprivileged users when not doing development. The nice thing about fine-grained control would be implementing a *more* restrictive policy for the case when it's toggled on. It wouldn't necessarily have to be globally enabled, only enabled for the relevant user or even a specific process, but that would be complicated to implement. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> The default has no impact on the "it's too coarse and limiting" > negative property > of this patch, which is the show-stopper aspect. Please fix that > aspect instead of > trying to argue around it. Disabling perf events in the kernel configuration is even more limiting, and is currently the alternative to doing it this way. It makes sense for debugging features to be disabled in production releases of products unless there's a way to control them where risk isn't increased. Having a way to toggle it dynamically will allow it to be remain available. > This isn't some narrow debugging mechanism we can turn on/off globally > and forget > about, this is a wide scope performance measurement and event logging > infrastructure that is being utilized not just by developers but by > apps and > runtimes as well. The incredibly wide scope is why it's such a big security problem. If it wasn't such a frequent source of vulnerabilities, it wouldn't have been disabled for unprivileged users in grsecurity, Debian and then Android. It's extended by lots of vendor code to specific to platforms too, so it isn't just some core kernel code that's properly reviewed. I don't think there are runtimes using this for JIT tracing. Perhaps it doesn't actually suit their needs. It's a theoretical use case. > > Let me take this another way instead. What would be a better way to > > provide a > > mechanism for system owners to disable perf without an LSM? (Since > > far fewer > > folks run with an enforcing "big" LSM: I'm seeking as wide a > > coverage as > > possible.) > > Because in practice what will happen is that if the only option is to > do something > drastic for sekjurity, IT departments will do it - while if there's a > more > flexible mechanism that does not throw out the baby with the bath > water that is > going to be used. It's already done: Android and Debian disable it for unprivileged users by default. It's already done for most of desktop and mobile Linux and perhaps even most servers. Not providing functionality desired by downstream doesn't mean it won't be provided there. They'll keep doing it whether or not this lands. If it doesn't land, it will only mean that mainline kernels aren't usable for making Android devices. They need to pass the compatibility test suite, which means having this. The mechanism could change but I don't see why the actual requirement would. > This is as if 20 years ago you had submitted a patch to the early > Linux TCP/IP > networking code to be on/off via a global sysctl switch and told > people that > "in developer mode you can have networking, talk to your admin". > > We'd have told you: "this switch is too coarse and limiting, please > implement > something better, like a list of routes which defines which IP ranges > are > accessible, and a privileged range of listen sockets ports and some > flexible > kernel side filtering mechanism to inhibit outgoing/incoming > connections". > > Global sysctls are way too coarse. The end result of submitting an LSM hook instead of using this would probably come down to having a global sysctl toggle in Yama. There would be two sysctl controls for perf restrictions rather than one which is needless complexity for the interface. Android and Debian would be using a fine-grained perf LSM to accomplish the same thing: globally disabling it for unprivileged users when not doing development. The nice thing about fine-grained control would be implementing a *more* restrictive policy for the case when it's toggled on. It wouldn't necessarily have to be globally enabled, only enabled for the relevant user or even a specific process, but that would be complicated to implement. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> > So the problem I have with this is that it will completely inhibit > > development of things like JITs that self-profile to re-compile > > frequently used code. > > Or reimplement strace with sys_perf_event_open(), speeding it up > greatly > by not using ptrace (see 'perf trace', one such attempt), combining it > with sys_bpf(), which can run unpriviledged as well, provides lots of > possibilities for efficient tooling that would be greatly stiffled by > such big hammer restrictions :-( The usage on Android wouldn't impact strace. It's a debugging tool used over the debugging shell so it could be taught to toggle on unprivileged access to perf events as the other tools using the API were. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
> > So the problem I have with this is that it will completely inhibit > > development of things like JITs that self-profile to re-compile > > frequently used code. > > Or reimplement strace with sys_perf_event_open(), speeding it up > greatly > by not using ptrace (see 'perf trace', one such attempt), combining it > with sys_bpf(), which can run unpriviledged as well, provides lots of > possibilities for efficient tooling that would be greatly stiffled by > such big hammer restrictions :-( The usage on Android wouldn't impact strace. It's a debugging tool used over the debugging shell so it could be taught to toggle on unprivileged access to perf events as the other tools using the API were. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Tue, 2016-08-02 at 11:52 +0200, Peter Zijlstra wrote: > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote: > > > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow > > all access to performance events by users without CAP_SYS_ADMIN. > > > > This new level of restriction is intended to reduce the attack > > surface of the kernel. Perf is a valuable tool for developers but > > is generally unnecessary and unused on production systems. Perf may > > open up an attack vector to vulnerable device-specific drivers as > > recently demonstrated in CVE-2016-0805, CVE-2016-0819, > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. > > We have bugs we fix them, we don't kill complete infrastructure > because > > of them. It's still accessible to privileged processes either way. Android still allows access from unprivileged processes but it can only be enabled via the debugging shell, which is not enabled by default either. It isn't even possible to disable the perf events infrastructure via kernel configuration for every architecture right now. You're forcing people to have common local privilege escalation and information leak vulnerabilities for something few people actually use. This patch is now a requirement for any Android devices with a security patch level above August 2016. The only thing that not merging it is going to accomplish is preventing a mainline kernel from ever being used on Android devices, unless you provide an alternative it can use for the same use case. https://source.android.com/security/bulletin/2016-08-01.html > > This new level of > > restriction allows for a safe default to be set on production > > systems > > while leaving a simple means for developers to grant access [1]. > > So the problem I have with this is that it will completely inhibit > development of things like JITs that self-profile to re-compile > frequently used code. > > I would much rather have an LSM hook where the security stuff can do more fine grained control of things. Allowing some apps perf usage while > denying others. If the only need was controlling access per-process statically, then using seccomp-bpf works fine. It needs to be dynamic though. I don't think SELinux could be used to provide the functionality so it would have to be a whole new LSM. I doubt anyone will implement that when the necessary functionality is already available. It's already exposed only for developers using profiling tools until they reboot, so finer grained control wouldn't accomplish much. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open
On Tue, 2016-08-02 at 11:52 +0200, Peter Zijlstra wrote: > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote: > > > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow > > all access to performance events by users without CAP_SYS_ADMIN. > > > > This new level of restriction is intended to reduce the attack > > surface of the kernel. Perf is a valuable tool for developers but > > is generally unnecessary and unused on production systems. Perf may > > open up an attack vector to vulnerable device-specific drivers as > > recently demonstrated in CVE-2016-0805, CVE-2016-0819, > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. > > We have bugs we fix them, we don't kill complete infrastructure > because > > of them. It's still accessible to privileged processes either way. Android still allows access from unprivileged processes but it can only be enabled via the debugging shell, which is not enabled by default either. It isn't even possible to disable the perf events infrastructure via kernel configuration for every architecture right now. You're forcing people to have common local privilege escalation and information leak vulnerabilities for something few people actually use. This patch is now a requirement for any Android devices with a security patch level above August 2016. The only thing that not merging it is going to accomplish is preventing a mainline kernel from ever being used on Android devices, unless you provide an alternative it can use for the same use case. https://source.android.com/security/bulletin/2016-08-01.html > > This new level of > > restriction allows for a safe default to be set on production > > systems > > while leaving a simple means for developers to grant access [1]. > > So the problem I have with this is that it will completely inhibit > development of things like JITs that self-profile to re-compile > frequently used code. > > I would much rather have an LSM hook where the security stuff can do more fine grained control of things. Allowing some apps perf usage while > denying others. If the only need was controlling access per-process statically, then using seccomp-bpf works fine. It needs to be dynamic though. I don't think SELinux could be used to provide the functionality so it would have to be a whole new LSM. I doubt anyone will implement that when the necessary functionality is already available. It's already exposed only for developers using profiling tools until they reboot, so finer grained control wouldn't accomplish much. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization
> > It's very hard to quantify the benefits of fine-grained > > randomization, > > ? N = # of possible addresses. The bigger N is, the more chances the > attacker will trip up before finding what they were looking for. If the attacker is forcing the creation of many objects with a function pointer and then trying to hit one, the only thing that would help is if the heap is very sparse with random bases within it. They don't need to hit a specific object for an exploit to work. The details of how the randomization is done and the guarantees that are provided certainly matter. Individual random gaps are low entropy and they won't add up to much higher entropy randomization even for two objects that are far apart. The entropy has no chance to build up since the sizes will average out. I'm not saying it doesn't make sense to do this (it's a feature that I really want), but there are a lot of ways to approach fine-grained mmap randomization and the design decisions should be justified and their impact analyzed/measured. > > but there are other useful guarantees you could provide. It would be > > quite helpful for the kernel to expose the option to force a > > PROT_NONE > > mapping after every allocation. The gaps should actually be > > enforced. > > > > So perhaps 3 things, simply exposed as off-by-default sysctl options > > (no need for special treatment on 32-bit): > > I'm certainly not an mm-developer, but this looks to me like we're > pushing the work of creating efficient, random mappings out to > userspace. :-/ Exposing configuration doesn't push work to userspace. I can't see any way that this would be done by default even on 64-bit due to the extra VMAs, so it really needs configuration. > > a) configurable minimum gap size in pages (for protection against > > linear and small {under,over}flows) b) configurable minimum gap size > > based on a ratio to allocation size (for making the heap sparse to > > mitigate heap sprays, especially when mixed with fine-grained > > randomization - for example 2x would add a 2M gap after a 1M > > mapping) > > mmm, this looks like an information leak. Best to set a range of > pages > and pick a random number within that range for each call. A minimum gap size provides guarantees not offered by randomization alone. It might not make sense to approach making the heap sparse by forcing it separately from randomization, but doing it isn't leaking information. Obviously the random gap would be chosen by picking a maximum size (n) and choosing a size between [0, n], which was the next potential variable, separate from these non-randomization-related guarantees. > > c) configurable maximum random gap size (the random gap would be in > > addition to the enforced minimums) > > > > The randomization could just be considered an extra with minor > > benefits rather than the whole feature. A full fine-grained > > randomization implementation would need a higher-level form of > > randomization than gaps in the kernel along with cooperation from > > userspace allocators. This would make sense as one part of it > > though. > > Ok, so here's an idea. This idea could be used in conjunction with > random gaps, or on it's own. It would be enhanced by userspace random > load order. > > The benefit is that with 32bit address space, and no random gapping, > it's still not wasting much space. > > Given a memory space, break it up into X bands such that there are 2*X > possible addresses. > > |A B|C D|E F|G H| ... |2*X-2 2*X-1| > |--> <--|--> <--|--> <--|--> <--| ... |--> <--| > min max > > For each call to mmap, we randomly pick a value within [0 - 2*X). > Assuming A=0 in the diagram above, even values grow up, odd values > grow > down. Gradually consuming the single gap in the middle of each band. > > How many bands to use would depend on: > * 32/64bit > * Average number of mmap calls > * largest single mmap call usually seen > * if using random gaps and range used > > If the free gap in a chosen band is too small for the request, pick > again among the other bands. > > Again, I'm not an mm dev, so I might be totally smoking crack on this > one... > > thx, > > Jason. Address space fragmentation matters a lot, not only wasted space due to memory that's explicitly reserved for random gaps. The randomization guarantees under situations like memory exhaustion also matter. I do think fine-grained randomization would be useful, but I think it's unclear what a good approach would be, along with what the security benefits would be. The malloc implementation is also very relevant. OpenBSD has fine-grained mmap randomization with a fair bit of thought put into how it works, but I don't think there has been much analysis of it. The security properties really aren't clear. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization
> > It's very hard to quantify the benefits of fine-grained > > randomization, > > ? N = # of possible addresses. The bigger N is, the more chances the > attacker will trip up before finding what they were looking for. If the attacker is forcing the creation of many objects with a function pointer and then trying to hit one, the only thing that would help is if the heap is very sparse with random bases within it. They don't need to hit a specific object for an exploit to work. The details of how the randomization is done and the guarantees that are provided certainly matter. Individual random gaps are low entropy and they won't add up to much higher entropy randomization even for two objects that are far apart. The entropy has no chance to build up since the sizes will average out. I'm not saying it doesn't make sense to do this (it's a feature that I really want), but there are a lot of ways to approach fine-grained mmap randomization and the design decisions should be justified and their impact analyzed/measured. > > but there are other useful guarantees you could provide. It would be > > quite helpful for the kernel to expose the option to force a > > PROT_NONE > > mapping after every allocation. The gaps should actually be > > enforced. > > > > So perhaps 3 things, simply exposed as off-by-default sysctl options > > (no need for special treatment on 32-bit): > > I'm certainly not an mm-developer, but this looks to me like we're > pushing the work of creating efficient, random mappings out to > userspace. :-/ Exposing configuration doesn't push work to userspace. I can't see any way that this would be done by default even on 64-bit due to the extra VMAs, so it really needs configuration. > > a) configurable minimum gap size in pages (for protection against > > linear and small {under,over}flows) b) configurable minimum gap size > > based on a ratio to allocation size (for making the heap sparse to > > mitigate heap sprays, especially when mixed with fine-grained > > randomization - for example 2x would add a 2M gap after a 1M > > mapping) > > mmm, this looks like an information leak. Best to set a range of > pages > and pick a random number within that range for each call. A minimum gap size provides guarantees not offered by randomization alone. It might not make sense to approach making the heap sparse by forcing it separately from randomization, but doing it isn't leaking information. Obviously the random gap would be chosen by picking a maximum size (n) and choosing a size between [0, n], which was the next potential variable, separate from these non-randomization-related guarantees. > > c) configurable maximum random gap size (the random gap would be in > > addition to the enforced minimums) > > > > The randomization could just be considered an extra with minor > > benefits rather than the whole feature. A full fine-grained > > randomization implementation would need a higher-level form of > > randomization than gaps in the kernel along with cooperation from > > userspace allocators. This would make sense as one part of it > > though. > > Ok, so here's an idea. This idea could be used in conjunction with > random gaps, or on it's own. It would be enhanced by userspace random > load order. > > The benefit is that with 32bit address space, and no random gapping, > it's still not wasting much space. > > Given a memory space, break it up into X bands such that there are 2*X > possible addresses. > > |A B|C D|E F|G H| ... |2*X-2 2*X-1| > |--> <--|--> <--|--> <--|--> <--| ... |--> <--| > min max > > For each call to mmap, we randomly pick a value within [0 - 2*X). > Assuming A=0 in the diagram above, even values grow up, odd values > grow > down. Gradually consuming the single gap in the middle of each band. > > How many bands to use would depend on: > * 32/64bit > * Average number of mmap calls > * largest single mmap call usually seen > * if using random gaps and range used > > If the free gap in a chosen band is too small for the request, pick > again among the other bands. > > Again, I'm not an mm dev, so I might be totally smoking crack on this > one... > > thx, > > Jason. Address space fragmentation matters a lot, not only wasted space due to memory that's explicitly reserved for random gaps. The randomization guarantees under situations like memory exhaustion also matter. I do think fine-grained randomization would be useful, but I think it's unclear what a good approach would be, along with what the security benefits would be. The malloc implementation is also very relevant. OpenBSD has fine-grained mmap randomization with a fair bit of thought put into how it works, but I don't think there has been much analysis of it. The security properties really aren't clear. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization
> > In the Project Zero Stagefright post > > (http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html) > > , > > we see that the linear allocation of memory combined with the low > > number of bits in the initial mmap offset resulted in a much more > > predictable layout which aided the attacker. The initial random mmap > > base range was increased by Daniel Cashman in > > d07e22597d1d355829b7b18ac19afa912cf758d1, but we've done nothing to > > address page relative attacks. > > > > Inter-mmap randomization will decrease the predictability of later > > mmap() allocations, which should help make data structures harder to > > find in memory. In addition, this patch will also introduce unmapped > > gaps between pages, preventing linear overruns from one mapping to > > another another mapping. I am unable to quantify how much this will > > improve security, but it should be > 0. > > One person calls "unmapped gaps between pages" a feature, others call > it > a mess. ;-) It's very hard to quantify the benefits of fine-grained randomization, but there are other useful guarantees you could provide. It would be quite helpful for the kernel to expose the option to force a PROT_NONE mapping after every allocation. The gaps should actually be enforced. So perhaps 3 things, simply exposed as off-by-default sysctl options (no need for special treatment on 32-bit): a) configurable minimum gap size in pages (for protection against linear and small {under,over}flows) b) configurable minimum gap size based on a ratio to allocation size (for making the heap sparse to mitigate heap sprays, especially when mixed with fine-grained randomization - for example 2x would add a 2M gap after a 1M mapping) c) configurable maximum random gap size (the random gap would be in addition to the enforced minimums) The randomization could just be considered an extra with minor benefits rather than the whole feature. A full fine-grained randomization implementation would need a higher-level form of randomization than gaps in the kernel along with cooperation from userspace allocators. This would make sense as one part of it though. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization
> > In the Project Zero Stagefright post > > (http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html) > > , > > we see that the linear allocation of memory combined with the low > > number of bits in the initial mmap offset resulted in a much more > > predictable layout which aided the attacker. The initial random mmap > > base range was increased by Daniel Cashman in > > d07e22597d1d355829b7b18ac19afa912cf758d1, but we've done nothing to > > address page relative attacks. > > > > Inter-mmap randomization will decrease the predictability of later > > mmap() allocations, which should help make data structures harder to > > find in memory. In addition, this patch will also introduce unmapped > > gaps between pages, preventing linear overruns from one mapping to > > another another mapping. I am unable to quantify how much this will > > improve security, but it should be > 0. > > One person calls "unmapped gaps between pages" a feature, others call > it > a mess. ;-) It's very hard to quantify the benefits of fine-grained randomization, but there are other useful guarantees you could provide. It would be quite helpful for the kernel to expose the option to force a PROT_NONE mapping after every allocation. The gaps should actually be enforced. So perhaps 3 things, simply exposed as off-by-default sysctl options (no need for special treatment on 32-bit): a) configurable minimum gap size in pages (for protection against linear and small {under,over}flows) b) configurable minimum gap size based on a ratio to allocation size (for making the heap sparse to mitigate heap sprays, especially when mixed with fine-grained randomization - for example 2x would add a 2M gap after a 1M mapping) c) configurable maximum random gap size (the random gap would be in addition to the enforced minimums) The randomization could just be considered an extra with minor benefits rather than the whole feature. A full fine-grained randomization implementation would need a higher-level form of randomization than gaps in the kernel along with cooperation from userspace allocators. This would make sense as one part of it though. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy
> I'd like it to dump stack and be fatal to the process involved, but > yeah, I guess BUG() would work. Creating an infrastructure for > handling security-related Oopses can be done separately from this > (and > I'd like to see that added, since it's a nice bit of configurable > reactivity to possible attacks). In grsecurity, the oops handling also uses do_group_exit instead of do_exit but both that change (or at least the option to do it) and the exploit handling could be done separately from this without actually needing special treatment for USERCOPY. Could expose is as something like panic_on_oops=2 as a balance between the existing options. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy
> I'd like it to dump stack and be fatal to the process involved, but > yeah, I guess BUG() would work. Creating an infrastructure for > handling security-related Oopses can be done separately from this > (and > I'd like to see that added, since it's a nice bit of configurable > reactivity to possible attacks). In grsecurity, the oops handling also uses do_group_exit instead of do_exit but both that change (or at least the option to do it) and the exploit handling could be done separately from this without actually needing special treatment for USERCOPY. Could expose is as something like panic_on_oops=2 as a balance between the existing options. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy
> This could be a BUG, but I'd rather not panic the entire kernel. It seems unlikely that it will panic without panic_on_oops and that's an explicit opt-in to taking down the system on kernel logic errors exactly like this. In grsecurity, it calls the kernel exploit handling logic (panic if root, otherwise kill all process of that user and ban them until reboot) but that same logic is also called for BUG via oops handling so there's only really a distinction with panic_on_oops=1. Does it make sense to be less fatal for a fatal assertion that's more likely to be security-related? Maybe you're worried about having some false positives for the whitelisting portion, but I don't think those will lurk around very long with the way this works. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy
> This could be a BUG, but I'd rather not panic the entire kernel. It seems unlikely that it will panic without panic_on_oops and that's an explicit opt-in to taking down the system on kernel logic errors exactly like this. In grsecurity, it calls the kernel exploit handling logic (panic if root, otherwise kill all process of that user and ban them until reboot) but that same logic is also called for BUG via oops handling so there's only really a distinction with panic_on_oops=1. Does it make sense to be less fatal for a fatal assertion that's more likely to be security-related? Maybe you're worried about having some false positives for the whitelisting portion, but I don't think those will lurk around very long with the way this works. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v2 2/3] Mark functions with the __nocapture attribute
On Tue, 2016-07-12 at 15:08 -0400, Kees Cook wrote: > On Mon, Jul 4, 2016 at 7:42 PM, Emese Revfy> wrote: > > > > The nocapture gcc attribute can be on functions only. > > The attribute takes one or more unsigned integer constants as > > parameters > > that specify the function argument(s) of const char* type to > > initify. > > If the marked argument is a vararg then the plugin initifies > > all vararg arguments. > > Why is this called "nocapture"? Not captured by what? It seems like > it > means "initify this if possible". Am I misunderstanding its purpose? It means they don't escape via that function, i.e. they aren't stored anywhere to be used in any way after the call. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v2 2/3] Mark functions with the __nocapture attribute
On Tue, 2016-07-12 at 15:08 -0400, Kees Cook wrote: > On Mon, Jul 4, 2016 at 7:42 PM, Emese Revfy > wrote: > > > > The nocapture gcc attribute can be on functions only. > > The attribute takes one or more unsigned integer constants as > > parameters > > that specify the function argument(s) of const char* type to > > initify. > > If the marked argument is a vararg then the plugin initifies > > all vararg arguments. > > Why is this called "nocapture"? Not captured by what? It seems like > it > means "initify this if possible". Am I misunderstanding its purpose? It means they don't escape via that function, i.e. they aren't stored anywhere to be used in any way after the call. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open
On Fri, 2016-06-17 at 17:00 -0300, Arnaldo Carvalho de Melo wrote: > Em Fri, Jun 17, 2016 at 12:16:47PM -0400, Daniel Micay escreveu: > > On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote: > > > This Changelog is completely devoid of information. _WHY_ are you > > > doing this? > > > Attack surface reduction. It's possible to use seccomp-bpf for some > > limited cases, but it's not flexible enough. There are lots of > > information leaks and local privilege escalation vulnerabilities via > > perf events, yet on most Linux installs it's not ever being used. So > > turning it off by default on those installs is an easy win. The > > holes > > are reduced to root -> kernel (and that's not a meaningful boundary > > in > > mainline right now - although as is the case here, Debian has a > > bunch of > > securelevel patches for that). > > Is ptrace also disabled on such systems, or any of the other more > recent > syscalls? The same arguments could probably be used to disable those: > reduce attack surface, possibly the new ones have bugs as they are > relatively new and it takes a long time for new syscalls to be more > generally used, if we go on disabling them in such a way, they will > probably never get used :-\ ptrace is allowed for third party apps within their SELinux domain, but they all run as different users (so the kernel attack surface is there). It's now disabled for privileged platform apps and most other domains. A bit backwards, but removing it for third party apps would break compatibility so it would have to be done at an API level bump. At least, without deciding it is worth the cost like hidepid=2 in Android N (which exposes no mechanism for exceptions in 3rd party apps, only the base system). If seccomp-bpf didn't imply such high system call overhead outside of x86_64, Android would probably be walling off some new system calls. It needs 2-phase lookup similar to x86 on ARM. Android kernels do avoid enabling lots of kernel functionality from System V IPC to USERNS though. New features wouldn't end up enabled if they were behind config options without an explicit choice. > Wouldn't the recent bump in perf_event_paranoid to 2 enough? I.e. only > allow profiling of user tasks? Most of the vulnerabilities are still exposed at 2. That prevents leaking information about the kernel WITHOUT vulnerabilities though, and would be an improvement when perf is disabled - but that doesn't really matter much (Android kernel debugging and profiling would also still work with 2). > Or is there something more specific that we should disable/constrain > to > reduce such surface contact without using such a big hammer? It's desired to have it globally disabled by default. Could use seccomp- bpf to globally disable it, but that's a bigger hammer because it can't be globally turned off without a reboot (could only profile newly spawned processes after disabling it). Since it's only going to be enabled by developers, trying to make it more fine-grained wouldn't really pay off. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open
On Fri, 2016-06-17 at 17:00 -0300, Arnaldo Carvalho de Melo wrote: > Em Fri, Jun 17, 2016 at 12:16:47PM -0400, Daniel Micay escreveu: > > On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote: > > > This Changelog is completely devoid of information. _WHY_ are you > > > doing this? > > > Attack surface reduction. It's possible to use seccomp-bpf for some > > limited cases, but it's not flexible enough. There are lots of > > information leaks and local privilege escalation vulnerabilities via > > perf events, yet on most Linux installs it's not ever being used. So > > turning it off by default on those installs is an easy win. The > > holes > > are reduced to root -> kernel (and that's not a meaningful boundary > > in > > mainline right now - although as is the case here, Debian has a > > bunch of > > securelevel patches for that). > > Is ptrace also disabled on such systems, or any of the other more > recent > syscalls? The same arguments could probably be used to disable those: > reduce attack surface, possibly the new ones have bugs as they are > relatively new and it takes a long time for new syscalls to be more > generally used, if we go on disabling them in such a way, they will > probably never get used :-\ ptrace is allowed for third party apps within their SELinux domain, but they all run as different users (so the kernel attack surface is there). It's now disabled for privileged platform apps and most other domains. A bit backwards, but removing it for third party apps would break compatibility so it would have to be done at an API level bump. At least, without deciding it is worth the cost like hidepid=2 in Android N (which exposes no mechanism for exceptions in 3rd party apps, only the base system). If seccomp-bpf didn't imply such high system call overhead outside of x86_64, Android would probably be walling off some new system calls. It needs 2-phase lookup similar to x86 on ARM. Android kernels do avoid enabling lots of kernel functionality from System V IPC to USERNS though. New features wouldn't end up enabled if they were behind config options without an explicit choice. > Wouldn't the recent bump in perf_event_paranoid to 2 enough? I.e. only > allow profiling of user tasks? Most of the vulnerabilities are still exposed at 2. That prevents leaking information about the kernel WITHOUT vulnerabilities though, and would be an improvement when perf is disabled - but that doesn't really matter much (Android kernel debugging and profiling would also still work with 2). > Or is there something more specific that we should disable/constrain > to > reduce such surface contact without using such a big hammer? It's desired to have it globally disabled by default. Could use seccomp- bpf to globally disable it, but that's a bigger hammer because it can't be globally turned off without a reboot (could only profile newly spawned processes after disabling it). Since it's only going to be enabled by developers, trying to make it more fine-grained wouldn't really pay off. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open
On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote: > On Thu, Jun 16, 2016 at 03:27:55PM -0700, Kees Cook wrote: > > Hi guys, > > > > This patch wasn't originally CCed to you (I'm fixing that now). > > Would > > you consider taking this into the perf tree? > > No. > > > It's been in active use > > in both Debian and Android for a while now. > > Very nice of you all to finally inform us I suppose :/ It was in Debian a lot longer than Android, although the Android feature came from a downstream variant where it was done much earlier: https://android-review.googlesource.com/#/c/233736/ > > > > > > > > > > access to performance events by users without CAP_SYS_ADMIN. > > > > > Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that > > > > > makes this value the default. > > > > > > > > > > This is based on a similar feature in grsecurity > > > > > (CONFIG_GRKERNSEC_PERF_HARDEN). This version doesn't include > > > > > making > > > > > the variable read-only. It also allows enabling further > > > > > restriction > > > > > at run-time regardless of whether the default is changed. > > This Changelog is completely devoid of information. _WHY_ are you > doing > this? Attack surface reduction. It's possible to use seccomp-bpf for some limited cases, but it's not flexible enough. There are lots of information leaks and local privilege escalation vulnerabilities via perf events, yet on most Linux installs it's not ever being used. So turning it off by default on those installs is an easy win. The holes are reduced to root -> kernel (and that's not a meaningful boundary in mainline right now - although as is the case here, Debian has a bunch of securelevel patches for that). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open
On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote: > On Thu, Jun 16, 2016 at 03:27:55PM -0700, Kees Cook wrote: > > Hi guys, > > > > This patch wasn't originally CCed to you (I'm fixing that now). > > Would > > you consider taking this into the perf tree? > > No. > > > It's been in active use > > in both Debian and Android for a while now. > > Very nice of you all to finally inform us I suppose :/ It was in Debian a lot longer than Android, although the Android feature came from a downstream variant where it was done much earlier: https://android-review.googlesource.com/#/c/233736/ > > > > > > > > > > access to performance events by users without CAP_SYS_ADMIN. > > > > > Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that > > > > > makes this value the default. > > > > > > > > > > This is based on a similar feature in grsecurity > > > > > (CONFIG_GRKERNSEC_PERF_HARDEN). This version doesn't include > > > > > making > > > > > the variable read-only. It also allows enabling further > > > > > restriction > > > > > at run-time regardless of whether the default is changed. > > This Changelog is completely devoid of information. _WHY_ are you > doing > this? Attack surface reduction. It's possible to use seccomp-bpf for some limited cases, but it's not flexible enough. There are lots of information leaks and local privilege escalation vulnerabilities via perf events, yet on most Linux installs it's not ever being used. So turning it off by default on those installs is an easy win. The holes are reduced to root -> kernel (and that's not a meaningful boundary in mainline right now - although as is the case here, Debian has a bunch of securelevel patches for that). signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 2/2] security,perf: Allow further restriction of perf_event_open
> As a debian user, is this a good place to complain? Because it does > get > it the way. It would be relevant to whether or not it should be set to 3 by default in the kernel without explicit configuration, but there's no proposal to do that. Debian has to pick a trade-off beyond security and a tiny roadblock for developers. It's not always the case though. In Android, there's userspace integration allowing it to be toggled by the Android Debugging Bridge shell user so profiling tools are being taught to automatically toggle it. Enabling ADB and then using it for profiling is an implicit opt-in. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 2/2] security,perf: Allow further restriction of perf_event_open
> As a debian user, is this a good place to complain? Because it does > get > it the way. It would be relevant to whether or not it should be set to 3 by default in the kernel without explicit configuration, but there's no proposal to do that. Debian has to pick a trade-off beyond security and a tiny roadblock for developers. It's not always the case though. In Android, there's userspace integration allowing it to be toggled by the Android Debugging Bridge shell user so profiling tools are being taught to automatically toggle it. Enabling ADB and then using it for profiling is an implicit opt-in. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v4 0/4] SROP Mitigation: Sigreturn Cookies
> Then there's an unanswered question: is this patch acceptable given > that it's an ABI break? Security fixes are sometimes an exception to > the "no ABI breaks" rule, but it's by no means an automatic exception. > > --Andy It seems this could be worked around in general. Processes can have a bit tracking whether this is enabled, and CRIU can save/restore it. It would just leave it off for resuming old saved processes. Should CRIU really be covered by the kernel's ABI guarantee though? It seems like this was meant to be extensible, so it's adding an extra ABI guarantee that wasn't there before. It makes sense to freeze this ABI for CRIU, but a version field should be added first in one final ABI break if it's not too late. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH v4 0/4] SROP Mitigation: Sigreturn Cookies
> Then there's an unanswered question: is this patch acceptable given > that it's an ABI break? Security fixes are sometimes an exception to > the "no ABI breaks" rule, but it's by no means an automatic exception. > > --Andy It seems this could be worked around in general. Processes can have a bit tracking whether this is enabled, and CRIU can save/restore it. It would just leave it off for resuming old saved processes. Should CRIU really be covered by the kernel's ABI guarantee though? It seems like this was meant to be extensible, so it's adding an extra ABI guarantee that wasn't there before. It makes sense to freeze this ABI for CRIU, but a version field should be added first in one final ABI break if it's not too late. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
> This feature is already implemented by two distros, and likely wanted > by others. We cannot ignore that. Date point: Arch Linux won't be enabling CONFIG_USERNS until there's a way to disable unprivileged user namespaces. The kernel maintainers are unwilling to carry long-term out-of-tree patches. https://github.com/sandstorm-io/sandstorm/blob/d270755b1b55e5be6c96df2cce7c914f35f0d2a2/install.sh#L464-L474 signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
> This feature is already implemented by two distros, and likely wanted > by others. We cannot ignore that. Date point: Arch Linux won't be enabling CONFIG_USERNS until there's a way to disable unprivileged user namespaces. The kernel maintainers are unwilling to carry long-term out-of-tree patches. https://github.com/sandstorm-io/sandstorm/blob/d270755b1b55e5be6c96df2cce7c914f35f0d2a2/install.sh#L464-L474 signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [RFC][PATCH 0/7] Sanitization of slabs based on grsecurity/PaX
> I am not sure what the point of this patchset is. We have a similar > effect > to sanitization already in the allocators through two mechanisms: The rationale was covered earlier. Are you responding to that or did you not see it? > 1. Slab poisoning > 2. Allocation with GFP_ZERO > > I do not think we need a third one. You could accomplish your goals > much > easier without this code churn by either > > 1. Improve the existing poisoning mechanism. Ensure that there are no > gaps. Security sensitive kernel slab caches can then be created > with > the POISONING flag set. Maybe add a Kconfig flag that enables > POISONING for each cache? What was the issue when you tried using > posining for sanitization? > > 2. Add a mechanism that ensures that GFP_ZERO is set for each > allocation. > That way every object you retrieve is zeroed and thus you have > implied > sanitization. This also can be done in a rather simple way by > changing > the GFP_KERNEL etc constants to include __GFP_ZERO depending on a > Kconfig option. Or add some runtime setting of the gfp flags > somewhere. > > Generally I would favor option #2 if you must have sanitization > because > that is the only option to really give you a deterministic content of > object on each allocation. Any half way measures would not work I > think. > > Note also that most allocations are already either allocations that > zero > the content or they are immediately initializing the content of the > allocated object. After all the object is not really usable if the > content is random. You may be able to avoid this whole endeavor by > auditing the kernel for locations where the object is not initialized > after allocation. > > Once one recognizes the above it seems that sanitization is pretty > useless. Its just another pass of writing zeroes before the allocator > or > uer of the allocated object sets up deterministic content of the > object or > -- in most cases -- zeroes it again. Sanitization isn't just to prevent leaks from usage of uninitialized data in later allocations. It's a mitigation for use-after-free (esp. if it's combined with some form of delayed freeing) and it reduces the lifetime of data. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: [RFC][PATCH 0/7] Sanitization of slabs based on grsecurity/PaX
> I am not sure what the point of this patchset is. We have a similar > effect > to sanitization already in the allocators through two mechanisms: The rationale was covered earlier. Are you responding to that or did you not see it? > 1. Slab poisoning > 2. Allocation with GFP_ZERO > > I do not think we need a third one. You could accomplish your goals > much > easier without this code churn by either > > 1. Improve the existing poisoning mechanism. Ensure that there are no > gaps. Security sensitive kernel slab caches can then be created > with > the POISONING flag set. Maybe add a Kconfig flag that enables > POISONING for each cache? What was the issue when you tried using > posining for sanitization? > > 2. Add a mechanism that ensures that GFP_ZERO is set for each > allocation. > That way every object you retrieve is zeroed and thus you have > implied > sanitization. This also can be done in a rather simple way by > changing > the GFP_KERNEL etc constants to include __GFP_ZERO depending on a > Kconfig option. Or add some runtime setting of the gfp flags > somewhere. > > Generally I would favor option #2 if you must have sanitization > because > that is the only option to really give you a deterministic content of > object on each allocation. Any half way measures would not work I > think. > > Note also that most allocations are already either allocations that > zero > the content or they are immediately initializing the content of the > allocated object. After all the object is not really usable if the > content is random. You may be able to avoid this whole endeavor by > auditing the kernel for locations where the object is not initialized > after allocation. > > Once one recognizes the above it seems that sanitization is pretty > useless. Its just another pass of writing zeroes before the allocator > or > uer of the allocated object sets up deterministic content of the > object or > -- in most cases -- zeroes it again. Sanitization isn't just to prevent leaks from usage of uninitialized data in later allocations. It's a mitigation for use-after-free (esp. if it's combined with some form of delayed freeing) and it reduces the lifetime of data. signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 00/13] MADV_FREE support
On 05/12/15 06:10 AM, Pavel Machek wrote: > On Wed 2015-11-04 10:25:54, Minchan Kim wrote: >> MADV_FREE is on linux-next so long time. The reason was two, I think. >> >> 1. MADV_FREE code on reclaim path was really mess. > > Could you explain what MADV_FREE does? > > Comment in code says 'free the page only when there's memory > pressure'. So I mark my caches MADV_FREE, no memory pressure, I can > keep using it? And if there's memory pressure, what happens? I get > zeros? SIGSEGV? You get zeroes. It's not designed for that use case right now. It's for malloc implementations to use internally. There would need to be a new feature like MADV_FREE_UNDO for it to be usable for caches and it may make more sense for that to be a separate feature entirely, i.e. have a different flag for marking too (not sure) since it wouldn't need to worry about whether stuff is touched. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 00/13] MADV_FREE support
On 05/12/15 06:10 AM, Pavel Machek wrote: > On Wed 2015-11-04 10:25:54, Minchan Kim wrote: >> MADV_FREE is on linux-next so long time. The reason was two, I think. >> >> 1. MADV_FREE code on reclaim path was really mess. > > Could you explain what MADV_FREE does? > > Comment in code says 'free the page only when there's memory > pressure'. So I mark my caches MADV_FREE, no memory pressure, I can > keep using it? And if there's memory pressure, what happens? I get > zeros? SIGSEGV? You get zeroes. It's not designed for that use case right now. It's for malloc implementations to use internally. There would need to be a new feature like MADV_FREE_UNDO for it to be usable for caches and it may make more sense for that to be a separate feature entirely, i.e. have a different flag for marking too (not sure) since it wouldn't need to worry about whether stuff is touched. signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
On 13/11/15 02:03 AM, Minchan Kim wrote: > On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote: >>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO >>> easily when we need it. Maybe, that's what you want. Right? >> >> Yes, but why the access bit instead of the dirty bit for that? It could >> always be made more strict (i.e. access bit) in the future, while going >> the other way won't be possible. So I think the dirty bit is really the >> more conservative choice since if it turns out to be a mistake it can be >> fixed without a backwards incompatible change. > > Absolutely true. That's why I insist on dirty bit until now although > I didn't tell the reason. But I thought you wanted to change for using > access bit for the future, too. It seems MADV_FREE start to bloat > over and over again before knowing real problems and usecases. > It's almost same situation with volatile ranges so I really want to > stop at proper point which maintainer should decide, I hope. > Without it, we will make the feature a lot heavy by just brain storming > and then causes lots of churn in MM code without real bebenfit > It would be very painful for us. Well, I don't think you need more than a good API and an implementation with no known bugs, kernel security concerns or backwards compatibility issues. Configuration and API extensions are something for later (i.e. land a baseline, then submit stuff like sysctl tunables). Just my take on it though... signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
On 13/11/15 02:03 AM, Minchan Kim wrote: > On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote: >>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO >>> easily when we need it. Maybe, that's what you want. Right? >> >> Yes, but why the access bit instead of the dirty bit for that? It could >> always be made more strict (i.e. access bit) in the future, while going >> the other way won't be possible. So I think the dirty bit is really the >> more conservative choice since if it turns out to be a mistake it can be >> fixed without a backwards incompatible change. > > Absolutely true. That's why I insist on dirty bit until now although > I didn't tell the reason. But I thought you wanted to change for using > access bit for the future, too. It seems MADV_FREE start to bloat > over and over again before knowing real problems and usecases. > It's almost same situation with volatile ranges so I really want to > stop at proper point which maintainer should decide, I hope. > Without it, we will make the feature a lot heavy by just brain storming > and then causes lots of churn in MM code without real bebenfit > It would be very painful for us. Well, I don't think you need more than a good API and an implementation with no known bugs, kernel security concerns or backwards compatibility issues. Configuration and API extensions are something for later (i.e. land a baseline, then submit stuff like sysctl tunables). Just my take on it though... signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO > easily when we need it. Maybe, that's what you want. Right? Yes, but why the access bit instead of the dirty bit for that? It could always be made more strict (i.e. access bit) in the future, while going the other way won't be possible. So I think the dirty bit is really the more conservative choice since if it turns out to be a mistake it can be fixed without a backwards incompatible change. signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
On 13/11/15 01:15 AM, Minchan Kim wrote: > On Thu, Nov 12, 2015 at 12:21:30AM -0500, Daniel Micay wrote: >>> I also think that the kernel should commit to either zeroing the page >>> or leaving it unchanged in response to MADV_FREE (even if the decision >>> of which to do is made later on). I think that your patch series does >>> this, but only after a few of the patches are applied (the swap entry >>> freeing), and I think that it should be a real guaranteed part of the >>> semantics and maybe have a test case. >> >> This would be a good thing to test because it would be required to add >> MADV_FREE_UNDO down the road. It would mean the same semantics as the >> MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably >> value in that for the sake of migrating existing software too. > > So, do you mean that we could implement MADV_FREE_UNDO with "read" > opearation("just access bit marking) easily in future? > > If so, it would be good reason to change MADV_FREE from dirty bit to > access bit. Okay, I will look at that. I just meant testing that the data is either zero or the old data if it's read before it's written to. Not having it stay around once there is a read. Not sure if that's what Andy meant. signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO > easily when we need it. Maybe, that's what you want. Right? Yes, but why the access bit instead of the dirty bit for that? It could always be made more strict (i.e. access bit) in the future, while going the other way won't be possible. So I think the dirty bit is really the more conservative choice since if it turns out to be a mistake it can be fixed without a backwards incompatible change. signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
On 13/11/15 01:15 AM, Minchan Kim wrote: > On Thu, Nov 12, 2015 at 12:21:30AM -0500, Daniel Micay wrote: >>> I also think that the kernel should commit to either zeroing the page >>> or leaving it unchanged in response to MADV_FREE (even if the decision >>> of which to do is made later on). I think that your patch series does >>> this, but only after a few of the patches are applied (the swap entry >>> freeing), and I think that it should be a real guaranteed part of the >>> semantics and maybe have a test case. >> >> This would be a good thing to test because it would be required to add >> MADV_FREE_UNDO down the road. It would mean the same semantics as the >> MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably >> value in that for the sake of migrating existing software too. > > So, do you mean that we could implement MADV_FREE_UNDO with "read" > opearation("just access bit marking) easily in future? > > If so, it would be good reason to change MADV_FREE from dirty bit to > access bit. Okay, I will look at that. I just meant testing that the data is either zero or the old data if it's read before it's written to. Not having it stay around once there is a read. Not sure if that's what Andy meant. signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
> I also think that the kernel should commit to either zeroing the page > or leaving it unchanged in response to MADV_FREE (even if the decision > of which to do is made later on). I think that your patch series does > this, but only after a few of the patches are applied (the swap entry > freeing), and I think that it should be a real guaranteed part of the > semantics and maybe have a test case. This would be a good thing to test because it would be required to add MADV_FREE_UNDO down the road. It would mean the same semantics as the MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably value in that for the sake of migrating existing software too. For one example, it could be dropped into Firefox: https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferWindows.cpp And in Chromium: https://code.google.com/p/chromium/codesearch#chromium/src/base/memory/discardable_shared_memory.cc Worth noting that both also support the API for pinning/unpinning that's used by Android's ashmem too. Linux really needs a feature like this for caches. Firefox simply doesn't drop the memory at all on Linux right now: https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp (Lock == pin, Unlock == unpin) For reference: https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)
> I also think that the kernel should commit to either zeroing the page > or leaving it unchanged in response to MADV_FREE (even if the decision > of which to do is made later on). I think that your patch series does > this, but only after a few of the patches are applied (the swap entry > freeing), and I think that it should be a real guaranteed part of the > semantics and maybe have a test case. This would be a good thing to test because it would be required to add MADV_FREE_UNDO down the road. It would mean the same semantics as the MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably value in that for the sake of migrating existing software too. For one example, it could be dropped into Firefox: https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferWindows.cpp And in Chromium: https://code.google.com/p/chromium/codesearch#chromium/src/base/memory/discardable_shared_memory.cc Worth noting that both also support the API for pinning/unpinning that's used by Android's ashmem too. Linux really needs a feature like this for caches. Firefox simply doesn't drop the memory at all on Linux right now: https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp (Lock == pin, Unlock == unpin) For reference: https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> active:clean active:dirty*, sigh. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> I posted a patch doing exactly iovec madvise. Doesn't support MADV_FREE yet > though, but should be easy to do it. > > http://marc.info/?l=linux-mm=144615663522661=2 I think that would be a great way to deal with this. It keeps the nice property of still being able to drop pages in allocations that have been handed out but not yet touched. The allocator just needs to be designed to do lots of purging in one go (i.e. something like an 8:1 active:clean ratio triggers purging and it goes all the way to 16:1). signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> active:clean active:dirty*, sigh. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> I posted a patch doing exactly iovec madvise. Doesn't support MADV_FREE yet > though, but should be easy to do it. > > http://marc.info/?l=linux-mm=144615663522661=2 I think that would be a great way to deal with this. It keeps the nice property of still being able to drop pages in allocations that have been handed out but not yet touched. The allocator just needs to be designed to do lots of purging in one go (i.e. something like an 8:1 active:clean ratio triggers purging and it goes all the way to 16:1). signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
> From a user perspective, it doesn't depend on swap. It's just slower > without swap because it does what MADV_DONTNEED does. The current > implementation can be dropped in where MADV_DONTNEED was previously used. It just wouldn't replace existing layers of purging logic until that edge case is fixed and it gains better THP integration. It's already a very useful API with significant performance wins over MADV_DONTNEED, so it will be useful. The only risk involved in landing it is that a better feature might come along. Worst case scenario being that the kernel ends up with a synonym for MADV_DONTNEED (but I think there will still be a use case for this even if a pinning/unpinning API existed, as this is more precise). signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
>>> It probably makes sense to stop thinking about them as anonymous pages >>> entirely at this point when it comes to aging. They're really not. The >>> LRU lists are split to differentiate access patterns and cost of page >>> stealing (and restoring). From that angle, MADV_FREE pages really have >>> nothing in common with in-use anonymous pages, and so they shouldn't >>> be on the same LRU list. >>> >>> That would also fix the very unfortunate and unexpected consequence of >>> tying the lazy free optimization to the availability of swap space. >>> >>> I would prefer to see this addressed before the code goes upstream. >> >> I don't think it would be ideal for these potentially very hot pages to >> be dropped before very cold pages were swapped out. It's the kind of >> tuning that needs to be informed by lots of real world experience and >> lots of testing. It wouldn't impact the API. > > What about them is hot? They contain garbage, you have to write to > them before you can use them. Granted, you might have to refetch > cachelines if you don't do cacheline-aligned populating writes, but > you can do a lot of them before it's more expensive than doing IO. It's hot because applications churn through memory via the allocator. Drop the pages and the application is now churning through page faults and zeroing rather than simply reusing memory. It's not something that may happen, it *will* happen. A page in the page cache *may* be reused, but often won't be, especially when the I/O patterns don't line up well with the way it works. The whole point of the feature is not requiring the allocator to have elaborate mechanisms for aging pages and throttling purging. That ends up resulting in lots of memory held by userspace where the kernel can't reclaim it under memory pressure. If it's dropped before page cache, it isn't going to be able to replace any of that logic in allocators. The page cache is speculative. Page caching by allocators is not really speculative. Using MADV_FREE on the pages at all is speculative. The memory is probably going to be reused fairly soon (unless the process exits, and then it doesn't matter), but purging will end up reducing memory usage for the portions that aren't. It would be a different story for a full unpinning/pinning feature since that would have other use cases (speculative caches), but this is really only useful in allocators. >> Whether MADV_FREE is useful as an API vs. something like a pair of >> system calls for pinning and unpinning memory is what should be worried >> about right now. The internal implementation just needs to be correct >> and useful right now, not perfect. Simpler is probably better than it >> being more well tuned for an initial implementation too. > > Yes, it wouldn't impact the API, but the dependency on swap is very > random from a user experience and severely limits the usefulness of > this. It should probably be addressed before this gets released. As > this involves getting the pages off the anon LRU, we need to figure > out where they should go instead. From a user perspective, it doesn't depend on swap. It's just slower without swap because it does what MADV_DONTNEED does. The current implementation can be dropped in where MADV_DONTNEED was previously used. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> With enough pages at once, though, munmap would be fine, too. That implies lots of page faults and zeroing though. The zeroing alone is a major performance issue. There are separate issues with munmap since it ends up resulting in a lot more virtual memory fragmentation. It would help if the kernel used first-best-fit for mmap instead of the current naive algorithm (bonus: O(log n) worst-case, not O(n)). Since allocators like jemalloc and PartitionAlloc want 2M aligned spans, mixing them with other allocators can also accelerate the VM fragmentation caused by the dumb mmap algorithm (i.e. they make a 2M aligned mapping, some other mmap user does 4k, now there's a nearly 2M gap when the next 2M region is made and the kernel keeps going rather than reusing it). Anyway, that's a totally separate issue from this. Just felt like complaining :). > Maybe what's really needed is a MADV_FREE variant that takes an iovec. > On an all-cores multithreaded mm, the TLB shootdown broadcast takes > thousands of cycles on each core more or less regardless of how much > of the TLB gets zapped. That would work very well. The allocator ends up having a sequence of dirty spans that it needs to purge in one go. As long as purging is fairly spread out, the cost of a single TLB shootdown isn't that bad. It is extremely bad if it needs to do it over and over to purge a bunch of ranges, which can happen if the memory has ended up being very, very fragmentated despite the efforts to compact it (depends on what the application ends up doing). signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
> Even if we're wrong about the aging of those MADV_FREE pages, their > contents are invalidated; they can be discarded freely, and restoring > them is a mere GFP_ZERO allocation. All other anonymous pages have to > be written to disk, and potentially be read back. > > [ Arguably, MADV_FREE pages should even be reclaimed before inactive > page cache. It's the same cost to discard both types of pages, but > restoring page cache involves IO. ] Keep in mind that this is memory the kernel wouldn't be getting back at all if the allocator wasn't going out of the way to purge it, and they aren't going to go out of their way to purge it if it means the kernel is going to steal the pages when there isn't actually memory pressure. An allocator would be using MADV_DONTNEED if it didn't expect that the pages were going to be used against shortly. MADV_FREE indicates that it has time to inform the kernel that they're unused but they could still be very hot. > It probably makes sense to stop thinking about them as anonymous pages > entirely at this point when it comes to aging. They're really not. The > LRU lists are split to differentiate access patterns and cost of page > stealing (and restoring). From that angle, MADV_FREE pages really have > nothing in common with in-use anonymous pages, and so they shouldn't > be on the same LRU list. > > That would also fix the very unfortunate and unexpected consequence of > tying the lazy free optimization to the availability of swap space. > > I would prefer to see this addressed before the code goes upstream. I don't think it would be ideal for these potentially very hot pages to be dropped before very cold pages were swapped out. It's the kind of tuning that needs to be informed by lots of real world experience and lots of testing. It wouldn't impact the API. Whether MADV_FREE is useful as an API vs. something like a pair of system calls for pinning and unpinning memory is what should be worried about right now. The internal implementation just needs to be correct and useful right now, not perfect. Simpler is probably better than it being more well tuned for an initial implementation too. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> That's comparable to Android's pinning / unpinning API for ashmem and I > think it makes sense if it's faster. It's different than the MADV_FREE > API though, because the new allocations that are handed out won't have > the usual lazy commit which MADV_FREE provides. Pages in an allocation > that's handed out can still be dropped until they are actually written > to. It's considered active by jemalloc either way, but only a subset of > the active pages are actually committed. There's probably a use case for > both of these systems. Also, consider that MADV_FREE would allow jemalloc to be extremely aggressive with purging when it actually has to do it. It can start with the largest span of memory and it can mark more than strictly necessary to drop below the ratio as there's no cost to using the memory again (not even a system call). Since the main cost is using the system call at all, there's going to be pressure to mark the largest possible spans in one go. It will mean concentration on memory compaction will improve performance. I think that's the right direction for the kernel to be guiding userspace. It will play better with THP than the allocator trying to be very precise with purging based on aging. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to > reduce > page fault. But there is one issue remaining, the TLB flush. Both > MADV_DONTNEED > and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary > multi-thread applications. In our production workload, we observed 80% CPU > spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes. > We haven't tested MADV_FREE yet, but the result should be similar. It's hard > to > avoid the TLB flush issue with MADV_FREE, because it helps avoid data > corruption. > > The new proposal tries to fix the TLB issue. We introduce two madvise verbs: > > MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel > just records the range in current stage. Should memory pressure happen, page > reclaim can free the memory directly regardless the pte state. > > MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon. > Kernel deletes the record and prevents page reclaim discards the memory. If > the > memory isn't reclaimed, userspace will access the old memory, otherwise do > normal page fault handling. > > The point is to let userspace notify kernel if memory can be discarded, > instead > of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is > required till page reclaim actually frees the memory (page reclaim need do the > TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of > MADV_FREE. > > Compared to MADV_FREE, reusing memory with the new proposal isn't transparent, > eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc. > > We don't have code to backup this yet, sorry. We'd like to discuss it if it > makes sense. That's comparable to Android's pinning / unpinning API for ashmem and I think it makes sense if it's faster. It's different than the MADV_FREE API though, because the new allocations that are handed out won't have the usual lazy commit which MADV_FREE provides. Pages in an allocation that's handed out can still be dropped until they are actually written to. It's considered active by jemalloc either way, but only a subset of the active pages are actually committed. There's probably a use case for both of these systems. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to > reduce > page fault. But there is one issue remaining, the TLB flush. Both > MADV_DONTNEED > and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary > multi-thread applications. In our production workload, we observed 80% CPU > spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes. > We haven't tested MADV_FREE yet, but the result should be similar. It's hard > to > avoid the TLB flush issue with MADV_FREE, because it helps avoid data > corruption. > > The new proposal tries to fix the TLB issue. We introduce two madvise verbs: > > MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel > just records the range in current stage. Should memory pressure happen, page > reclaim can free the memory directly regardless the pte state. > > MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon. > Kernel deletes the record and prevents page reclaim discards the memory. If > the > memory isn't reclaimed, userspace will access the old memory, otherwise do > normal page fault handling. > > The point is to let userspace notify kernel if memory can be discarded, > instead > of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is > required till page reclaim actually frees the memory (page reclaim need do the > TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of > MADV_FREE. > > Compared to MADV_FREE, reusing memory with the new proposal isn't transparent, > eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc. > > We don't have code to backup this yet, sorry. We'd like to discuss it if it > makes sense. That's comparable to Android's pinning / unpinning API for ashmem and I think it makes sense if it's faster. It's different than the MADV_FREE API though, because the new allocations that are handed out won't have the usual lazy commit which MADV_FREE provides. Pages in an allocation that's handed out can still be dropped until they are actually written to. It's considered active by jemalloc either way, but only a subset of the active pages are actually committed. There's probably a use case for both of these systems. signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
> Even if we're wrong about the aging of those MADV_FREE pages, their > contents are invalidated; they can be discarded freely, and restoring > them is a mere GFP_ZERO allocation. All other anonymous pages have to > be written to disk, and potentially be read back. > > [ Arguably, MADV_FREE pages should even be reclaimed before inactive > page cache. It's the same cost to discard both types of pages, but > restoring page cache involves IO. ] Keep in mind that this is memory the kernel wouldn't be getting back at all if the allocator wasn't going out of the way to purge it, and they aren't going to go out of their way to purge it if it means the kernel is going to steal the pages when there isn't actually memory pressure. An allocator would be using MADV_DONTNEED if it didn't expect that the pages were going to be used against shortly. MADV_FREE indicates that it has time to inform the kernel that they're unused but they could still be very hot. > It probably makes sense to stop thinking about them as anonymous pages > entirely at this point when it comes to aging. They're really not. The > LRU lists are split to differentiate access patterns and cost of page > stealing (and restoring). From that angle, MADV_FREE pages really have > nothing in common with in-use anonymous pages, and so they shouldn't > be on the same LRU list. > > That would also fix the very unfortunate and unexpected consequence of > tying the lazy free optimization to the availability of swap space. > > I would prefer to see this addressed before the code goes upstream. I don't think it would be ideal for these potentially very hot pages to be dropped before very cold pages were swapped out. It's the kind of tuning that needs to be informed by lots of real world experience and lots of testing. It wouldn't impact the API. Whether MADV_FREE is useful as an API vs. something like a pair of system calls for pinning and unpinning memory is what should be worried about right now. The internal implementation just needs to be correct and useful right now, not perfect. Simpler is probably better than it being more well tuned for an initial implementation too. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> With enough pages at once, though, munmap would be fine, too. That implies lots of page faults and zeroing though. The zeroing alone is a major performance issue. There are separate issues with munmap since it ends up resulting in a lot more virtual memory fragmentation. It would help if the kernel used first-best-fit for mmap instead of the current naive algorithm (bonus: O(log n) worst-case, not O(n)). Since allocators like jemalloc and PartitionAlloc want 2M aligned spans, mixing them with other allocators can also accelerate the VM fragmentation caused by the dumb mmap algorithm (i.e. they make a 2M aligned mapping, some other mmap user does 4k, now there's a nearly 2M gap when the next 2M region is made and the kernel keeps going rather than reusing it). Anyway, that's a totally separate issue from this. Just felt like complaining :). > Maybe what's really needed is a MADV_FREE variant that takes an iovec. > On an all-cores multithreaded mm, the TLB shootdown broadcast takes > thousands of cycles on each core more or less regardless of how much > of the TLB gets zapped. That would work very well. The allocator ends up having a sequence of dirty spans that it needs to purge in one go. As long as purging is fairly spread out, the cost of a single TLB shootdown isn't that bad. It is extremely bad if it needs to do it over and over to purge a bunch of ranges, which can happen if the memory has ended up being very, very fragmentated despite the efforts to compact it (depends on what the application ends up doing). signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> That's comparable to Android's pinning / unpinning API for ashmem and I > think it makes sense if it's faster. It's different than the MADV_FREE > API though, because the new allocations that are handed out won't have > the usual lazy commit which MADV_FREE provides. Pages in an allocation > that's handed out can still be dropped until they are actually written > to. It's considered active by jemalloc either way, but only a subset of > the active pages are actually committed. There's probably a use case for > both of these systems. Also, consider that MADV_FREE would allow jemalloc to be extremely aggressive with purging when it actually has to do it. It can start with the largest span of memory and it can mark more than strictly necessary to drop below the ratio as there's no cost to using the memory again (not even a system call). Since the main cost is using the system call at all, there's going to be pressure to mark the largest possible spans in one go. It will mean concentration on memory compaction will improve performance. I think that's the right direction for the kernel to be guiding userspace. It will play better with THP than the allocator trying to be very precise with purging based on aging. signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
> From a user perspective, it doesn't depend on swap. It's just slower > without swap because it does what MADV_DONTNEED does. The current > implementation can be dropped in where MADV_DONTNEED was previously used. It just wouldn't replace existing layers of purging logic until that edge case is fixed and it gains better THP integration. It's already a very useful API with significant performance wins over MADV_DONTNEED, so it will be useful. The only risk involved in landing it is that a better feature might come along. Worst case scenario being that the kernel ends up with a synonym for MADV_DONTNEED (but I think there will still be a use case for this even if a pinning/unpinning API existed, as this is more precise). signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
>>> It probably makes sense to stop thinking about them as anonymous pages >>> entirely at this point when it comes to aging. They're really not. The >>> LRU lists are split to differentiate access patterns and cost of page >>> stealing (and restoring). From that angle, MADV_FREE pages really have >>> nothing in common with in-use anonymous pages, and so they shouldn't >>> be on the same LRU list. >>> >>> That would also fix the very unfortunate and unexpected consequence of >>> tying the lazy free optimization to the availability of swap space. >>> >>> I would prefer to see this addressed before the code goes upstream. >> >> I don't think it would be ideal for these potentially very hot pages to >> be dropped before very cold pages were swapped out. It's the kind of >> tuning that needs to be informed by lots of real world experience and >> lots of testing. It wouldn't impact the API. > > What about them is hot? They contain garbage, you have to write to > them before you can use them. Granted, you might have to refetch > cachelines if you don't do cacheline-aligned populating writes, but > you can do a lot of them before it's more expensive than doing IO. It's hot because applications churn through memory via the allocator. Drop the pages and the application is now churning through page faults and zeroing rather than simply reusing memory. It's not something that may happen, it *will* happen. A page in the page cache *may* be reused, but often won't be, especially when the I/O patterns don't line up well with the way it works. The whole point of the feature is not requiring the allocator to have elaborate mechanisms for aging pages and throttling purging. That ends up resulting in lots of memory held by userspace where the kernel can't reclaim it under memory pressure. If it's dropped before page cache, it isn't going to be able to replace any of that logic in allocators. The page cache is speculative. Page caching by allocators is not really speculative. Using MADV_FREE on the pages at all is speculative. The memory is probably going to be reused fairly soon (unless the process exits, and then it doesn't matter), but purging will end up reducing memory usage for the portions that aren't. It would be a different story for a full unpinning/pinning feature since that would have other use cases (speculative caches), but this is really only useful in allocators. >> Whether MADV_FREE is useful as an API vs. something like a pair of >> system calls for pinning and unpinning memory is what should be worried >> about right now. The internal implementation just needs to be correct >> and useful right now, not perfect. Simpler is probably better than it >> being more well tuned for an initial implementation too. > > Yes, it wouldn't impact the API, but the dependency on swap is very > random from a user experience and severely limits the usefulness of > this. It should probably be addressed before this gets released. As > this involves getting the pages off the anon LRU, we need to figure > out where they should go instead. From a user perspective, it doesn't depend on swap. It's just slower without swap because it does what MADV_DONTNEED does. The current implementation can be dropped in where MADV_DONTNEED was previously used. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
On 04/11/15 12:53 AM, Daniel Micay wrote: >> In the common case it will be passed many pages by the allocator. There >> will still be a layer of purging logic on top of MADV_FREE but it can be >> much thinner than the current workarounds for MADV_DONTNEED. So the >> allocator would still be coalescing dirty ranges and only purging when >> the ratio of dirty:clean pages rises above some threshold. It would be >> able to weight the largest ranges for purging first rather than logic >> based on stuff like aging as is used for MADV_DONTNEED. > > I would expect that jemalloc would just start putting the dirty ranges > into the usual pair of red-black trees (with coalescing) and then doing > purging starting from the largest spans to get back down below whatever > dirty:clean ratio it's trying to keep. Right now, it has all lots of > other logic to deal with this since each MADV_DONTNEED call results in > lots of zeroing and then page faults. Er, I mean dirty:active (i.e. ratio of unpurged, dirty pages to ones that are handed out as allocations, which is kept at something like 1:8). A high constant cost in the madvise call but quick handling of each page means that allocators need to be more aggressive with purging more than they strictly need to in one go. For example, it might need to purge 2M to meet the ratio but it could have a contiguous span of 32M of dirty pages. If the cost per page is low enough, it could just do the entire range. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> In the common case it will be passed many pages by the allocator. There > will still be a layer of purging logic on top of MADV_FREE but it can be > much thinner than the current workarounds for MADV_DONTNEED. So the > allocator would still be coalescing dirty ranges and only purging when > the ratio of dirty:clean pages rises above some threshold. It would be > able to weight the largest ranges for purging first rather than logic > based on stuff like aging as is used for MADV_DONTNEED. I would expect that jemalloc would just start putting the dirty ranges into the usual pair of red-black trees (with coalescing) and then doing purging starting from the largest spans to get back down below whatever dirty:clean ratio it's trying to keep. Right now, it has all lots of other logic to deal with this since each MADV_DONTNEED call results in lots of zeroing and then page faults. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> Does this set the write protect bit? > > What happens on architectures without hardware dirty tracking? It's supposed to avoid needing page faults when the data is accessed again, but it can just be implemented via page faults on architectures without a way to check for access or writes. MADV_DONTNEED is also a valid implementation of MADV_FREE if it comes to that (which is what it does on swapless systems for now). > Using the dirty bit for these semantics scares me. This API creates a > page that can have visible nonzero contents and then can > asynchronously and magically zero itself thereafter. That makes me > nervous. Could we use the accessed bit instead? Then the observable > semantics would be equivalent to having MADV_FREE either zero the page > or do nothing, except that it doesn't make up its mind until the next > read. FWIW, those are already basically the semantics provided by GCC and LLVM for data the compiler considers uninitialized (they could be more aggressive since C just says it's undefined, but in practice they allow it but can produce inconsistent results even if it isn't touched). http://llvm.org/docs/LangRef.html#undefined-values It doesn't seem like there would be an advantage to checking if the data was written to vs. whether it was accessed if checking for both of those is comparable in performance. I don't know enough about that. >> + ptent = pte_mkold(ptent); >> + ptent = pte_mkclean(ptent); >> + set_pte_at(mm, addr, pte, ptent); >> + tlb_remove_tlb_entry(tlb, pte, addr); > > It looks like you are flushing the TLB. In a multithreaded program, > that's rather expensive. Potentially silly question: would it be > better to just zero the page immediately in a multithreaded program > and then, when swapping out, check the page is zeroed and, if so, skip > swapping it out? That could be done without forcing an IPI. In the common case it will be passed many pages by the allocator. There will still be a layer of purging logic on top of MADV_FREE but it can be much thinner than the current workarounds for MADV_DONTNEED. So the allocator would still be coalescing dirty ranges and only purging when the ratio of dirty:clean pages rises above some threshold. It would be able to weight the largest ranges for purging first rather than logic based on stuff like aging as is used for MADV_DONTNEED. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> In the common case it will be passed many pages by the allocator. There > will still be a layer of purging logic on top of MADV_FREE but it can be > much thinner than the current workarounds for MADV_DONTNEED. So the > allocator would still be coalescing dirty ranges and only purging when > the ratio of dirty:clean pages rises above some threshold. It would be > able to weight the largest ranges for purging first rather than logic > based on stuff like aging as is used for MADV_DONTNEED. I would expect that jemalloc would just start putting the dirty ranges into the usual pair of red-black trees (with coalescing) and then doing purging starting from the largest spans to get back down below whatever dirty:clean ratio it's trying to keep. Right now, it has all lots of other logic to deal with this since each MADV_DONTNEED call results in lots of zeroing and then page faults. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
> Does this set the write protect bit? > > What happens on architectures without hardware dirty tracking? It's supposed to avoid needing page faults when the data is accessed again, but it can just be implemented via page faults on architectures without a way to check for access or writes. MADV_DONTNEED is also a valid implementation of MADV_FREE if it comes to that (which is what it does on swapless systems for now). > Using the dirty bit for these semantics scares me. This API creates a > page that can have visible nonzero contents and then can > asynchronously and magically zero itself thereafter. That makes me > nervous. Could we use the accessed bit instead? Then the observable > semantics would be equivalent to having MADV_FREE either zero the page > or do nothing, except that it doesn't make up its mind until the next > read. FWIW, those are already basically the semantics provided by GCC and LLVM for data the compiler considers uninitialized (they could be more aggressive since C just says it's undefined, but in practice they allow it but can produce inconsistent results even if it isn't touched). http://llvm.org/docs/LangRef.html#undefined-values It doesn't seem like there would be an advantage to checking if the data was written to vs. whether it was accessed if checking for both of those is comparable in performance. I don't know enough about that. >> + ptent = pte_mkold(ptent); >> + ptent = pte_mkclean(ptent); >> + set_pte_at(mm, addr, pte, ptent); >> + tlb_remove_tlb_entry(tlb, pte, addr); > > It looks like you are flushing the TLB. In a multithreaded program, > that's rather expensive. Potentially silly question: would it be > better to just zero the page immediately in a multithreaded program > and then, when swapping out, check the page is zeroed and, if so, skip > swapping it out? That could be done without forcing an IPI. In the common case it will be passed many pages by the allocator. There will still be a layer of purging logic on top of MADV_FREE but it can be much thinner than the current workarounds for MADV_DONTNEED. So the allocator would still be coalescing dirty ranges and only purging when the ratio of dirty:clean pages rises above some threshold. It would be able to weight the largest ranges for purging first rather than logic based on stuff like aging as is used for MADV_DONTNEED. signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
On 04/11/15 12:53 AM, Daniel Micay wrote: >> In the common case it will be passed many pages by the allocator. There >> will still be a layer of purging logic on top of MADV_FREE but it can be >> much thinner than the current workarounds for MADV_DONTNEED. So the >> allocator would still be coalescing dirty ranges and only purging when >> the ratio of dirty:clean pages rises above some threshold. It would be >> able to weight the largest ranges for purging first rather than logic >> based on stuff like aging as is used for MADV_DONTNEED. > > I would expect that jemalloc would just start putting the dirty ranges > into the usual pair of red-black trees (with coalescing) and then doing > purging starting from the largest spans to get back down below whatever > dirty:clean ratio it's trying to keep. Right now, it has all lots of > other logic to deal with this since each MADV_DONTNEED call results in > lots of zeroing and then page faults. Er, I mean dirty:active (i.e. ratio of unpurged, dirty pages to ones that are handed out as allocations, which is kept at something like 1:8). A high constant cost in the madvise call but quick handling of each page means that allocators need to be more aggressive with purging more than they strictly need to in one go. For example, it might need to purge 2M to meet the ratio but it could have a contiguous span of 32M of dirty pages. If the cost per page is low enough, it could just do the entire range. signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/8] MADV_FREE support
On 01/11/15 12:51 AM, David Rientjes wrote: > On Fri, 30 Oct 2015, Minchan Kim wrote: > >> MADV_FREE is on linux-next so long time. The reason was two, I think. >> >> 1. MADV_FREE code on reclaim path was really mess. >> >> 2. Andrew really want to see voice of userland people who want to use >>the syscall. >> >> A few month ago, Daniel Micay(jemalloc active contributor) requested me >> to make progress upstreaming but I was busy at that time so it took >> so long time for me to revist the code and finally, I clean it up the >> mess recently so it solves the #2 issue. >> >> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew >> again recently and they said it would be great to have even though >> it has swap dependency now so Andrew decided he will do that for v4.4. >> > > First, thanks very much for refreshing the patchset and reposting after a > series of changes have been periodically added to -mm, it makes it much > easier. > > For tcmalloc, we can do some things in the allocator itself to increase > the amount of memory backed by thp. Specifically, we can prefer to > release Spans to pageblocks that are already not backed by thp so there is > no additional split on each scavenge. This is somewhat easy if all memory > is organized into hugepage-aligned pageblocks in the allocator itself. > Second, we can prefer to release Spans of longer length on each scavenge > so we can delay scavenging for as long as possible in a hope we can find > more pages to coalesce. Third, we can discount refaulted released memory > from the scavenging period. > > That significantly improves the amount of memory backed by thp for > tcmalloc. The problem, however, is that tcmalloc uses MADV_DONTNEED to > release memory to the system and MADV_FREE wouldn't help at all in a > swapless environment. > > To combat that, I've proposed a new MADV bit that simply caches the > ranges freed by the allocator per vma and places them on both a per-vma > and per-memcg list. During reclaim, this list is iterated and ptes are > freed after thp split period to the normal directed reclaim. Without > memory pressure, this backs 100% of the heap with thp with a relatively > lightweight kernel change (the majority is vma manipulation on split) and > a couple line change to tcmalloc. When pulling memory from the returned > freelists, the memory that we have MADV_DONTNEED'd, we need to use another > MADV bit to remove it from this cache, so there is a second madvise(2) > syscall involved but the freeing call is much less expensive since there > is no pagetable walk without memory pressure or synchronous thp split. > > I've been looking at MADV_FREE to see if there is common ground that could > be shared, but perhaps it's just easier to ask what your proposed strategy > is so that tcmalloc users, especially those in swapless environments, > would benefit from any of your work? The current implementation requires swap because the kernel already has robust infrastructure for swapping out anonymous memory when there's memory pressure. The MADV_FREE implementation just has to hook in there and cause pages to be dropped instead of swapped out. There's no reason it couldn't be extended to work in swapless environments, but it will take additional design and implementation work. As a stop-gap, I think zram and friends will work fine as a form of swap for this. It can definitely be improved to cooperate well with THP too. I've been following the progress, and most of the problems seem to have been with the THP and that's a very active area of development. Seems best to deal with that after a simple, working implementation lands. The best aspect of MADV_FREE is that it completely avoids page faults when there's no memory pressure. Making use of the freed memory only triggers page faults if the pages had to be dropped because the system ran out of memory. It also avoids needing to zero the pages. The memory can also still be freed at any time if there's memory pressure again even if it's handed out as an allocation until it's actually touched. The call to madvise still has significant overhead, but it's much cheaper than MADV_DONTNEED. Allocators will be able to lean on the kernel to make good decisions rather than implementing lazy freeing entirely on their own. It should improve performance *and* behavior under memory pressure since allocators can be more aggressive with it than MADV_DONTNEED. A nice future improvement would be landing MADV_FREE_UNDO feature to allow an attempt to pin the pages in memory again. It would make this work very well for implementing caches that are dropped under memory pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I think browser vendors would be very interested in it. signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/8] MADV_FREE support
On 01/11/15 12:51 AM, David Rientjes wrote: > On Fri, 30 Oct 2015, Minchan Kim wrote: > >> MADV_FREE is on linux-next so long time. The reason was two, I think. >> >> 1. MADV_FREE code on reclaim path was really mess. >> >> 2. Andrew really want to see voice of userland people who want to use >>the syscall. >> >> A few month ago, Daniel Micay(jemalloc active contributor) requested me >> to make progress upstreaming but I was busy at that time so it took >> so long time for me to revist the code and finally, I clean it up the >> mess recently so it solves the #2 issue. >> >> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew >> again recently and they said it would be great to have even though >> it has swap dependency now so Andrew decided he will do that for v4.4. >> > > First, thanks very much for refreshing the patchset and reposting after a > series of changes have been periodically added to -mm, it makes it much > easier. > > For tcmalloc, we can do some things in the allocator itself to increase > the amount of memory backed by thp. Specifically, we can prefer to > release Spans to pageblocks that are already not backed by thp so there is > no additional split on each scavenge. This is somewhat easy if all memory > is organized into hugepage-aligned pageblocks in the allocator itself. > Second, we can prefer to release Spans of longer length on each scavenge > so we can delay scavenging for as long as possible in a hope we can find > more pages to coalesce. Third, we can discount refaulted released memory > from the scavenging period. > > That significantly improves the amount of memory backed by thp for > tcmalloc. The problem, however, is that tcmalloc uses MADV_DONTNEED to > release memory to the system and MADV_FREE wouldn't help at all in a > swapless environment. > > To combat that, I've proposed a new MADV bit that simply caches the > ranges freed by the allocator per vma and places them on both a per-vma > and per-memcg list. During reclaim, this list is iterated and ptes are > freed after thp split period to the normal directed reclaim. Without > memory pressure, this backs 100% of the heap with thp with a relatively > lightweight kernel change (the majority is vma manipulation on split) and > a couple line change to tcmalloc. When pulling memory from the returned > freelists, the memory that we have MADV_DONTNEED'd, we need to use another > MADV bit to remove it from this cache, so there is a second madvise(2) > syscall involved but the freeing call is much less expensive since there > is no pagetable walk without memory pressure or synchronous thp split. > > I've been looking at MADV_FREE to see if there is common ground that could > be shared, but perhaps it's just easier to ask what your proposed strategy > is so that tcmalloc users, especially those in swapless environments, > would benefit from any of your work? The current implementation requires swap because the kernel already has robust infrastructure for swapping out anonymous memory when there's memory pressure. The MADV_FREE implementation just has to hook in there and cause pages to be dropped instead of swapped out. There's no reason it couldn't be extended to work in swapless environments, but it will take additional design and implementation work. As a stop-gap, I think zram and friends will work fine as a form of swap for this. It can definitely be improved to cooperate well with THP too. I've been following the progress, and most of the problems seem to have been with the THP and that's a very active area of development. Seems best to deal with that after a simple, working implementation lands. The best aspect of MADV_FREE is that it completely avoids page faults when there's no memory pressure. Making use of the freed memory only triggers page faults if the pages had to be dropped because the system ran out of memory. It also avoids needing to zero the pages. The memory can also still be freed at any time if there's memory pressure again even if it's handed out as an allocation until it's actually touched. The call to madvise still has significant overhead, but it's much cheaper than MADV_DONTNEED. Allocators will be able to lean on the kernel to make good decisions rather than implementing lazy freeing entirely on their own. It should improve performance *and* behavior under memory pressure since allocators can be more aggressive with it than MADV_DONTNEED. A nice future improvement would be landing MADV_FREE_UNDO feature to allow an attempt to pin the pages in memory again. It would make this work very well for implementing caches that are dropped under memory pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I think browser vendors would be very interested in it. signature.asc Description: OpenPGP digital signature
[PATCH v4] mm: add mremap flag for preserving the old mapping
This introduces the MREMAP_RETAIN flag for preserving the source mapping when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to the source mapping will fault and map in fresh zeroed pages. It is currently limited to writable MAP_PRIVATE|MAP_ANONYMOUS mappings and will return EFAULT when used on anything else. This covers the intended use case in general purpose allocators. For consistency, the old_len >= new_len case could decommit the pages instead of unmapping. However, userspace can accomplish the same thing via madvise and the flag is coherent without the additional complexity. Motivation: TCMalloc and jemalloc avoid releasing virtual memory in order to reduce virtual memory fragmentation. A call to munmap or mremap would leave a hole in the address space. Instead, unused pages are lazily returned to the operating system via MADV_DONTNEED. Since mremap cannot be used to elide copies, TCMalloc and jemalloc end up being significantly slower for patterns like repeated vector / hash table reallocations. Consider the typical vector building pattern: #include #include int main(void) { for (size_t i = 0; i < 100; i++) { void *ptr = NULL; size_t old_size = 0; for (size_t size = 4; size < (1 << 30); size *= 2) { ptr = realloc(ptr, size); if (!ptr) return 1; memset(ptr + old_size, 0xff, size - old_size); old_size = size; } free(ptr); } } Transparent huge pages disabled: glibc (baseline, uses mremap already): 15.051s jemalloc without MREMAP_RETAIN: 38.540s jemalloc with MREMAP_RETAIN: 15.086s Transparent huge pages enabled: glibc (baseline, uses mremap already): 8.464s jemalloc without MREMAP_RETAIN: 18.230s jemalloc with MREMAP_RETAIN: 6.696s In practice, in-place growth never occurs for huge allocations because the heap grows in the downwards direction for all 3 allocators. TCMalloc and jemalloc pay for enormous copies while glibc is only spending time writing new elements to the vector. Even if it was grown in the other direction, real-world applications would end up blocking in-place growth with new allocations. The allocators could attempt to map the source location again after an mremap call, but there is no guarantee of success in a multi-threaded program and fragmentating memory over time is considered unacceptable. Signed-off-by: Daniel Micay --- include/linux/huge_mm.h | 2 +- include/linux/mm.h| 6 ++ include/uapi/linux/mman.h | 1 + mm/huge_memory.c | 4 ++-- mm/memory.c | 2 +- mm/mmap.c | 12 +++ mm/mremap.c | 52 +++ 7 files changed, 57 insertions(+), 22 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 63579cb..3c13b20 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -143,7 +143,7 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long end, long adjust_next) { - if (!vma->anon_vma || vma->vm_ops) + if (!vma->anon_vma || (vma->vm_ops && !vma->vm_ops->allow_huge_pages)) return; __vma_adjust_trans_huge(vma, start, end, adjust_next); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 8981cc8..1e61036 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -273,6 +273,12 @@ struct vm_operations_struct { /* called by sys_remap_file_pages() to populate non-linear mapping */ int (*remap_pages)(struct vm_area_struct *vma, unsigned long addr, unsigned long size, pgoff_t pgoff); + + /* Check if the mapping may be duplicated by MREMAP_RETAIN */ + bool (*may_duplicate)(struct vm_area_struct *vma); + + /* if there is no vm_ops table, this is considered true */ + bool allow_huge_pages; }; struct mmu_gather; diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index ade4acd..4e9a546 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -5,6 +5,7 @@ #define MREMAP_MAYMOVE 1 #define MREMAP_FIXED 2 +#define MREMAP_RETAIN 4 #define OVERCOMMIT_GUESS 0 #define OVERCOMMIT_ALWAYS 1 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d9a21d06..350b478 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2077,7 +2077,7 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma) * page fault if needed. */ return 0; - if (vma->vm_ops) + if ((vma->vm_ops && !vma->vm_ops->allow_huge_pages)) /* khugepaged not yet working on file or special mappings */ return 0; VM_BUG_ON(vma->vm_flags &a
Re: [PATCH v3] mm: add mremap flag for preserving the old mapping
On 30/09/14 01:49 PM, Andy Lutomirski wrote: > > I think it might pay to add an explicit vm_op to authorize > duplication, especially for non-cow mappings. IOW this kind of > extension seems quite magical for anything that doesn't have the > normal COW semantics, including for plain old read-only mappings. Adding a vm_ops table to MAP_PRIVATE|MAP_ANONYMOUS mappings has a significant performance impact. I haven't yet narrowed it down, but there's at least one code path a check of `!vma->vm_ops` for the fast path. One is for transparent huge page faults, so the performance impact makes sense. I'll use a simpler implementation for now since the requirements are very narrow / simple. signature.asc Description: OpenPGP digital signature
Re: [PATCH v3] mm: add mremap flag for preserving the old mapping
On 30/09/14 01:49 PM, Andy Lutomirski wrote: I think it might pay to add an explicit vm_op to authorize duplication, especially for non-cow mappings. IOW this kind of extension seems quite magical for anything that doesn't have the normal COW semantics, including for plain old read-only mappings. Adding a vm_ops table to MAP_PRIVATE|MAP_ANONYMOUS mappings has a significant performance impact. I haven't yet narrowed it down, but there's at least one code path a check of `!vma-vm_ops` for the fast path. One is for transparent huge page faults, so the performance impact makes sense. I'll use a simpler implementation for now since the requirements are very narrow / simple. signature.asc Description: OpenPGP digital signature
[PATCH v4] mm: add mremap flag for preserving the old mapping
This introduces the MREMAP_RETAIN flag for preserving the source mapping when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to the source mapping will fault and map in fresh zeroed pages. It is currently limited to writable MAP_PRIVATE|MAP_ANONYMOUS mappings and will return EFAULT when used on anything else. This covers the intended use case in general purpose allocators. For consistency, the old_len = new_len case could decommit the pages instead of unmapping. However, userspace can accomplish the same thing via madvise and the flag is coherent without the additional complexity. Motivation: TCMalloc and jemalloc avoid releasing virtual memory in order to reduce virtual memory fragmentation. A call to munmap or mremap would leave a hole in the address space. Instead, unused pages are lazily returned to the operating system via MADV_DONTNEED. Since mremap cannot be used to elide copies, TCMalloc and jemalloc end up being significantly slower for patterns like repeated vector / hash table reallocations. Consider the typical vector building pattern: #include string.h #include stdlib.h int main(void) { for (size_t i = 0; i 100; i++) { void *ptr = NULL; size_t old_size = 0; for (size_t size = 4; size (1 30); size *= 2) { ptr = realloc(ptr, size); if (!ptr) return 1; memset(ptr + old_size, 0xff, size - old_size); old_size = size; } free(ptr); } } Transparent huge pages disabled: glibc (baseline, uses mremap already): 15.051s jemalloc without MREMAP_RETAIN: 38.540s jemalloc with MREMAP_RETAIN: 15.086s Transparent huge pages enabled: glibc (baseline, uses mremap already): 8.464s jemalloc without MREMAP_RETAIN: 18.230s jemalloc with MREMAP_RETAIN: 6.696s In practice, in-place growth never occurs for huge allocations because the heap grows in the downwards direction for all 3 allocators. TCMalloc and jemalloc pay for enormous copies while glibc is only spending time writing new elements to the vector. Even if it was grown in the other direction, real-world applications would end up blocking in-place growth with new allocations. The allocators could attempt to map the source location again after an mremap call, but there is no guarantee of success in a multi-threaded program and fragmentating memory over time is considered unacceptable. Signed-off-by: Daniel Micay danielmi...@gmail.com --- include/linux/huge_mm.h | 2 +- include/linux/mm.h| 6 ++ include/uapi/linux/mman.h | 1 + mm/huge_memory.c | 4 ++-- mm/memory.c | 2 +- mm/mmap.c | 12 +++ mm/mremap.c | 52 +++ 7 files changed, 57 insertions(+), 22 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 63579cb..3c13b20 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -143,7 +143,7 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long end, long adjust_next) { - if (!vma-anon_vma || vma-vm_ops) + if (!vma-anon_vma || (vma-vm_ops !vma-vm_ops-allow_huge_pages)) return; __vma_adjust_trans_huge(vma, start, end, adjust_next); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 8981cc8..1e61036 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -273,6 +273,12 @@ struct vm_operations_struct { /* called by sys_remap_file_pages() to populate non-linear mapping */ int (*remap_pages)(struct vm_area_struct *vma, unsigned long addr, unsigned long size, pgoff_t pgoff); + + /* Check if the mapping may be duplicated by MREMAP_RETAIN */ + bool (*may_duplicate)(struct vm_area_struct *vma); + + /* if there is no vm_ops table, this is considered true */ + bool allow_huge_pages; }; struct mmu_gather; diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index ade4acd..4e9a546 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -5,6 +5,7 @@ #define MREMAP_MAYMOVE 1 #define MREMAP_FIXED 2 +#define MREMAP_RETAIN 4 #define OVERCOMMIT_GUESS 0 #define OVERCOMMIT_ALWAYS 1 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d9a21d06..350b478 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2077,7 +2077,7 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma) * page fault if needed. */ return 0; - if (vma-vm_ops) + if ((vma-vm_ops !vma-vm_ops-allow_huge_pages)) /* khugepaged not yet working on file or special mappings */ return 0; VM_BUG_ON(vma-vm_flags VM_NO_THP); @@ -2405,7 +2405,7 @@ static bool
Re: [PATCH v3] mm: add mremap flag for preserving the old mapping
On 30/09/14 01:49 PM, Andy Lutomirski wrote: > > I think it might pay to add an explicit vm_op to authorize > duplication, especially for non-cow mappings. IOW this kind of > extension seems quite magical for anything that doesn't have the > normal COW semantics, including for plain old read-only mappings. This sounds like the best way forwards. Setting up the op for private, anonymous mappings and having it check vm_flags & VM_WRITE dynamically seems like it would be enough for the intended use case in general purpose allocators. It can be extended to other mapping types later if there's a compelling use case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] mm: add mremap flag for preserving the old mapping
On 30/09/14 01:53 AM, Andy Lutomirski wrote: > On Mon, Sep 29, 2014 at 9:55 PM, Daniel Micay wrote: >> This introduces the MREMAP_RETAIN flag for preserving the source mapping >> when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to >> the source location will fault and cause fresh pages to be mapped in. >> >> For consistency, the old_len >= new_len case could decommit the pages >> instead of unmapping. However, userspace can accomplish the same thing >> via madvise and a coherent definition of the flag is possible without >> the extra complexity. > > IMO this needs very clear documentation of exactly what it does. Agreed, and thanks for the review. I'll post a slightly modified version of the patch soon (mostly more commit message changes). > Does it preserve the contents of the source pages? (If so, why? > Aren't you wasting a bunch of time on page faults and possibly > unnecessary COWs?) The source will act as if it was just created. For an anonymous memory mapping, it will fault on any accesses and bring in new zeroed pages. In jemalloc, it replaces an enormous memset(dst, src, size) followed by madvise(src, size, MADV_DONTNEED) with mremap. Using mremap also ends up eliding page faults from writes at the destination. TCMalloc has nearly the same page allocation design, although it tries to throttle the purging so it won't always gain as much. > Does it work on file mappings? Can it extend file mappings while it moves > them? It works on file mappings. If a move occurs, there will be the usual extended destination mapping but with the source mapping left intact. It wouldn't be useful with existing allocators, but in theory a general purpose allocator could expose an MMIO API in order to reuse the same address space via MAP_FIXED/MREMAP_FIXED to reduce VM fragmentation. > If you MREMAP_RETAIN a partially COWed private mapping, what happens? The original mapping is zeroed in the following test, as it would be without fork: #define _GNU_SOURCE #include #include #include #include #include int main(void) { size_t size = 1024 * 1024; char *orig = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); memset(orig, 5, size); int pid = fork(); if (pid == -1) return 1; if (pid == 0) { memset(orig, 5, 1024); char *new = mremap(orig, size, size * 128, MREMAP_MAYMOVE|4); if (new == orig) return 1; for (size_t i = 0; i < size; i++) if (new[i] != 5) return 1; for (size_t i = 0; i < size; i++) if (orig[i] != 0) return 1; return 0; } int status; if (wait() < -1) return 1; if (WIFEXITED(status)) return WEXITSTATUS(status); return 1; } Hopefully this is the case you're referring to. :) > Does it work on special mappings? If so, please prevent it from doing > so. mremapping x86's vdso is a thing, and duplicating x86's vdso > should not become a thing, because x86_32 in particular will become > extremely confused. I'll add a check for arch_vma_name(vma) == NULL. There's an existing check for VM_DONTEXPAND | VM_PFNMAP when expanding allocations (the only case this flag impacts). Are there other kinds of special mappings that you're referring to? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] mm: add mremap flag for preserving the old mapping
On 30/09/14 01:53 AM, Andy Lutomirski wrote: On Mon, Sep 29, 2014 at 9:55 PM, Daniel Micay danielmi...@gmail.com wrote: This introduces the MREMAP_RETAIN flag for preserving the source mapping when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to the source location will fault and cause fresh pages to be mapped in. For consistency, the old_len = new_len case could decommit the pages instead of unmapping. However, userspace can accomplish the same thing via madvise and a coherent definition of the flag is possible without the extra complexity. IMO this needs very clear documentation of exactly what it does. Agreed, and thanks for the review. I'll post a slightly modified version of the patch soon (mostly more commit message changes). Does it preserve the contents of the source pages? (If so, why? Aren't you wasting a bunch of time on page faults and possibly unnecessary COWs?) The source will act as if it was just created. For an anonymous memory mapping, it will fault on any accesses and bring in new zeroed pages. In jemalloc, it replaces an enormous memset(dst, src, size) followed by madvise(src, size, MADV_DONTNEED) with mremap. Using mremap also ends up eliding page faults from writes at the destination. TCMalloc has nearly the same page allocation design, although it tries to throttle the purging so it won't always gain as much. Does it work on file mappings? Can it extend file mappings while it moves them? It works on file mappings. If a move occurs, there will be the usual extended destination mapping but with the source mapping left intact. It wouldn't be useful with existing allocators, but in theory a general purpose allocator could expose an MMIO API in order to reuse the same address space via MAP_FIXED/MREMAP_FIXED to reduce VM fragmentation. If you MREMAP_RETAIN a partially COWed private mapping, what happens? The original mapping is zeroed in the following test, as it would be without fork: #define _GNU_SOURCE #include string.h #include stdlib.h #include sys/mman.h #include unistd.h #include sys/wait.h int main(void) { size_t size = 1024 * 1024; char *orig = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); memset(orig, 5, size); int pid = fork(); if (pid == -1) return 1; if (pid == 0) { memset(orig, 5, 1024); char *new = mremap(orig, size, size * 128, MREMAP_MAYMOVE|4); if (new == orig) return 1; for (size_t i = 0; i size; i++) if (new[i] != 5) return 1; for (size_t i = 0; i size; i++) if (orig[i] != 0) return 1; return 0; } int status; if (wait(status) -1) return 1; if (WIFEXITED(status)) return WEXITSTATUS(status); return 1; } Hopefully this is the case you're referring to. :) Does it work on special mappings? If so, please prevent it from doing so. mremapping x86's vdso is a thing, and duplicating x86's vdso should not become a thing, because x86_32 in particular will become extremely confused. I'll add a check for arch_vma_name(vma) == NULL. There's an existing check for VM_DONTEXPAND | VM_PFNMAP when expanding allocations (the only case this flag impacts). Are there other kinds of special mappings that you're referring to? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/