Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random

2016-10-31 Thread Daniel Micay
On Mon, 2016-10-31 at 22:10 +0100, Florian Weimer wrote:
> * Daniel Micay:
> 
> > > It makes a lot of sense on x86_64 where it means the canary is
> > > still 56 bits. Also, you want -fstack-check for protecting again
> > > stack overflows rather than stack *buffer* overflow. SSP won't
> > > really help you in that regard. Sadly, while -fstack-check now
> > > works well in GCC 6 with little performance cost, it's not really
> > > a
> 
> I think GCC still does not treat the return address push on
> architectures which have such a CALL instruction as an implicit stack
> probe.
> 
> > > complete feature (and Clang impls it as a no-op!).
> 
> How many guard pages at the end of the stack does the kernel
> guarantee?  I saw some -fstack-check-generated code which seemed to
> jump over a single guard page.
>
> The other thing I've seen which could impact the effectiveness of
> -fstack-check: mmap *without* MAP_FIXED and a hint within stack
> allocation can create a mapping inside the stack.  That's rather
> surprising, and I'm not sure if the net result is that there actually
> is a guard page in all cases.

It's not ideal but userspace can work around it. OpenJDK and ART both do
something like walking to the end of the main thread stack during init
to install their own guard region. ART then uses all but the last guard
page it installed as a red zone to handle out-of-stack (not sure about
OpenJDK).

-fstack-stack is supposed to handle a single guard by default, and
that's all there is for thread stacks by default.

> > Note: talking about userspace after the entropy bit. The kernel
> > doesn't
> > really -fstack-check, at least in even slightly sane code...
> 
> There used to be lots of discussions about kernel stack sizes ...

It should just be banning VLAs, alloca and large stack frames though, if
it's not already. There wasn't even support for guard pages with kernel
stacks until recently outside grsecurity, and -fstack-check relies on
them so it doesn't seem like a great solution for the kernel.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random

2016-10-31 Thread Daniel Micay
> It makes a lot of sense on x86_64 where it means the canary is still
> 56
> bits. Also, you want -fstack-check for protecting again stack
> overflows
> rather than stack *buffer* overflow. SSP won't really help you in that
> regard. Sadly, while -fstack-check now works well in GCC 6 with little
> performance cost, it's not really a complete feature (and Clang impls
> it
> as a no-op!).

Note: talking about userspace after the entropy bit. The kernel doesn't
really -fstack-check, at least in even slightly sane code...

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random

2016-10-31 Thread Daniel Micay
> It makes a lot of sense on x86_64 where it means the canary is still
> 56
> bits. Also, you want -fstack-check for protecting again stack
> overflows
> rather than stack *buffer* overflow. SSP won't really help you in that
> regard. Sadly, while -fstack-check now works well in GCC 6 with little
> performance cost, it's not really a complete feature (and Clang impls
> it
> as a no-op!).

Note: talking about userspace after the entropy bit. The kernel doesn't
really -fstack-check, at least in even slightly sane code...

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random

2016-10-31 Thread Daniel Micay
On Mon, 2016-10-31 at 21:45 +0100, Florian Weimer wrote:
> * Jann Horn:
> 
> > On Mon, Oct 31, 2016 at 09:04:02AM -0700, Kees Cook wrote:
> > > On Mon, Oct 31, 2016 at 7:04 AM, Jann Horn  wrote:
> > > > On machines with sizeof(unsigned long)==8, this ensures that the
> > > > more
> > > > significant 32 bits of stack_canary are random, too.
> > > > stack_canary is defined as unsigned long, all the architectures
> > > > with stack
> > > > protector support already pick the stack_canary of init as a
> > > > random
> > > > unsigned long, and get_random_long() should be as fast as
> > > > get_random_int(),
> > > > so there seems to be no good reason against this.
> > > > 
> > > > This should help if someone tries to guess a stack canary with
> > > > brute force.
> > > > 
> > > > (This change has been made in PaX already, with a different
> > > > RNG.)
> > > > 
> > > > Signed-off-by: Jann Horn 
> > > 
> > > Acked-by: Kees Cook 
> > > 
> > > (A separate change might be to make sure that the leading byte is
> > > zeroed. Entropy of the value, I think, is less important than
> > > blocking
> > > canary exposures from unbounded str* functions. Brute forcing
> > > kernel
> > > stack canaries isn't like it bruting them in userspace...)
> > 
> > Yeah, makes sense. Especially on 64bit, 56 bits of entropy ought to
> > be
> > enough anyway.
> 
> So you two approve of the way glibc does this currently?  (See the
> other thread.)
> 
> I was under the impression that the kernel performs far less
> null-terminated string processing the average user space application,
> especially on the stack.  (A lot of userspace code assumes large
> stacks and puts essentially arbitrarily long strings into VLAs.)

It makes a lot of sense on x86_64 where it means the canary is still 56
bits. Also, you want -fstack-check for protecting again stack overflows
rather than stack *buffer* overflow. SSP won't really help you in that
regard. Sadly, while -fstack-check now works well in GCC 6 with little
performance cost, it's not really a complete feature (and Clang impls it
as a no-op!).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] fork: make whole stack_canary random

2016-10-31 Thread Daniel Micay
On Mon, 2016-10-31 at 21:45 +0100, Florian Weimer wrote:
> * Jann Horn:
> 
> > On Mon, Oct 31, 2016 at 09:04:02AM -0700, Kees Cook wrote:
> > > On Mon, Oct 31, 2016 at 7:04 AM, Jann Horn  wrote:
> > > > On machines with sizeof(unsigned long)==8, this ensures that the
> > > > more
> > > > significant 32 bits of stack_canary are random, too.
> > > > stack_canary is defined as unsigned long, all the architectures
> > > > with stack
> > > > protector support already pick the stack_canary of init as a
> > > > random
> > > > unsigned long, and get_random_long() should be as fast as
> > > > get_random_int(),
> > > > so there seems to be no good reason against this.
> > > > 
> > > > This should help if someone tries to guess a stack canary with
> > > > brute force.
> > > > 
> > > > (This change has been made in PaX already, with a different
> > > > RNG.)
> > > > 
> > > > Signed-off-by: Jann Horn 
> > > 
> > > Acked-by: Kees Cook 
> > > 
> > > (A separate change might be to make sure that the leading byte is
> > > zeroed. Entropy of the value, I think, is less important than
> > > blocking
> > > canary exposures from unbounded str* functions. Brute forcing
> > > kernel
> > > stack canaries isn't like it bruting them in userspace...)
> > 
> > Yeah, makes sense. Especially on 64bit, 56 bits of entropy ought to
> > be
> > enough anyway.
> 
> So you two approve of the way glibc does this currently?  (See the
> other thread.)
> 
> I was under the impression that the kernel performs far less
> null-terminated string processing the average user space application,
> especially on the stack.  (A lot of userspace code assumes large
> stacks and puts essentially arbitrarily long strings into VLAs.)

It makes a lot of sense on x86_64 where it means the canary is still 56
bits. Also, you want -fstack-check for protecting again stack overflows
rather than stack *buffer* overflow. SSP won't really help you in that
regard. Sadly, while -fstack-check now works well in GCC 6 with little
performance cost, it's not really a complete feature (and Clang impls it
as a no-op!).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-19 Thread Daniel Micay
On Wed, 2016-10-19 at 07:26 -0300, Arnaldo Carvalho de Melo wrote:
> 
> But self profiling JITs would be useful for non-developers, on Android
> (anywhere, really), and for that it would require being able to at
> least, well, self profile, using sys_perf_event_open() by a normal
> process, limited to profiling itself, no?
> 
> This not being possible, self profiling will use some other means, its
> like sys_perf_event_open() never existed for them.
> 
> - Arnaldo

It would defeat the purpose of the security feature if it was exposed to
apps on Android. There are other ways for Chromium (including WebView)
and the ART JIT to profile. AFAIK, they've never actually considered
using perf for their JIT profiling. It wasn't something that anyone
cared about when this was implemented.

Chromium would *certainly* never use perf for this. They use a much
tighter sandbox than the Android app sandbox. It doesn't have system
calls like open(), let alone perf events. Their seccomp-bpf usage is to
reduce kernel attack surface, since it has proven to be the easiest way
out of a sandbox. They don't even allow futex() without filtering based 
on the parameters anymore. That proved to be a major
issue.

The only case where untrusted apps declaring the privileges they need
actually works out is when the privileges are high-level controls over
access to a user's personal information. That way, they can be exposed
to the user with the ability to reject access when it's first needed or
toggle it off later. The strength of the app sandbox isn't something
that end users can be responsible for. Permissions also don't work if
apps bypass them with local privilege escalation vulnerabilities.

Even for the base system, no perf access is better than having it used
anywhere. The difference is only that the base system can actually be
trusted to declare what it needs, but those components can be exploited
so it's still best if they are trusted as little as possible at runtime.

I could see Android completely removing unprivileged access down the
road too, not as a form of access control (apps already run as unique
uid/gid pairs, etc.) but to remove a non-trivial form of kernel attack
surface. The perf API isn't being singled out here. It just happened to
be the first case where a kernel API was restricted to developer usage
on Android. Should expect more of this.

If you want perf exposed, make it secure. Stop writing kernel code in a
memory unsafe language and start using isolation. Not going to happen in
all likelihood, so kernel code will be increasingly walled off instead
since it's the biggest liability on the system. Fixing bugs on a case by
case basis doesn't work for systems that need even basic security.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-19 Thread Daniel Micay
On Wed, 2016-10-19 at 07:26 -0300, Arnaldo Carvalho de Melo wrote:
> 
> But self profiling JITs would be useful for non-developers, on Android
> (anywhere, really), and for that it would require being able to at
> least, well, self profile, using sys_perf_event_open() by a normal
> process, limited to profiling itself, no?
> 
> This not being possible, self profiling will use some other means, its
> like sys_perf_event_open() never existed for them.
> 
> - Arnaldo

It would defeat the purpose of the security feature if it was exposed to
apps on Android. There are other ways for Chromium (including WebView)
and the ART JIT to profile. AFAIK, they've never actually considered
using perf for their JIT profiling. It wasn't something that anyone
cared about when this was implemented.

Chromium would *certainly* never use perf for this. They use a much
tighter sandbox than the Android app sandbox. It doesn't have system
calls like open(), let alone perf events. Their seccomp-bpf usage is to
reduce kernel attack surface, since it has proven to be the easiest way
out of a sandbox. They don't even allow futex() without filtering based 
on the parameters anymore. That proved to be a major
issue.

The only case where untrusted apps declaring the privileges they need
actually works out is when the privileges are high-level controls over
access to a user's personal information. That way, they can be exposed
to the user with the ability to reject access when it's first needed or
toggle it off later. The strength of the app sandbox isn't something
that end users can be responsible for. Permissions also don't work if
apps bypass them with local privilege escalation vulnerabilities.

Even for the base system, no perf access is better than having it used
anywhere. The difference is only that the base system can actually be
trusted to declare what it needs, but those components can be exploited
so it's still best if they are trusted as little as possible at runtime.

I could see Android completely removing unprivileged access down the
road too, not as a form of access control (apps already run as unique
uid/gid pairs, etc.) but to remove a non-trivial form of kernel attack
surface. The perf API isn't being singled out here. It just happened to
be the first case where a kernel API was restricted to developer usage
on Android. Should expect more of this.

If you want perf exposed, make it secure. Stop writing kernel code in a
memory unsafe language and start using isolation. Not going to happen in
all likelihood, so kernel code will be increasingly walled off instead
since it's the biggest liability on the system. Fixing bugs on a case by
case basis doesn't work for systems that need even basic security.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-19 Thread Daniel Micay
On Wed, 2016-10-19 at 10:41 +0100, Mark Rutland wrote:
> On Mon, Oct 17, 2016 at 10:54:33AM -0400, Daniel Micay wrote:
> > On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote:
> > > It's also my understanding that for Android, perf_event_paranoid
> > > is
> > > lowered when the user enables developer mode (rather than only
> > > when an
> > > external debugger is attached); is that correct?
> > 
> > It's exposed as a "system property" marked as writable by the shell
> > user, so the Android Debug Bridge shell can lower it. The debugging
> > tools learned how to toggle it off automatically when they're used.
> > It
> > intentionally isn't a persist. prefixed property so the setting also
> > goes away on reboot.
> > 
> > ADB (incl. the shell user) isn't available until developer mode is
> > enabled + ADB is toggled on in the developer settings, and then it
> > still
> > requires whitelisting keys.
> 
> Ah; so I'd misunderstood somewhat.
> 
> I was under the (clearly mistaken) impression that this was lowered
> when
> developer mode was enabled, rather than only when it was both enabled
> and ADB was connected, for example.
> 
> Thanks for clearing that up!

ADB provides a shell as the 'shell' user, and that user has the ability
to toggle the sysctl. So profiling tools were able to be taught to do
that automatically. It's the only way that the 'shell' user is actually
exposed. For example, a terminal app just runs in the untrusted_app
SELinux domain as a unique unprivileged uid/gid pair, not as the much
more privileged ADB 'shell' domain.

So it doesn't actually get toggled off you use ADB to do something else.

ADB itself is pretty much comparable to SSH, but over USB (i.e. key-
based way of getting a shell).

The 'shell' user has tools like 'run-as' to be able to run things as
various apps (if they are marked debuggable), so in theory it could be
finer-grained and act only there, for the app being debugged. It would
be really hard to cover all use cases and maybe things other than apps
though (although in an Android 'user' build, the base system itself
isn't very debuggable, you really need 'userdebug' or 'eng' which isn't
what ships on end user devices).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-19 Thread Daniel Micay
On Wed, 2016-10-19 at 10:41 +0100, Mark Rutland wrote:
> On Mon, Oct 17, 2016 at 10:54:33AM -0400, Daniel Micay wrote:
> > On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote:
> > > It's also my understanding that for Android, perf_event_paranoid
> > > is
> > > lowered when the user enables developer mode (rather than only
> > > when an
> > > external debugger is attached); is that correct?
> > 
> > It's exposed as a "system property" marked as writable by the shell
> > user, so the Android Debug Bridge shell can lower it. The debugging
> > tools learned how to toggle it off automatically when they're used.
> > It
> > intentionally isn't a persist. prefixed property so the setting also
> > goes away on reboot.
> > 
> > ADB (incl. the shell user) isn't available until developer mode is
> > enabled + ADB is toggled on in the developer settings, and then it
> > still
> > requires whitelisting keys.
> 
> Ah; so I'd misunderstood somewhat.
> 
> I was under the (clearly mistaken) impression that this was lowered
> when
> developer mode was enabled, rather than only when it was both enabled
> and ADB was connected, for example.
> 
> Thanks for clearing that up!

ADB provides a shell as the 'shell' user, and that user has the ability
to toggle the sysctl. So profiling tools were able to be taught to do
that automatically. It's the only way that the 'shell' user is actually
exposed. For example, a terminal app just runs in the untrusted_app
SELinux domain as a unique unprivileged uid/gid pair, not as the much
more privileged ADB 'shell' domain.

So it doesn't actually get toggled off you use ADB to do something else.

ADB itself is pretty much comparable to SSH, but over USB (i.e. key-
based way of getting a shell).

The 'shell' user has tools like 'run-as' to be able to run things as
various apps (if they are marked debuggable), so in theory it could be
finer-grained and act only there, for the app being debugged. It would
be really hard to cover all use cases and maybe things other than apps
though (although in an Android 'user' build, the base system itself
isn't very debuggable, you really need 'userdebug' or 'eng' which isn't
what ships on end user devices).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-18 Thread Daniel Micay
On Tue, 2016-10-18 at 13:48 -0700, Kees Cook wrote:
> On Mon, Oct 17, 2016 at 6:44 AM, Mark Rutland 
> wrote:
> > Hi,
> > 
> > Attempt to revive discussions below...
> > 
> > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote:
> > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow
> > > all access to performance events by users without CAP_SYS_ADMIN.
> > > 
> > > This new level of restriction is intended to reduce the attack
> > > surface of the kernel. Perf is a valuable tool for developers but
> > > is generally unnecessary and unused on production systems. Perf
> > > may
> > > open up an attack vector to vulnerable device-specific drivers as
> > > recently demonstrated in CVE-2016-0805, CVE-2016-0819,
> > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of
> > > restriction allows for a safe default to be set on production
> > > systems
> > > while leaving a simple means for developers to grant access [1].
> > > 
> > > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad
> > > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's
> > > patches
> > > have been modified and split up to address on-list feedback.
> > > 
> > > kernel.perf_event_paranoid=3 is the default on both Debian [2] and
> > > Android [3].
> > 
> > While people weren't particularly happy with this global toggle
> > approach, my understanding from face-to-face discussions at LSS2016
> > was
> > that people were happy with a more scoped restriction (e.g. using
> > capabilities or some other access control mechanism), but no-one had
> > the
> > time to work on that.
> > 
> > Does that match everyone's understanding, or am I mistaken?
> 
> That's correct: some kind of finer-grain control would be preferred to
> the maintainer, but no one has had time to work on it. (The =3 sysctl
> setting present in Android, Debian, and Ubuntu satisfies most people.)
> 
> -Kees

It's also worth noting that fine-grained control via a scoped mechanism
would likely only be used to implement *more restrictions* on Android,
not to make the feature less aggressive. It's desirable for perf events
to be disabled by default for non-root across the board on Android. The
part that's imperfect is that when a developer uses a profiling tool,
unprivileged usage is automatically enabled across the board until
reboot. Ideally, it would be enabled only for the scope where it's
needed. It would be very tricky to implement though, especially without
adding friction, and it would only have value for protecting devices
being used for development. It really doesn't seem to be worth the
trouble, especially since it doesn't persist on reboot. It's only a
temporary security hole and only for developer devices.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-18 Thread Daniel Micay
On Tue, 2016-10-18 at 13:48 -0700, Kees Cook wrote:
> On Mon, Oct 17, 2016 at 6:44 AM, Mark Rutland 
> wrote:
> > Hi,
> > 
> > Attempt to revive discussions below...
> > 
> > On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote:
> > > When kernel.perf_event_paranoid is set to 3 (or greater), disallow
> > > all access to performance events by users without CAP_SYS_ADMIN.
> > > 
> > > This new level of restriction is intended to reduce the attack
> > > surface of the kernel. Perf is a valuable tool for developers but
> > > is generally unnecessary and unused on production systems. Perf
> > > may
> > > open up an attack vector to vulnerable device-specific drivers as
> > > recently demonstrated in CVE-2016-0805, CVE-2016-0819,
> > > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of
> > > restriction allows for a safe default to be set on production
> > > systems
> > > while leaving a simple means for developers to grant access [1].
> > > 
> > > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad
> > > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's
> > > patches
> > > have been modified and split up to address on-list feedback.
> > > 
> > > kernel.perf_event_paranoid=3 is the default on both Debian [2] and
> > > Android [3].
> > 
> > While people weren't particularly happy with this global toggle
> > approach, my understanding from face-to-face discussions at LSS2016
> > was
> > that people were happy with a more scoped restriction (e.g. using
> > capabilities or some other access control mechanism), but no-one had
> > the
> > time to work on that.
> > 
> > Does that match everyone's understanding, or am I mistaken?
> 
> That's correct: some kind of finer-grain control would be preferred to
> the maintainer, but no one has had time to work on it. (The =3 sysctl
> setting present in Android, Debian, and Ubuntu satisfies most people.)
> 
> -Kees

It's also worth noting that fine-grained control via a scoped mechanism
would likely only be used to implement *more restrictions* on Android,
not to make the feature less aggressive. It's desirable for perf events
to be disabled by default for non-root across the board on Android. The
part that's imperfect is that when a developer uses a profiling tool,
unprivileged usage is automatically enabled across the board until
reboot. Ideally, it would be enabled only for the scope where it's
needed. It would be very tricky to implement though, especially without
adding friction, and it would only have value for protecting devices
being used for development. It really doesn't seem to be worth the
trouble, especially since it doesn't persist on reboot. It's only a
temporary security hole and only for developer devices.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-17 Thread Daniel Micay
On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote:
> Hi,
> 
> Attempt to revive discussions below...
> 
> On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote:
> > When kernel.perf_event_paranoid is set to 3 (or greater), disallow
> > all access to performance events by users without CAP_SYS_ADMIN.
> > 
> > This new level of restriction is intended to reduce the attack
> > surface of the kernel. Perf is a valuable tool for developers but
> > is generally unnecessary and unused on production systems. Perf may
> > open up an attack vector to vulnerable device-specific drivers as
> > recently demonstrated in CVE-2016-0805, CVE-2016-0819,
> > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of
> > restriction allows for a safe default to be set on production
> > systems
> > while leaving a simple means for developers to grant access [1].
> > 
> > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad
> > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's patches
> > have been modified and split up to address on-list feedback.
> > 
> > kernel.perf_event_paranoid=3 is the default on both Debian [2] and
> > Android [3].
> 
> While people weren't particularly happy with this global toggle
> approach, my understanding from face-to-face discussions at LSS2016
> was
> that people were happy with a more scoped restriction (e.g. using
> capabilities or some other access control mechanism), but no-one had
> the
> time to work on that.
> 
> Does that match everyone's understanding, or am I mistaken?
> 
> It's also my understanding that for Android, perf_event_paranoid is
> lowered when the user enables developer mode (rather than only when an
> external debugger is attached); is that correct?

It's exposed as a "system property" marked as writable by the shell
user, so the Android Debug Bridge shell can lower it. The debugging
tools learned how to toggle it off automatically when they're used. It
intentionally isn't a persist. prefixed property so the setting also
goes away on reboot.

ADB (incl. the shell user) isn't available until developer mode is
enabled + ADB is toggled on in the developer settings, and then it still
requires whitelisting keys.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-10-17 Thread Daniel Micay
On Mon, 2016-10-17 at 14:44 +0100, Mark Rutland wrote:
> Hi,
> 
> Attempt to revive discussions below...
> 
> On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote:
> > When kernel.perf_event_paranoid is set to 3 (or greater), disallow
> > all access to performance events by users without CAP_SYS_ADMIN.
> > 
> > This new level of restriction is intended to reduce the attack
> > surface of the kernel. Perf is a valuable tool for developers but
> > is generally unnecessary and unused on production systems. Perf may
> > open up an attack vector to vulnerable device-specific drivers as
> > recently demonstrated in CVE-2016-0805, CVE-2016-0819,
> > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843. This new level of
> > restriction allows for a safe default to be set on production
> > systems
> > while leaving a simple means for developers to grant access [1].
> > 
> > This feature is derived from CONFIG_GRKERNSEC_PERF_HARDEN by Brad
> > Spengler. It is based on a patch by Ben Hutchings [2]. Ben's patches
> > have been modified and split up to address on-list feedback.
> > 
> > kernel.perf_event_paranoid=3 is the default on both Debian [2] and
> > Android [3].
> 
> While people weren't particularly happy with this global toggle
> approach, my understanding from face-to-face discussions at LSS2016
> was
> that people were happy with a more scoped restriction (e.g. using
> capabilities or some other access control mechanism), but no-one had
> the
> time to work on that.
> 
> Does that match everyone's understanding, or am I mistaken?
> 
> It's also my understanding that for Android, perf_event_paranoid is
> lowered when the user enables developer mode (rather than only when an
> external debugger is attached); is that correct?

It's exposed as a "system property" marked as writable by the shell
user, so the Android Debug Bridge shell can lower it. The debugging
tools learned how to toggle it off automatically when they're used. It
intentionally isn't a persist. prefixed property so the setting also
goes away on reboot.

ADB (incl. the shell user) isn't available until developer mode is
enabled + ADB is toggled on in the developer settings, and then it still
requires whitelisting keys.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 2/2] arm: apply more __ro_after_init

2016-08-10 Thread Daniel Micay
On Wed, 2016-08-10 at 10:43 +0100, Russell King - ARM Linux wrote:
> On Fri, Jun 03, 2016 at 11:40:24AM -0700, Kees Cook wrote:
> > 
> > @@ -1309,16 +1309,11 @@ void __init arm_mm_memblock_reserve(void)
> >   * Any other function or debugging method which may touch any
> > device _will_
> >   * crash the kernel.
> >   */
> > +static char vectors[PAGE_SIZE * 2] __ro_after_init
> > __aligned(PAGE_SIZE);
> >  static void __init devicemaps_init(const struct machine_desc
> > *mdesc)
> >  {
> >     struct map_desc map;
> >     unsigned long addr;
> > -   void *vectors;
> > -
> > -   /*
> > -    * Allocate the vector page early.
> > -    */
> > -   vectors = early_alloc(PAGE_SIZE * 2);
> 
> This one is not appropriate.  We _do_ write to these pages after init
> for FIQ handler updates.  See set_fiq_handler().

This is one of the many cases where pax_open_kernel/pax_close_kernel are
needed to temporarily toggle it read-only. From grsecurity:

@@ -95,7 +95,10 @@ void set_fiq_handler(void *start, unsigned int
length)
    void *base = vectors_page;
    unsigned offset = FIQ_OFFSET;
 
+   pax_open_kernel();
    memcpy(base + offset, start, length);
+   pax_close_kernel();

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 2/2] arm: apply more __ro_after_init

2016-08-10 Thread Daniel Micay
On Wed, 2016-08-10 at 10:43 +0100, Russell King - ARM Linux wrote:
> On Fri, Jun 03, 2016 at 11:40:24AM -0700, Kees Cook wrote:
> > 
> > @@ -1309,16 +1309,11 @@ void __init arm_mm_memblock_reserve(void)
> >   * Any other function or debugging method which may touch any
> > device _will_
> >   * crash the kernel.
> >   */
> > +static char vectors[PAGE_SIZE * 2] __ro_after_init
> > __aligned(PAGE_SIZE);
> >  static void __init devicemaps_init(const struct machine_desc
> > *mdesc)
> >  {
> >     struct map_desc map;
> >     unsigned long addr;
> > -   void *vectors;
> > -
> > -   /*
> > -    * Allocate the vector page early.
> > -    */
> > -   vectors = early_alloc(PAGE_SIZE * 2);
> 
> This one is not appropriate.  We _do_ write to these pages after init
> for FIQ handler updates.  See set_fiq_handler().

This is one of the many cases where pax_open_kernel/pax_close_kernel are
needed to temporarily toggle it read-only. From grsecurity:

@@ -95,7 +95,10 @@ void set_fiq_handler(void *start, unsigned int
length)
    void *base = vectors_page;
    unsigned offset = FIQ_OFFSET;
 
+   pax_open_kernel();
    memcpy(base + offset, start, length);
+   pax_close_kernel();

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
> My claim was not that the mainline code was impressively perfect, but
> rather that the vendor code was worse, countering a prior claim
> otherwise. Hence, reality.

You're arguing with a straw man.

I was responding to a comment about out-of-tree code, not generic
architecture perf drivers vs. alternative versions by SoC vendors.

Qualcomm and other vendors landing their drivers in mainline would be
nice, but it wouldn't make it inherently higher quality. I don't really
see what it has to do with this, which I why I responded...

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
> My claim was not that the mainline code was impressively perfect, but
> rather that the vendor code was worse, countering a prior claim
> otherwise. Hence, reality.

You're arguing with a straw man.

I was responding to a comment about out-of-tree code, not generic
architecture perf drivers vs. alternative versions by SoC vendors.

Qualcomm and other vendors landing their drivers in mainline would be
nice, but it wouldn't make it inherently higher quality. I don't really
see what it has to do with this, which I why I responded...

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 16:55 +, Roberts, William C wrote:
> > 
> > -Original Message-
> > From: Daniel Micay [mailto:danielmi...@gmail.com]
> > Sent: Thursday, August 4, 2016 9:53 AM
> > To: kernel-harden...@lists.openwall.com; ja...@lakedaemon.net;
> > linux-
> > m...@vger.kernel.org; linux-kernel@vger.kernel.org; akpm@linux-
> > foundation.org
> > Cc: keesc...@chromium.org; gre...@linuxfoundation.org; n...@google.co
> > m;
> > je...@google.com; saly...@android.com; dcash...@android.com
> > Subject: Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap
> > randomization
> > 
> > On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com
> > wrote:
> > > 
> > > The recent get_random_long() change in get_random_range() and then
> > > the
> > > subsequent patches Jason put out, all stemmed from my tinkering
> > > with
> > > the concept of randomizing mmap.
> > > 
> > > Any feedback would be greatly appreciated, including any feedback
> > > indicating that I am idiot.
> > 
> > The RAND_THREADSTACK feature in grsecurity makes the gaps the way I
> > think
> > would be ideal, i.e. tracked as part of the appropriate VMA. It
> > would be
> > straightforward to make it more general purpose.
> 
> I am not familiar with that, thanks for pointing it out. I'll take a
> look when my time
> frees up for this again.

I'm actually wrong about that now that I look more closely...

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 16:55 +, Roberts, William C wrote:
> > 
> > -Original Message-
> > From: Daniel Micay [mailto:danielmi...@gmail.com]
> > Sent: Thursday, August 4, 2016 9:53 AM
> > To: kernel-harden...@lists.openwall.com; ja...@lakedaemon.net;
> > linux-
> > m...@vger.kernel.org; linux-kernel@vger.kernel.org; akpm@linux-
> > foundation.org
> > Cc: keesc...@chromium.org; gre...@linuxfoundation.org; n...@google.co
> > m;
> > je...@google.com; saly...@android.com; dcash...@android.com
> > Subject: Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap
> > randomization
> > 
> > On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com
> > wrote:
> > > 
> > > The recent get_random_long() change in get_random_range() and then
> > > the
> > > subsequent patches Jason put out, all stemmed from my tinkering
> > > with
> > > the concept of randomizing mmap.
> > > 
> > > Any feedback would be greatly appreciated, including any feedback
> > > indicating that I am idiot.
> > 
> > The RAND_THREADSTACK feature in grsecurity makes the gaps the way I
> > think
> > would be ideal, i.e. tracked as part of the appropriate VMA. It
> > would be
> > straightforward to make it more general purpose.
> 
> I am not familiar with that, thanks for pointing it out. I'll take a
> look when my time
> frees up for this again.

I'm actually wrong about that now that I look more closely...

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization

2016-08-04 Thread Daniel Micay
On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com wrote:
> The recent get_random_long() change in get_random_range() and then the
> subsequent patches Jason put out, all stemmed from my tinkering
> with the concept of randomizing mmap.
> 
> Any feedback would be greatly appreciated, including any feedback
> indicating that I am idiot.

The RAND_THREADSTACK feature in grsecurity makes the gaps the way I
think would be ideal, i.e. tracked as part of the appropriate VMA. It
would be straightforward to make it more general purpose.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH] [RFC] Introduce mmap randomization

2016-08-04 Thread Daniel Micay
On Tue, 2016-07-26 at 11:22 -0700, william.c.robe...@intel.com wrote:
> The recent get_random_long() change in get_random_range() and then the
> subsequent patches Jason put out, all stemmed from my tinkering
> with the concept of randomizing mmap.
> 
> Any feedback would be greatly appreciated, including any feedback
> indicating that I am idiot.

The RAND_THREADSTACK feature in grsecurity makes the gaps the way I
think would be ideal, i.e. tracked as part of the appropriate VMA. It
would be straightforward to make it more general purpose.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 17:10 +0100, Mark Rutland wrote:
> On Thu, Aug 04, 2016 at 11:44:28AM -0400, Daniel Micay wrote:
> > 
> > Qualcomm's drivers might be lower quality than core kernel code, but
> > they're way above the baseline set by mainline kernel drivers...
> 
> I don't think that's true for the arm/arm64 perf code.

The baseline architecture support is essentially core kernel code. I
agree it's much better than the SoC vendor code. You're spending a lot
of time auditing, fuzzing and improving the code in general, which is
not true for most drivers. They don't get that attention.

> I think we've done a reasonable job of testing and fixing those, along
> with core infrastructure issues. The perf fuzzer runs for a very long
> time on a mainline kernel without issues, while on my Nexus 5x I get a
> hard lockup after ~85 seconds (and prior to the last android update
> the
> lockup was instantaneous).
>
> From my personal experience (and as above), and talking specifically
> about PMU drivers, I think that the opposite is true. This is not to
> say
> there aren't issues; I would not be surprised if there are. But it's
> disingenuous to say that mainline code is worse than that which exists
> in a vendor kernel when the latter is demonstrably much easier to
> break
> than the former.

I wasn't talking specifically about perf.

> If there are issues you are aware of, please report them. If those
> issues only exist in non-upstream code, then the applicable concerns
> are
> somewhat different (though certainly still exist).

I'm not going to do volunteer work for a corporation. I've learned that
lesson after spending years doing it.

> But please, let's frame the argument to match reality.

The argument is framed in reality. Stating that it now often takes a few
hours to find a vulnerability with the unaltered, widely known public
perf fuzzer is not impressive. It's really an argument for claiming that
it's a significant security issue.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 17:10 +0100, Mark Rutland wrote:
> On Thu, Aug 04, 2016 at 11:44:28AM -0400, Daniel Micay wrote:
> > 
> > Qualcomm's drivers might be lower quality than core kernel code, but
> > they're way above the baseline set by mainline kernel drivers...
> 
> I don't think that's true for the arm/arm64 perf code.

The baseline architecture support is essentially core kernel code. I
agree it's much better than the SoC vendor code. You're spending a lot
of time auditing, fuzzing and improving the code in general, which is
not true for most drivers. They don't get that attention.

> I think we've done a reasonable job of testing and fixing those, along
> with core infrastructure issues. The perf fuzzer runs for a very long
> time on a mainline kernel without issues, while on my Nexus 5x I get a
> hard lockup after ~85 seconds (and prior to the last android update
> the
> lockup was instantaneous).
>
> From my personal experience (and as above), and talking specifically
> about PMU drivers, I think that the opposite is true. This is not to
> say
> there aren't issues; I would not be surprised if there are. But it's
> disingenuous to say that mainline code is worse than that which exists
> in a vendor kernel when the latter is demonstrably much easier to
> break
> than the former.

I wasn't talking specifically about perf.

> If there are issues you are aware of, please report them. If those
> issues only exist in non-upstream code, then the applicable concerns
> are
> somewhat different (though certainly still exist).

I'm not going to do volunteer work for a corporation. I've learned that
lesson after spending years doing it.

> But please, let's frame the argument to match reality.

The argument is framed in reality. Stating that it now often takes a few
hours to find a vulnerability with the unaltered, widely known public
perf fuzzer is not impressive. It's really an argument for claiming that
it's a significant security issue.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 16:11 +0200, Peter Zijlstra wrote:
> On Thu, Aug 04, 2016 at 09:45:23AM -0400, Daniel Micay wrote:
> > 
> > Qualcomm's perf driver is out-of-tree along with most of their other
> > drivers. 
> 
> 
> So you're asking us to maim upstream perf for some out of tree junk?
> Srously? *plonk*

This feature doesn't come from Android. The perf events subsystem in the
mainline kernel is packed full of vulnerabilities too. The problem is so
bad that pointing one of the public fuzzers at it for a short period of
time is all that's required to start finding them.

Qualcomm's drivers might be lower quality than core kernel code, but
they're way above the baseline set by mainline kernel drivers...

Shining the same light on mainline drivers wouldn't be pretty. The work
going into hardening the Qualcomm drivers isn't happening upstream to
any comparable extent.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 16:11 +0200, Peter Zijlstra wrote:
> On Thu, Aug 04, 2016 at 09:45:23AM -0400, Daniel Micay wrote:
> > 
> > Qualcomm's perf driver is out-of-tree along with most of their other
> > drivers. 
> 
> 
> So you're asking us to maim upstream perf for some out of tree junk?
> Srously? *plonk*

This feature doesn't come from Android. The perf events subsystem in the
mainline kernel is packed full of vulnerabilities too. The problem is so
bad that pointing one of the public fuzzers at it for a short period of
time is all that's required to start finding them.

Qualcomm's drivers might be lower quality than core kernel code, but
they're way above the baseline set by mainline kernel drivers...

Shining the same light on mainline drivers wouldn't be pretty. The work
going into hardening the Qualcomm drivers isn't happening upstream to
any comparable extent.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 11:28 +0100, Mark Rutland wrote:
> On Wed, Aug 03, 2016 at 03:36:16PM -0400, Daniel Micay wrote:
> > 
> > There's a lot of architecture and vendor specific perf events code
> > and
> > lots of bleeding edge features. On Android, a lot of the perf events
> > vulnerabilities have been specific to the Qualcomm SoC platform.
> > Other
> > platforms are likely just receiving a lot less attention.
> 
> Are the relevant perf drivers for those platforms upstream? I've seen
> no
> patches addressing security issues in the ARMv7 krait+Scorpion PMU
> driver since it was added, and there's no ARMv8 QCOM PMU driver.
> 
> If there are outstanding issues, please report them upstream.
> 
> FWIW, I've used Vince Weaver's perf fuzzer to test the ARM PMU code
> (both the framework and drivers), so other platforms are seeing some
> attention. That said, I haven't done that recently.

Qualcomm's perf driver is out-of-tree along with most of their other
drivers. Their drivers add up to a LOT of code shared across over a
billion mobile devices, leading to the focus on them. It also helps that
there are bounties for Nexus devices, so there are multi thousand dollar
rewards for bugs in the Qualcomm drivers compared to nothing for other
platforms / drivers. Now that perf is only available via ADB debugging,
further perf bugs no longer technically qualify for their bounties (but
they might still pay, I don't know).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-04 Thread Daniel Micay
On Thu, 2016-08-04 at 11:28 +0100, Mark Rutland wrote:
> On Wed, Aug 03, 2016 at 03:36:16PM -0400, Daniel Micay wrote:
> > 
> > There's a lot of architecture and vendor specific perf events code
> > and
> > lots of bleeding edge features. On Android, a lot of the perf events
> > vulnerabilities have been specific to the Qualcomm SoC platform.
> > Other
> > platforms are likely just receiving a lot less attention.
> 
> Are the relevant perf drivers for those platforms upstream? I've seen
> no
> patches addressing security issues in the ARMv7 krait+Scorpion PMU
> driver since it was added, and there's no ARMv8 QCOM PMU driver.
> 
> If there are outstanding issues, please report them upstream.
> 
> FWIW, I've used Vince Weaver's perf fuzzer to test the ARM PMU code
> (both the framework and drivers), so other platforms are seeing some
> attention. That said, I haven't done that recently.

Qualcomm's perf driver is out-of-tree along with most of their other
drivers. Their drivers add up to a LOT of code shared across over a
billion mobile devices, leading to the focus on them. It also helps that
there are bounties for Nexus devices, so there are multi thousand dollar
rewards for bugs in the Qualcomm drivers compared to nothing for other
platforms / drivers. Now that perf is only available via ADB debugging,
further perf bugs no longer technically qualify for their bounties (but
they might still pay, I don't know).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-03 Thread Daniel Micay
> One of the strengths of linux is applications of features the authors
> of
> the software had not imagined.  Your proposals seem to be trying to
> put
> the world a tiny little box where if someone had not imagined and
> preapproved a use of a feature it should not happen.   Let's please
> avoid implementing totalitarianism to avoid malicious code exploiting
> bugs in the kernel.  I am not interested in that future.

You're describing operating systems like Android, ChromeOS and iOS.

That future is already here and the Linux kernel is the major weak point
in the attempts to build those systems based on Linux. Even for the very
restricted Chrome sandbox, it's the easiest way out.

Android similarly allows near zero access to /sys for apps and little
access to /proc beyond the /proc/PID directories belonging to an app.

> Especially when dealing with disabling code to reduce attack surface,
> when then are no known attacks what we are actually dealing with
> is a small percentage probability reduction that a malicious attacker
> will be able to exploit the attack.

There are perf events vulnerabilities being exploited in the wild to
gain root on Android. It's not a theoretical attack vector. They're used
in both malware and rooting tools. Local privilege escalation bugs in
the kernel are common so there are a lot of alternatives but it's one of
the major sources for vulnerabilities. There's a lot of architecture and
vendor specific perf events code and lots of bleeding edge features. On
Android, a lot of the perf events vulnerabilities have been specific to
the Qualcomm SoC platform. Other platforms are likely just receiving a
lot less attention.

> Remember security is as much about availability as it is about
> integrity.  You keep imagining features that are great big denial of
> service attacks on legitimate users.

Only developers care about perf events and they still have access to it.
JIT compilers have other ways to do tracing and even if they consider
this to be the ideal API, it's not particularly important if they have
to settle for something else. In reality, it's a small compromise.

> I vote for sandboxes.  Perhaps seccomp.  Perhaps a per userns sysctl.
> Perhaps something else.

It's not possible to use the current incarnation of seccomp for this
since it can't be dynamically granted/revoked. Perhaps it would be
possible to support adding/removing or at least toggling seccomp filters
for groups of processes. That would be good enough to take care of user
ns, ptrace, perf events, etc.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-03 Thread Daniel Micay
> One of the strengths of linux is applications of features the authors
> of
> the software had not imagined.  Your proposals seem to be trying to
> put
> the world a tiny little box where if someone had not imagined and
> preapproved a use of a feature it should not happen.   Let's please
> avoid implementing totalitarianism to avoid malicious code exploiting
> bugs in the kernel.  I am not interested in that future.

You're describing operating systems like Android, ChromeOS and iOS.

That future is already here and the Linux kernel is the major weak point
in the attempts to build those systems based on Linux. Even for the very
restricted Chrome sandbox, it's the easiest way out.

Android similarly allows near zero access to /sys for apps and little
access to /proc beyond the /proc/PID directories belonging to an app.

> Especially when dealing with disabling code to reduce attack surface,
> when then are no known attacks what we are actually dealing with
> is a small percentage probability reduction that a malicious attacker
> will be able to exploit the attack.

There are perf events vulnerabilities being exploited in the wild to
gain root on Android. It's not a theoretical attack vector. They're used
in both malware and rooting tools. Local privilege escalation bugs in
the kernel are common so there are a lot of alternatives but it's one of
the major sources for vulnerabilities. There's a lot of architecture and
vendor specific perf events code and lots of bleeding edge features. On
Android, a lot of the perf events vulnerabilities have been specific to
the Qualcomm SoC platform. Other platforms are likely just receiving a
lot less attention.

> Remember security is as much about availability as it is about
> integrity.  You keep imagining features that are great big denial of
> service attacks on legitimate users.

Only developers care about perf events and they still have access to it.
JIT compilers have other ways to do tracing and even if they consider
this to be the ideal API, it's not particularly important if they have
to settle for something else. In reality, it's a small compromise.

> I vote for sandboxes.  Perhaps seccomp.  Perhaps a per userns sysctl.
> Perhaps something else.

It's not possible to use the current incarnation of seccomp for this
since it can't be dynamically granted/revoked. Perhaps it would be
possible to support adding/removing or at least toggling seccomp filters
for groups of processes. That would be good enough to take care of user
ns, ptrace, perf events, etc.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-03 Thread Daniel Micay
Having this in Yama would also make it probable that there would be a
security-centric default. It would end up wiping out unprivileged perf
events access on distributions using Yama for ptrace_scope unless they
make the explicit decision to disable it. Having the perf subsystem
extend the existing perf_event_paranoid sysctl leaves the control over
the upstream default in the hands of the perf subsystem, not LSMs.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-03 Thread Daniel Micay
Having this in Yama would also make it probable that there would be a
security-centric default. It would end up wiping out unprivileged perf
events access on distributions using Yama for ptrace_scope unless they
make the explicit decision to disable it. Having the perf subsystem
extend the existing perf_event_paranoid sysctl leaves the control over
the upstream default in the hands of the perf subsystem, not LSMs.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-03 Thread Daniel Micay
> The default has no impact on the "it's too coarse and limiting"
> negative property 
> of this patch, which is the show-stopper aspect. Please fix that
> aspect instead of 
> trying to argue around it.

Disabling perf events in the kernel configuration is even more limiting,
and is currently the alternative to doing it this way. It makes sense
for debugging features to be disabled in production releases of products
unless there's a way to control them where risk isn't increased. Having
a way to toggle it dynamically will allow it to be remain available.

> This isn't some narrow debugging mechanism we can turn on/off globally
> and forget 
> about, this is a wide scope performance measurement and event logging 
> infrastructure that is being utilized not just by developers but by
> apps and 
> runtimes as well.

The incredibly wide scope is why it's such a big security problem. If it
wasn't such a frequent source of vulnerabilities, it wouldn't have been
disabled for unprivileged users in grsecurity, Debian and then Android.
It's extended by lots of vendor code to specific to platforms too, so it
isn't just some core kernel code that's properly reviewed.

I don't think there are runtimes using this for JIT tracing. Perhaps it
doesn't actually suit their needs. It's a theoretical use case.

> > Let me take this another way instead. What would be a better way to
> > provide a 
> > mechanism for system owners to disable perf without an LSM? (Since
> > far fewer 
> > folks run with an enforcing "big" LSM: I'm seeking as wide a
> > coverage as 
> > possible.)
> 
> Because in practice what will happen is that if the only option is to
> do something 
> drastic for sekjurity, IT departments will do it - while if there's a
> more 
> flexible mechanism that does not throw out the baby with the bath
> water that is 
> going to be used.

It's already done: Android and Debian disable it for unprivileged users
by default. It's already done for most of desktop and mobile Linux and
perhaps even most servers. Not providing functionality desired by
downstream doesn't mean it won't be provided there.

They'll keep doing it whether or not this lands. If it doesn't land, it
will only mean that mainline kernels aren't usable for making Android
devices. They need to pass the compatibility test suite, which means
having this. The mechanism could change but I don't see why the actual
requirement would.

> This is as if 20 years ago you had submitted a patch to the early
> Linux TCP/IP 
> networking code to be on/off via a global sysctl switch and told
> people that 
> "in developer mode you can have networking, talk to your admin".
> 
> We'd have told you: "this switch is too coarse and limiting, please
> implement 
> something better, like a list of routes which defines which IP ranges
> are 
> accessible, and a privileged range of listen sockets ports and some
> flexible 
> kernel side filtering mechanism to inhibit outgoing/incoming
> connections".
> 
> Global sysctls are way too coarse.

The end result of submitting an LSM hook instead of using this would
probably come down to having a global sysctl toggle in Yama. There would
be two sysctl controls for perf restrictions rather than one which is
needless complexity for the interface.

Android and Debian would be using a fine-grained perf LSM to accomplish
the same thing: globally disabling it for unprivileged users when not
doing development. The nice thing about fine-grained control would be
implementing a *more* restrictive policy for the case when it's toggled
on. It wouldn't necessarily have to be globally enabled, only enabled
for the relevant user or even a specific process, but that would be
complicated to implement.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-03 Thread Daniel Micay
> The default has no impact on the "it's too coarse and limiting"
> negative property 
> of this patch, which is the show-stopper aspect. Please fix that
> aspect instead of 
> trying to argue around it.

Disabling perf events in the kernel configuration is even more limiting,
and is currently the alternative to doing it this way. It makes sense
for debugging features to be disabled in production releases of products
unless there's a way to control them where risk isn't increased. Having
a way to toggle it dynamically will allow it to be remain available.

> This isn't some narrow debugging mechanism we can turn on/off globally
> and forget 
> about, this is a wide scope performance measurement and event logging 
> infrastructure that is being utilized not just by developers but by
> apps and 
> runtimes as well.

The incredibly wide scope is why it's such a big security problem. If it
wasn't such a frequent source of vulnerabilities, it wouldn't have been
disabled for unprivileged users in grsecurity, Debian and then Android.
It's extended by lots of vendor code to specific to platforms too, so it
isn't just some core kernel code that's properly reviewed.

I don't think there are runtimes using this for JIT tracing. Perhaps it
doesn't actually suit their needs. It's a theoretical use case.

> > Let me take this another way instead. What would be a better way to
> > provide a 
> > mechanism for system owners to disable perf without an LSM? (Since
> > far fewer 
> > folks run with an enforcing "big" LSM: I'm seeking as wide a
> > coverage as 
> > possible.)
> 
> Because in practice what will happen is that if the only option is to
> do something 
> drastic for sekjurity, IT departments will do it - while if there's a
> more 
> flexible mechanism that does not throw out the baby with the bath
> water that is 
> going to be used.

It's already done: Android and Debian disable it for unprivileged users
by default. It's already done for most of desktop and mobile Linux and
perhaps even most servers. Not providing functionality desired by
downstream doesn't mean it won't be provided there.

They'll keep doing it whether or not this lands. If it doesn't land, it
will only mean that mainline kernels aren't usable for making Android
devices. They need to pass the compatibility test suite, which means
having this. The mechanism could change but I don't see why the actual
requirement would.

> This is as if 20 years ago you had submitted a patch to the early
> Linux TCP/IP 
> networking code to be on/off via a global sysctl switch and told
> people that 
> "in developer mode you can have networking, talk to your admin".
> 
> We'd have told you: "this switch is too coarse and limiting, please
> implement 
> something better, like a list of routes which defines which IP ranges
> are 
> accessible, and a privileged range of listen sockets ports and some
> flexible 
> kernel side filtering mechanism to inhibit outgoing/incoming
> connections".
> 
> Global sysctls are way too coarse.

The end result of submitting an LSM hook instead of using this would
probably come down to having a global sysctl toggle in Yama. There would
be two sysctl controls for perf restrictions rather than one which is
needless complexity for the interface.

Android and Debian would be using a fine-grained perf LSM to accomplish
the same thing: globally disabling it for unprivileged users when not
doing development. The nice thing about fine-grained control would be
implementing a *more* restrictive policy for the case when it's toggled
on. It wouldn't necessarily have to be globally enabled, only enabled
for the relevant user or even a specific process, but that would be
complicated to implement.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-02 Thread Daniel Micay
> > So the problem I have with this is that it will completely inhibit
> > development of things like JITs that self-profile to re-compile
> > frequently used code.
> 
> Or reimplement strace with sys_perf_event_open(), speeding it up
> greatly
> by not using ptrace (see 'perf trace', one such attempt), combining it
> with sys_bpf(), which can run unpriviledged as well, provides lots of
> possibilities for efficient tooling that would be greatly stiffled by
> such big hammer restrictions :-(

The usage on Android wouldn't impact strace. It's a debugging tool used
over the debugging shell so it could be taught to toggle on unprivileged
access to perf events as the other tools using the API were.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-02 Thread Daniel Micay
> > So the problem I have with this is that it will completely inhibit
> > development of things like JITs that self-profile to re-compile
> > frequently used code.
> 
> Or reimplement strace with sys_perf_event_open(), speeding it up
> greatly
> by not using ptrace (see 'perf trace', one such attempt), combining it
> with sys_bpf(), which can run unpriviledged as well, provides lots of
> possibilities for efficient tooling that would be greatly stiffled by
> such big hammer restrictions :-(

The usage on Android wouldn't impact strace. It's a debugging tool used
over the debugging shell so it could be taught to toggle on unprivileged
access to perf events as the other tools using the API were.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-02 Thread Daniel Micay
On Tue, 2016-08-02 at 11:52 +0200, Peter Zijlstra wrote:
> On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote:
> > 
> > When kernel.perf_event_paranoid is set to 3 (or greater), disallow
> > all access to performance events by users without CAP_SYS_ADMIN.
> > 
> > This new level of restriction is intended to reduce the attack
> > surface of the kernel. Perf is a valuable tool for developers but
> > is generally unnecessary and unused on production systems. Perf may
> > open up an attack vector to vulnerable device-specific drivers as
> > recently demonstrated in CVE-2016-0805, CVE-2016-0819,
> > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843.
> 
> We have bugs we fix them, we don't kill complete infrastructure
> because
> 
> of them.

It's still accessible to privileged processes either way. Android still
allows access from unprivileged processes but it can only be enabled via
the debugging shell, which is not enabled by default either.

It isn't even possible to disable the perf events infrastructure via
kernel configuration for every architecture right now. You're forcing
people to have common local privilege escalation and information leak
vulnerabilities for something few people actually use.

This patch is now a requirement for any Android devices with a security
patch level above August 2016. The only thing that not merging it is
going to accomplish is preventing a mainline kernel from ever being used
on Android devices, unless you provide an alternative it can use for the
same use case.

https://source.android.com/security/bulletin/2016-08-01.html

> > This new level of
> > restriction allows for a safe default to be set on production
> > systems
> > while leaving a simple means for developers to grant access [1].
> 
> So the problem I have with this is that it will completely inhibit
> development of things like JITs that self-profile to re-compile
> frequently used code.
> 
> I would much rather have an LSM hook where the security stuff can do
more fine grained control of things. Allowing some apps perf usage while
> denying others.

If the only need was controlling access per-process statically, then
using seccomp-bpf works fine. It needs to be dynamic though. I don't
think SELinux could be used to provide the functionality so it would
have to be a whole new LSM. I doubt anyone will implement that when the
necessary functionality is already available. It's already exposed only
for developers using profiling tools until they reboot, so finer grained
control wouldn't accomplish much.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-02 Thread Daniel Micay
On Tue, 2016-08-02 at 11:52 +0200, Peter Zijlstra wrote:
> On Wed, Jul 27, 2016 at 07:45:46AM -0700, Jeff Vander Stoep wrote:
> > 
> > When kernel.perf_event_paranoid is set to 3 (or greater), disallow
> > all access to performance events by users without CAP_SYS_ADMIN.
> > 
> > This new level of restriction is intended to reduce the attack
> > surface of the kernel. Perf is a valuable tool for developers but
> > is generally unnecessary and unused on production systems. Perf may
> > open up an attack vector to vulnerable device-specific drivers as
> > recently demonstrated in CVE-2016-0805, CVE-2016-0819,
> > CVE-2016-0843, CVE-2016-3768, and CVE-2016-3843.
> 
> We have bugs we fix them, we don't kill complete infrastructure
> because
> 
> of them.

It's still accessible to privileged processes either way. Android still
allows access from unprivileged processes but it can only be enabled via
the debugging shell, which is not enabled by default either.

It isn't even possible to disable the perf events infrastructure via
kernel configuration for every architecture right now. You're forcing
people to have common local privilege escalation and information leak
vulnerabilities for something few people actually use.

This patch is now a requirement for any Android devices with a security
patch level above August 2016. The only thing that not merging it is
going to accomplish is preventing a mainline kernel from ever being used
on Android devices, unless you provide an alternative it can use for the
same use case.

https://source.android.com/security/bulletin/2016-08-01.html

> > This new level of
> > restriction allows for a safe default to be set on production
> > systems
> > while leaving a simple means for developers to grant access [1].
> 
> So the problem I have with this is that it will completely inhibit
> development of things like JITs that self-profile to re-compile
> frequently used code.
> 
> I would much rather have an LSM hook where the security stuff can do
more fine grained control of things. Allowing some apps perf usage while
> denying others.

If the only need was controlling access per-process statically, then
using seccomp-bpf works fine. It needs to be dynamic though. I don't
think SELinux could be used to provide the functionality so it would
have to be a whole new LSM. I doubt anyone will implement that when the
necessary functionality is already available. It's already exposed only
for developers using profiling tools until they reboot, so finer grained
control wouldn't accomplish much.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization

2016-07-31 Thread Daniel Micay
> > It's very hard to quantify the benefits of fine-grained
> > randomization,
> 
> ?  N = # of possible addresses.  The bigger N is, the more chances the
> attacker will trip up before finding what they were looking for.

If the attacker is forcing the creation of many objects with a function
pointer and then trying to hit one, the only thing that would help is if
the heap is very sparse with random bases within it. They don't need to
hit a specific object for an exploit to work.

The details of how the randomization is done and the guarantees that are
provided certainly matter. Individual random gaps are low entropy and
they won't add up to much higher entropy randomization even for two
objects that are far apart. The entropy has no chance to build up since
the sizes will average out.

I'm not saying it doesn't make sense to do this (it's a feature that I
really want), but there are a lot of ways to approach fine-grained mmap
randomization and the design decisions should be justified and their
impact analyzed/measured.

> > but there are other useful guarantees you could provide. It would be
> > quite helpful for the kernel to expose the option to force a
> > PROT_NONE
> > mapping after every allocation. The gaps should actually be
> > enforced.
> > 
> > So perhaps 3 things, simply exposed as off-by-default sysctl options
> > (no need for special treatment on 32-bit):
> 
> I'm certainly not an mm-developer, but this looks to me like we're
> pushing the work of creating efficient, random mappings out to
> userspace.  :-/

Exposing configuration doesn't push work to userspace. I can't see any
way that this would be done by default even on 64-bit due to the extra
VMAs, so it really needs configuration.

> > a) configurable minimum gap size in pages (for protection against
> > linear and small {under,over}flows) b) configurable minimum gap size
> > based on a ratio to allocation size (for making the heap sparse to
> > mitigate heap sprays, especially when mixed with fine-grained
> > randomization - for example 2x would add a 2M gap after a 1M
> > mapping)
> 
> mmm, this looks like an information leak.  Best to set a range of
> pages
> and pick a random number within that range for each call.

A minimum gap size provides guarantees not offered by randomization
alone. It might not make sense to approach making the heap sparse by
forcing it separately from randomization, but doing it isn't leaking
information.

Obviously the random gap would be chosen by picking a maximum size (n)
and choosing a size between [0, n], which was the next potential
variable, separate from these non-randomization-related guarantees.

> > c) configurable maximum random gap size (the random gap would be in
> > addition to the enforced minimums)
> > 
> > The randomization could just be considered an extra with minor
> > benefits rather than the whole feature. A full fine-grained
> > randomization implementation would need a higher-level form of
> > randomization than gaps in the kernel along with cooperation from
> > userspace allocators. This would make sense as one part of it
> > though.
> 
> Ok, so here's an idea.  This idea could be used in conjunction with
> random gaps, or on it's own.  It would be enhanced by userspace random
> load order.
> 
> The benefit is that with 32bit address space, and no random gapping,
> it's still not wasting much space.
> 
> Given a memory space, break it up into X bands such that there are 2*X
> possible addresses.
> 
>   |A B|C D|E F|G H| ... |2*X-2  2*X-1|
>   |--> <--|--> <--|--> <--|--> <--| ... |-->  <--|
> min  max
> 
> For each call to mmap, we randomly pick a value within [0 - 2*X).
> Assuming A=0 in the diagram above, even values grow up, odd values
> grow
> down.  Gradually consuming the single gap in the middle of each band.
> 
> How many bands to use would depend on:
>   * 32/64bit
>   * Average number of mmap calls
>   * largest single mmap call usually seen
>   * if using random gaps and range used
> 
> If the free gap in a chosen band is too small for the request, pick
> again among the other bands.
> 
> Again, I'm not an mm dev, so I might be totally smoking crack on this
> one...
> 
> thx,
> 
> Jason.

Address space fragmentation matters a lot, not only wasted space due to
memory that's explicitly reserved for random gaps. The randomization
guarantees under situations like memory exhaustion also matter.

I do think fine-grained randomization would be useful, but I think it's
unclear what a good approach would be, along with what the security
benefits would be. The malloc implementation is also very relevant.

OpenBSD has fine-grained mmap randomization with a fair bit of thought
put into how it works, but I don't think there has been much analysis of
it. The security properties really aren't clear.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization

2016-07-31 Thread Daniel Micay
> > It's very hard to quantify the benefits of fine-grained
> > randomization,
> 
> ?  N = # of possible addresses.  The bigger N is, the more chances the
> attacker will trip up before finding what they were looking for.

If the attacker is forcing the creation of many objects with a function
pointer and then trying to hit one, the only thing that would help is if
the heap is very sparse with random bases within it. They don't need to
hit a specific object for an exploit to work.

The details of how the randomization is done and the guarantees that are
provided certainly matter. Individual random gaps are low entropy and
they won't add up to much higher entropy randomization even for two
objects that are far apart. The entropy has no chance to build up since
the sizes will average out.

I'm not saying it doesn't make sense to do this (it's a feature that I
really want), but there are a lot of ways to approach fine-grained mmap
randomization and the design decisions should be justified and their
impact analyzed/measured.

> > but there are other useful guarantees you could provide. It would be
> > quite helpful for the kernel to expose the option to force a
> > PROT_NONE
> > mapping after every allocation. The gaps should actually be
> > enforced.
> > 
> > So perhaps 3 things, simply exposed as off-by-default sysctl options
> > (no need for special treatment on 32-bit):
> 
> I'm certainly not an mm-developer, but this looks to me like we're
> pushing the work of creating efficient, random mappings out to
> userspace.  :-/

Exposing configuration doesn't push work to userspace. I can't see any
way that this would be done by default even on 64-bit due to the extra
VMAs, so it really needs configuration.

> > a) configurable minimum gap size in pages (for protection against
> > linear and small {under,over}flows) b) configurable minimum gap size
> > based on a ratio to allocation size (for making the heap sparse to
> > mitigate heap sprays, especially when mixed with fine-grained
> > randomization - for example 2x would add a 2M gap after a 1M
> > mapping)
> 
> mmm, this looks like an information leak.  Best to set a range of
> pages
> and pick a random number within that range for each call.

A minimum gap size provides guarantees not offered by randomization
alone. It might not make sense to approach making the heap sparse by
forcing it separately from randomization, but doing it isn't leaking
information.

Obviously the random gap would be chosen by picking a maximum size (n)
and choosing a size between [0, n], which was the next potential
variable, separate from these non-randomization-related guarantees.

> > c) configurable maximum random gap size (the random gap would be in
> > addition to the enforced minimums)
> > 
> > The randomization could just be considered an extra with minor
> > benefits rather than the whole feature. A full fine-grained
> > randomization implementation would need a higher-level form of
> > randomization than gaps in the kernel along with cooperation from
> > userspace allocators. This would make sense as one part of it
> > though.
> 
> Ok, so here's an idea.  This idea could be used in conjunction with
> random gaps, or on it's own.  It would be enhanced by userspace random
> load order.
> 
> The benefit is that with 32bit address space, and no random gapping,
> it's still not wasting much space.
> 
> Given a memory space, break it up into X bands such that there are 2*X
> possible addresses.
> 
>   |A B|C D|E F|G H| ... |2*X-2  2*X-1|
>   |--> <--|--> <--|--> <--|--> <--| ... |-->  <--|
> min  max
> 
> For each call to mmap, we randomly pick a value within [0 - 2*X).
> Assuming A=0 in the diagram above, even values grow up, odd values
> grow
> down.  Gradually consuming the single gap in the middle of each band.
> 
> How many bands to use would depend on:
>   * 32/64bit
>   * Average number of mmap calls
>   * largest single mmap call usually seen
>   * if using random gaps and range used
> 
> If the free gap in a chosen band is too small for the request, pick
> again among the other bands.
> 
> Again, I'm not an mm dev, so I might be totally smoking crack on this
> one...
> 
> thx,
> 
> Jason.

Address space fragmentation matters a lot, not only wasted space due to
memory that's explicitly reserved for random gaps. The randomization
guarantees under situations like memory exhaustion also matter.

I do think fine-grained randomization would be useful, but I think it's
unclear what a good approach would be, along with what the security
benefits would be. The malloc implementation is also very relevant.

OpenBSD has fine-grained mmap randomization with a fair bit of thought
put into how it works, but I don't think there has been much analysis of
it. The security properties really aren't clear.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization

2016-07-29 Thread Daniel Micay
> > In the Project Zero Stagefright post
> > (http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html)
> > ,
> > we see that the linear allocation of memory combined with the low
> > number of bits in the initial mmap offset resulted in a much more
> > predictable layout which aided the attacker. The initial random mmap
> > base range was increased by Daniel Cashman in
> > d07e22597d1d355829b7b18ac19afa912cf758d1, but we've done nothing to
> > address page relative attacks.
> > 
> > Inter-mmap randomization will decrease the predictability of later
> > mmap() allocations, which should help make data structures harder to
> > find in memory. In addition, this patch will also introduce unmapped
> > gaps between pages, preventing linear overruns from one mapping to
> > another another mapping. I am unable to quantify how much this will
> > improve security, but it should be > 0.
> 
> One person calls "unmapped gaps between pages" a feature, others call
> it
> a mess. ;-)

It's very hard to quantify the benefits of fine-grained randomization,
but there are other useful guarantees you could provide. It would be
quite helpful for the kernel to expose the option to force a PROT_NONE
mapping after every allocation. The gaps should actually be enforced.

So perhaps 3 things, simply exposed as off-by-default sysctl options (no
need for special treatment on 32-bit):

a) configurable minimum gap size in pages (for protection against linear
and small {under,over}flows)
b) configurable minimum gap size based on a ratio to allocation size
(for making the heap sparse to mitigate heap sprays, especially when
mixed with fine-grained randomization - for example 2x would add a 2M
gap after a 1M mapping)
c) configurable maximum random gap size (the random gap would be in
addition to the enforced minimums)

The randomization could just be considered an extra with minor benefits
rather than the whole feature. A full fine-grained randomization
implementation would need a higher-level form of randomization than gaps
in the kernel along with cooperation from userspace allocators. This
would make sense as one part of it though.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH] [RFC] Introduce mmap randomization

2016-07-29 Thread Daniel Micay
> > In the Project Zero Stagefright post
> > (http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html)
> > ,
> > we see that the linear allocation of memory combined with the low
> > number of bits in the initial mmap offset resulted in a much more
> > predictable layout which aided the attacker. The initial random mmap
> > base range was increased by Daniel Cashman in
> > d07e22597d1d355829b7b18ac19afa912cf758d1, but we've done nothing to
> > address page relative attacks.
> > 
> > Inter-mmap randomization will decrease the predictability of later
> > mmap() allocations, which should help make data structures harder to
> > find in memory. In addition, this patch will also introduce unmapped
> > gaps between pages, preventing linear overruns from one mapping to
> > another another mapping. I am unable to quantify how much this will
> > improve security, but it should be > 0.
> 
> One person calls "unmapped gaps between pages" a feature, others call
> it
> a mess. ;-)

It's very hard to quantify the benefits of fine-grained randomization,
but there are other useful guarantees you could provide. It would be
quite helpful for the kernel to expose the option to force a PROT_NONE
mapping after every allocation. The gaps should actually be enforced.

So perhaps 3 things, simply exposed as off-by-default sysctl options (no
need for special treatment on 32-bit):

a) configurable minimum gap size in pages (for protection against linear
and small {under,over}flows)
b) configurable minimum gap size based on a ratio to allocation size
(for making the heap sparse to mitigate heap sprays, especially when
mixed with fine-grained randomization - for example 2x would add a 2M
gap after a 1M mapping)
c) configurable maximum random gap size (the random gap would be in
addition to the enforced minimums)

The randomization could just be considered an extra with minor benefits
rather than the whole feature. A full fine-grained randomization
implementation would need a higher-level form of randomization than gaps
in the kernel along with cooperation from userspace allocators. This
would make sense as one part of it though.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy

2016-07-15 Thread Daniel Micay
> I'd like it to dump stack and be fatal to the process involved, but
> yeah, I guess BUG() would work. Creating an infrastructure for
> handling security-related Oopses can be done separately from this
> (and
> I'd like to see that added, since it's a nice bit of configurable
> reactivity to possible attacks).

In grsecurity, the oops handling also uses do_group_exit instead of
do_exit but both that change (or at least the option to do it) and the
exploit handling could be done separately from this without actually
needing special treatment for USERCOPY. Could expose is as something
like panic_on_oops=2 as a balance between the existing options.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy

2016-07-15 Thread Daniel Micay
> I'd like it to dump stack and be fatal to the process involved, but
> yeah, I guess BUG() would work. Creating an infrastructure for
> handling security-related Oopses can be done separately from this
> (and
> I'd like to see that added, since it's a nice bit of configurable
> reactivity to possible attacks).

In grsecurity, the oops handling also uses do_group_exit instead of
do_exit but both that change (or at least the option to do it) and the
exploit handling could be done separately from this without actually
needing special treatment for USERCOPY. Could expose is as something
like panic_on_oops=2 as a balance between the existing options.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy

2016-07-15 Thread Daniel Micay
> This could be a BUG, but I'd rather not panic the entire kernel.

It seems unlikely that it will panic without panic_on_oops and that's
an explicit opt-in to taking down the system on kernel logic errors
exactly like this. In grsecurity, it calls the kernel exploit handling
logic (panic if root, otherwise kill all process of that user and ban
them until reboot) but that same logic is also called for BUG via oops
handling so there's only really a distinction with panic_on_oops=1.

Does it make sense to be less fatal for a fatal assertion that's more
likely to be security-related? Maybe you're worried about having some
false positives for the whitelisting portion, but I don't think those
will lurk around very long with the way this works.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v2 02/11] mm: Hardened usercopy

2016-07-15 Thread Daniel Micay
> This could be a BUG, but I'd rather not panic the entire kernel.

It seems unlikely that it will panic without panic_on_oops and that's
an explicit opt-in to taking down the system on kernel logic errors
exactly like this. In grsecurity, it calls the kernel exploit handling
logic (panic if root, otherwise kill all process of that user and ban
them until reboot) but that same logic is also called for BUG via oops
handling so there's only really a distinction with panic_on_oops=1.

Does it make sense to be less fatal for a fatal assertion that's more
likely to be security-related? Maybe you're worried about having some
false positives for the whitelisting portion, but I don't think those
will lurk around very long with the way this works.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v2 2/3] Mark functions with the __nocapture attribute

2016-07-12 Thread Daniel Micay
On Tue, 2016-07-12 at 15:08 -0400, Kees Cook wrote:
> On Mon, Jul 4, 2016 at 7:42 PM, Emese Revfy 
> wrote:
> > 
> > The nocapture gcc attribute can be on functions only.
> > The attribute takes one or more unsigned integer constants as
> > parameters
> > that specify the function argument(s) of const char* type to
> > initify.
> > If the marked argument is a vararg then the plugin initifies
> > all vararg arguments.
> 
> Why is this called "nocapture"? Not captured by what? It seems like
> it
> means "initify this if possible". Am I misunderstanding its purpose?

It means they don't escape via that function, i.e. they aren't stored
anywhere to be used in any way after the call.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v2 2/3] Mark functions with the __nocapture attribute

2016-07-12 Thread Daniel Micay
On Tue, 2016-07-12 at 15:08 -0400, Kees Cook wrote:
> On Mon, Jul 4, 2016 at 7:42 PM, Emese Revfy 
> wrote:
> > 
> > The nocapture gcc attribute can be on functions only.
> > The attribute takes one or more unsigned integer constants as
> > parameters
> > that specify the function argument(s) of const char* type to
> > initify.
> > If the marked argument is a vararg then the plugin initifies
> > all vararg arguments.
> 
> Why is this called "nocapture"? Not captured by what? It seems like
> it
> means "initify this if possible". Am I misunderstanding its purpose?

It means they don't escape via that function, i.e. they aren't stored
anywhere to be used in any way after the call.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open

2016-06-17 Thread Daniel Micay
On Fri, 2016-06-17 at 17:00 -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Jun 17, 2016 at 12:16:47PM -0400, Daniel Micay escreveu:
> > On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote:
> > > This Changelog is completely devoid of information. _WHY_ are you
> > > doing this?
>  
> > Attack surface reduction. It's possible to use seccomp-bpf for some
> > limited cases, but it's not flexible enough. There are lots of
> > information leaks and local privilege escalation vulnerabilities via
> > perf events, yet on most Linux installs it's not ever being used. So
> > turning it off by default on those installs is an easy win. The
> > holes
> > are reduced to root -> kernel (and that's not a meaningful boundary
> > in
> > mainline right now - although as is the case here, Debian has a
> > bunch of
> > securelevel patches for that).
> 
> Is ptrace also disabled on such systems, or any of the other more
> recent
> syscalls? The same arguments could probably be used to disable those:
> reduce attack surface, possibly the new ones have bugs as they are
> relatively new and it takes a long time for new syscalls to be more
> generally used, if we go on disabling them in such a way, they will
> probably never get used :-\

ptrace is allowed for third party apps within their SELinux domain, but
they all run as different users (so the kernel attack surface is there).
It's now disabled for privileged platform apps and most other domains. A
bit backwards, but removing it for third party apps would break
compatibility so it would have to be done at an API level bump. At
least, without deciding it is worth the cost like hidepid=2 in Android N
(which exposes no mechanism for exceptions in 3rd party apps, only the
base system).

If seccomp-bpf didn't imply such high system call overhead outside of
x86_64, Android would probably be walling off some new system calls. It
needs 2-phase lookup similar to x86 on ARM. Android kernels do avoid
enabling lots of kernel functionality from System V IPC to USERNS
though. New features wouldn't end up enabled if they were behind config
options without an explicit choice.

> Wouldn't the recent bump in perf_event_paranoid to 2 enough? I.e. only
> allow profiling of user tasks?

Most of the vulnerabilities are still exposed at 2. That prevents
leaking information about the kernel WITHOUT vulnerabilities though, and
would be an improvement when perf is disabled - but that doesn't really
matter much (Android kernel debugging and profiling would also still
work with 2).

> Or is there something more specific that we should disable/constrain
> to
> reduce such surface contact without using such a big hammer?

It's desired to have it globally disabled by default. Could use seccomp-
bpf to globally disable it, but that's a bigger hammer because it can't
be globally turned off without a reboot (could only profile newly
spawned processes after disabling it). Since it's only going to be
enabled by developers, trying to make it more fine-grained wouldn't
really pay off.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open

2016-06-17 Thread Daniel Micay
On Fri, 2016-06-17 at 17:00 -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Jun 17, 2016 at 12:16:47PM -0400, Daniel Micay escreveu:
> > On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote:
> > > This Changelog is completely devoid of information. _WHY_ are you
> > > doing this?
>  
> > Attack surface reduction. It's possible to use seccomp-bpf for some
> > limited cases, but it's not flexible enough. There are lots of
> > information leaks and local privilege escalation vulnerabilities via
> > perf events, yet on most Linux installs it's not ever being used. So
> > turning it off by default on those installs is an easy win. The
> > holes
> > are reduced to root -> kernel (and that's not a meaningful boundary
> > in
> > mainline right now - although as is the case here, Debian has a
> > bunch of
> > securelevel patches for that).
> 
> Is ptrace also disabled on such systems, or any of the other more
> recent
> syscalls? The same arguments could probably be used to disable those:
> reduce attack surface, possibly the new ones have bugs as they are
> relatively new and it takes a long time for new syscalls to be more
> generally used, if we go on disabling them in such a way, they will
> probably never get used :-\

ptrace is allowed for third party apps within their SELinux domain, but
they all run as different users (so the kernel attack surface is there).
It's now disabled for privileged platform apps and most other domains. A
bit backwards, but removing it for third party apps would break
compatibility so it would have to be done at an API level bump. At
least, without deciding it is worth the cost like hidepid=2 in Android N
(which exposes no mechanism for exceptions in 3rd party apps, only the
base system).

If seccomp-bpf didn't imply such high system call overhead outside of
x86_64, Android would probably be walling off some new system calls. It
needs 2-phase lookup similar to x86 on ARM. Android kernels do avoid
enabling lots of kernel functionality from System V IPC to USERNS
though. New features wouldn't end up enabled if they were behind config
options without an explicit choice.

> Wouldn't the recent bump in perf_event_paranoid to 2 enough? I.e. only
> allow profiling of user tasks?

Most of the vulnerabilities are still exposed at 2. That prevents
leaking information about the kernel WITHOUT vulnerabilities though, and
would be an improvement when perf is disabled - but that doesn't really
matter much (Android kernel debugging and profiling would also still
work with 2).

> Or is there something more specific that we should disable/constrain
> to
> reduce such surface contact without using such a big hammer?

It's desired to have it globally disabled by default. Could use seccomp-
bpf to globally disable it, but that's a bigger hammer because it can't
be globally turned off without a reboot (could only profile newly
spawned processes after disabling it). Since it's only going to be
enabled by developers, trying to make it more fine-grained wouldn't
really pay off.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open

2016-06-17 Thread Daniel Micay
On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 03:27:55PM -0700, Kees Cook wrote:
> > Hi guys,
> > 
> > This patch wasn't originally CCed to you (I'm fixing that now).
> > Would
> > you consider taking this into the perf tree? 
> 
> No.
> 
> > It's been in active use
> > in both Debian and Android for a while now.
> 
> Very nice of you all to finally inform us I suppose :/

It was in Debian a lot longer than Android, although the Android feature
came from a downstream variant where it was done much earlier:

https://android-review.googlesource.com/#/c/233736/

> > > > > 
> > > > > access to performance events by users without CAP_SYS_ADMIN.
> > > > > Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that
> > > > > makes this value the default.
> > > > > 
> > > > > This is based on a similar feature in grsecurity
> > > > > (CONFIG_GRKERNSEC_PERF_HARDEN).  This version doesn't include
> > > > > making
> > > > > the variable read-only.  It also allows enabling further
> > > > > restriction
> > > > > at run-time regardless of whether the default is changed.
> 
> This Changelog is completely devoid of information. _WHY_ are you
> doing
> this?

Attack surface reduction. It's possible to use seccomp-bpf for some
limited cases, but it's not flexible enough. There are lots of
information leaks and local privilege escalation vulnerabilities via
perf events, yet on most Linux installs it's not ever being used. So
turning it off by default on those installs is an easy win. The holes
are reduced to root -> kernel (and that's not a meaningful boundary in
mainline right now - although as is the case here, Debian has a bunch of
securelevel patches for that).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] [PATCH 2/2] security,perf: Allow further restriction of perf_event_open

2016-06-17 Thread Daniel Micay
On Fri, 2016-06-17 at 08:54 +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 03:27:55PM -0700, Kees Cook wrote:
> > Hi guys,
> > 
> > This patch wasn't originally CCed to you (I'm fixing that now).
> > Would
> > you consider taking this into the perf tree? 
> 
> No.
> 
> > It's been in active use
> > in both Debian and Android for a while now.
> 
> Very nice of you all to finally inform us I suppose :/

It was in Debian a lot longer than Android, although the Android feature
came from a downstream variant where it was done much earlier:

https://android-review.googlesource.com/#/c/233736/

> > > > > 
> > > > > access to performance events by users without CAP_SYS_ADMIN.
> > > > > Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that
> > > > > makes this value the default.
> > > > > 
> > > > > This is based on a similar feature in grsecurity
> > > > > (CONFIG_GRKERNSEC_PERF_HARDEN).  This version doesn't include
> > > > > making
> > > > > the variable read-only.  It also allows enabling further
> > > > > restriction
> > > > > at run-time regardless of whether the default is changed.
> 
> This Changelog is completely devoid of information. _WHY_ are you
> doing
> this?

Attack surface reduction. It's possible to use seccomp-bpf for some
limited cases, but it's not flexible enough. There are lots of
information leaks and local privilege escalation vulnerabilities via
perf events, yet on most Linux installs it's not ever being used. So
turning it off by default on those installs is an easy win. The holes
are reduced to root -> kernel (and that's not a meaningful boundary in
mainline right now - although as is the case here, Debian has a bunch of
securelevel patches for that).

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 2/2] security,perf: Allow further restriction of perf_event_open

2016-06-17 Thread Daniel Micay
> As a debian user, is this a good place to complain? Because it does
> get
> it the way.

It would be relevant to whether or not it should be set to 3 by default
in the kernel without explicit configuration, but there's no proposal to
do that. Debian has to pick a trade-off beyond security and a tiny
roadblock for developers. It's not always the case though.

In Android, there's userspace integration allowing it to be toggled by
the Android Debugging Bridge shell user so profiling tools are being
taught to automatically toggle it. Enabling ADB and then using it for
profiling is an implicit opt-in.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 2/2] security,perf: Allow further restriction of perf_event_open

2016-06-17 Thread Daniel Micay
> As a debian user, is this a good place to complain? Because it does
> get
> it the way.

It would be relevant to whether or not it should be set to 3 by default
in the kernel without explicit configuration, but there's no proposal to
do that. Debian has to pick a trade-off beyond security and a tiny
roadblock for developers. It's not always the case though.

In Android, there's userspace integration allowing it to be toggled by
the Android Debugging Bridge shell user so profiling tools are being
taught to automatically toggle it. Enabling ADB and then using it for
profiling is an implicit opt-in.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v4 0/4] SROP Mitigation: Sigreturn Cookies

2016-03-29 Thread Daniel Micay
> Then there's an unanswered question: is this patch acceptable given
> that it's an ABI break?  Security fixes are sometimes an exception to
> the "no ABI breaks" rule, but it's by no means an automatic exception.
> 
> --Andy

It seems this could be worked around in general. Processes can have a
bit tracking whether this is enabled, and CRIU can save/restore it. It
would just leave it off for resuming old saved processes.

Should CRIU really be covered by the kernel's ABI guarantee though? It
seems like this was meant to be extensible, so it's adding an extra ABI
guarantee that wasn't there before. It makes sense to freeze this ABI
for CRIU, but a version field should be added first in one final ABI
break if it's not too late.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH v4 0/4] SROP Mitigation: Sigreturn Cookies

2016-03-29 Thread Daniel Micay
> Then there's an unanswered question: is this patch acceptable given
> that it's an ABI break?  Security fixes are sometimes an exception to
> the "no ABI breaks" rule, but it's by no means an automatic exception.
> 
> --Andy

It seems this could be worked around in general. Processes can have a
bit tracking whether this is enabled, and CRIU can save/restore it. It
would just leave it off for resuming old saved processes.

Should CRIU really be covered by the kernel's ABI guarantee though? It
seems like this was meant to be extensible, so it's adding an extra ABI
guarantee that wasn't there before. It makes sense to freeze this ABI
for CRIU, but a version field should be added first in one final ABI
break if it's not too late.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled

2016-01-25 Thread Daniel Micay
> This feature is already implemented by two distros, and likely wanted
> by others. We cannot ignore that.

Date point: Arch Linux won't be enabling CONFIG_USERNS until there's a
way to disable unprivileged user namespaces. The kernel maintainers are
unwilling to carry long-term out-of-tree patches.

https://github.com/sandstorm-io/sandstorm/blob/d270755b1b55e5be6c96df2cce7c914f35f0d2a2/install.sh#L464-L474

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled

2016-01-25 Thread Daniel Micay
> This feature is already implemented by two distros, and likely wanted
> by others. We cannot ignore that.

Date point: Arch Linux won't be enabling CONFIG_USERNS until there's a
way to disable unprivileged user namespaces. The kernel maintainers are
unwilling to carry long-term out-of-tree patches.

https://github.com/sandstorm-io/sandstorm/blob/d270755b1b55e5be6c96df2cce7c914f35f0d2a2/install.sh#L464-L474

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [RFC][PATCH 0/7] Sanitization of slabs based on grsecurity/PaX

2015-12-22 Thread Daniel Micay
> I am not sure what the point of this patchset is. We have a similar
> effect
> to sanitization already in the allocators through two mechanisms:

The rationale was covered earlier. Are you responding to that or did you
not see it?

> 1. Slab poisoning
> 2. Allocation with GFP_ZERO
> 
> I do not think we need a third one. You could accomplish your goals
> much
> easier without this code churn by either
> 
> 1. Improve the existing poisoning mechanism. Ensure that there are no
>    gaps. Security sensitive kernel slab caches can then be created
> with
>    the  POISONING flag set. Maybe add a Kconfig flag that enables
>    POISONING for each cache? What was the issue when you tried using
>    posining for sanitization?
> 
> 2. Add a mechanism that ensures that GFP_ZERO is set for each
> allocation.
>    That way every object you retrieve is zeroed and thus you have
> implied
>    sanitization. This also can be done in a rather simple way by
> changing
>    the  GFP_KERNEL etc constants to include __GFP_ZERO depending on a
>    Kconfig option. Or add some runtime setting of the gfp flags
> somewhere.
> 
> Generally I would favor option #2 if you must have sanitization
> because
> that is the only option to really give you a deterministic content of
> object on each allocation. Any half way measures would not work I
> think.
> 
> Note also that most allocations are already either allocations that
> zero
> the content or they are immediately initializing the content of the
> allocated object. After all the object is not really usable if the
> content is random. You may be able to avoid this whole endeavor by
> auditing the kernel for locations where the object is not initialized
> after allocation.
> 
> Once one recognizes the above it seems that sanitization is pretty
> useless. Its just another pass of writing zeroes before the allocator
> or
> uer of the allocated object sets up deterministic content of the
> object or
> -- in most cases -- zeroes it again.

Sanitization isn't just to prevent leaks from usage of uninitialized
data in later allocations. It's a mitigation for use-after-free (esp. if
it's combined with some form of delayed freeing) and it reduces the
lifetime of data.

signature.asc
Description: This is a digitally signed message part


Re: [kernel-hardening] Re: [RFC][PATCH 0/7] Sanitization of slabs based on grsecurity/PaX

2015-12-22 Thread Daniel Micay
> I am not sure what the point of this patchset is. We have a similar
> effect
> to sanitization already in the allocators through two mechanisms:

The rationale was covered earlier. Are you responding to that or did you
not see it?

> 1. Slab poisoning
> 2. Allocation with GFP_ZERO
> 
> I do not think we need a third one. You could accomplish your goals
> much
> easier without this code churn by either
> 
> 1. Improve the existing poisoning mechanism. Ensure that there are no
>    gaps. Security sensitive kernel slab caches can then be created
> with
>    the  POISONING flag set. Maybe add a Kconfig flag that enables
>    POISONING for each cache? What was the issue when you tried using
>    posining for sanitization?
> 
> 2. Add a mechanism that ensures that GFP_ZERO is set for each
> allocation.
>    That way every object you retrieve is zeroed and thus you have
> implied
>    sanitization. This also can be done in a rather simple way by
> changing
>    the  GFP_KERNEL etc constants to include __GFP_ZERO depending on a
>    Kconfig option. Or add some runtime setting of the gfp flags
> somewhere.
> 
> Generally I would favor option #2 if you must have sanitization
> because
> that is the only option to really give you a deterministic content of
> object on each allocation. Any half way measures would not work I
> think.
> 
> Note also that most allocations are already either allocations that
> zero
> the content or they are immediately initializing the content of the
> allocated object. After all the object is not really usable if the
> content is random. You may be able to avoid this whole endeavor by
> auditing the kernel for locations where the object is not initialized
> after allocation.
> 
> Once one recognizes the above it seems that sanitization is pretty
> useless. Its just another pass of writing zeroes before the allocator
> or
> uer of the allocated object sets up deterministic content of the
> object or
> -- in most cases -- zeroes it again.

Sanitization isn't just to prevent leaks from usage of uninitialized
data in later allocations. It's a mitigation for use-after-free (esp. if
it's combined with some form of delayed freeing) and it reduces the
lifetime of data.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2 00/13] MADV_FREE support

2015-12-05 Thread Daniel Micay
On 05/12/15 06:10 AM, Pavel Machek wrote:
> On Wed 2015-11-04 10:25:54, Minchan Kim wrote:
>> MADV_FREE is on linux-next so long time. The reason was two, I think.
>>
>> 1. MADV_FREE code on reclaim path was really mess.
> 
> Could you explain what MADV_FREE does?
> 
> Comment in code says 'free the page only when there's memory
> pressure'. So I mark my caches MADV_FREE, no memory pressure, I can
> keep using it? And if there's memory pressure, what happens? I get
> zeros? SIGSEGV?

You get zeroes. It's not designed for that use case right now. It's for
malloc implementations to use internally. There would need to be a new
feature like MADV_FREE_UNDO for it to be usable for caches and it may
make more sense for that to be a separate feature entirely, i.e. have a
different flag for marking too (not sure) since it wouldn't need to
worry about whether stuff is touched.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 00/13] MADV_FREE support

2015-12-05 Thread Daniel Micay
On 05/12/15 06:10 AM, Pavel Machek wrote:
> On Wed 2015-11-04 10:25:54, Minchan Kim wrote:
>> MADV_FREE is on linux-next so long time. The reason was two, I think.
>>
>> 1. MADV_FREE code on reclaim path was really mess.
> 
> Could you explain what MADV_FREE does?
> 
> Comment in code says 'free the page only when there's memory
> pressure'. So I mark my caches MADV_FREE, no memory pressure, I can
> keep using it? And if there's memory pressure, what happens? I get
> zeros? SIGSEGV?

You get zeroes. It's not designed for that use case right now. It's for
malloc implementations to use internally. There would need to be a new
feature like MADV_FREE_UNDO for it to be usable for caches and it may
make more sense for that to be a separate feature entirely, i.e. have a
different flag for marking too (not sure) since it wouldn't need to
worry about whether stuff is touched.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-13 Thread Daniel Micay
On 13/11/15 02:03 AM, Minchan Kim wrote:
> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>> easily when we need it. Maybe, that's what you want. Right?
>>
>> Yes, but why the access bit instead of the dirty bit for that? It could
>> always be made more strict (i.e. access bit) in the future, while going
>> the other way won't be possible. So I think the dirty bit is really the
>> more conservative choice since if it turns out to be a mistake it can be
>> fixed without a backwards incompatible change.
> 
> Absolutely true. That's why I insist on dirty bit until now although
> I didn't tell the reason. But I thought you wanted to change for using
> access bit for the future, too. It seems MADV_FREE start to bloat
> over and over again before knowing real problems and usecases.
> It's almost same situation with volatile ranges so I really want to
> stop at proper point which maintainer should decide, I hope.
> Without it, we will make the feature a lot heavy by just brain storming
> and then causes lots of churn in MM code without real bebenfit
> It would be very painful for us.

Well, I don't think you need more than a good API and an implementation
with no known bugs, kernel security concerns or backwards compatibility
issues. Configuration and API extensions are something for later (i.e.
land a baseline, then submit stuff like sysctl tunables). Just my take
on it though...



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-13 Thread Daniel Micay
On 13/11/15 02:03 AM, Minchan Kim wrote:
> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>> easily when we need it. Maybe, that's what you want. Right?
>>
>> Yes, but why the access bit instead of the dirty bit for that? It could
>> always be made more strict (i.e. access bit) in the future, while going
>> the other way won't be possible. So I think the dirty bit is really the
>> more conservative choice since if it turns out to be a mistake it can be
>> fixed without a backwards incompatible change.
> 
> Absolutely true. That's why I insist on dirty bit until now although
> I didn't tell the reason. But I thought you wanted to change for using
> access bit for the future, too. It seems MADV_FREE start to bloat
> over and over again before knowing real problems and usecases.
> It's almost same situation with volatile ranges so I really want to
> stop at proper point which maintainer should decide, I hope.
> Without it, we will make the feature a lot heavy by just brain storming
> and then causes lots of churn in MM code without real bebenfit
> It would be very painful for us.

Well, I don't think you need more than a good API and an implementation
with no known bugs, kernel security concerns or backwards compatibility
issues. Configuration and API extensions are something for later (i.e.
land a baseline, then submit stuff like sysctl tunables). Just my take
on it though...



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-12 Thread Daniel Micay
> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
> easily when we need it. Maybe, that's what you want. Right?

Yes, but why the access bit instead of the dirty bit for that? It could
always be made more strict (i.e. access bit) in the future, while going
the other way won't be possible. So I think the dirty bit is really the
more conservative choice since if it turns out to be a mistake it can be
fixed without a backwards incompatible change.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-12 Thread Daniel Micay
On 13/11/15 01:15 AM, Minchan Kim wrote:
> On Thu, Nov 12, 2015 at 12:21:30AM -0500, Daniel Micay wrote:
>>> I also think that the kernel should commit to either zeroing the page
>>> or leaving it unchanged in response to MADV_FREE (even if the decision
>>> of which to do is made later on).  I think that your patch series does
>>> this, but only after a few of the patches are applied (the swap entry
>>> freeing), and I think that it should be a real guaranteed part of the
>>> semantics and maybe have a test case.
>>
>> This would be a good thing to test because it would be required to add
>> MADV_FREE_UNDO down the road. It would mean the same semantics as the
>> MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
>> value in that for the sake of migrating existing software too.
> 
> So, do you mean that we could implement MADV_FREE_UNDO with "read"
> opearation("just access bit marking) easily in future?
> 
> If so, it would be good reason to change MADV_FREE from dirty bit to
> access bit. Okay, I will look at that.

I just meant testing that the data is either zero or the old data if
it's read before it's written to. Not having it stay around once there
is a read. Not sure if that's what Andy meant.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-12 Thread Daniel Micay
> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
> easily when we need it. Maybe, that's what you want. Right?

Yes, but why the access bit instead of the dirty bit for that? It could
always be made more strict (i.e. access bit) in the future, while going
the other way won't be possible. So I think the dirty bit is really the
more conservative choice since if it turns out to be a mistake it can be
fixed without a backwards incompatible change.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-12 Thread Daniel Micay
On 13/11/15 01:15 AM, Minchan Kim wrote:
> On Thu, Nov 12, 2015 at 12:21:30AM -0500, Daniel Micay wrote:
>>> I also think that the kernel should commit to either zeroing the page
>>> or leaving it unchanged in response to MADV_FREE (even if the decision
>>> of which to do is made later on).  I think that your patch series does
>>> this, but only after a few of the patches are applied (the swap entry
>>> freeing), and I think that it should be a real guaranteed part of the
>>> semantics and maybe have a test case.
>>
>> This would be a good thing to test because it would be required to add
>> MADV_FREE_UNDO down the road. It would mean the same semantics as the
>> MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
>> value in that for the sake of migrating existing software too.
> 
> So, do you mean that we could implement MADV_FREE_UNDO with "read"
> opearation("just access bit marking) easily in future?
> 
> If so, it would be good reason to change MADV_FREE from dirty bit to
> access bit. Okay, I will look at that.

I just meant testing that the data is either zero or the old data if
it's read before it's written to. Not having it stay around once there
is a read. Not sure if that's what Andy meant.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-11 Thread Daniel Micay
> I also think that the kernel should commit to either zeroing the page
> or leaving it unchanged in response to MADV_FREE (even if the decision
> of which to do is made later on).  I think that your patch series does
> this, but only after a few of the patches are applied (the swap entry
> freeing), and I think that it should be a real guaranteed part of the
> semantics and maybe have a test case.

This would be a good thing to test because it would be required to add
MADV_FREE_UNDO down the road. It would mean the same semantics as the
MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
value in that for the sake of migrating existing software too.

For one example, it could be dropped into Firefox:

https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferWindows.cpp

And in Chromium:

https://code.google.com/p/chromium/codesearch#chromium/src/base/memory/discardable_shared_memory.cc

Worth noting that both also support the API for pinning/unpinning that's
used by Android's ashmem too. Linux really needs a feature like this for
caches. Firefox simply doesn't drop the memory at all on Linux right now:

https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp

(Lock == pin, Unlock == unpin)

For reference:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

2015-11-11 Thread Daniel Micay
> I also think that the kernel should commit to either zeroing the page
> or leaving it unchanged in response to MADV_FREE (even if the decision
> of which to do is made later on).  I think that your patch series does
> this, but only after a few of the patches are applied (the swap entry
> freeing), and I think that it should be a real guaranteed part of the
> semantics and maybe have a test case.

This would be a good thing to test because it would be required to add
MADV_FREE_UNDO down the road. It would mean the same semantics as the
MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
value in that for the sake of migrating existing software too.

For one example, it could be dropped into Firefox:

https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferWindows.cpp

And in Chromium:

https://code.google.com/p/chromium/codesearch#chromium/src/base/memory/discardable_shared_memory.cc

Worth noting that both also support the API for pinning/unpinning that's
used by Android's ashmem too. Linux really needs a feature like this for
caches. Firefox simply doesn't drop the memory at all on Linux right now:

https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp

(Lock == pin, Unlock == unpin)

For reference:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-05 Thread Daniel Micay
> active:clean

active:dirty*, sigh.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-05 Thread Daniel Micay
> I posted a patch doing exactly iovec madvise. Doesn't support MADV_FREE yet
> though, but should be easy to do it.
> 
> http://marc.info/?l=linux-mm=144615663522661=2

I think that would be a great way to deal with this. It keeps the nice
property of still being able to drop pages in allocations that have been
handed out but not yet touched. The allocator just needs to be designed
to do lots of purging in one go (i.e. something like an 8:1 active:clean
ratio triggers purging and it goes all the way to 16:1).



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-05 Thread Daniel Micay
> active:clean

active:dirty*, sigh.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-05 Thread Daniel Micay
> I posted a patch doing exactly iovec madvise. Doesn't support MADV_FREE yet
> though, but should be easy to do it.
> 
> http://marc.info/?l=linux-mm=144615663522661=2

I think that would be a great way to deal with this. It keeps the nice
property of still being able to drop pages in allocations that have been
handed out but not yet touched. The allocator just needs to be designed
to do lots of purging in one go (i.e. something like an 8:1 active:clean
ratio triggers purging and it goes all the way to 16:1).



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/8] mm: move lazily freed pages to inactive list

2015-11-04 Thread Daniel Micay
> From a user perspective, it doesn't depend on swap. It's just slower
> without swap because it does what MADV_DONTNEED does. The current
> implementation can be dropped in where MADV_DONTNEED was previously used.

It just wouldn't replace existing layers of purging logic until that
edge case is fixed and it gains better THP integration.

It's already a very useful API with significant performance wins over
MADV_DONTNEED, so it will be useful. The only risk involved in landing
it is that a better feature might come along. Worst case scenario being
that the kernel ends up with a synonym for MADV_DONTNEED (but I think
there will still be a use case for this even if a pinning/unpinning API
existed, as this is more precise).



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/8] mm: move lazily freed pages to inactive list

2015-11-04 Thread Daniel Micay
>>> It probably makes sense to stop thinking about them as anonymous pages
>>> entirely at this point when it comes to aging. They're really not. The
>>> LRU lists are split to differentiate access patterns and cost of page
>>> stealing (and restoring). From that angle, MADV_FREE pages really have
>>> nothing in common with in-use anonymous pages, and so they shouldn't
>>> be on the same LRU list.
>>>
>>> That would also fix the very unfortunate and unexpected consequence of
>>> tying the lazy free optimization to the availability of swap space.
>>>
>>> I would prefer to see this addressed before the code goes upstream.
>>
>> I don't think it would be ideal for these potentially very hot pages to
>> be dropped before very cold pages were swapped out. It's the kind of
>> tuning that needs to be informed by lots of real world experience and
>> lots of testing. It wouldn't impact the API.
> 
> What about them is hot? They contain garbage, you have to write to
> them before you can use them. Granted, you might have to refetch
> cachelines if you don't do cacheline-aligned populating writes, but
> you can do a lot of them before it's more expensive than doing IO.

It's hot because applications churn through memory via the allocator.

Drop the pages and the application is now churning through page faults
and zeroing rather than simply reusing memory. It's not something that
may happen, it *will* happen. A page in the page cache *may* be reused,
but often won't be, especially when the I/O patterns don't line up well
with the way it works.

The whole point of the feature is not requiring the allocator to have
elaborate mechanisms for aging pages and throttling purging. That ends
up resulting in lots of memory held by userspace where the kernel can't
reclaim it under memory pressure. If it's dropped before page cache, it
isn't going to be able to replace any of that logic in allocators.

The page cache is speculative. Page caching by allocators is not really
speculative. Using MADV_FREE on the pages at all is speculative. The
memory is probably going to be reused fairly soon (unless the process
exits, and then it doesn't matter), but purging will end up reducing
memory usage for the portions that aren't.

It would be a different story for a full unpinning/pinning feature since
that would have other use cases (speculative caches), but this is really
only useful in allocators.

>> Whether MADV_FREE is useful as an API vs. something like a pair of
>> system calls for pinning and unpinning memory is what should be worried
>> about right now. The internal implementation just needs to be correct
>> and useful right now, not perfect. Simpler is probably better than it
>> being more well tuned for an initial implementation too.
> 
> Yes, it wouldn't impact the API, but the dependency on swap is very
> random from a user experience and severely limits the usefulness of
> this. It should probably be addressed before this gets released. As
> this involves getting the pages off the anon LRU, we need to figure
> out where they should go instead.

From a user perspective, it doesn't depend on swap. It's just slower
without swap because it does what MADV_DONTNEED does. The current
implementation can be dropped in where MADV_DONTNEED was previously used.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-04 Thread Daniel Micay
> With enough pages at once, though, munmap would be fine, too.

That implies lots of page faults and zeroing though. The zeroing alone
is a major performance issue.

There are separate issues with munmap since it ends up resulting in a
lot more virtual memory fragmentation. It would help if the kernel used
first-best-fit for mmap instead of the current naive algorithm (bonus:
O(log n) worst-case, not O(n)). Since allocators like jemalloc and
PartitionAlloc want 2M aligned spans, mixing them with other allocators
can also accelerate the VM fragmentation caused by the dumb mmap
algorithm (i.e. they make a 2M aligned mapping, some other mmap user
does 4k, now there's a nearly 2M gap when the next 2M region is made and
the kernel keeps going rather than reusing it). Anyway, that's a totally
separate issue from this. Just felt like complaining :).

> Maybe what's really needed is a MADV_FREE variant that takes an iovec.
> On an all-cores multithreaded mm, the TLB shootdown broadcast takes
> thousands of cycles on each core more or less regardless of how much
> of the TLB gets zapped.

That would work very well. The allocator ends up having a sequence of
dirty spans that it needs to purge in one go. As long as purging is
fairly spread out, the cost of a single TLB shootdown isn't that bad. It
is extremely bad if it needs to do it over and over to purge a bunch of
ranges, which can happen if the memory has ended up being very, very
fragmentated despite the efforts to compact it (depends on what the
application ends up doing).



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/8] mm: move lazily freed pages to inactive list

2015-11-04 Thread Daniel Micay
> Even if we're wrong about the aging of those MADV_FREE pages, their
> contents are invalidated; they can be discarded freely, and restoring
> them is a mere GFP_ZERO allocation. All other anonymous pages have to
> be written to disk, and potentially be read back.
> 
> [ Arguably, MADV_FREE pages should even be reclaimed before inactive
>   page cache. It's the same cost to discard both types of pages, but
>   restoring page cache involves IO. ]

Keep in mind that this is memory the kernel wouldn't be getting back at
all if the allocator wasn't going out of the way to purge it, and they
aren't going to go out of their way to purge it if it means the kernel
is going to steal the pages when there isn't actually memory pressure.

An allocator would be using MADV_DONTNEED if it didn't expect that the
pages were going to be used against shortly. MADV_FREE indicates that it
has time to inform the kernel that they're unused but they could still
be very hot.

> It probably makes sense to stop thinking about them as anonymous pages
> entirely at this point when it comes to aging. They're really not. The
> LRU lists are split to differentiate access patterns and cost of page
> stealing (and restoring). From that angle, MADV_FREE pages really have
> nothing in common with in-use anonymous pages, and so they shouldn't
> be on the same LRU list.
> 
> That would also fix the very unfortunate and unexpected consequence of
> tying the lazy free optimization to the availability of swap space.
> 
> I would prefer to see this addressed before the code goes upstream.

I don't think it would be ideal for these potentially very hot pages to
be dropped before very cold pages were swapped out. It's the kind of
tuning that needs to be informed by lots of real world experience and
lots of testing. It wouldn't impact the API.

Whether MADV_FREE is useful as an API vs. something like a pair of
system calls for pinning and unpinning memory is what should be worried
about right now. The internal implementation just needs to be correct
and useful right now, not perfect. Simpler is probably better than it
being more well tuned for an initial implementation too.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-04 Thread Daniel Micay
> That's comparable to Android's pinning / unpinning API for ashmem and I
> think it makes sense if it's faster. It's different than the MADV_FREE
> API though, because the new allocations that are handed out won't have
> the usual lazy commit which MADV_FREE provides. Pages in an allocation
> that's handed out can still be dropped until they are actually written
> to. It's considered active by jemalloc either way, but only a subset of
> the active pages are actually committed. There's probably a use case for
> both of these systems.

Also, consider that MADV_FREE would allow jemalloc to be extremely
aggressive with purging when it actually has to do it. It can start with
the largest span of memory and it can mark more than strictly necessary
to drop below the ratio as there's no cost to using the memory again
(not even a system call).

Since the main cost is using the system call at all, there's going to be
pressure to mark the largest possible spans in one go. It will mean
concentration on memory compaction will improve performance. I think
that's the right direction for the kernel to be guiding userspace. It
will play better with THP than the allocator trying to be very precise
with purging based on aging.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-04 Thread Daniel Micay
> Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to 
> reduce
> page fault. But there is one issue remaining, the TLB flush. Both 
> MADV_DONTNEED
> and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
> multi-thread applications. In our production workload, we observed 80% CPU
> spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
> We haven't tested MADV_FREE yet, but the result should be similar. It's hard 
> to
> avoid the TLB flush issue with MADV_FREE, because it helps avoid data
> corruption.
> 
> The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
> 
> MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> just records the range in current stage. Should memory pressure happen, page
> reclaim can free the memory directly regardless the pte state.
> 
> MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> Kernel deletes the record and prevents page reclaim discards the memory. If 
> the
> memory isn't reclaimed, userspace will access the old memory, otherwise do
> normal page fault handling.
> 
> The point is to let userspace notify kernel if memory can be discarded, 
> instead
> of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> required till page reclaim actually frees the memory (page reclaim need do the
> TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> MADV_FREE.
> 
> Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
> 
> We don't have code to backup this yet, sorry. We'd like to discuss it if it
> makes sense.

That's comparable to Android's pinning / unpinning API for ashmem and I
think it makes sense if it's faster. It's different than the MADV_FREE
API though, because the new allocations that are handed out won't have
the usual lazy commit which MADV_FREE provides. Pages in an allocation
that's handed out can still be dropped until they are actually written
to. It's considered active by jemalloc either way, but only a subset of
the active pages are actually committed. There's probably a use case for
both of these systems.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-04 Thread Daniel Micay
> Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to 
> reduce
> page fault. But there is one issue remaining, the TLB flush. Both 
> MADV_DONTNEED
> and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
> multi-thread applications. In our production workload, we observed 80% CPU
> spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
> We haven't tested MADV_FREE yet, but the result should be similar. It's hard 
> to
> avoid the TLB flush issue with MADV_FREE, because it helps avoid data
> corruption.
> 
> The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
> 
> MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> just records the range in current stage. Should memory pressure happen, page
> reclaim can free the memory directly regardless the pte state.
> 
> MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> Kernel deletes the record and prevents page reclaim discards the memory. If 
> the
> memory isn't reclaimed, userspace will access the old memory, otherwise do
> normal page fault handling.
> 
> The point is to let userspace notify kernel if memory can be discarded, 
> instead
> of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> required till page reclaim actually frees the memory (page reclaim need do the
> TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> MADV_FREE.
> 
> Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
> 
> We don't have code to backup this yet, sorry. We'd like to discuss it if it
> makes sense.

That's comparable to Android's pinning / unpinning API for ashmem and I
think it makes sense if it's faster. It's different than the MADV_FREE
API though, because the new allocations that are handed out won't have
the usual lazy commit which MADV_FREE provides. Pages in an allocation
that's handed out can still be dropped until they are actually written
to. It's considered active by jemalloc either way, but only a subset of
the active pages are actually committed. There's probably a use case for
both of these systems.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/8] mm: move lazily freed pages to inactive list

2015-11-04 Thread Daniel Micay
> Even if we're wrong about the aging of those MADV_FREE pages, their
> contents are invalidated; they can be discarded freely, and restoring
> them is a mere GFP_ZERO allocation. All other anonymous pages have to
> be written to disk, and potentially be read back.
> 
> [ Arguably, MADV_FREE pages should even be reclaimed before inactive
>   page cache. It's the same cost to discard both types of pages, but
>   restoring page cache involves IO. ]

Keep in mind that this is memory the kernel wouldn't be getting back at
all if the allocator wasn't going out of the way to purge it, and they
aren't going to go out of their way to purge it if it means the kernel
is going to steal the pages when there isn't actually memory pressure.

An allocator would be using MADV_DONTNEED if it didn't expect that the
pages were going to be used against shortly. MADV_FREE indicates that it
has time to inform the kernel that they're unused but they could still
be very hot.

> It probably makes sense to stop thinking about them as anonymous pages
> entirely at this point when it comes to aging. They're really not. The
> LRU lists are split to differentiate access patterns and cost of page
> stealing (and restoring). From that angle, MADV_FREE pages really have
> nothing in common with in-use anonymous pages, and so they shouldn't
> be on the same LRU list.
> 
> That would also fix the very unfortunate and unexpected consequence of
> tying the lazy free optimization to the availability of swap space.
> 
> I would prefer to see this addressed before the code goes upstream.

I don't think it would be ideal for these potentially very hot pages to
be dropped before very cold pages were swapped out. It's the kind of
tuning that needs to be informed by lots of real world experience and
lots of testing. It wouldn't impact the API.

Whether MADV_FREE is useful as an API vs. something like a pair of
system calls for pinning and unpinning memory is what should be worried
about right now. The internal implementation just needs to be correct
and useful right now, not perfect. Simpler is probably better than it
being more well tuned for an initial implementation too.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-04 Thread Daniel Micay
> With enough pages at once, though, munmap would be fine, too.

That implies lots of page faults and zeroing though. The zeroing alone
is a major performance issue.

There are separate issues with munmap since it ends up resulting in a
lot more virtual memory fragmentation. It would help if the kernel used
first-best-fit for mmap instead of the current naive algorithm (bonus:
O(log n) worst-case, not O(n)). Since allocators like jemalloc and
PartitionAlloc want 2M aligned spans, mixing them with other allocators
can also accelerate the VM fragmentation caused by the dumb mmap
algorithm (i.e. they make a 2M aligned mapping, some other mmap user
does 4k, now there's a nearly 2M gap when the next 2M region is made and
the kernel keeps going rather than reusing it). Anyway, that's a totally
separate issue from this. Just felt like complaining :).

> Maybe what's really needed is a MADV_FREE variant that takes an iovec.
> On an all-cores multithreaded mm, the TLB shootdown broadcast takes
> thousands of cycles on each core more or less regardless of how much
> of the TLB gets zapped.

That would work very well. The allocator ends up having a sequence of
dirty spans that it needs to purge in one go. As long as purging is
fairly spread out, the cost of a single TLB shootdown isn't that bad. It
is extremely bad if it needs to do it over and over to purge a bunch of
ranges, which can happen if the memory has ended up being very, very
fragmentated despite the efforts to compact it (depends on what the
application ends up doing).



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-04 Thread Daniel Micay
> That's comparable to Android's pinning / unpinning API for ashmem and I
> think it makes sense if it's faster. It's different than the MADV_FREE
> API though, because the new allocations that are handed out won't have
> the usual lazy commit which MADV_FREE provides. Pages in an allocation
> that's handed out can still be dropped until they are actually written
> to. It's considered active by jemalloc either way, but only a subset of
> the active pages are actually committed. There's probably a use case for
> both of these systems.

Also, consider that MADV_FREE would allow jemalloc to be extremely
aggressive with purging when it actually has to do it. It can start with
the largest span of memory and it can mark more than strictly necessary
to drop below the ratio as there's no cost to using the memory again
(not even a system call).

Since the main cost is using the system call at all, there's going to be
pressure to mark the largest possible spans in one go. It will mean
concentration on memory compaction will improve performance. I think
that's the right direction for the kernel to be guiding userspace. It
will play better with THP than the allocator trying to be very precise
with purging based on aging.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/8] mm: move lazily freed pages to inactive list

2015-11-04 Thread Daniel Micay
> From a user perspective, it doesn't depend on swap. It's just slower
> without swap because it does what MADV_DONTNEED does. The current
> implementation can be dropped in where MADV_DONTNEED was previously used.

It just wouldn't replace existing layers of purging logic until that
edge case is fixed and it gains better THP integration.

It's already a very useful API with significant performance wins over
MADV_DONTNEED, so it will be useful. The only risk involved in landing
it is that a better feature might come along. Worst case scenario being
that the kernel ends up with a synonym for MADV_DONTNEED (but I think
there will still be a use case for this even if a pinning/unpinning API
existed, as this is more precise).



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/8] mm: move lazily freed pages to inactive list

2015-11-04 Thread Daniel Micay
>>> It probably makes sense to stop thinking about them as anonymous pages
>>> entirely at this point when it comes to aging. They're really not. The
>>> LRU lists are split to differentiate access patterns and cost of page
>>> stealing (and restoring). From that angle, MADV_FREE pages really have
>>> nothing in common with in-use anonymous pages, and so they shouldn't
>>> be on the same LRU list.
>>>
>>> That would also fix the very unfortunate and unexpected consequence of
>>> tying the lazy free optimization to the availability of swap space.
>>>
>>> I would prefer to see this addressed before the code goes upstream.
>>
>> I don't think it would be ideal for these potentially very hot pages to
>> be dropped before very cold pages were swapped out. It's the kind of
>> tuning that needs to be informed by lots of real world experience and
>> lots of testing. It wouldn't impact the API.
> 
> What about them is hot? They contain garbage, you have to write to
> them before you can use them. Granted, you might have to refetch
> cachelines if you don't do cacheline-aligned populating writes, but
> you can do a lot of them before it's more expensive than doing IO.

It's hot because applications churn through memory via the allocator.

Drop the pages and the application is now churning through page faults
and zeroing rather than simply reusing memory. It's not something that
may happen, it *will* happen. A page in the page cache *may* be reused,
but often won't be, especially when the I/O patterns don't line up well
with the way it works.

The whole point of the feature is not requiring the allocator to have
elaborate mechanisms for aging pages and throttling purging. That ends
up resulting in lots of memory held by userspace where the kernel can't
reclaim it under memory pressure. If it's dropped before page cache, it
isn't going to be able to replace any of that logic in allocators.

The page cache is speculative. Page caching by allocators is not really
speculative. Using MADV_FREE on the pages at all is speculative. The
memory is probably going to be reused fairly soon (unless the process
exits, and then it doesn't matter), but purging will end up reducing
memory usage for the portions that aren't.

It would be a different story for a full unpinning/pinning feature since
that would have other use cases (speculative caches), but this is really
only useful in allocators.

>> Whether MADV_FREE is useful as an API vs. something like a pair of
>> system calls for pinning and unpinning memory is what should be worried
>> about right now. The internal implementation just needs to be correct
>> and useful right now, not perfect. Simpler is probably better than it
>> being more well tuned for an initial implementation too.
> 
> Yes, it wouldn't impact the API, but the dependency on swap is very
> random from a user experience and severely limits the usefulness of
> this. It should probably be addressed before this gets released. As
> this involves getting the pages off the anon LRU, we need to figure
> out where they should go instead.

From a user perspective, it doesn't depend on swap. It's just slower
without swap because it does what MADV_DONTNEED does. The current
implementation can be dropped in where MADV_DONTNEED was previously used.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-03 Thread Daniel Micay
On 04/11/15 12:53 AM, Daniel Micay wrote:
>> In the common case it will be passed many pages by the allocator. There
>> will still be a layer of purging logic on top of MADV_FREE but it can be
>> much thinner than the current workarounds for MADV_DONTNEED. So the
>> allocator would still be coalescing dirty ranges and only purging when
>> the ratio of dirty:clean pages rises above some threshold. It would be
>> able to weight the largest ranges for purging first rather than logic
>> based on stuff like aging as is used for MADV_DONTNEED.
> 
> I would expect that jemalloc would just start putting the dirty ranges
> into the usual pair of red-black trees (with coalescing) and then doing
> purging starting from the largest spans to get back down below whatever
> dirty:clean ratio it's trying to keep. Right now, it has all lots of
> other logic to deal with this since each MADV_DONTNEED call results in
> lots of zeroing and then page faults.

Er, I mean dirty:active (i.e. ratio of unpurged, dirty pages to ones
that are handed out as allocations, which is kept at something like
1:8). A high constant cost in the madvise call but quick handling of
each page means that allocators need to be more aggressive with purging
more than they strictly need to in one go. For example, it might need to
purge 2M to meet the ratio but it could have a contiguous span of 32M of
dirty pages. If the cost per page is low enough, it could just do the
entire range.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-03 Thread Daniel Micay
> In the common case it will be passed many pages by the allocator. There
> will still be a layer of purging logic on top of MADV_FREE but it can be
> much thinner than the current workarounds for MADV_DONTNEED. So the
> allocator would still be coalescing dirty ranges and only purging when
> the ratio of dirty:clean pages rises above some threshold. It would be
> able to weight the largest ranges for purging first rather than logic
> based on stuff like aging as is used for MADV_DONTNEED.

I would expect that jemalloc would just start putting the dirty ranges
into the usual pair of red-black trees (with coalescing) and then doing
purging starting from the largest spans to get back down below whatever
dirty:clean ratio it's trying to keep. Right now, it has all lots of
other logic to deal with this since each MADV_DONTNEED call results in
lots of zeroing and then page faults.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-03 Thread Daniel Micay
> Does this set the write protect bit?
> 
> What happens on architectures without hardware dirty tracking?

It's supposed to avoid needing page faults when the data is accessed
again, but it can just be implemented via page faults on architectures
without a way to check for access or writes. MADV_DONTNEED is also a
valid implementation of MADV_FREE if it comes to that (which is what it
does on swapless systems for now).

> Using the dirty bit for these semantics scares me.  This API creates a
> page that can have visible nonzero contents and then can
> asynchronously and magically zero itself thereafter.  That makes me
> nervous.  Could we use the accessed bit instead?  Then the observable
> semantics would be equivalent to having MADV_FREE either zero the page
> or do nothing, except that it doesn't make up its mind until the next
> read.

FWIW, those are already basically the semantics provided by GCC and LLVM
for data the compiler considers uninitialized (they could be more
aggressive since C just says it's undefined, but in practice they allow
it but can produce inconsistent results even if it isn't touched).

http://llvm.org/docs/LangRef.html#undefined-values

It doesn't seem like there would be an advantage to checking if the data
was written to vs. whether it was accessed if checking for both of those
is comparable in performance. I don't know enough about that.

>> +   ptent = pte_mkold(ptent);
>> +   ptent = pte_mkclean(ptent);
>> +   set_pte_at(mm, addr, pte, ptent);
>> +   tlb_remove_tlb_entry(tlb, pte, addr);
> 
> It looks like you are flushing the TLB.  In a multithreaded program,
> that's rather expensive.  Potentially silly question: would it be
> better to just zero the page immediately in a multithreaded program
> and then, when swapping out, check the page is zeroed and, if so, skip
> swapping it out?  That could be done without forcing an IPI.

In the common case it will be passed many pages by the allocator. There
will still be a layer of purging logic on top of MADV_FREE but it can be
much thinner than the current workarounds for MADV_DONTNEED. So the
allocator would still be coalescing dirty ranges and only purging when
the ratio of dirty:clean pages rises above some threshold. It would be
able to weight the largest ranges for purging first rather than logic
based on stuff like aging as is used for MADV_DONTNEED.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-03 Thread Daniel Micay
> In the common case it will be passed many pages by the allocator. There
> will still be a layer of purging logic on top of MADV_FREE but it can be
> much thinner than the current workarounds for MADV_DONTNEED. So the
> allocator would still be coalescing dirty ranges and only purging when
> the ratio of dirty:clean pages rises above some threshold. It would be
> able to weight the largest ranges for purging first rather than logic
> based on stuff like aging as is used for MADV_DONTNEED.

I would expect that jemalloc would just start putting the dirty ranges
into the usual pair of red-black trees (with coalescing) and then doing
purging starting from the largest spans to get back down below whatever
dirty:clean ratio it's trying to keep. Right now, it has all lots of
other logic to deal with this since each MADV_DONTNEED call results in
lots of zeroing and then page faults.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-03 Thread Daniel Micay
> Does this set the write protect bit?
> 
> What happens on architectures without hardware dirty tracking?

It's supposed to avoid needing page faults when the data is accessed
again, but it can just be implemented via page faults on architectures
without a way to check for access or writes. MADV_DONTNEED is also a
valid implementation of MADV_FREE if it comes to that (which is what it
does on swapless systems for now).

> Using the dirty bit for these semantics scares me.  This API creates a
> page that can have visible nonzero contents and then can
> asynchronously and magically zero itself thereafter.  That makes me
> nervous.  Could we use the accessed bit instead?  Then the observable
> semantics would be equivalent to having MADV_FREE either zero the page
> or do nothing, except that it doesn't make up its mind until the next
> read.

FWIW, those are already basically the semantics provided by GCC and LLVM
for data the compiler considers uninitialized (they could be more
aggressive since C just says it's undefined, but in practice they allow
it but can produce inconsistent results even if it isn't touched).

http://llvm.org/docs/LangRef.html#undefined-values

It doesn't seem like there would be an advantage to checking if the data
was written to vs. whether it was accessed if checking for both of those
is comparable in performance. I don't know enough about that.

>> +   ptent = pte_mkold(ptent);
>> +   ptent = pte_mkclean(ptent);
>> +   set_pte_at(mm, addr, pte, ptent);
>> +   tlb_remove_tlb_entry(tlb, pte, addr);
> 
> It looks like you are flushing the TLB.  In a multithreaded program,
> that's rather expensive.  Potentially silly question: would it be
> better to just zero the page immediately in a multithreaded program
> and then, when swapping out, check the page is zeroed and, if so, skip
> swapping it out?  That could be done without forcing an IPI.

In the common case it will be passed many pages by the allocator. There
will still be a layer of purging logic on top of MADV_FREE but it can be
much thinner than the current workarounds for MADV_DONTNEED. So the
allocator would still be coalescing dirty ranges and only purging when
the ratio of dirty:clean pages rises above some threshold. It would be
able to weight the largest ranges for purging first rather than logic
based on stuff like aging as is used for MADV_DONTNEED.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

2015-11-03 Thread Daniel Micay
On 04/11/15 12:53 AM, Daniel Micay wrote:
>> In the common case it will be passed many pages by the allocator. There
>> will still be a layer of purging logic on top of MADV_FREE but it can be
>> much thinner than the current workarounds for MADV_DONTNEED. So the
>> allocator would still be coalescing dirty ranges and only purging when
>> the ratio of dirty:clean pages rises above some threshold. It would be
>> able to weight the largest ranges for purging first rather than logic
>> based on stuff like aging as is used for MADV_DONTNEED.
> 
> I would expect that jemalloc would just start putting the dirty ranges
> into the usual pair of red-black trees (with coalescing) and then doing
> purging starting from the largest spans to get back down below whatever
> dirty:clean ratio it's trying to keep. Right now, it has all lots of
> other logic to deal with this since each MADV_DONTNEED call results in
> lots of zeroing and then page faults.

Er, I mean dirty:active (i.e. ratio of unpurged, dirty pages to ones
that are handed out as allocations, which is kept at something like
1:8). A high constant cost in the madvise call but quick handling of
each page means that allocators need to be more aggressive with purging
more than they strictly need to in one go. For example, it might need to
purge 2M to meet the ratio but it could have a contiguous span of 32M of
dirty pages. If the cost per page is low enough, it could just do the
entire range.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/8] MADV_FREE support

2015-11-01 Thread Daniel Micay
On 01/11/15 12:51 AM, David Rientjes wrote:
> On Fri, 30 Oct 2015, Minchan Kim wrote:
> 
>> MADV_FREE is on linux-next so long time. The reason was two, I think.
>>
>> 1. MADV_FREE code on reclaim path was really mess.
>>
>> 2. Andrew really want to see voice of userland people who want to use
>>the syscall.
>>
>> A few month ago, Daniel Micay(jemalloc active contributor) requested me
>> to make progress upstreaming but I was busy at that time so it took
>> so long time for me to revist the code and finally, I clean it up the
>> mess recently so it solves the #2 issue.
>>
>> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
>> again recently and they said it would be great to have even though
>> it has swap dependency now so Andrew decided he will do that for v4.4.
>>
> 
> First, thanks very much for refreshing the patchset and reposting after a 
> series of changes have been periodically added to -mm, it makes it much 
> easier.
> 
> For tcmalloc, we can do some things in the allocator itself to increase 
> the amount of memory backed by thp.  Specifically, we can prefer to 
> release Spans to pageblocks that are already not backed by thp so there is 
> no additional split on each scavenge.  This is somewhat easy if all memory 
> is organized into hugepage-aligned pageblocks in the allocator itself.  
> Second, we can prefer to release Spans of longer length on each scavenge 
> so we can delay scavenging for as long as possible in a hope we can find 
> more pages to coalesce.  Third, we can discount refaulted released memory 
> from the scavenging period.
> 
> That significantly improves the amount of memory backed by thp for 
> tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
> release memory to the system and MADV_FREE wouldn't help at all in a 
> swapless environment.
> 
> To combat that, I've proposed a new MADV bit that simply caches the 
> ranges freed by the allocator per vma and places them on both a per-vma 
> and per-memcg list.  During reclaim, this list is iterated and ptes are 
> freed after thp split period to the normal directed reclaim.  Without 
> memory pressure, this backs 100% of the heap with thp with a relatively 
> lightweight kernel change (the majority is vma manipulation on split) and 
> a couple line change to tcmalloc.  When pulling memory from the returned 
> freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
> MADV bit to remove it from this cache, so there is a second madvise(2) 
> syscall involved but the freeing call is much less expensive since there 
> is no pagetable walk without memory pressure or synchronous thp split.
> 
> I've been looking at MADV_FREE to see if there is common ground that could 
> be shared, but perhaps it's just easier to ask what your proposed strategy 
> is so that tcmalloc users, especially those in swapless environments, 
> would benefit from any of your work?

The current implementation requires swap because the kernel already has
robust infrastructure for swapping out anonymous memory when there's
memory pressure. The MADV_FREE implementation just has to hook in there
and cause pages to be dropped instead of swapped out. There's no reason
it couldn't be extended to work in swapless environments, but it will
take additional design and implementation work. As a stop-gap, I think
zram and friends will work fine as a form of swap for this.

It can definitely be improved to cooperate well with THP too. I've been
following the progress, and most of the problems seem to have been with
the THP and that's a very active area of development. Seems best to deal
with that after a simple, working implementation lands.

The best aspect of MADV_FREE is that it completely avoids page faults
when there's no memory pressure. Making use of the freed memory only
triggers page faults if the pages had to be dropped because the system
ran out of memory. It also avoids needing to zero the pages. The memory
can also still be freed at any time if there's memory pressure again
even if it's handed out as an allocation until it's actually touched.

The call to madvise still has significant overhead, but it's much
cheaper than MADV_DONTNEED. Allocators will be able to lean on the
kernel to make good decisions rather than implementing lazy freeing
entirely on their own. It should improve performance *and* behavior
under memory pressure since allocators can be more aggressive with it
than MADV_DONTNEED.

A nice future improvement would be landing MADV_FREE_UNDO feature to
allow an attempt to pin the pages in memory again. It would make this
work very well for implementing caches that are dropped under memory
pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
think browser vendors would be very interested in it.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/8] MADV_FREE support

2015-11-01 Thread Daniel Micay
On 01/11/15 12:51 AM, David Rientjes wrote:
> On Fri, 30 Oct 2015, Minchan Kim wrote:
> 
>> MADV_FREE is on linux-next so long time. The reason was two, I think.
>>
>> 1. MADV_FREE code on reclaim path was really mess.
>>
>> 2. Andrew really want to see voice of userland people who want to use
>>the syscall.
>>
>> A few month ago, Daniel Micay(jemalloc active contributor) requested me
>> to make progress upstreaming but I was busy at that time so it took
>> so long time for me to revist the code and finally, I clean it up the
>> mess recently so it solves the #2 issue.
>>
>> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
>> again recently and they said it would be great to have even though
>> it has swap dependency now so Andrew decided he will do that for v4.4.
>>
> 
> First, thanks very much for refreshing the patchset and reposting after a 
> series of changes have been periodically added to -mm, it makes it much 
> easier.
> 
> For tcmalloc, we can do some things in the allocator itself to increase 
> the amount of memory backed by thp.  Specifically, we can prefer to 
> release Spans to pageblocks that are already not backed by thp so there is 
> no additional split on each scavenge.  This is somewhat easy if all memory 
> is organized into hugepage-aligned pageblocks in the allocator itself.  
> Second, we can prefer to release Spans of longer length on each scavenge 
> so we can delay scavenging for as long as possible in a hope we can find 
> more pages to coalesce.  Third, we can discount refaulted released memory 
> from the scavenging period.
> 
> That significantly improves the amount of memory backed by thp for 
> tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
> release memory to the system and MADV_FREE wouldn't help at all in a 
> swapless environment.
> 
> To combat that, I've proposed a new MADV bit that simply caches the 
> ranges freed by the allocator per vma and places them on both a per-vma 
> and per-memcg list.  During reclaim, this list is iterated and ptes are 
> freed after thp split period to the normal directed reclaim.  Without 
> memory pressure, this backs 100% of the heap with thp with a relatively 
> lightweight kernel change (the majority is vma manipulation on split) and 
> a couple line change to tcmalloc.  When pulling memory from the returned 
> freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
> MADV bit to remove it from this cache, so there is a second madvise(2) 
> syscall involved but the freeing call is much less expensive since there 
> is no pagetable walk without memory pressure or synchronous thp split.
> 
> I've been looking at MADV_FREE to see if there is common ground that could 
> be shared, but perhaps it's just easier to ask what your proposed strategy 
> is so that tcmalloc users, especially those in swapless environments, 
> would benefit from any of your work?

The current implementation requires swap because the kernel already has
robust infrastructure for swapping out anonymous memory when there's
memory pressure. The MADV_FREE implementation just has to hook in there
and cause pages to be dropped instead of swapped out. There's no reason
it couldn't be extended to work in swapless environments, but it will
take additional design and implementation work. As a stop-gap, I think
zram and friends will work fine as a form of swap for this.

It can definitely be improved to cooperate well with THP too. I've been
following the progress, and most of the problems seem to have been with
the THP and that's a very active area of development. Seems best to deal
with that after a simple, working implementation lands.

The best aspect of MADV_FREE is that it completely avoids page faults
when there's no memory pressure. Making use of the freed memory only
triggers page faults if the pages had to be dropped because the system
ran out of memory. It also avoids needing to zero the pages. The memory
can also still be freed at any time if there's memory pressure again
even if it's handed out as an allocation until it's actually touched.

The call to madvise still has significant overhead, but it's much
cheaper than MADV_DONTNEED. Allocators will be able to lean on the
kernel to make good decisions rather than implementing lazy freeing
entirely on their own. It should improve performance *and* behavior
under memory pressure since allocators can be more aggressive with it
than MADV_DONTNEED.

A nice future improvement would be landing MADV_FREE_UNDO feature to
allow an attempt to pin the pages in memory again. It would make this
work very well for implementing caches that are dropped under memory
pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
think browser vendors would be very interested in it.



signature.asc
Description: OpenPGP digital signature


[PATCH v4] mm: add mremap flag for preserving the old mapping

2014-10-02 Thread Daniel Micay
This introduces the MREMAP_RETAIN flag for preserving the source mapping
when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to
the source mapping will fault and map in fresh zeroed pages.

It is currently limited to writable MAP_PRIVATE|MAP_ANONYMOUS mappings
and will return EFAULT when used on anything else. This covers the
intended use case in general purpose allocators.

For consistency, the old_len >= new_len case could decommit the pages
instead of unmapping. However, userspace can accomplish the same thing
via madvise and the flag is coherent without the additional complexity.

Motivation:

TCMalloc and jemalloc avoid releasing virtual memory in order to reduce
virtual memory fragmentation. A call to munmap or mremap would leave a
hole in the address space. Instead, unused pages are lazily returned to
the operating system via MADV_DONTNEED.

Since mremap cannot be used to elide copies, TCMalloc and jemalloc end
up being significantly slower for patterns like repeated vector / hash
table reallocations. Consider the typical vector building pattern:

#include 
#include 

int main(void) {
for (size_t i = 0; i < 100; i++) {
void *ptr = NULL;
size_t old_size = 0;
for (size_t size = 4; size < (1 << 30); size *= 2) {
ptr = realloc(ptr, size);
if (!ptr) return 1;
memset(ptr + old_size, 0xff, size - old_size);
old_size = size;
}
free(ptr);
}
}

Transparent huge pages disabled:

glibc (baseline, uses mremap already): 15.051s
jemalloc without MREMAP_RETAIN: 38.540s
jemalloc with MREMAP_RETAIN: 15.086s

Transparent huge pages enabled:

glibc (baseline, uses mremap already): 8.464s
jemalloc without MREMAP_RETAIN: 18.230s
jemalloc with MREMAP_RETAIN: 6.696s

In practice, in-place growth never occurs for huge allocations because
the heap grows in the downwards direction for all 3 allocators. TCMalloc
and jemalloc pay for enormous copies while glibc is only spending time
writing new elements to the vector. Even if it was grown in the other
direction, real-world applications would end up blocking in-place growth
with new allocations.

The allocators could attempt to map the source location again after an
mremap call, but there is no guarantee of success in a multi-threaded
program and fragmentating memory over time is considered unacceptable.

Signed-off-by: Daniel Micay 
---
 include/linux/huge_mm.h   |  2 +-
 include/linux/mm.h|  6 ++
 include/uapi/linux/mman.h |  1 +
 mm/huge_memory.c  |  4 ++--
 mm/memory.c   |  2 +-
 mm/mmap.c | 12 +++
 mm/mremap.c   | 52 +++
 7 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb..3c13b20 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -143,7 +143,7 @@ static inline void vma_adjust_trans_huge(struct 
vm_area_struct *vma,
 unsigned long end,
 long adjust_next)
 {
-   if (!vma->anon_vma || vma->vm_ops)
+   if (!vma->anon_vma || (vma->vm_ops && !vma->vm_ops->allow_huge_pages))
return;
__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..1e61036 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -273,6 +273,12 @@ struct vm_operations_struct {
/* called by sys_remap_file_pages() to populate non-linear mapping */
int (*remap_pages)(struct vm_area_struct *vma, unsigned long addr,
   unsigned long size, pgoff_t pgoff);
+
+   /* Check if the mapping may be duplicated by MREMAP_RETAIN */
+   bool (*may_duplicate)(struct vm_area_struct *vma);
+
+   /* if there is no vm_ops table, this is considered true */
+   bool allow_huge_pages;
 };
 
 struct mmu_gather;
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index ade4acd..4e9a546 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -5,6 +5,7 @@
 
 #define MREMAP_MAYMOVE 1
 #define MREMAP_FIXED   2
+#define MREMAP_RETAIN  4
 
 #define OVERCOMMIT_GUESS   0
 #define OVERCOMMIT_ALWAYS  1
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..350b478 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2077,7 +2077,7 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
 * page fault if needed.
 */
return 0;
-   if (vma->vm_ops)
+   if ((vma->vm_ops && !vma->vm_ops->allow_huge_pages))
/* khugepaged not yet working on file or special mappings */
return 0;
VM_BUG_ON(vma->vm_flags &a

Re: [PATCH v3] mm: add mremap flag for preserving the old mapping

2014-10-02 Thread Daniel Micay
On 30/09/14 01:49 PM, Andy Lutomirski wrote:
> 
> I think it might pay to add an explicit vm_op to authorize
> duplication, especially for non-cow mappings.  IOW this kind of
> extension seems quite magical for anything that doesn't have the
> normal COW semantics, including for plain old read-only mappings.

Adding a vm_ops table to MAP_PRIVATE|MAP_ANONYMOUS mappings has a
significant performance impact. I haven't yet narrowed it down, but
there's at least one code path a check of `!vma->vm_ops` for the fast
path. One is for transparent huge page faults, so the performance impact
makes sense. I'll use a simpler implementation for now since the
requirements are very narrow / simple.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3] mm: add mremap flag for preserving the old mapping

2014-10-02 Thread Daniel Micay
On 30/09/14 01:49 PM, Andy Lutomirski wrote:
 
 I think it might pay to add an explicit vm_op to authorize
 duplication, especially for non-cow mappings.  IOW this kind of
 extension seems quite magical for anything that doesn't have the
 normal COW semantics, including for plain old read-only mappings.

Adding a vm_ops table to MAP_PRIVATE|MAP_ANONYMOUS mappings has a
significant performance impact. I haven't yet narrowed it down, but
there's at least one code path a check of `!vma-vm_ops` for the fast
path. One is for transparent huge page faults, so the performance impact
makes sense. I'll use a simpler implementation for now since the
requirements are very narrow / simple.



signature.asc
Description: OpenPGP digital signature


[PATCH v4] mm: add mremap flag for preserving the old mapping

2014-10-02 Thread Daniel Micay
This introduces the MREMAP_RETAIN flag for preserving the source mapping
when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to
the source mapping will fault and map in fresh zeroed pages.

It is currently limited to writable MAP_PRIVATE|MAP_ANONYMOUS mappings
and will return EFAULT when used on anything else. This covers the
intended use case in general purpose allocators.

For consistency, the old_len = new_len case could decommit the pages
instead of unmapping. However, userspace can accomplish the same thing
via madvise and the flag is coherent without the additional complexity.

Motivation:

TCMalloc and jemalloc avoid releasing virtual memory in order to reduce
virtual memory fragmentation. A call to munmap or mremap would leave a
hole in the address space. Instead, unused pages are lazily returned to
the operating system via MADV_DONTNEED.

Since mremap cannot be used to elide copies, TCMalloc and jemalloc end
up being significantly slower for patterns like repeated vector / hash
table reallocations. Consider the typical vector building pattern:

#include string.h
#include stdlib.h

int main(void) {
for (size_t i = 0; i  100; i++) {
void *ptr = NULL;
size_t old_size = 0;
for (size_t size = 4; size  (1  30); size *= 2) {
ptr = realloc(ptr, size);
if (!ptr) return 1;
memset(ptr + old_size, 0xff, size - old_size);
old_size = size;
}
free(ptr);
}
}

Transparent huge pages disabled:

glibc (baseline, uses mremap already): 15.051s
jemalloc without MREMAP_RETAIN: 38.540s
jemalloc with MREMAP_RETAIN: 15.086s

Transparent huge pages enabled:

glibc (baseline, uses mremap already): 8.464s
jemalloc without MREMAP_RETAIN: 18.230s
jemalloc with MREMAP_RETAIN: 6.696s

In practice, in-place growth never occurs for huge allocations because
the heap grows in the downwards direction for all 3 allocators. TCMalloc
and jemalloc pay for enormous copies while glibc is only spending time
writing new elements to the vector. Even if it was grown in the other
direction, real-world applications would end up blocking in-place growth
with new allocations.

The allocators could attempt to map the source location again after an
mremap call, but there is no guarantee of success in a multi-threaded
program and fragmentating memory over time is considered unacceptable.

Signed-off-by: Daniel Micay danielmi...@gmail.com
---
 include/linux/huge_mm.h   |  2 +-
 include/linux/mm.h|  6 ++
 include/uapi/linux/mman.h |  1 +
 mm/huge_memory.c  |  4 ++--
 mm/memory.c   |  2 +-
 mm/mmap.c | 12 +++
 mm/mremap.c   | 52 +++
 7 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb..3c13b20 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -143,7 +143,7 @@ static inline void vma_adjust_trans_huge(struct 
vm_area_struct *vma,
 unsigned long end,
 long adjust_next)
 {
-   if (!vma-anon_vma || vma-vm_ops)
+   if (!vma-anon_vma || (vma-vm_ops  !vma-vm_ops-allow_huge_pages))
return;
__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..1e61036 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -273,6 +273,12 @@ struct vm_operations_struct {
/* called by sys_remap_file_pages() to populate non-linear mapping */
int (*remap_pages)(struct vm_area_struct *vma, unsigned long addr,
   unsigned long size, pgoff_t pgoff);
+
+   /* Check if the mapping may be duplicated by MREMAP_RETAIN */
+   bool (*may_duplicate)(struct vm_area_struct *vma);
+
+   /* if there is no vm_ops table, this is considered true */
+   bool allow_huge_pages;
 };
 
 struct mmu_gather;
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index ade4acd..4e9a546 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -5,6 +5,7 @@
 
 #define MREMAP_MAYMOVE 1
 #define MREMAP_FIXED   2
+#define MREMAP_RETAIN  4
 
 #define OVERCOMMIT_GUESS   0
 #define OVERCOMMIT_ALWAYS  1
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..350b478 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2077,7 +2077,7 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
 * page fault if needed.
 */
return 0;
-   if (vma-vm_ops)
+   if ((vma-vm_ops  !vma-vm_ops-allow_huge_pages))
/* khugepaged not yet working on file or special mappings */
return 0;
VM_BUG_ON(vma-vm_flags  VM_NO_THP);
@@ -2405,7 +2405,7 @@ static bool

Re: [PATCH v3] mm: add mremap flag for preserving the old mapping

2014-09-30 Thread Daniel Micay
On 30/09/14 01:49 PM, Andy Lutomirski wrote:
> 
> I think it might pay to add an explicit vm_op to authorize
> duplication, especially for non-cow mappings.  IOW this kind of
> extension seems quite magical for anything that doesn't have the
> normal COW semantics, including for plain old read-only mappings.

This sounds like the best way forwards.

Setting up the op for private, anonymous mappings and having it check
vm_flags & VM_WRITE dynamically seems like it would be enough for the
intended use case in general purpose allocators. It can be extended to
other mapping types later if there's a compelling use case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: add mremap flag for preserving the old mapping

2014-09-30 Thread Daniel Micay
On 30/09/14 01:53 AM, Andy Lutomirski wrote:
> On Mon, Sep 29, 2014 at 9:55 PM, Daniel Micay  wrote:
>> This introduces the MREMAP_RETAIN flag for preserving the source mapping
>> when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to
>> the source location will fault and cause fresh pages to be mapped in.
>>
>> For consistency, the old_len >= new_len case could decommit the pages
>> instead of unmapping. However, userspace can accomplish the same thing
>> via madvise and a coherent definition of the flag is possible without
>> the extra complexity.
> 
> IMO this needs very clear documentation of exactly what it does.

Agreed, and thanks for the review. I'll post a slightly modified version
of the patch soon (mostly more commit message changes).

> Does it preserve the contents of the source pages?  (If so, why?
> Aren't you wasting a bunch of time on page faults and possibly
> unnecessary COWs?)

The source will act as if it was just created. For an anonymous memory
mapping, it will fault on any accesses and bring in new zeroed pages.

In jemalloc, it replaces an enormous memset(dst, src, size) followed by
madvise(src, size, MADV_DONTNEED) with mremap. Using mremap also ends up
eliding page faults from writes at the destination.

TCMalloc has nearly the same page allocation design, although it tries
to throttle the purging so it won't always gain as much.

> Does it work on file mappings?  Can it extend file mappings while it moves 
> them?

It works on file mappings. If a move occurs, there will be the usual
extended destination mapping but with the source mapping left intact.

It wouldn't be useful with existing allocators, but in theory a general
purpose allocator could expose an MMIO API in order to reuse the same
address space via MAP_FIXED/MREMAP_FIXED to reduce VM fragmentation.

> If you MREMAP_RETAIN a partially COWed private mapping, what happens?

The original mapping is zeroed in the following test, as it would be
without fork:

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 

int main(void) {
  size_t size = 1024 * 1024;
  char *orig = mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  memset(orig, 5, size);
  int pid = fork();
  if (pid == -1)
return 1;
  if (pid == 0) {
memset(orig, 5, 1024);
char *new = mremap(orig, size, size * 128, MREMAP_MAYMOVE|4);
if (new == orig) return 1;
for (size_t i = 0; i < size; i++)
  if (new[i] != 5)
return 1;
for (size_t i = 0; i < size; i++)
  if (orig[i] != 0)
return 1;
return 0;
  }
  int status;
  if (wait() < -1) return 1;
  if (WIFEXITED(status))
return WEXITSTATUS(status);
  return 1;
}

Hopefully this is the case you're referring to. :)

> Does it work on special mappings?  If so, please prevent it from doing
> so.  mremapping x86's vdso is a thing, and duplicating x86's vdso
> should not become a thing, because x86_32 in particular will become
> extremely confused.

I'll add a check for arch_vma_name(vma) == NULL.

There's an existing check for VM_DONTEXPAND | VM_PFNMAP when expanding
allocations (the only case this flag impacts). Are there other kinds of
special mappings that you're referring to?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: add mremap flag for preserving the old mapping

2014-09-30 Thread Daniel Micay
On 30/09/14 01:53 AM, Andy Lutomirski wrote:
 On Mon, Sep 29, 2014 at 9:55 PM, Daniel Micay danielmi...@gmail.com wrote:
 This introduces the MREMAP_RETAIN flag for preserving the source mapping
 when MREMAP_MAYMOVE moves the pages to a new destination. Accesses to
 the source location will fault and cause fresh pages to be mapped in.

 For consistency, the old_len = new_len case could decommit the pages
 instead of unmapping. However, userspace can accomplish the same thing
 via madvise and a coherent definition of the flag is possible without
 the extra complexity.
 
 IMO this needs very clear documentation of exactly what it does.

Agreed, and thanks for the review. I'll post a slightly modified version
of the patch soon (mostly more commit message changes).

 Does it preserve the contents of the source pages?  (If so, why?
 Aren't you wasting a bunch of time on page faults and possibly
 unnecessary COWs?)

The source will act as if it was just created. For an anonymous memory
mapping, it will fault on any accesses and bring in new zeroed pages.

In jemalloc, it replaces an enormous memset(dst, src, size) followed by
madvise(src, size, MADV_DONTNEED) with mremap. Using mremap also ends up
eliding page faults from writes at the destination.

TCMalloc has nearly the same page allocation design, although it tries
to throttle the purging so it won't always gain as much.

 Does it work on file mappings?  Can it extend file mappings while it moves 
 them?

It works on file mappings. If a move occurs, there will be the usual
extended destination mapping but with the source mapping left intact.

It wouldn't be useful with existing allocators, but in theory a general
purpose allocator could expose an MMIO API in order to reuse the same
address space via MAP_FIXED/MREMAP_FIXED to reduce VM fragmentation.

 If you MREMAP_RETAIN a partially COWed private mapping, what happens?

The original mapping is zeroed in the following test, as it would be
without fork:

#define _GNU_SOURCE

#include string.h
#include stdlib.h
#include sys/mman.h
#include unistd.h
#include sys/wait.h

int main(void) {
  size_t size = 1024 * 1024;
  char *orig = mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  memset(orig, 5, size);
  int pid = fork();
  if (pid == -1)
return 1;
  if (pid == 0) {
memset(orig, 5, 1024);
char *new = mremap(orig, size, size * 128, MREMAP_MAYMOVE|4);
if (new == orig) return 1;
for (size_t i = 0; i  size; i++)
  if (new[i] != 5)
return 1;
for (size_t i = 0; i  size; i++)
  if (orig[i] != 0)
return 1;
return 0;
  }
  int status;
  if (wait(status)  -1) return 1;
  if (WIFEXITED(status))
return WEXITSTATUS(status);
  return 1;
}

Hopefully this is the case you're referring to. :)

 Does it work on special mappings?  If so, please prevent it from doing
 so.  mremapping x86's vdso is a thing, and duplicating x86's vdso
 should not become a thing, because x86_32 in particular will become
 extremely confused.

I'll add a check for arch_vma_name(vma) == NULL.

There's an existing check for VM_DONTEXPAND | VM_PFNMAP when expanding
allocations (the only case this flag impacts). Are there other kinds of
special mappings that you're referring to?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   >