Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor
On Wed, Sep 23, 2020 at 08:00:07PM +0200, Solar Designer wrote: > A couple of other things Brad kindly pointed out: > > SELinux already has similar protections (execmem, execmod): > > http://lkml.iu.edu/hypermail/linux/kernel/0508.2/0194.html > https://danwalsh.livejournal.com/6117.html Actually, that's right in Madhavan's "Introduction": "LSMs such as SELinux implement W^X" and "The W^X implementation today is not complete." I'm sorry I jumped into this thread out of context. Alexander
Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor
On Wed, Sep 23, 2020 at 04:39:31PM +0200, Florian Weimer wrote: > * Solar Designer: > > > While I share my opinion here, I don't mean that to block Madhavan's > > work. I'd rather defer to people more knowledgeable in current userland > > and ABI issues/limitations and plans on dealing with those, especially > > to Florian Weimer. I haven't seen Florian say anything specific for or > > against Madhavan's proposal, and I'd like to. (Have I missed that?) [...] > I think it's unnecessary for the libffi use case. [...] > I don't know if kernel support could > make sense in this context, but it would be a completely different > patch. Thanks. Are there currently relevant use cases where the proposed trampfd would be useful and likely actually made use of by userland - e.g., specific userland project developers saying they'd use it, or Madhavan intending to develop and contribute userland patches? Alexander
Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor
On Wed, Sep 23, 2020 at 05:18:35PM +0200, Pavel Machek wrote: > > It sure does make sense to combine ret2libc/ROP to mprotect() with one's > > own injected shellcode. Compared to doing everything from ROP, this is > > easier and more reliable across versions/builds if the desired > > payload > > Ok, so this starts to be a bit confusing. > > I thought W^X is to protect from attackers that have overflowed buffer > somewhere, but can not to do arbitrary syscalls, yet. > > You are saying that there's important class of attackers that can do > some syscalls but not arbitrary ones. They might be able to do many, most, or all arbitrary syscalls via ret2libc or such. The crucial detail is that each time they do that, they risk incompatibility with the given target system (version, build, maybe ASLR if gadgets from multiple libraries are involved). By using mprotect(), they only take this risk once (need to get the address of an mprotect() gadget and of what to change protections on right), and then they can invoke multiple syscalls from their shellcode more reliably. So for doing a lot of work, mprotect() combined with injected code can be easier and more reliable. It is also an extra option an attacker can use, in addition to doing everything via borrowed code. More flexibility for the attacker means the attacker may choose whichever approach works better in a given case (or try several). I am embarrassed for not thinking/recalling this when I first posted earlier today. It's actually obvious. I'm just getting old and rusty. > I'd like to see definition of that attacker (and perhaps description > of the system the protection is expected to be useful on -- if it is > not close to common Linux distros). There's nothing unusual about that attacker and the system. A couple of other things Brad kindly pointed out: SELinux already has similar protections (execmem, execmod): http://lkml.iu.edu/hypermail/linux/kernel/0508.2/0194.html https://danwalsh.livejournal.com/6117.html PaX MPROTECT is implemented in a way or at a layer that covers ptrace() abuse that I mentioned. (At least that's how I understood Brad.) Alexander P.S. Meanwhile, Twitter locked my account "for security purposes". Fun. I'll just let it be for now.
Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor
On Wed, Sep 23, 2020 at 11:14:56AM +0200, Solar Designer wrote: > On Wed, Sep 23, 2020 at 10:14:26AM +0200, Pavel Machek wrote: > > > Introduction > > > > > > > > > Dynamic code is used in many different user applications. Dynamic code is > > > often generated at runtime. Dynamic code can also just be a pre-defined > > > sequence of machine instructions in a data buffer. Examples of dynamic > > > code are trampolines, JIT code, DBT code, etc. > > > > > > Dynamic code is placed either in a data page or in a stack page. In order > > > to execute dynamic code, the page it resides in needs to be mapped with > > > execute permissions. Writable pages with execute permissions provide an > > > attack surface for hackers. Attackers can use this to inject malicious > > > code, modify existing code or do other harm. > > > > > > To mitigate this, LSMs such as SELinux implement W^X. That is, they may > > > not > > > allow pages to have both write and execute permissions. This prevents > > > dynamic code from executing and blocks applications that use it. To allow > > > genuine applications to run, exceptions have to be made for them (by > > > setting > > > execmem, etc) which opens the door to security issues. > > > > > > The W^X implementation today is not complete. There exist many user level > > > tricks that can be used to load and execute dynamic code. E.g., > > > > > > - Load the code into a file and map the file with R-X. > > > > > > - Load the code in an RW- page. Change the permissions to R--. Then, > > > change the permissions to R-X. > > > > > > - Load the code in an RW- page. Remap the page with R-X to get a separate > > > mapping to the same underlying physical page. > > > > > > IMO, these are all security holes as an attacker can exploit them to > > > inject > > > his own code. > > > > IMO, you are smoking crack^H^H very seriously misunderstanding what > > W^X is supposed to protect from. > > > > W^X is not supposed to protect you from attackers that can already do > > system calls. So loading code into a file then mapping the file as R-X > > is in no way security hole in W^X. > > > > If you want to provide protection from attackers that _can_ do system > > calls, fine, but please don't talk about W^X and please specify what > > types of attacks you want to prevent and why that's good thing. > > On one hand, Pavel is absolutely right. It is ridiculous to say that > "these are all security holes as an attacker can exploit them to inject > his own code." I stand corrected, due to Brad's tweet and follow-ups here: https://twitter.com/spendergrsec/status/1308728284390318082 It sure does make sense to combine ret2libc/ROP to mprotect() with one's own injected shellcode. Compared to doing everything from ROP, this is easier and more reliable across versions/builds if the desired payload is non-trivial. My own example: invoking a shell in a local attack on Linux is trivial enough to do via ret2libc only, but a connect-back shell in a remote attack might be easier and more reliably done via mprotect() + shellcode. Per the follow-ups, this was an established technique on Windows and iOS until further hardening prevented it. So it does make sense for Linux to do the same (as an option because of it breaking existing stuff), and not so much as policy enforcement for the sake of it and ease of reasoning, but mostly to force real-world exploits to be more complex and less reliable. > On the other hand, "what W^X is supposed to protect from" depends on how > the term W^X is defined (historically, by PaX and OpenBSD). It may be > that W^X is partially not a feature to defeat attacks per se, but also a > policy enforcement feature preventing use of dangerous techniques (JIT). > > Such policy might or might not make sense. It might make sense for ease > of reasoning, e.g. "I've flipped this setting, and now I'm certain the > system doesn't have JIT within a process (can still have it through > dynamically creating and invoking an entire new program), so there are > no opportunities for an attacker to inject code nor generate previously > non-existing ROP gadgets into an executable mapping within a process." > > I do find it questionable whether such policy and such reasoning make > sense beyond academia. I was wrong in the above, focusing on the wrong thing. > Then, there might be even more ways in which W^X is not perfect enough > to enable such reasoning. What about using ptrace(2) to inject code?
Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor
On Wed, Sep 23, 2020 at 10:14:26AM +0200, Pavel Machek wrote: > > Introduction > > > > > > Dynamic code is used in many different user applications. Dynamic code is > > often generated at runtime. Dynamic code can also just be a pre-defined > > sequence of machine instructions in a data buffer. Examples of dynamic > > code are trampolines, JIT code, DBT code, etc. > > > > Dynamic code is placed either in a data page or in a stack page. In order > > to execute dynamic code, the page it resides in needs to be mapped with > > execute permissions. Writable pages with execute permissions provide an > > attack surface for hackers. Attackers can use this to inject malicious > > code, modify existing code or do other harm. > > > > To mitigate this, LSMs such as SELinux implement W^X. That is, they may not > > allow pages to have both write and execute permissions. This prevents > > dynamic code from executing and blocks applications that use it. To allow > > genuine applications to run, exceptions have to be made for them (by setting > > execmem, etc) which opens the door to security issues. > > > > The W^X implementation today is not complete. There exist many user level > > tricks that can be used to load and execute dynamic code. E.g., > > > > - Load the code into a file and map the file with R-X. > > > > - Load the code in an RW- page. Change the permissions to R--. Then, > > change the permissions to R-X. > > > > - Load the code in an RW- page. Remap the page with R-X to get a separate > > mapping to the same underlying physical page. > > > > IMO, these are all security holes as an attacker can exploit them to inject > > his own code. > > IMO, you are smoking crack^H^H very seriously misunderstanding what > W^X is supposed to protect from. > > W^X is not supposed to protect you from attackers that can already do > system calls. So loading code into a file then mapping the file as R-X > is in no way security hole in W^X. > > If you want to provide protection from attackers that _can_ do system > calls, fine, but please don't talk about W^X and please specify what > types of attacks you want to prevent and why that's good thing. On one hand, Pavel is absolutely right. It is ridiculous to say that "these are all security holes as an attacker can exploit them to inject his own code." On the other hand, "what W^X is supposed to protect from" depends on how the term W^X is defined (historically, by PaX and OpenBSD). It may be that W^X is partially not a feature to defeat attacks per se, but also a policy enforcement feature preventing use of dangerous techniques (JIT). Such policy might or might not make sense. It might make sense for ease of reasoning, e.g. "I've flipped this setting, and now I'm certain the system doesn't have JIT within a process (can still have it through dynamically creating and invoking an entire new program), so there are no opportunities for an attacker to inject code nor generate previously non-existing ROP gadgets into an executable mapping within a process." I do find it questionable whether such policy and such reasoning make sense beyond academia. Then, there might be even more ways in which W^X is not perfect enough to enable such reasoning. What about using ptrace(2) to inject code? Should enabling W^X also disable ability to debug programs by non-root? We already have Yama ptrace_scope, which can achieve that at the highest setting, although that's rather inconvenient and is probably unexpected by most to be a requirement for having (ridiculously?) full W^X allowing for the academic reasoning. Personally, I am for policies that make more practical sense. For example, years ago I advocated here on kernel-hardening that we should have a mode where ELF flags enabling/disabling executable stack are ignored, and non-executable stack is always enforced. This should also be extended to default (at program startup) permissions on more than just stack (but also on .bss, typical libcs' heap allocations, etc.) However, I am not convinced there's enough value in extending the policy to restricting explicit uses of mprotect(2). Yes, PaX did that, and its emutramp.txt said "runtime code generation is by its nature incompatible with PaX's PAGEEXEC/SEGMEXEC and MPROTECT features, therefore the real solution is not in emulation but by designing a kernel API for runtime code generation and modifying userland to make use of it." However, not being convinced in the MPROTECT feature having enough practical value, I am also not convinced "a kernel API for runtime code generation and modifying userland to make use of it" is the way to go. Having static instead of dynamically-generated trampolines in userland code where possible (and making other userland/ABI changes to make that possible in more/all cases) is an obvious improvement, and IMO should be a priority over the above. While I share my opinion here, I don't mean that to block Madhavan's work. I'd
Re: [PATCH v2] Documentation/security-bugs: provide more information about linux-distros
On Thu, Jul 18, 2019 at 06:51:07PM -0700, Kees Cook wrote: > On Thu, Jul 18, 2019 at 08:39:19PM -0400, Sasha Levin wrote: > > On Thu, Jul 18, 2019 at 03:00:55PM -0700, Kees Cook wrote: > > > On Wed, Jul 17, 2019 at 07:11:03PM -0400, Sasha Levin wrote: > > > > Provide more information about how to interact with the linux-distros > > > > mailing list for disclosing security bugs. > > > > > > > > Reference the linux-distros list policy and clarify that the reporter > > > > must read and understand those policies as they differ from > > > > secur...@kernel.org's policy. > > > > > > > > Suggested-by: Solar Designer > > > > Signed-off-by: Sasha Levin > > > > > > Sorry, but NACK, see below... I like Sasha's PATCH v2 better, but if Kees insists on NACK'ing it then I suggest that we apply Sasha's first revision of the patch instead. I think either revision is an improvement on the status quo. > I think reinforcing information to avoid past mistakes is appropriate > here. Maybe, but from my perspective common past issues with Linux kernel bugs reported to linux-distros were: - The reporter having been directed to post from elsewhere (and I suspect this documentation file) without being aware of list policy. - The reporter not mentioning (and sometimes not replying even when asked) whether they're also coordinating with security@k.o or whether they want someone on linux-distros to help coordinate with security@k.o. (Maybe this is something we want to write about here.) - The Linux kernel bug having been introduced too recently to be of much interest to distros. > Reports have regularly missed the "[vs]" detail or suggested > embargoes that ended on Fridays, etc. This happens too. Regarding missing the "[vs]" detail, technically there are also a number of other conditions that also let the message through, but those are changing and are deliberately not advertised. > Sending to the distros@ list risks exposing Linux-only flaws to non-Linux > distros. Right. > This has caused leaks in the past Do you mean leaks to *BSD security teams or to the public? I'm not aware of past leaks to the public via the non-Linux distros present on the distros@ list. Are you? Alexander
Re: [PATCH 1/3] x86/asm: Pin sensitive CR0 bits
On Tue, Feb 26, 2019 at 03:36:45PM -0800, Kees Cook wrote: > static inline void native_write_cr0(unsigned long val) > { > - asm volatile("mov %0,%%cr0": : "r" (val), "m" (__force_order)); > + bool warn = false; > + > +again: > + val |= X86_CR0_WP; > + /* > + * In order to have the compiler not optimize away the check > + * in the WARN_ONCE(), mark "val" as being also an output ("+r") This comment is now slightly out of date: the check is no longer "in the WARN_ONCE()". Ditto about the comment for CR4. > + * by this asm() block so it will perform an explicit check, as > + * if it were "volatile". > + */ > + asm volatile("mov %0,%%cr0": "+r" (val) : "m" (__force_order) : ); > + /* > + * If the MOV above was used directly as a ROP gadget we can > + * notice the lack of pinned bits in "val" and start the function > + * from the beginning to gain the WP bit for sure. And do it > + * without first taking the exception for a WARN(). > + */ > + if ((val & X86_CR0_WP) != X86_CR0_WP) { > + warn = true; > + goto again; > + } > + WARN_ONCE(warn, "Attempt to unpin X86_CR0_WP, cr0 bypass attack?!\n"); > } Alexander
Re: [PATCH v2] x86/asm: Pin sensitive CR4 bits
On Wed, Feb 20, 2019 at 01:20:58PM -0800, Kees Cook wrote: > On Wed, Feb 20, 2019 at 10:49 AM Solar Designer wrote: > > > > On Wed, Feb 20, 2019 at 10:09:34AM -0800, Kees Cook wrote: > > > + if (WARN_ONCE((val & cr4_pin) != cr4_pin, "cr4 bypass attempt?!\n")) > > > + goto again; > > > > I think "goto again" is too mild a response given that it occurs after a > > successful write of a non-pinned value to CR4. I think it'd allow some > > exploits to eventually win the race: make their desired use of whatever > > functionality SMEP, etc. would have prevented - which may be just a few > > instructions they need to run - before the CR4 value is reverted after > > "goto again". I think it's one of those cases where a kernel panic > > would be more appropriate. > > It will not land upstream with a BUG() or panic(). Linus has > explicitly stated that none of this work can do that until it has > "baked" in the kernel for a couple years. OK. > In his defense, anyone sufficiently paranoid can already raise the > priority of a WARN() into a panic via sysctl kernel.panic_on_warn (and > kernel.panic_on_oops). I think there are too many uses of WARN() for anyone sane to enable that in production, whereas it'd have made sense to enable it for the few security-related uses. > > Also, WARN_ONCE possibly introduces a delay sufficient to realistically > > win this race on the first try. If we choose to warn, we should do it > > after having reverted the CR4 value, not before. > > Isn't cr4 CPU-local though? Good point. I don't know. If CR4 is per hardware thread, then the race would require an interrupt and would be much harder to win. > Couldn't we turn off interrupts to stop the race? This won't help. An attack would skip the code that disables interrupts and land right on the MOV instruction. Alexander
Re: [PATCH v2] x86/asm: Pin sensitive CR4 bits
On Wed, Feb 20, 2019 at 10:09:34AM -0800, Kees Cook wrote: > +extern volatile unsigned long cr4_pin; > + > static inline void native_write_cr4(unsigned long val) > { > +again: > + val |= cr4_pin; > asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order)); > + /* > + * If the MOV above was used directly as a ROP gadget we can > + * notice the lack of pinned bits in "val" and start the function > + * from the beginning to gain the cr4_pin bits for sure. > + */ > + if (WARN_ONCE((val & cr4_pin) != cr4_pin, "cr4 bypass attempt?!\n")) > + goto again; > } I think "goto again" is too mild a response given that it occurs after a successful write of a non-pinned value to CR4. I think it'd allow some exploits to eventually win the race: make their desired use of whatever functionality SMEP, etc. would have prevented - which may be just a few instructions they need to run - before the CR4 value is reverted after "goto again". I think it's one of those cases where a kernel panic would be more appropriate. Also, WARN_ONCE possibly introduces a delay sufficient to realistically win this race on the first try. If we choose to warn, we should do it after having reverted the CR4 value, not before. Alexander
Re: [kernel-hardening] [PATCH v3 2/2] Protected O_CREAT open in sticky directories
On Tue, Dec 05, 2017 at 11:21:00AM +0100, Salvatore Mesoraca wrote: > 2017-11-30 17:30 GMT+01:00 Solar Designer <so...@openwall.com>: > > $ strace flock /tmp/lockfile -c cat > > [...] > > open("/tmp/lockfile", O_RDONLY|O_CREAT|O_NOCTTY, 0666) = 3 > > flock(3, LOCK_EX) = 0 > > > > This use of flock(1) would be a worse vulnerability, not limited to DoS > > against the script itself but also allowing for privilege escalation > > unless (sym)link restrictions are enabled in the kernel. Adding O_EXCL > > would help (reduce the impact back to DoS against itself), and would > > require that the retry logic (like what is seen in the lock directory > > example above) be present. > > > That behavior can be certainly avoided, but of course it isn't a > > > security problem per se. > > > > I think it is a security problem per se, except in the "subtle case" > > above, and it's great that our proposed policy would catch it. > > I agree on the DoS, though, at first, I didn't consider it a "bug" because > there isn't any open mode that can prevent the DoS in this case. > If you want to avoid it, you must avoid other-users-writable directories > at all. So, It think that, if you are using a sticky directory, it's > intended behavior to let someone else "lock" you. Right. There's a worse DoS I had overlooked, though: flock(1) can also be made to create and/or lock another file (maybe in another directory). Perhaps adding O_NOFOLLOW (alongside flock(1)'s existing use of O_NOCTTY) would be a good idea, even though uses in a directory writable by someone else are inherently risky anyway. > But maybe many flock(1) users are not aware of the issue and so, sometimes, > it can be unintended. > I didn't consider privilege escalation as an issue because I always > looked at flock(1) under the assumption that the lockfile is never actually > read or modified in any way and so it shouldn't make too much difference if > it's an already existing regular file or a symlink etc. > Am I missing something? You made a good point, but yes: O_CREAT will follow a dangling symlink and there are cases where creating an empty file of an attacker-chosen pathname may allow for privilege escalation. For example, crontab(1) man page on RHEL7 says regarding /etc/cron.allow and /etc/cron.deny: "If neither of these files exists, only the super user is allowed to use cron." In that configuration, simply creating empty /etc/cron.deny grants access to crontab(1) to all users. As user: $ crontab -l You (solar) are not allowed to use this program (crontab) See crontab(1) for more information $ ln -s /etc/cron.deny /tmp/lockfile As root: # sysctl -w fs.protected_symlinks=0 fs.protected_symlinks = 0 # flock /tmp/lockfile -c echo As user: $ crontab -l no crontab for solar There may also be side-effects on open of device files (the best known example is rewinding a tape), and we won't avoid those by retrying without O_CREAT|O_EXCL. O_NOFOLLOW will help against symlinks to device files, but not against hard links (if on same device). The kernel's symlink and hardlink protections help, but in this sub-thread we were discussing detecting userspace software issues without waiting for an attack. Things like this fit David Laight's point well: programs trying to make risky (mis)uses less risky sometimes also avoid being detected by our proposed policy. That's life. Alexander
Re: [kernel-hardening] [PATCH v3 2/2] Protected O_CREAT open in sticky directories
On Tue, Dec 05, 2017 at 11:21:00AM +0100, Salvatore Mesoraca wrote: > 2017-11-30 17:30 GMT+01:00 Solar Designer : > > $ strace flock /tmp/lockfile -c cat > > [...] > > open("/tmp/lockfile", O_RDONLY|O_CREAT|O_NOCTTY, 0666) = 3 > > flock(3, LOCK_EX) = 0 > > > > This use of flock(1) would be a worse vulnerability, not limited to DoS > > against the script itself but also allowing for privilege escalation > > unless (sym)link restrictions are enabled in the kernel. Adding O_EXCL > > would help (reduce the impact back to DoS against itself), and would > > require that the retry logic (like what is seen in the lock directory > > example above) be present. > > > That behavior can be certainly avoided, but of course it isn't a > > > security problem per se. > > > > I think it is a security problem per se, except in the "subtle case" > > above, and it's great that our proposed policy would catch it. > > I agree on the DoS, though, at first, I didn't consider it a "bug" because > there isn't any open mode that can prevent the DoS in this case. > If you want to avoid it, you must avoid other-users-writable directories > at all. So, It think that, if you are using a sticky directory, it's > intended behavior to let someone else "lock" you. Right. There's a worse DoS I had overlooked, though: flock(1) can also be made to create and/or lock another file (maybe in another directory). Perhaps adding O_NOFOLLOW (alongside flock(1)'s existing use of O_NOCTTY) would be a good idea, even though uses in a directory writable by someone else are inherently risky anyway. > But maybe many flock(1) users are not aware of the issue and so, sometimes, > it can be unintended. > I didn't consider privilege escalation as an issue because I always > looked at flock(1) under the assumption that the lockfile is never actually > read or modified in any way and so it shouldn't make too much difference if > it's an already existing regular file or a symlink etc. > Am I missing something? You made a good point, but yes: O_CREAT will follow a dangling symlink and there are cases where creating an empty file of an attacker-chosen pathname may allow for privilege escalation. For example, crontab(1) man page on RHEL7 says regarding /etc/cron.allow and /etc/cron.deny: "If neither of these files exists, only the super user is allowed to use cron." In that configuration, simply creating empty /etc/cron.deny grants access to crontab(1) to all users. As user: $ crontab -l You (solar) are not allowed to use this program (crontab) See crontab(1) for more information $ ln -s /etc/cron.deny /tmp/lockfile As root: # sysctl -w fs.protected_symlinks=0 fs.protected_symlinks = 0 # flock /tmp/lockfile -c echo As user: $ crontab -l no crontab for solar There may also be side-effects on open of device files (the best known example is rewinding a tape), and we won't avoid those by retrying without O_CREAT|O_EXCL. O_NOFOLLOW will help against symlinks to device files, but not against hard links (if on same device). The kernel's symlink and hardlink protections help, but in this sub-thread we were discussing detecting userspace software issues without waiting for an attack. Things like this fit David Laight's point well: programs trying to make risky (mis)uses less risky sometimes also avoid being detected by our proposed policy. That's life. Alexander
Re: [PATCH v3 2/2] Protected O_CREAT open in sticky directories
On Thu, Nov 30, 2017 at 04:53:06PM +, David Laight wrote: > From: Salvatore Mesoraca > > if a program tries to open a file, in a sticky directory, > > with the O_CREAT flag and without the O_EXCL, it probably has a bug. > > This feature allows to detect and potentially block programs that > > act this way, it can be used to find vulnerabilities (like those > > prevented by patch #1) and to do policy enforcement. > I presume the 'vulnerabilities' are related to symlinks being created > just before the open? That's one way to exploit them. We already have a sysctl to try and block that specific attack, and we're considering adding more, for other attacks. But we'd also like an easy way to learn of the vulnerabilities without anyone trying to exploit them yet, hence this patch. > Trouble is this change breaks a lot of general use of /tmp. That's general misuse of /tmp. Things like "command > /tmp/file" without having pre-created the file with O_EXCL e.g. by mktemp(1). > I always assumed that code that cared would use O_EXCL and > everything else wasn't worth subverting. Opinions will vary on whether it's worth subverting or not, and someone may reasonably want to configure this differently on different systems. > I found code in vi (and elsewhere) that subverted these checks > by opening with O_WRONLY if stat() showed the file existed and > O_CREAT|O_EXCL if it didn't. Yes, such misuses of /tmp and the like by use of those programs - where the potential consequences would often be less severe (if your example above is correct, it sounds like it'd be data spoofing but not overwriting another file over a link) - could remain unnoticed. > I'm pretty sure that traditionally a lot of these opens were done > with O_CREAT|O_TRUNC. > Implementing that as unlink() followed by a create would stop > 'random' (ok all) symlinks being followed. That's racy. We have O_EXCL, and it must be used along with O_CREAT when the directory is untrusted. (If it were only about symlinks, we also already have O_NOFOLLOW.) > Overall I'm pretty sure this change will break things badly somewhere. Sure. We want it to visibly break what was subtly broken, so that the breakage can be seen and corrected without an attempted attack. Alexander
Re: [PATCH v3 2/2] Protected O_CREAT open in sticky directories
On Thu, Nov 30, 2017 at 04:53:06PM +, David Laight wrote: > From: Salvatore Mesoraca > > if a program tries to open a file, in a sticky directory, > > with the O_CREAT flag and without the O_EXCL, it probably has a bug. > > This feature allows to detect and potentially block programs that > > act this way, it can be used to find vulnerabilities (like those > > prevented by patch #1) and to do policy enforcement. > I presume the 'vulnerabilities' are related to symlinks being created > just before the open? That's one way to exploit them. We already have a sysctl to try and block that specific attack, and we're considering adding more, for other attacks. But we'd also like an easy way to learn of the vulnerabilities without anyone trying to exploit them yet, hence this patch. > Trouble is this change breaks a lot of general use of /tmp. That's general misuse of /tmp. Things like "command > /tmp/file" without having pre-created the file with O_EXCL e.g. by mktemp(1). > I always assumed that code that cared would use O_EXCL and > everything else wasn't worth subverting. Opinions will vary on whether it's worth subverting or not, and someone may reasonably want to configure this differently on different systems. > I found code in vi (and elsewhere) that subverted these checks > by opening with O_WRONLY if stat() showed the file existed and > O_CREAT|O_EXCL if it didn't. Yes, such misuses of /tmp and the like by use of those programs - where the potential consequences would often be less severe (if your example above is correct, it sounds like it'd be data spoofing but not overwriting another file over a link) - could remain unnoticed. > I'm pretty sure that traditionally a lot of these opens were done > with O_CREAT|O_TRUNC. > Implementing that as unlink() followed by a create would stop > 'random' (ok all) symlinks being followed. That's racy. We have O_EXCL, and it must be used along with O_CREAT when the directory is untrusted. (If it were only about symlinks, we also already have O_NOFOLLOW.) > Overall I'm pretty sure this change will break things badly somewhere. Sure. We want it to visibly break what was subtly broken, so that the breakage can be seen and corrected without an attempted attack. Alexander
Re: [kernel-hardening] [PATCH v3 2/2] Protected O_CREAT open in sticky directories
Replying to Salvatore and Ian at once, and CC'ing H. Peter Anvin and Karel Zak for util-linux flock(1). On Thu, Nov 30, 2017 at 02:57:06PM +, Ian Campbell wrote: > On Thu, 2017-11-30 at 15:39 +0100, Salvatore Mesoraca wrote: > > 2017-11-27 1:26 GMT+01:00 Solar Designer <so...@openwall.com>: > > > Why would "shared lock files based on flock()" need O_CREAT without > > > O_EXCL? Where specifically are they currently used that way? > > > > I don't think that they *need* to act like this, but this is how > > util-linux's flock(1) currently works. Oh, but you never referred to flock(1) in there, so I read flock() as implying flock(2). I think util-linux's flock(1) is relatively obscure, and as far as I can tell it isn't documented anywhere whether it may be used on untrusted lock file directories or not. A revision of the flock(1) man page seen on RHEL7 gives really weird examples using /tmp. The current util-linux-2.31/sys-utils/flock.1 is similar. strace'ing those examples, I see flock(1) literally uses the /tmp directory itself (not a file inside) as the lock, which surprisingly appears to work. Checking the util-linux tree, it looks like this was added as a feature. I suppose the good news is this doesn't appear to allow for worse than a DoS against the script using flock(1) (since a different user can also take that lock), unless any of the upper level directories are also untrusted. The man page as seen at https://linux.die.net/man/1/flock currently only lists a more complex example with /var/lock/mylockfile, where flock(1) itself is asked to access it via a pre-opened fd. Permissions on /var/lock vary between distros (e.g., a RHEL6'ish system I just checked has it as 775 root:lock, whereas a RHEL7'ish has it as symlink to ../run/lock, which is 755 root:root). Anyway, these are just examples. The reality is people will use flock(1) in various directories, some of them trusted and some not. If our proposed policy can detect and optionally break some uses in untrusted directories, that's good since such uses are currently unsafe (see below). > > And it doesn't seem an unreasonable behavior from their perspective, > > I thought that too, specifically I reasoned that using O_EXCL would > defeat the purpose of the _shared_ flock(), wouldn't it? No. O_EXCL means the attempt to O_CREAT will fail, at which point the program can retry without both of these flags. (The actual lock is obtained later via flock(2) or fcntl(2) anyway.) In fact, flock(1) already does something similar, as tested using the very first example from the man page on a RHEL7'ish system: $ strace flock /tmp -c cat [...] open("/tmp", O_RDONLY|O_CREAT|O_NOCTTY, 0666) = -1 EISDIR (Is a directory) open("/tmp", O_RDONLY|O_NOCTTY) = 3 flock(3, LOCK_EX) = 0 As you can see, when O_CREAT failed, the program retried without that flag. It could just as easily have tried O_CREAT|O_EXCL, then dropped both at once. Testing with a lock file (rather than lock directory): $ strace flock /tmp/lockfile -c cat [...] open("/tmp/lockfile", O_RDONLY|O_CREAT|O_NOCTTY, 0666) = 3 flock(3, LOCK_EX) = 0 This use of flock(1) would be a worse vulnerability, not limited to DoS against the script itself but also allowing for privilege escalation unless (sym)link restrictions are enabled in the kernel. Adding O_EXCL would help (reduce the impact back to DoS against itself), and would require that the retry logic (like what is seen in the lock directory example above) be present. On Thu, Nov 30, 2017 at 03:39:48PM +0100, Salvatore Mesoraca wrote: > so maybe other programs do that too. Maybe, and they would be similarly vulnerable if they do that in untrusted directories, and they would also need to be fixed. A subtle case is when something like this is technically done in a directory writable by someone else, but is not necessarily crossing what the program's author or the sysadmin recognize as a privilege boundary. Maybe they have accepted that this one pseudo-user can attack that other one anyway, such as because under a certain daemon's privsep model the would-be attacking pseudo-user is necessarily more privileged anyway. This is why we shouldn't by default extend this policy to all directories writable by someone else, but instead we consider introducing a sysctl with varying policy levels - initially focusing on sticky directories writable by someone else and only later optionally extending to non-sticky directories writable by someone else. The latter mode would have even greater focus on development and debugging rather than on production use, although I imagine that specific systems could eventually afford it in production as well. > I was citing that case just to make it clear that, if someone gets > a warning because of flock(1), they shouldn't be worried about
Re: [kernel-hardening] [PATCH v3 2/2] Protected O_CREAT open in sticky directories
Replying to Salvatore and Ian at once, and CC'ing H. Peter Anvin and Karel Zak for util-linux flock(1). On Thu, Nov 30, 2017 at 02:57:06PM +, Ian Campbell wrote: > On Thu, 2017-11-30 at 15:39 +0100, Salvatore Mesoraca wrote: > > 2017-11-27 1:26 GMT+01:00 Solar Designer : > > > Why would "shared lock files based on flock()" need O_CREAT without > > > O_EXCL? Where specifically are they currently used that way? > > > > I don't think that they *need* to act like this, but this is how > > util-linux's flock(1) currently works. Oh, but you never referred to flock(1) in there, so I read flock() as implying flock(2). I think util-linux's flock(1) is relatively obscure, and as far as I can tell it isn't documented anywhere whether it may be used on untrusted lock file directories or not. A revision of the flock(1) man page seen on RHEL7 gives really weird examples using /tmp. The current util-linux-2.31/sys-utils/flock.1 is similar. strace'ing those examples, I see flock(1) literally uses the /tmp directory itself (not a file inside) as the lock, which surprisingly appears to work. Checking the util-linux tree, it looks like this was added as a feature. I suppose the good news is this doesn't appear to allow for worse than a DoS against the script using flock(1) (since a different user can also take that lock), unless any of the upper level directories are also untrusted. The man page as seen at https://linux.die.net/man/1/flock currently only lists a more complex example with /var/lock/mylockfile, where flock(1) itself is asked to access it via a pre-opened fd. Permissions on /var/lock vary between distros (e.g., a RHEL6'ish system I just checked has it as 775 root:lock, whereas a RHEL7'ish has it as symlink to ../run/lock, which is 755 root:root). Anyway, these are just examples. The reality is people will use flock(1) in various directories, some of them trusted and some not. If our proposed policy can detect and optionally break some uses in untrusted directories, that's good since such uses are currently unsafe (see below). > > And it doesn't seem an unreasonable behavior from their perspective, > > I thought that too, specifically I reasoned that using O_EXCL would > defeat the purpose of the _shared_ flock(), wouldn't it? No. O_EXCL means the attempt to O_CREAT will fail, at which point the program can retry without both of these flags. (The actual lock is obtained later via flock(2) or fcntl(2) anyway.) In fact, flock(1) already does something similar, as tested using the very first example from the man page on a RHEL7'ish system: $ strace flock /tmp -c cat [...] open("/tmp", O_RDONLY|O_CREAT|O_NOCTTY, 0666) = -1 EISDIR (Is a directory) open("/tmp", O_RDONLY|O_NOCTTY) = 3 flock(3, LOCK_EX) = 0 As you can see, when O_CREAT failed, the program retried without that flag. It could just as easily have tried O_CREAT|O_EXCL, then dropped both at once. Testing with a lock file (rather than lock directory): $ strace flock /tmp/lockfile -c cat [...] open("/tmp/lockfile", O_RDONLY|O_CREAT|O_NOCTTY, 0666) = 3 flock(3, LOCK_EX) = 0 This use of flock(1) would be a worse vulnerability, not limited to DoS against the script itself but also allowing for privilege escalation unless (sym)link restrictions are enabled in the kernel. Adding O_EXCL would help (reduce the impact back to DoS against itself), and would require that the retry logic (like what is seen in the lock directory example above) be present. On Thu, Nov 30, 2017 at 03:39:48PM +0100, Salvatore Mesoraca wrote: > so maybe other programs do that too. Maybe, and they would be similarly vulnerable if they do that in untrusted directories, and they would also need to be fixed. A subtle case is when something like this is technically done in a directory writable by someone else, but is not necessarily crossing what the program's author or the sysadmin recognize as a privilege boundary. Maybe they have accepted that this one pseudo-user can attack that other one anyway, such as because under a certain daemon's privsep model the would-be attacking pseudo-user is necessarily more privileged anyway. This is why we shouldn't by default extend this policy to all directories writable by someone else, but instead we consider introducing a sysctl with varying policy levels - initially focusing on sticky directories writable by someone else and only later optionally extending to non-sticky directories writable by someone else. The latter mode would have even greater focus on development and debugging rather than on production use, although I imagine that specific systems could eventually afford it in production as well. > I was citing that case just to make it clear that, if someone gets > a warning because of flock(1), they shouldn't be worried about it. Checking the uti
Re: [PATCH v3 2/2] Protected O_CREAT open in sticky directories
On Fri, Nov 24, 2017 at 12:43:47PM +0100, Salvatore Mesoraca wrote: > 2017-11-24 11:53 GMT+01:00 David Laight: > > From: Alan Cox > >> Sent: 22 November 2017 16:52 > >> > >> On Wed, 22 Nov 2017 09:01:46 +0100 Salvatore Mesoraca > >> wrote: > >> > >> > Disallows O_CREAT open missing the O_EXCL flag, in world or > >> > group writable directories, even if the file doesn't exist yet. > >> > With few exceptions (e.g. shared lock files based on flock()) Why would "shared lock files based on flock()" need O_CREAT without O_EXCL? Where specifically are they currently used that way? My guess would be a group-writable mail spool (that would be fcntl() locks on Linux now, but it's the same thing for the purpose of this discussion), but even then I don't see a need for such open flags. If a program does that, we could want to find out and revise it (if O_CREAT|O_EXCL fails, retry without these to open the existing file, then flock() either way). > >> Enough exceptions to make it a bad idea. Maybe, but as Salvatore points out we're considering this for rather limited use - easy opt-in policy enforcement during software development and auditing (years ago, I was using an LKM for this purpose, which caught some issues in Postfix that were promptly patched), and maybe on specific production systems where the sysadmins know what they're doing. > >> Firstly if you care this much *stop* having shared writable directories. > >> We have namespaces, you don't need them. You can give every user their > >> own /tmp etc. This is good advice (and I've been doing similar with pam_mktemp, prior to namespaces), but it's not exactly for the same use case. Also, there are shared writable directories that have to remain shared unless more things are reworked. For example, mail spools and crontab directories, which could be avoided by having the corresponding files in user homes instead (and it's been done), but that's also a related risk (a privileged process accessing a directory writable by a user, even if accessing only for reading, lowering the risk impact). > > Looks like a very bad idea to me as well. > > > > Doesn't this stop all shell redirects into a shared /tmp ? The above sounds like it'd be something bad - but it's actually just right for an opt-in policy like this. A command like "program > /tmp/file" is generally unsafe (unless the file was pre-created safely, see below) and should be avoided. The fact that a lot of documentation gives commands of this kind only emphasizes the need for easy policy enforcement against such misuses of /tmp. Ideally, we'd stop most "shell redirects into a shared /tmp" - all but those into files correctly pre-created with mktemp(1) and the like. Technically, we could achieve similar by allowing O_CREAT without O_EXCL if the file exists _and_ has the same owner as our current fsuid or the owner is root. Most scripts that use mktemp(1) then use that file without switching privileges, so they'll continue working. If there's an exception to that on a system where I'd configure a policy like what's discussed here, I'd rather learn of it and be interested in looking at that unusual script. That script would be taking a risk by making one (pseudo-)user (or root) write into a file in /tmp created (even if initially safely) by another pseudo-user (not root), thus letting the latter pseudo-user attack the former (or root). Ditto for programs using mkstemp(3), but then somehow going through the pathname rather than the fd. That's not ideal, but it's sometimes safe and sometimes makes sense (such as when they need to exec other programs only accepting a pathname), so the same exception and approach applies. > > I'm pretty sure most programs use O_CREAT | O_TRUNC for output > > files - they'll all stop working. That would be great. Unfortunately (in this context), you've identified exceptions, where programs try to be smart but are not necessarily safe. It's beneficial to be able to easily stop at least some misuses of /tmp and the like, even if we can't stop all. > If some program does such a thing, that's a potential vulnerability. Exactly. > With "protected_hardlinks" you are, in most cases, safe. And "protected_symlinks", and not in terms of data spoofing attacks via FIFOs and regular files (the initial focus of this patchset). > But, still, that program has a bug and having this feature enabled will > help you notice it soon. > For that matter, I'm using this patch on my system and I don't have any > program behaving like this. > > > If there are some directories where you need to force O_EXCL you > > need to make it a property of the directory, not the kernel. > > > > David Good idea, but this would need to be checked and enforced by the kernel anyhow, so it's a matter of coarse vs. fine policy. Coarse is much easier to implement and to start using, albeit perhaps mostly not by general-purpose distros, but by
Re: [PATCH v3 2/2] Protected O_CREAT open in sticky directories
On Fri, Nov 24, 2017 at 12:43:47PM +0100, Salvatore Mesoraca wrote: > 2017-11-24 11:53 GMT+01:00 David Laight : > > From: Alan Cox > >> Sent: 22 November 2017 16:52 > >> > >> On Wed, 22 Nov 2017 09:01:46 +0100 Salvatore Mesoraca > >> wrote: > >> > >> > Disallows O_CREAT open missing the O_EXCL flag, in world or > >> > group writable directories, even if the file doesn't exist yet. > >> > With few exceptions (e.g. shared lock files based on flock()) Why would "shared lock files based on flock()" need O_CREAT without O_EXCL? Where specifically are they currently used that way? My guess would be a group-writable mail spool (that would be fcntl() locks on Linux now, but it's the same thing for the purpose of this discussion), but even then I don't see a need for such open flags. If a program does that, we could want to find out and revise it (if O_CREAT|O_EXCL fails, retry without these to open the existing file, then flock() either way). > >> Enough exceptions to make it a bad idea. Maybe, but as Salvatore points out we're considering this for rather limited use - easy opt-in policy enforcement during software development and auditing (years ago, I was using an LKM for this purpose, which caught some issues in Postfix that were promptly patched), and maybe on specific production systems where the sysadmins know what they're doing. > >> Firstly if you care this much *stop* having shared writable directories. > >> We have namespaces, you don't need them. You can give every user their > >> own /tmp etc. This is good advice (and I've been doing similar with pam_mktemp, prior to namespaces), but it's not exactly for the same use case. Also, there are shared writable directories that have to remain shared unless more things are reworked. For example, mail spools and crontab directories, which could be avoided by having the corresponding files in user homes instead (and it's been done), but that's also a related risk (a privileged process accessing a directory writable by a user, even if accessing only for reading, lowering the risk impact). > > Looks like a very bad idea to me as well. > > > > Doesn't this stop all shell redirects into a shared /tmp ? The above sounds like it'd be something bad - but it's actually just right for an opt-in policy like this. A command like "program > /tmp/file" is generally unsafe (unless the file was pre-created safely, see below) and should be avoided. The fact that a lot of documentation gives commands of this kind only emphasizes the need for easy policy enforcement against such misuses of /tmp. Ideally, we'd stop most "shell redirects into a shared /tmp" - all but those into files correctly pre-created with mktemp(1) and the like. Technically, we could achieve similar by allowing O_CREAT without O_EXCL if the file exists _and_ has the same owner as our current fsuid or the owner is root. Most scripts that use mktemp(1) then use that file without switching privileges, so they'll continue working. If there's an exception to that on a system where I'd configure a policy like what's discussed here, I'd rather learn of it and be interested in looking at that unusual script. That script would be taking a risk by making one (pseudo-)user (or root) write into a file in /tmp created (even if initially safely) by another pseudo-user (not root), thus letting the latter pseudo-user attack the former (or root). Ditto for programs using mkstemp(3), but then somehow going through the pathname rather than the fd. That's not ideal, but it's sometimes safe and sometimes makes sense (such as when they need to exec other programs only accepting a pathname), so the same exception and approach applies. > > I'm pretty sure most programs use O_CREAT | O_TRUNC for output > > files - they'll all stop working. That would be great. Unfortunately (in this context), you've identified exceptions, where programs try to be smart but are not necessarily safe. It's beneficial to be able to easily stop at least some misuses of /tmp and the like, even if we can't stop all. > If some program does such a thing, that's a potential vulnerability. Exactly. > With "protected_hardlinks" you are, in most cases, safe. And "protected_symlinks", and not in terms of data spoofing attacks via FIFOs and regular files (the initial focus of this patchset). > But, still, that program has a bug and having this feature enabled will > help you notice it soon. > For that matter, I'm using this patch on my system and I don't have any > program behaving like this. > > > If there are some directories where you need to force O_EXCL you > > need to make it a property of the directory, not the kernel. > > > > David Good idea, but this would need to be checked and enforced by the kernel anyhow, so it's a matter of coarse vs. fine policy. Coarse is much easier to implement and to start using, albeit perhaps mostly not by general-purpose distros, but by a software developer/auditor, an adventurous
Re: [kernel-hardening] [PATCH v2 0/5] stackprotector: ascii armor the stack canary
On Wed, Sep 20, 2017 at 01:18:04PM +0200, Yann Droneaud wrote: > Le mardi 19 septembre 2017 ?? 19:16 +0200, Solar Designer a ??crit : > > > > We could put/require a NUL in the middle of the canary, > > but with the full canary being only 64-bit at most that would also > > make some attacks easier. > > Are you suggesting to randomly select which byte to set to 0 in each > canary ? Definitely not. That's only 8 different possibilities per canary, and the weakest one will affect exploitability in each scenario. So that would be a fairly clear change to the worse. I suggest that we make no further changes at this time, unless someone comes up with an idea that would clearly hurt exploitation more than it helps exploitation, overall across different scenarios. Alexander
Re: [kernel-hardening] [PATCH v2 0/5] stackprotector: ascii armor the stack canary
On Wed, Sep 20, 2017 at 01:18:04PM +0200, Yann Droneaud wrote: > Le mardi 19 septembre 2017 ?? 19:16 +0200, Solar Designer a ??crit : > > > > We could put/require a NUL in the middle of the canary, > > but with the full canary being only 64-bit at most that would also > > make some attacks easier. > > Are you suggesting to randomly select which byte to set to 0 in each > canary ? Definitely not. That's only 8 different possibilities per canary, and the weakest one will affect exploitability in each scenario. So that would be a fairly clear change to the worse. I suggest that we make no further changes at this time, unless someone comes up with an idea that would clearly hurt exploitation more than it helps exploitation, overall across different scenarios. Alexander
Re: [kernel-hardening] [PATCH v2 0/5] stackprotector: ascii armor the stack canary
On Wed, May 24, 2017 at 11:57:46AM -0400, r...@redhat.com wrote: > Zero out the first byte of the stack canary value on 64 bit systems, > in order to mitigate unterminated C string overflows. > > The null byte both prevents C string functions from reading the > canary, and from writing it if the canary value were guessed or > obtained through some other means. > > Reducing the entropy by 8 bits is acceptable on 64-bit systems, > which will still have 56 bits of entropy left, but not on 32 > bit systems, so the "ascii armor" canary is only implemented on > 64-bit systems. > > Inspired by the "ascii armor" code in execshield and Daniel Micay's > linux-hardened tree. > > Also see https://github.com/thestinger/linux-hardened/ Brad trolls us all lightly with this trivia question: https://twitter.com/grsecurity/status/905246423591084033 "For #trivia can you describe one scenario where this change actually helps exploitation using similar C string funcs?" I suppose the expected answer is: The change helps exploitation when the overwriting string ends just before the canary. Its NUL overwriting the NUL byte in the canary would go undetected. Before this change, there would be a 255/256 chance of detection. I hope this was considered. The change might still be a good tradeoff, or it might not, depending on which scenarios are more likely (leak of canary value or the string required in an exploit ending at that exact byte location), and we probably lack such statistics. I am not proposing to revert the change. I had actually contemplated speaking up when this was discussed, but did not for lack of a better suggestion. We could put/require a NUL in the middle of the canary, but with the full canary being only 64-bit at most that would also make some attacks easier. So this is JFYI. No action needed on it, I think. Alexander
Re: [kernel-hardening] [PATCH v2 0/5] stackprotector: ascii armor the stack canary
On Wed, May 24, 2017 at 11:57:46AM -0400, r...@redhat.com wrote: > Zero out the first byte of the stack canary value on 64 bit systems, > in order to mitigate unterminated C string overflows. > > The null byte both prevents C string functions from reading the > canary, and from writing it if the canary value were guessed or > obtained through some other means. > > Reducing the entropy by 8 bits is acceptable on 64-bit systems, > which will still have 56 bits of entropy left, but not on 32 > bit systems, so the "ascii armor" canary is only implemented on > 64-bit systems. > > Inspired by the "ascii armor" code in execshield and Daniel Micay's > linux-hardened tree. > > Also see https://github.com/thestinger/linux-hardened/ Brad trolls us all lightly with this trivia question: https://twitter.com/grsecurity/status/905246423591084033 "For #trivia can you describe one scenario where this change actually helps exploitation using similar C string funcs?" I suppose the expected answer is: The change helps exploitation when the overwriting string ends just before the canary. Its NUL overwriting the NUL byte in the canary would go undetected. Before this change, there would be a 255/256 chance of detection. I hope this was considered. The change might still be a good tradeoff, or it might not, depending on which scenarios are more likely (leak of canary value or the string required in an exploit ending at that exact byte location), and we probably lack such statistics. I am not proposing to revert the change. I had actually contemplated speaking up when this was discussed, but did not for lack of a better suggestion. We could put/require a NUL in the middle of the canary, but with the full canary being only 64-bit at most that would also make some attacks easier. So this is JFYI. No action needed on it, I think. Alexander
Re: [RFC] Restrict writes into untrusted FIFOs and regular files
On Tue, Sep 19, 2017 at 06:06:15PM +0200, Salvatore Mesoraca wrote: > 2017-09-19 2:37 GMT+02:00 Solar Designer <so...@openwall.com>: > > On Mon, Sep 18, 2017 at 02:00:50PM -0700, Kees Cook wrote: > >> On Fri, Sep 15, 2017 at 1:43 AM, Salvatore Mesoraca > >> <s.mesorac...@gmail.com> wrote: > >> > +protected_regular_files: > >> > + > >> > +This protection is similar to protected_fifos, but it > >> > +avoids writes to an attacker-controlled regular file, where program > >> > +expected to create one. > >> > + > >> > +When set to "0", regular files writing is unrestricted. > >> > + > >> > +When set to "1" don't allow O_CREAT open on regular files that we > >> > +don't own in world writable sticky directories, unless they are > >> > +owned by the owner of the directory. [...] > > Although this is sufficient against attacks (if the kernel's check for > > these properties is not racy; I don't know if it is), for the policy > > enforcement use case and reason we might want to support a simpler mode > > where O_CREAT without O_EXCL would be disallowed in sticky directories > > (only world writable? or also writable by anyone other than us? - e.g., > > it'd catch some unsafe uses of mail spools) regardless of whether a > > file of that name already exists or not. Maybe extra settings: > > > > When set to "2" also don't allow O_CREAT open without O_EXCL in > > world-writable sticky directories (even if the files don't already > > exist, for consistent policy enforcement) > > > > When set to "3" also don't allow O_CREAT open on regular files that we > > don't own in sticky directories writable by anyone else, unless the > > files are owned by the owner of the directory. > > > > When set to "4" also don't allow O_CREAT open without O_EXCL in > > sticky directories writable by anyone else (even if the files don't > > already exist, for consistent policy enforcement) > > > > Or maybe "2" and "4" should be a separate knob, so that "3" could be > > used without the policy enforcement aspect of "2", although enabling > > this separate knob at the highest level would make protected_regular > > redundant. > > > > I could envision further levels for non-sticky world-writable and > > writable-by-others directories, and even for unsafe writes without > > O_CREAT and unsafe reads, but then the protected_regular name would > > become even more misleading as without O_CREAT the program could > > actually intend to work with a non-regular file. > > > > Let's avoid further scope creep for now, but have this in mind. As I > > had mentioned in another thread on kernel-hardening, policy enforcement > > like this implemented in a kernel module helped me find weaknesses in > > old Postfix' privsep implementation, which were promptly patched (that > > was many years ago). Having this generally available and easy to enable > > could result in more findings like this by more people. > > > > A setting similar to "3" above should probably also exist for > > protected_fifos (would be "2" there). > > I think I could add "3" to both protected_fifos and protected_regulars > actually using "2" for both. And then add another sysctl for modes > "2" and "4" for both fifos and regular files. Sounds good to me. The third sysctl (or several) could be introduced with a separate patch, focusing on file access safety policy warnings and enforcement in general rather than on any specific file types. Alexander
Re: [RFC] Restrict writes into untrusted FIFOs and regular files
On Tue, Sep 19, 2017 at 06:06:15PM +0200, Salvatore Mesoraca wrote: > 2017-09-19 2:37 GMT+02:00 Solar Designer : > > On Mon, Sep 18, 2017 at 02:00:50PM -0700, Kees Cook wrote: > >> On Fri, Sep 15, 2017 at 1:43 AM, Salvatore Mesoraca > >> wrote: > >> > +protected_regular_files: > >> > + > >> > +This protection is similar to protected_fifos, but it > >> > +avoids writes to an attacker-controlled regular file, where program > >> > +expected to create one. > >> > + > >> > +When set to "0", regular files writing is unrestricted. > >> > + > >> > +When set to "1" don't allow O_CREAT open on regular files that we > >> > +don't own in world writable sticky directories, unless they are > >> > +owned by the owner of the directory. [...] > > Although this is sufficient against attacks (if the kernel's check for > > these properties is not racy; I don't know if it is), for the policy > > enforcement use case and reason we might want to support a simpler mode > > where O_CREAT without O_EXCL would be disallowed in sticky directories > > (only world writable? or also writable by anyone other than us? - e.g., > > it'd catch some unsafe uses of mail spools) regardless of whether a > > file of that name already exists or not. Maybe extra settings: > > > > When set to "2" also don't allow O_CREAT open without O_EXCL in > > world-writable sticky directories (even if the files don't already > > exist, for consistent policy enforcement) > > > > When set to "3" also don't allow O_CREAT open on regular files that we > > don't own in sticky directories writable by anyone else, unless the > > files are owned by the owner of the directory. > > > > When set to "4" also don't allow O_CREAT open without O_EXCL in > > sticky directories writable by anyone else (even if the files don't > > already exist, for consistent policy enforcement) > > > > Or maybe "2" and "4" should be a separate knob, so that "3" could be > > used without the policy enforcement aspect of "2", although enabling > > this separate knob at the highest level would make protected_regular > > redundant. > > > > I could envision further levels for non-sticky world-writable and > > writable-by-others directories, and even for unsafe writes without > > O_CREAT and unsafe reads, but then the protected_regular name would > > become even more misleading as without O_CREAT the program could > > actually intend to work with a non-regular file. > > > > Let's avoid further scope creep for now, but have this in mind. As I > > had mentioned in another thread on kernel-hardening, policy enforcement > > like this implemented in a kernel module helped me find weaknesses in > > old Postfix' privsep implementation, which were promptly patched (that > > was many years ago). Having this generally available and easy to enable > > could result in more findings like this by more people. > > > > A setting similar to "3" above should probably also exist for > > protected_fifos (would be "2" there). > > I think I could add "3" to both protected_fifos and protected_regulars > actually using "2" for both. And then add another sysctl for modes > "2" and "4" for both fifos and regular files. Sounds good to me. The third sysctl (or several) could be introduced with a separate patch, focusing on file access safety policy warnings and enforcement in general rather than on any specific file types. Alexander
Re: [RFC] Restrict writes into untrusted FIFOs and regular files
On Mon, Sep 18, 2017 at 02:00:50PM -0700, Kees Cook wrote: > On Fri, Sep 15, 2017 at 1:43 AM, Salvatore Mesoraca> wrote: > > The purpose is to make data spoofing attacks harder. > > Do you have any examples of attacks (CVEs, blog posts, etc) that you > could link to in this commit? I doubt there are (m)any examples of attacks and blog posts on this, because most systems didn't have similar (sym)link restrictions until recently and those attacks are simpler. As to CVEs, I expect that a large subset of CVEs assigned to "improper usage of temporary files" or "symlink attacks" or such are about vulnerabilities that are also exploitable into data spoofing, even if perhaps with different (and often lesser) impact than they are when exploited via (sym)links. IIRC, the attacks were first pointed out on LKML circa 1997 as an argument against (sym)link restrictions - that they're not enough on their own against some attacks anyway, because FIFOs and even regular files could also be used for those. My response was to include FIFO restrictions (on top of (sym)link restrictions) into -ow patches, and now the same became relevant for mainline (as it has the optional (sym)link restrictions now). One thing that has changed since is the addition of inotify, which probably made use of regular files for similar attacks much more practical. This is one reason why we might also protect against maybe-maliciously pre-created regular files now. Another reason is that we, or at least some users/sysadmins, may want to discourage misuse of /tmp, /dev/shm, and similar directories in general. Those directories are supposed to only be accessed through appropriate library interfaces, or using specific known-safe procedures. Trying to O_CREAT a file in there without O_EXCL is almost never right. It makes sense to provide an easy way to disallow that as policy enforcement. > bike shed: I think "_files" is redundant, since we're already in proc/sys/fs/ +1 > > +protected_regular_files: > > + > > +This protection is similar to protected_fifos, but it > > +avoids writes to an attacker-controlled regular file, where program > > +expected to create one. > > + > > +When set to "0", regular files writing is unrestricted. > > + > > +When set to "1" don't allow O_CREAT open on regular files that we > > +don't own in world writable sticky directories, unless they are > > +owned by the owner of the directory. > > + > > +This protection is based on the restrictions in Openwall. While symlink/hardlink/FIFO restrictions were in -ow patches, the regular file restriction was not - we're just inventing it now, even if upon my suggestion. Although this is sufficient against attacks (if the kernel's check for these properties is not racy; I don't know if it is), for the policy enforcement use case and reason we might want to support a simpler mode where O_CREAT without O_EXCL would be disallowed in sticky directories (only world writable? or also writable by anyone other than us? - e.g., it'd catch some unsafe uses of mail spools) regardless of whether a file of that name already exists or not. Maybe extra settings: When set to "2" also don't allow O_CREAT open without O_EXCL in world-writable sticky directories (even if the files don't already exist, for consistent policy enforcement) When set to "3" also don't allow O_CREAT open on regular files that we don't own in sticky directories writable by anyone else, unless the files are owned by the owner of the directory. When set to "4" also don't allow O_CREAT open without O_EXCL in sticky directories writable by anyone else (even if the files don't already exist, for consistent policy enforcement) Or maybe "2" and "4" should be a separate knob, so that "3" could be used without the policy enforcement aspect of "2", although enabling this separate knob at the highest level would make protected_regular redundant. I could envision further levels for non-sticky world-writable and writable-by-others directories, and even for unsafe writes without O_CREAT and unsafe reads, but then the protected_regular name would become even more misleading as without O_CREAT the program could actually intend to work with a non-regular file. Let's avoid further scope creep for now, but have this in mind. As I had mentioned in another thread on kernel-hardening, policy enforcement like this implemented in a kernel module helped me find weaknesses in old Postfix' privsep implementation, which were promptly patched (that was many years ago). Having this generally available and easy to enable could result in more findings like this by more people. A setting similar to "3" above should probably also exist for protected_fifos (would be "2" there). > Otherwise, yeah, looks good! :) +1 Alexander
Re: [RFC] Restrict writes into untrusted FIFOs and regular files
On Mon, Sep 18, 2017 at 02:00:50PM -0700, Kees Cook wrote: > On Fri, Sep 15, 2017 at 1:43 AM, Salvatore Mesoraca > wrote: > > The purpose is to make data spoofing attacks harder. > > Do you have any examples of attacks (CVEs, blog posts, etc) that you > could link to in this commit? I doubt there are (m)any examples of attacks and blog posts on this, because most systems didn't have similar (sym)link restrictions until recently and those attacks are simpler. As to CVEs, I expect that a large subset of CVEs assigned to "improper usage of temporary files" or "symlink attacks" or such are about vulnerabilities that are also exploitable into data spoofing, even if perhaps with different (and often lesser) impact than they are when exploited via (sym)links. IIRC, the attacks were first pointed out on LKML circa 1997 as an argument against (sym)link restrictions - that they're not enough on their own against some attacks anyway, because FIFOs and even regular files could also be used for those. My response was to include FIFO restrictions (on top of (sym)link restrictions) into -ow patches, and now the same became relevant for mainline (as it has the optional (sym)link restrictions now). One thing that has changed since is the addition of inotify, which probably made use of regular files for similar attacks much more practical. This is one reason why we might also protect against maybe-maliciously pre-created regular files now. Another reason is that we, or at least some users/sysadmins, may want to discourage misuse of /tmp, /dev/shm, and similar directories in general. Those directories are supposed to only be accessed through appropriate library interfaces, or using specific known-safe procedures. Trying to O_CREAT a file in there without O_EXCL is almost never right. It makes sense to provide an easy way to disallow that as policy enforcement. > bike shed: I think "_files" is redundant, since we're already in proc/sys/fs/ +1 > > +protected_regular_files: > > + > > +This protection is similar to protected_fifos, but it > > +avoids writes to an attacker-controlled regular file, where program > > +expected to create one. > > + > > +When set to "0", regular files writing is unrestricted. > > + > > +When set to "1" don't allow O_CREAT open on regular files that we > > +don't own in world writable sticky directories, unless they are > > +owned by the owner of the directory. > > + > > +This protection is based on the restrictions in Openwall. While symlink/hardlink/FIFO restrictions were in -ow patches, the regular file restriction was not - we're just inventing it now, even if upon my suggestion. Although this is sufficient against attacks (if the kernel's check for these properties is not racy; I don't know if it is), for the policy enforcement use case and reason we might want to support a simpler mode where O_CREAT without O_EXCL would be disallowed in sticky directories (only world writable? or also writable by anyone other than us? - e.g., it'd catch some unsafe uses of mail spools) regardless of whether a file of that name already exists or not. Maybe extra settings: When set to "2" also don't allow O_CREAT open without O_EXCL in world-writable sticky directories (even if the files don't already exist, for consistent policy enforcement) When set to "3" also don't allow O_CREAT open on regular files that we don't own in sticky directories writable by anyone else, unless the files are owned by the owner of the directory. When set to "4" also don't allow O_CREAT open without O_EXCL in sticky directories writable by anyone else (even if the files don't already exist, for consistent policy enforcement) Or maybe "2" and "4" should be a separate knob, so that "3" could be used without the policy enforcement aspect of "2", although enabling this separate knob at the highest level would make protected_regular redundant. I could envision further levels for non-sticky world-writable and writable-by-others directories, and even for unsafe writes without O_CREAT and unsafe reads, but then the protected_regular name would become even more misleading as without O_CREAT the program could actually intend to work with a non-regular file. Let's avoid further scope creep for now, but have this in mind. As I had mentioned in another thread on kernel-hardening, policy enforcement like this implemented in a kernel module helped me find weaknesses in old Postfix' privsep implementation, which were promptly patched (that was many years ago). Having this generally available and easy to enable could result in more findings like this by more people. A setting similar to "3" above should probably also exist for protected_fifos (would be "2" there). > Otherwise, yeah, looks good! :) +1 Alexander
Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas
Hi all, On Tue, Jul 04, 2017 at 07:16:06PM -0600, kseifr...@redhat.com wrote: > This issue occurs post stackguard patches correct? Fixing it sounds like > this might go beyond hardening and into CVE territory. Since this thread is public on LKML, as it should be, it's no longer valid to be CC'ed to linux-distros, which is for handling of embargoed issues only. So please drop linux-distros from further CC's (I moved linux-distros to Bcc on this reply, just so they know what happened). If specific security issues are identified (such as with LibreOffice and Java), then ideally those should be posted to oss-security as separate reports. I'd appreciate it if anyone takes care of that (regardless of CVE worthiness). In fact, I already mentioned this thread in: http://www.openwall.com/lists/oss-security/2017/07/05/11 Thank you! Alexander
Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas
Hi all, On Tue, Jul 04, 2017 at 07:16:06PM -0600, kseifr...@redhat.com wrote: > This issue occurs post stackguard patches correct? Fixing it sounds like > this might go beyond hardening and into CVE territory. Since this thread is public on LKML, as it should be, it's no longer valid to be CC'ed to linux-distros, which is for handling of embargoed issues only. So please drop linux-distros from further CC's (I moved linux-distros to Bcc on this reply, just so they know what happened). If specific security issues are identified (such as with LibreOffice and Java), then ideally those should be posted to oss-security as separate reports. I'd appreciate it if anyone takes care of that (regardless of CVE worthiness). In fact, I already mentioned this thread in: http://www.openwall.com/lists/oss-security/2017/07/05/11 Thank you! Alexander
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
> >>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer <so...@openwall.com> > >>> wrote: > >>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: > >>> >> *) When modules_autoload_mode is set to (2), automatic module loading > >>> >> is > >>> >> disabled for all. Once set, this value can not be changed. > >>> > > >>> > What purpose does this securelevel-like property ("Once set, this value > >>> > can not be changed.") serve here? I think this mode 2 is needed, but > >>> > without this extra property, which is bypassable by e.g. explicitly > >>> > loaded kernel modules anyway (and that's OK). On Mon, May 22, 2017 at 04:07:56PM -0700, Kees Cook wrote: > I'm on the fence. For modules_disabled and Yama, it was tied to > CAP_SYS_ADMIN, basically designed to be a at-boot setting that could > not later be undone by an attacker gaining that privilege, keeping > them out of either kernel memory or existing user process memory. > Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where > a CAP_SYS_MODULE-capable process could write to this sysctl but NOT > issue direct modprobe requests, but it's _possible_ via crazy symlink > games to trick capable processes into writing to sysctls. We've seen > this multiple times before, and it's a way for attackers to turn a > single privileged write into a privileged exec. OK, tricking a process via crazy symlink games is finally a potentially valid reason. The question then becomes: are there perhaps so many other important sysctl's, disk files, etc. (which the vulnerable capable process could similarly be tricked into writing) so that specifically resetting modules_autoload_mode isn't particularly lucrative? I think that the answer to that is usually yes. Another related question: do we really want to inconsistently single out a handful of sysctl's for this kind of extra protection? I think not. I agree there are some other settings where being unable to reset them makes sense, but I think this isn't one of those. > I might turn the question around, though: why would we want to have it > changeable at this setting? Convenience for the sysadmin - being able to correct one's error (e.g., wrong order of shell commands), respond to new findings (thought module autoloading was unneeded after some point, then found out some software relies on it), change one's mind, reuse a system differently than originally intended without a forced reboot. > I'm fine leaving that piece off, either way. I'm also fine with either decision. I just thought I'd point out what looked weird to me. I think this is an important patch that should get in, but primarily for modules_autoload_mode=1, which many distros could make the default (and maybe the kernel eventually should?) For modules_autoload_mode=2, we already seem to have the equivalent of modprobe=/bin/true (or does it differ subtly, maybe in return values?), which I already use at startup on a GPU box like this (preloading modules so that the OpenCL backends wouldn't need the autoloading): nvidia-smi nvidia-modprobe -u -c=0 #modprobe nvidia_uvm #modprobe fglrx sysctl -w kernel.modprobe=/bin/true sysctl -w kernel.hotplug=/bin/true but it's good to also have this supported more explicitly and more consistently through modules_autoload_mode=2 while we're at it. So I support having this mode as well. I just question the need to have it non-resettable. Alexander
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
> >>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer > >>> wrote: > >>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: > >>> >> *) When modules_autoload_mode is set to (2), automatic module loading > >>> >> is > >>> >> disabled for all. Once set, this value can not be changed. > >>> > > >>> > What purpose does this securelevel-like property ("Once set, this value > >>> > can not be changed.") serve here? I think this mode 2 is needed, but > >>> > without this extra property, which is bypassable by e.g. explicitly > >>> > loaded kernel modules anyway (and that's OK). On Mon, May 22, 2017 at 04:07:56PM -0700, Kees Cook wrote: > I'm on the fence. For modules_disabled and Yama, it was tied to > CAP_SYS_ADMIN, basically designed to be a at-boot setting that could > not later be undone by an attacker gaining that privilege, keeping > them out of either kernel memory or existing user process memory. > Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where > a CAP_SYS_MODULE-capable process could write to this sysctl but NOT > issue direct modprobe requests, but it's _possible_ via crazy symlink > games to trick capable processes into writing to sysctls. We've seen > this multiple times before, and it's a way for attackers to turn a > single privileged write into a privileged exec. OK, tricking a process via crazy symlink games is finally a potentially valid reason. The question then becomes: are there perhaps so many other important sysctl's, disk files, etc. (which the vulnerable capable process could similarly be tricked into writing) so that specifically resetting modules_autoload_mode isn't particularly lucrative? I think that the answer to that is usually yes. Another related question: do we really want to inconsistently single out a handful of sysctl's for this kind of extra protection? I think not. I agree there are some other settings where being unable to reset them makes sense, but I think this isn't one of those. > I might turn the question around, though: why would we want to have it > changeable at this setting? Convenience for the sysadmin - being able to correct one's error (e.g., wrong order of shell commands), respond to new findings (thought module autoloading was unneeded after some point, then found out some software relies on it), change one's mind, reuse a system differently than originally intended without a forced reboot. > I'm fine leaving that piece off, either way. I'm also fine with either decision. I just thought I'd point out what looked weird to me. I think this is an important patch that should get in, but primarily for modules_autoload_mode=1, which many distros could make the default (and maybe the kernel eventually should?) For modules_autoload_mode=2, we already seem to have the equivalent of modprobe=/bin/true (or does it differ subtly, maybe in return values?), which I already use at startup on a GPU box like this (preloading modules so that the OpenCL backends wouldn't need the autoloading): nvidia-smi nvidia-modprobe -u -c=0 #modprobe nvidia_uvm #modprobe fglrx sysctl -w kernel.modprobe=/bin/true sysctl -w kernel.hotplug=/bin/true but it's good to also have this supported more explicitly and more consistently through modules_autoload_mode=2 while we're at it. So I support having this mode as well. I just question the need to have it non-resettable. Alexander
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
On Mon, May 22, 2017 at 03:49:15PM +0200, Djalal Harouni wrote: > On Mon, May 22, 2017 at 2:08 PM, Solar Designer <so...@openwall.com> wrote: > > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: > >> *) When modules_autoload_mode is set to (2), automatic module loading is > >> disabled for all. Once set, this value can not be changed. > > > > What purpose does this securelevel-like property ("Once set, this value > > can not be changed.") serve here? I think this mode 2 is needed, but > > without this extra property, which is bypassable by e.g. explicitly > > loaded kernel modules anyway (and that's OK). > > My reasoning about "Once set, this value can not be changed" is mainly for: > > If you have some systems where modules are not updated for any given > reason, then the only one who will be able to load a module is an > administrator, basically this is a shortcut for: > > * Apps/services can run with CAP_NET_ADMIN but they are not allowed to > auto-load 'netdev' modules. > > * Explicitly loading modules can be guarded by seccomp filters *per* > app, so even if these apps have > CAP_SYS_MODULE they won't be able to explicitly load modules, one > has to remount some sysctl /proc/ entries read-only here and remove > CAP_SYS_ADMIN for all apps anyway. > > This mainly serves the purpose of these systems that do not receive > updates, if I don't want to expose those kernel interfaces what should > I do ? then if I want to unload old versions and replace them with new > ones what operation should be allowed ? and only real root of the > system can do it. Hence, the "Once set, this value can not be changed" > is more of a shortcut, also the idea was put in my mind based on how > "modules_disabled" is disabled forever, and some other interfaces. I > would say: it is easy to handle a transition from 1) "hey this system > is still up to date, some features should be exposed" to 2) "this > system is not up to date anymore, only root should expose some > features..." > > Hmm, I am not sure if this answers your question ? :-) This answers my question, but in a way that I summarize as "there's no good reason to include this securelevel-like property". > I definitively don't want to fall into "modules_disabled" trap where > is it too strict! "Once set, this value can not be changed" means for > some users do not set it otherwise the system is unusable... > > Maybe an extra "4" mode for that ? better get it right. I think you should simply exclude this property from mode 2. The module autoloading restrictions aren't meant to reduce root's powers; they're only meant to protect processes from shooting themselves and the system in the foot inadvertently (confused deputy). modules_disabled may be different in that respect, although with the rest of the kernel lacking securelevel-like support the point is moot. We had working securelevel in 2.0.34 through 2.0.40 inclusive, but we've lost it in 2.1+ with cap-bound apparently never becoming as complete a replacement for it and having been lost/broken further in 2.6.25+. I regret this, but that's a different story. Like I say, module autoloading doesn't even fit in with those restrictions - it's about a totally different threat model. Alexander
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
On Mon, May 22, 2017 at 03:49:15PM +0200, Djalal Harouni wrote: > On Mon, May 22, 2017 at 2:08 PM, Solar Designer wrote: > > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: > >> *) When modules_autoload_mode is set to (2), automatic module loading is > >> disabled for all. Once set, this value can not be changed. > > > > What purpose does this securelevel-like property ("Once set, this value > > can not be changed.") serve here? I think this mode 2 is needed, but > > without this extra property, which is bypassable by e.g. explicitly > > loaded kernel modules anyway (and that's OK). > > My reasoning about "Once set, this value can not be changed" is mainly for: > > If you have some systems where modules are not updated for any given > reason, then the only one who will be able to load a module is an > administrator, basically this is a shortcut for: > > * Apps/services can run with CAP_NET_ADMIN but they are not allowed to > auto-load 'netdev' modules. > > * Explicitly loading modules can be guarded by seccomp filters *per* > app, so even if these apps have > CAP_SYS_MODULE they won't be able to explicitly load modules, one > has to remount some sysctl /proc/ entries read-only here and remove > CAP_SYS_ADMIN for all apps anyway. > > This mainly serves the purpose of these systems that do not receive > updates, if I don't want to expose those kernel interfaces what should > I do ? then if I want to unload old versions and replace them with new > ones what operation should be allowed ? and only real root of the > system can do it. Hence, the "Once set, this value can not be changed" > is more of a shortcut, also the idea was put in my mind based on how > "modules_disabled" is disabled forever, and some other interfaces. I > would say: it is easy to handle a transition from 1) "hey this system > is still up to date, some features should be exposed" to 2) "this > system is not up to date anymore, only root should expose some > features..." > > Hmm, I am not sure if this answers your question ? :-) This answers my question, but in a way that I summarize as "there's no good reason to include this securelevel-like property". > I definitively don't want to fall into "modules_disabled" trap where > is it too strict! "Once set, this value can not be changed" means for > some users do not set it otherwise the system is unusable... > > Maybe an extra "4" mode for that ? better get it right. I think you should simply exclude this property from mode 2. The module autoloading restrictions aren't meant to reduce root's powers; they're only meant to protect processes from shooting themselves and the system in the foot inadvertently (confused deputy). modules_disabled may be different in that respect, although with the rest of the kernel lacking securelevel-like support the point is moot. We had working securelevel in 2.0.34 through 2.0.40 inclusive, but we've lost it in 2.1+ with cap-bound apparently never becoming as complete a replacement for it and having been lost/broken further in 2.6.25+. I regret this, but that's a different story. Like I say, module autoloading doesn't even fit in with those restrictions - it's about a totally different threat model. Alexander
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
Hi Djalal, Thank you for your work on this! On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: > *) When modules_autoload_mode is set to (2), automatic module loading is > disabled for all. Once set, this value can not be changed. What purpose does this securelevel-like property ("Once set, this value can not be changed.") serve here? I think this mode 2 is needed, but without this extra property, which is bypassable by e.g. explicitly loaded kernel modules anyway (and that's OK). I'm sorry if this has been discussed before. Alexander
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
Hi Djalal, Thank you for your work on this! On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: > *) When modules_autoload_mode is set to (2), automatic module loading is > disabled for all. Once set, this value can not be changed. What purpose does this securelevel-like property ("Once set, this value can not be changed.") serve here? I think this mode 2 is needed, but without this extra property, which is bypassable by e.g. explicitly loaded kernel modules anyway (and that's OK). I'm sorry if this has been discussed before. Alexander
Re: [oss-security] Linux kernel ping socket / AF_LLC connect() sin_family race
On Fri, Mar 24, 2017 at 03:21:06PM -0700, Eric Dumazet wrote: > Looks easy enough to fix ? Oh. Probably. Thanks. Need to test, but I guess you already did? > diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c > index > 2af6244b83e27ae384e96cf071c10c5a89674804..ccfbce13a6333a65dab64e4847dd510dfafb1b43 > 100644 > --- a/net/ipv4/ping.c > +++ b/net/ipv4/ping.c > @@ -156,17 +156,18 @@ int ping_hash(struct sock *sk) > void ping_unhash(struct sock *sk) > { > struct inet_sock *isk = inet_sk(sk); > + > pr_debug("ping_unhash(isk=%p,isk->num=%u)\n", isk, isk->inet_num); > + write_lock_bh(_table.lock); > if (sk_hashed(sk)) { > - write_lock_bh(_table.lock); > hlist_nulls_del(>sk_nulls_node); > sk_nulls_node_init(>sk_nulls_node); > sock_put(sk); > isk->inet_num = 0; > isk->inet_sport = 0; > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > - write_unlock_bh(_table.lock); > } > + write_unlock_bh(_table.lock); > } > EXPORT_SYMBOL_GPL(ping_unhash); FWIW, in Pavel's original implementation for 2.4.32 (unused), this was: static void ping_v4_unhash(struct sock *sk) { DEBUG(("ping_v4_unhash(sk=%p,sk->num=%u)\n", sk, sk->num)); write_lock_bh(_hash_lock); if (sk->pprev) { if (sk->next) sk->next->pprev = sk->pprev; *sk->pprev = sk->next; sk->pprev = NULL; sk->num = 0; sock_prot_dec_use(sk->prot); __sock_put(sk); } write_unlock_bh(_hash_lock); } Looks like the erroneous optimization (not expecting concurrent activity on the same socket?) was introduced during conversion to 2.6's hlists. So far this cursed function had 3 bugs, two of them security (including this one) and one probably benign (or if not, then effectively a subset of this bug as it performed some unneeded / stale debugging work before acquiring the lock), with all 3 introduced in forward-porting. Maybe the nature of forward-porting activity makes people relatively inattentive ("compiles with the new interfaces and still works? must be correct"), compared to when writing new code. Anyhow, I share some responsibility for this mess, for having advocated this patch being forward-ported and merged back then. I still like having this functionality and its userspace security benefits... but I don't like the kernel bugs. Alexander
Re: [oss-security] Linux kernel ping socket / AF_LLC connect() sin_family race
On Fri, Mar 24, 2017 at 03:21:06PM -0700, Eric Dumazet wrote: > Looks easy enough to fix ? Oh. Probably. Thanks. Need to test, but I guess you already did? > diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c > index > 2af6244b83e27ae384e96cf071c10c5a89674804..ccfbce13a6333a65dab64e4847dd510dfafb1b43 > 100644 > --- a/net/ipv4/ping.c > +++ b/net/ipv4/ping.c > @@ -156,17 +156,18 @@ int ping_hash(struct sock *sk) > void ping_unhash(struct sock *sk) > { > struct inet_sock *isk = inet_sk(sk); > + > pr_debug("ping_unhash(isk=%p,isk->num=%u)\n", isk, isk->inet_num); > + write_lock_bh(_table.lock); > if (sk_hashed(sk)) { > - write_lock_bh(_table.lock); > hlist_nulls_del(>sk_nulls_node); > sk_nulls_node_init(>sk_nulls_node); > sock_put(sk); > isk->inet_num = 0; > isk->inet_sport = 0; > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > - write_unlock_bh(_table.lock); > } > + write_unlock_bh(_table.lock); > } > EXPORT_SYMBOL_GPL(ping_unhash); FWIW, in Pavel's original implementation for 2.4.32 (unused), this was: static void ping_v4_unhash(struct sock *sk) { DEBUG(("ping_v4_unhash(sk=%p,sk->num=%u)\n", sk, sk->num)); write_lock_bh(_hash_lock); if (sk->pprev) { if (sk->next) sk->next->pprev = sk->pprev; *sk->pprev = sk->next; sk->pprev = NULL; sk->num = 0; sock_prot_dec_use(sk->prot); __sock_put(sk); } write_unlock_bh(_hash_lock); } Looks like the erroneous optimization (not expecting concurrent activity on the same socket?) was introduced during conversion to 2.6's hlists. So far this cursed function had 3 bugs, two of them security (including this one) and one probably benign (or if not, then effectively a subset of this bug as it performed some unneeded / stale debugging work before acquiring the lock), with all 3 introduced in forward-porting. Maybe the nature of forward-porting activity makes people relatively inattentive ("compiles with the new interfaces and still works? must be correct"), compared to when writing new code. Anyhow, I share some responsibility for this mess, for having advocated this patch being forward-ported and merged back then. I still like having this functionality and its userspace security benefits... but I don't like the kernel bugs. Alexander
non-response to IPI kills wrong(?) task, confuses logs
Hi, Adam Zabrocki, CC'ed here, informed me of the following: There is an undesirable situation in SMP Linux machines when sending an IPI via the smp_call_function_single() API: /* * smp_call_function_single - Run a function on a specific CPU * @func: The function to run. This must be fast and non-blocking. * @info: An arbitrary pointer to pass to the function. * @wait: If true, wait until function has completed on other CPUs. * * Returns 0 on success, else a negative status code. */ int smp_call_function_single(int cpu, smp_call_func_t func, void *info, int wait) It runs "func" on a specific CPU, and if the "wait" flag is true, it waits until "func" has completed work before returning to the caller. This is a useful feature used to control when it is safe to continue and potentially parse data generated by another CPU. Unfortunately, Linux can kill the task waiting on its smp_call_function_single(). The NMI watchdog is detecting a "soft lockup" on the CPU which calls smp_call_function_single() with the "wait" flag set. In practice it is not always the desirable behavior. It is possible that the target CPU of the IPI was non-responsive and would never (or not in time) pick up the IPI sent to it. From the perspective of the task sending the IPI, the target CPU never finished the work, so smp_call_function_single() continues its busy wait loop. In the meantime, the NMI watchdog can kill the busy-waiting task instead of somehow "unlocking" the CPU that did not handle the IPI in time. This is often visible in the logs like e.g.: Jul 18 00:43:06 ubuntu kernel: [1578895.078397] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [thermald:714] ... Jul 18 00:43:06 ubuntu kernel: [1578895.078470] CPU: 1 PID: 714 Comm: thermald Tainted: GW OEL 4.2.0-30-generic #36-Ubuntu Jul 18 00:43:06 ubuntu kernel: [1578895.078472] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 Jul 18 00:43:06 ubuntu kernel: [1578895.078475] task: 88003776e040 ti: 88003839 task.ti: 88003839 Jul 18 00:43:06 ubuntu kernel: [1578895.078477] RIP: 0010:[] [] smp_call_function_single+0xd8/0x130 Jul 18 00:43:06 ubuntu kernel: [1578895.078486] RSP: 0018:880038393c98 EFLAGS: 0202 Jul 18 00:43:06 ubuntu kernel: [1578895.078487] RAX: RBX: RCX: Jul 18 00:43:06 ubuntu kernel: [1578895.078489] RDX: 0001 RSI: 0100 RDI: 0292 Jul 18 00:43:06 ubuntu kernel: [1578895.078490] RBP: 880038393ce8 R08: R09: 0001 Jul 18 00:43:06 ubuntu kernel: [1578895.078492] R10: 8800175de000 R11: 880035123180 R12: 880034c1c2d0 Jul 18 00:43:06 ubuntu kernel: [1578895.078493] R13: 0292 R14: 0292 R15: 880038393c38 Jul 18 00:43:06 ubuntu kernel: [1578895.078495] FS: 7fa118895700() GS:88003d64() knlGS: Jul 18 00:43:06 ubuntu kernel: [1578895.078497] CS: 0010 DS: ES: CR0: 80050033 Jul 18 00:43:06 ubuntu kernel: [1578895.078499] CR2: 7fabda352000 CR3: 34c22000 CR4: 001406e0 Jul 18 00:43:06 ubuntu kernel: [1578895.078508] Stack: Jul 18 00:43:06 ubuntu kernel: [1578895.078510] 8121e2c4 880038393cd8 Jul 18 00:43:06 ubuntu kernel: [1578895.078513] 813f86b0 880038393d00 0003 5a1685d7 Jul 18 00:43:06 ubuntu kernel: [1578895.078515] 880038393d4c 880038393d48 880038393d38 813f884d Jul 18 00:43:06 ubuntu kernel: [1578895.078518] Call Trace: Jul 18 00:43:06 ubuntu kernel: [1578895.078526] [] ? mntput+0x24/0x40 Jul 18 00:43:06 ubuntu kernel: [1578895.078532] [] ? ucs2_strncmp+0x50/0x50 Jul 18 00:43:06 ubuntu kernel: [1578895.078642] [] rdmsr_on_cpu+0x5d/0x90 Jul 18 00:43:06 ubuntu kernel: [1578895.078650] [] show_temp+0xa7/0xe0 [coretemp] Jul 18 00:43:06 ubuntu kernel: [1578895.078656] [] dev_attr_show+0x20/0x50 Jul 18 00:43:06 ubuntu kernel: [1578895.078662] [] ? mutex_lock+0x16/0x40 Jul 18 00:43:06 ubuntu kernel: [1578895.078667] [] sysfs_kf_seq_show+0xbc/0x130 Jul 18 00:43:06 ubuntu kernel: [1578895.078669] [] kernfs_seq_show+0x23/0x30 Jul 18 00:43:06 ubuntu kernel: [1578895.078672] [] seq_read+0xec/0x390 Jul 18 00:43:06 ubuntu kernel: [1578895.078675] [] kernfs_fop_read+0x10a/0x160 Jul 18 00:43:06 ubuntu kernel: [1578895.078679] [] ? security_file_permission+0xa0/0xc0 Jul 18 00:43:06 ubuntu kernel: [1578895.078682] [] __vfs_read+0x18/0x40 Jul 18 00:43:06 ubuntu kernel: [1578895.078685] [] vfs_read+0x87/0x130 Jul 18 00:43:06 ubuntu kernel: [1578895.078687] [] SyS_read+0x55/0xc0 Jul 18 00:43:07 ubuntu kernel: [1578895.078690] [] entry_SYSCALL_64_fastpath+0x16/0x75 Jul 18 00:43:07 ubuntu kernel: [1578895.078692] Code: 25 28 00 00 00 75 70 48 83 c4 40 5b 41 5c 5d c3 48 8d 75 c8
non-response to IPI kills wrong(?) task, confuses logs
Hi, Adam Zabrocki, CC'ed here, informed me of the following: There is an undesirable situation in SMP Linux machines when sending an IPI via the smp_call_function_single() API: /* * smp_call_function_single - Run a function on a specific CPU * @func: The function to run. This must be fast and non-blocking. * @info: An arbitrary pointer to pass to the function. * @wait: If true, wait until function has completed on other CPUs. * * Returns 0 on success, else a negative status code. */ int smp_call_function_single(int cpu, smp_call_func_t func, void *info, int wait) It runs "func" on a specific CPU, and if the "wait" flag is true, it waits until "func" has completed work before returning to the caller. This is a useful feature used to control when it is safe to continue and potentially parse data generated by another CPU. Unfortunately, Linux can kill the task waiting on its smp_call_function_single(). The NMI watchdog is detecting a "soft lockup" on the CPU which calls smp_call_function_single() with the "wait" flag set. In practice it is not always the desirable behavior. It is possible that the target CPU of the IPI was non-responsive and would never (or not in time) pick up the IPI sent to it. From the perspective of the task sending the IPI, the target CPU never finished the work, so smp_call_function_single() continues its busy wait loop. In the meantime, the NMI watchdog can kill the busy-waiting task instead of somehow "unlocking" the CPU that did not handle the IPI in time. This is often visible in the logs like e.g.: Jul 18 00:43:06 ubuntu kernel: [1578895.078397] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [thermald:714] ... Jul 18 00:43:06 ubuntu kernel: [1578895.078470] CPU: 1 PID: 714 Comm: thermald Tainted: GW OEL 4.2.0-30-generic #36-Ubuntu Jul 18 00:43:06 ubuntu kernel: [1578895.078472] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 Jul 18 00:43:06 ubuntu kernel: [1578895.078475] task: 88003776e040 ti: 88003839 task.ti: 88003839 Jul 18 00:43:06 ubuntu kernel: [1578895.078477] RIP: 0010:[] [] smp_call_function_single+0xd8/0x130 Jul 18 00:43:06 ubuntu kernel: [1578895.078486] RSP: 0018:880038393c98 EFLAGS: 0202 Jul 18 00:43:06 ubuntu kernel: [1578895.078487] RAX: RBX: RCX: Jul 18 00:43:06 ubuntu kernel: [1578895.078489] RDX: 0001 RSI: 0100 RDI: 0292 Jul 18 00:43:06 ubuntu kernel: [1578895.078490] RBP: 880038393ce8 R08: R09: 0001 Jul 18 00:43:06 ubuntu kernel: [1578895.078492] R10: 8800175de000 R11: 880035123180 R12: 880034c1c2d0 Jul 18 00:43:06 ubuntu kernel: [1578895.078493] R13: 0292 R14: 0292 R15: 880038393c38 Jul 18 00:43:06 ubuntu kernel: [1578895.078495] FS: 7fa118895700() GS:88003d64() knlGS: Jul 18 00:43:06 ubuntu kernel: [1578895.078497] CS: 0010 DS: ES: CR0: 80050033 Jul 18 00:43:06 ubuntu kernel: [1578895.078499] CR2: 7fabda352000 CR3: 34c22000 CR4: 001406e0 Jul 18 00:43:06 ubuntu kernel: [1578895.078508] Stack: Jul 18 00:43:06 ubuntu kernel: [1578895.078510] 8121e2c4 880038393cd8 Jul 18 00:43:06 ubuntu kernel: [1578895.078513] 813f86b0 880038393d00 0003 5a1685d7 Jul 18 00:43:06 ubuntu kernel: [1578895.078515] 880038393d4c 880038393d48 880038393d38 813f884d Jul 18 00:43:06 ubuntu kernel: [1578895.078518] Call Trace: Jul 18 00:43:06 ubuntu kernel: [1578895.078526] [] ? mntput+0x24/0x40 Jul 18 00:43:06 ubuntu kernel: [1578895.078532] [] ? ucs2_strncmp+0x50/0x50 Jul 18 00:43:06 ubuntu kernel: [1578895.078642] [] rdmsr_on_cpu+0x5d/0x90 Jul 18 00:43:06 ubuntu kernel: [1578895.078650] [] show_temp+0xa7/0xe0 [coretemp] Jul 18 00:43:06 ubuntu kernel: [1578895.078656] [] dev_attr_show+0x20/0x50 Jul 18 00:43:06 ubuntu kernel: [1578895.078662] [] ? mutex_lock+0x16/0x40 Jul 18 00:43:06 ubuntu kernel: [1578895.078667] [] sysfs_kf_seq_show+0xbc/0x130 Jul 18 00:43:06 ubuntu kernel: [1578895.078669] [] kernfs_seq_show+0x23/0x30 Jul 18 00:43:06 ubuntu kernel: [1578895.078672] [] seq_read+0xec/0x390 Jul 18 00:43:06 ubuntu kernel: [1578895.078675] [] kernfs_fop_read+0x10a/0x160 Jul 18 00:43:06 ubuntu kernel: [1578895.078679] [] ? security_file_permission+0xa0/0xc0 Jul 18 00:43:06 ubuntu kernel: [1578895.078682] [] __vfs_read+0x18/0x40 Jul 18 00:43:06 ubuntu kernel: [1578895.078685] [] vfs_read+0x87/0x130 Jul 18 00:43:06 ubuntu kernel: [1578895.078687] [] SyS_read+0x55/0xc0 Jul 18 00:43:07 ubuntu kernel: [1578895.078690] [] entry_SYSCALL_64_fastpath+0x16/0x75 Jul 18 00:43:07 ubuntu kernel: [1578895.078692] Code: 25 28 00 00 00 75 70 48 83 c4 40 5b 41 5c 5d c3 48 8d 75 c8
Re: [patch] lib: check for strcpy() overflows to fixed length buffers
On Wed, Apr 30, 2014 at 06:08:44PM +0300, Dan Carpenter wrote: > There are sometimes where we know that we are doing an strcpy() into a > fixed length buffer. In those cases, we could verify that the strcpy() > doesn't overflow. This patch introduces DEBUG_STRICT_SLOW_STRCPY_CHECKS > if you want to check for that. FWIW, I had posted similar macros for userland strcpy() and friends to the security-audit list (now defunct) in 1998. Someone preserved a copy here (although the indentation is lost): http://www.opennet.ru/soft/0432.html In (weird) use, with proper indentation: http://www.merit.edu/mail.archives/nanog/2000-02/msg00366.html https://github.com/tureba/trinoo/blob/master/strfix.h Personally, I was using this at the time for building known-broken software like wu-ftpd, where the risk of false positives felt lower than the risk of buffer overflow bugs being in fact present in the code. I used gcc's Statement Exprs extension, which is also used in the Linux kernel a lot: http://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html So maybe you should, too. (That is, if you want to go ahead with this approach for code that isn't meant to be as broken as wu-ftpd was.) This lets us propagate the original return value. To determine the destination buffer size, I simply used sizeof() and skipped my added protection in case the size looked like that of a pointer. Now you have those nice new gcc features instead. :-) > The downside is that it makes strcpy slower. I guess the slowdown is mostly from the added strlen(). I avoided it by using strncat(), so I had truncation instead of detection. It is unclear which is better. Other functions I did this for are strcat(), sprintf(), vsprintf(). Alexander -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] lib: check for strcpy() overflows to fixed length buffers
On Wed, Apr 30, 2014 at 06:08:44PM +0300, Dan Carpenter wrote: There are sometimes where we know that we are doing an strcpy() into a fixed length buffer. In those cases, we could verify that the strcpy() doesn't overflow. This patch introduces DEBUG_STRICT_SLOW_STRCPY_CHECKS if you want to check for that. FWIW, I had posted similar macros for userland strcpy() and friends to the security-audit list (now defunct) in 1998. Someone preserved a copy here (although the indentation is lost): http://www.opennet.ru/soft/0432.html In (weird) use, with proper indentation: http://www.merit.edu/mail.archives/nanog/2000-02/msg00366.html https://github.com/tureba/trinoo/blob/master/strfix.h Personally, I was using this at the time for building known-broken software like wu-ftpd, where the risk of false positives felt lower than the risk of buffer overflow bugs being in fact present in the code. I used gcc's Statement Exprs extension, which is also used in the Linux kernel a lot: http://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html So maybe you should, too. (That is, if you want to go ahead with this approach for code that isn't meant to be as broken as wu-ftpd was.) This lets us propagate the original return value. To determine the destination buffer size, I simply used sizeof() and skipped my added protection in case the size looked like that of a pointer. Now you have those nice new gcc features instead. :-) The downside is that it makes strcpy slower. I guess the slowdown is mostly from the added strlen(). I avoided it by using strncat(), so I had truncation instead of detection. It is unclear which is better. Other functions I did this for are strcat(), sprintf(), vsprintf(). Alexander -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/