Re: "fs/namei.c: keep track of nd->root refcount status" causes boot panic
On Tue, Sep 03, 2019 at 06:56:10PM +0100, Al Viro wrote: > On Tue, Sep 03, 2019 at 08:39:30AM -0700, Christoph Hellwig wrote: > > > > There's much nastier situation than "new upstream kernel released, > > > need to rebuild" - it's bisect in mainline trying to locate something... > > > > I really don't get the point. And it's not like we've card about > > this anywhere else. And jumping wildly around with the numeric values > > for constants will lead to bugs like the one you added and fixed again > > and again. > > The thing is, there are several groups - it's not as if all additions > were guaranteed to be at the end. So either we play with renumbering > again and again, or we are back to the square one... > > Is there any common trick that would allow to verify the lack of duplicates > at the build time? What about: static_assert( (LOOKUP_FOLLOW^LOOKUP_DIRECTORY^LOOKUP_AUTOMOUNT^LOOKUP_EMPTY^LOOKUP_DOWN^ LOOKUP_REVAL^LOOKUP_RCU^ LOOKUP_OPEN^LOOKUP_CREATE^LOOKUP_EXCL^LOOKUP_RENAME_TARGET^ LOOKUP_PARENT^LOOKUP_NO_REVAL^LOOKUP_JUMPED^LOOKUP_ROOT^LOOKUP_ROOT_GRABBED) == (LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_AUTOMOUNT|LOOKUP_EMPTY|LOOKUP_DOWN| LOOKUP_REVAL|LOOKUP_RCU| LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL|LOOKUP_RENAME_TARGET| LOOKUP_PARENT|LOOKUP_NO_REVAL|LOOKUP_JUMPED|LOOKUP_ROOT|LOOKUP_ROOT_GRABBED) , "duplicated LOOKUP_* constant"); ? - Kevin
[PATCH] libertas: Add missing sentinel at end of if_usb.c fw_table
This sentinel tells the firmware loading process when to stop. Reported-and-tested-by: syzbot+98156c174c5a2cad9...@syzkaller.appspotmail.com Signed-off-by: Kevin Easton --- drivers/net/wireless/marvell/libertas/if_usb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/marvell/libertas/if_usb.c b/drivers/net/wireless/marvell/libertas/if_usb.c index f1622f0ff8c9..fe3142d85d1e 100644 --- a/drivers/net/wireless/marvell/libertas/if_usb.c +++ b/drivers/net/wireless/marvell/libertas/if_usb.c @@ -50,7 +50,8 @@ static const struct lbs_fw_table fw_table[] = { { MODEL_8388, "libertas/usb8388_v5.bin", NULL }, { MODEL_8388, "libertas/usb8388.bin", NULL }, { MODEL_8388, "usb8388.bin", NULL }, - { MODEL_8682, "libertas/usb8682.bin", NULL } + { MODEL_8682, "libertas/usb8682.bin", NULL }, + { 0, NULL, NULL } }; static const struct usb_device_id if_usb_table[] = { -- 2.11.0
Re: [PATCH 4.19 053/105] mm/mincore.c: make mincore() more conservative
On Wed, May 22, 2019 at 11:21:11AM +0200, Michal Hocko wrote: > On Wed 22-05-19 10:57:41, Pavel Machek wrote: > > Hi! > > > > > commit 134fca9063ad4851de767d1768180e5dede9a881 upstream. > > > > > > The semantics of what mincore() considers to be resident is not > > > completely clear, but Linux has always (since 2.3.52, which is when > > > mincore() was initially done) treated it as "page is available in page > > > cache". > > > > > > That's potentially a problem, as that [in]directly exposes > > > meta-information about pagecache / memory mapping state even about > > > memory not strictly belonging to the process executing the syscall, > > > opening possibilities for sidechannel attacks. > > > > > > Change the semantics of mincore() so that it only reveals pagecache > > > information for non-anonymous mappings that belog to files that the > > > calling process could (if it tried to) successfully open for writing; > > > otherwise we'd be including shared non-exclusive mappings, which > > > > > > - is the sidechannel > > > > > > - is not the usecase for mincore(), as that's primarily used for data, > > >not (shared) text > > > > ... > > > > > @@ -189,8 +205,13 @@ static long do_mincore(unsigned long add > > > vma = find_vma(current->mm, addr); > > > if (!vma || addr < vma->vm_start) > > > return -ENOMEM; > > > - mincore_walk.mm = vma->vm_mm; > > > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > > > + if (!can_do_mincore(vma)) { > > > + unsigned long pages = DIV_ROUND_UP(end - addr, PAGE_SIZE); > > > + memset(vec, 1, pages); > > > + return pages; > > > + } > > > + mincore_walk.mm = vma->vm_mm; > > > err = walk_page_range(addr, end, _walk); > > > > We normally return errors when we deny permissions; but this one just > > returns success and wrong data. > > > > Could we return -EPERM there? If not, should it at least get a > > comment? > > This was a deliberate decision AFAIR. We cannot return failure because > this could lead to an unexpected userspace failure. We are pretendeing > that those pages are present because that is the safest option - > e.g. consider an application which tries to refault until the page is > present... Yes, in particular several userspace applications I found used mincore() to find out whether a particular range is mapped at all or not, treating any error as "unmapped" and any non-error return as "mapped". - Kevin
Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote: > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai wrote: > > > > On 2019-04-15, Enrico Weigelt, metux IT consult wrote: > > > > This patchset makes it possible to retrieve pid file descriptors at > > > > process creation time by introducing the new flag CLONE_PIDFD to the > > > > clone() system call as previously discussed. > > > > > > Sorry, for highjacking this thread, but I'm curious on what things to > > > consider when introducing new CLONE_* flags. > > > > > > The reason I'm asking is: > > > > > > I'm working on implementing plan9-like fs namespaces, where unprivileged > > > processes can change their own namespace at will. For that, certain > > > traditional unix'ish things have to be disabled, most notably suid. > > > As forbidding suid can be helpful in other scenarios, too, I thought > > > about making this its own feature. Doing that switch on clone() seems > > > a nice place for that, IMHO. > > > > Just spit-balling -- is no_new_privs not sufficient for this usecase? > > Not granting privileges such as setuid during execve(2) is the main > > point of that flag. > > > > I would personally *love* it if distros started setting no_new_privs > for basically all processes. And pidfd actually gets us part of the > way toward a straightforward way to make sudo and su still work in a > no_new_privs world: su could call into a daemon that would spawn the > privileged task, and su would get a (read-only!) pidfd back and then > wait for the fd and exit. I suppose that, done naively, this might > cause some odd effects with respect to tty handling, but I bet it's > solveable. I suppose it would be nifty if there were a way for a > process, by mutual agreement, to reparent itself to an unrelated > process. > > Anyway, clone(2) is an enormous mess. Surely the right solution here > is to have a whole new process creation API that takes a big, > extensible struct as an argument, and supports *at least* the full > abilities of posix_spawn() and ideally covers all the use cases for > fork() + do stuff + exec(). It would be nifty if this API also had a > way to say "add no_new_privs and therefore enable extra functionality > that doesn't work without no_new_privs". This functionality would > include things like returning a future extra-privileged pidfd that > gives ptrace-like access. > > As basic examples, the improved process creation API should take a > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag > from, fds to close (or, maybe even better, a list of fds to *not* > close), a list of rlimit changes to make, a list of signal changes to > make, the ability to set sid, pgrp, uid, gid (as in > setresuid/setresgid), the ability to do capset() operations, etc. The > posix_spawn() API, for all that it's rather complicated, covers a > bunch of the basics pretty well. The idea of a system call that takes an infinitely-extendable laundry list of operations to perform in kernel space seems quite inelegant, if only for the error-reporting reason. Instead, I suggest that what you'd want is a way to create a new embryonic process that has no address space and isn't yet schedulable. You then just need other-process-directed variants of all the normal setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode), pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd) etc. Then when it's all set up you pr_execve() to kick it off. - Kevin
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
On Tue, Jan 15, 2019 at 11:52:25PM -0800, Josh Snyder wrote: > On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet > wrote: > > > > There is a difference with your previous patch though, that used to list no > > page in core when it didn't know; this patch lists pages as in core when it > > refuses to tell. I don't think that's very important, though. > > Is there a reason not to return -EPERM in this case? When I was looking through the Debian Code Search results, quite a few of the hits were for code that uses mincore() as a way to check if _anything_ is mapped at an address or not. This code doesn't care about the in core / not in core result of mincore(), just whether it returned an error or not. I think a new error return would break most of the instances of that code I saw. - Kevin
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina wrote: > > > > > Who actually _uses_ mincore()? That's probably the best guide to what > > > we should do. Maybe they open the file read-only even if they are the > > > owner, and we really should look at file ownership instead. > > > > Yeah, well > > > > https://codesearch.debian.net/search?q=mincore > > > > is a bit too much mess to get some idea quickly I am afraid. > > Yeah, heh. > > And the first hit is 'fincore', which probably nobody cares about > anyway, but it does > > fd = open (name, O_RDONLY) > .. > mmap(window, len, PROT_NONE, MAP_PRIVATE, .. > > so if we want to keep that working, we'd really need to actually check > file ownership rather than just looking at f_mode. > > But I don't know if anybody *uses* and cares about fincore, and it's > particularly questionable for non-root users. > ... > I didn't find anything that seems to really care, but I gave up after > a few pages of really boring stuff. I've gone through everything in the Debian code search, and this is the stuff that seems like it would be affected at all by the current patch: util-linux Contains 'fincore' as already noted above. e2fsprogs e4defrag tries to drop pages that it caused to be loaded into the page cache, but it's not clear that this ever worked as designed anyway (it calls mincore() before ioctl(fd, EXT4_IOC_MOVE_EXT ..) but then after the sync_file_range it drops the pages that *were* in the page cache at the time of mincore()). pgfincore postgresql extension used to try to dump/restore page cache status of database backing files across reboots. It uses a fresh mapping with mincore() to try to determine the current page cache status of a file. nocache LD_PRELOAD library that tries to drop any pages that the victim program has caused to be loaded into the page cache, uses mincore on a fresh mapping to see what was resident beforehand. Also includes 'cachestats' command that's basically another 'fincore'. xfsprogs xfs_io has a 'mincore' sub-command that is roughly equivalent to 'fincore'. vmtouch vmtouch is "Portable file system cache diagnostics and control", among other things it implements 'fincore' type functionality, and one of its touted use-cases is "Preserving virtual memory profile when failing over servers". qemu qemu uses mincore() with a fresh PROT_NONE, MAP_PRIVATE mapping to implement the "x-check-cache-dropped" option. ( https://patchwork.kernel.org/patch/10395865/ ) (Everything else I could see was either looking at anonymous VMAs, its own existing mapping that it's been using for actual IO, or was just using mincore() to see if an address was part of any mapping at all). - Kevin
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina wrote: > > > > > Who actually _uses_ mincore()? That's probably the best guide to what > > > we should do. Maybe they open the file read-only even if they are the > > > owner, and we really should look at file ownership instead. > > > > Yeah, well > > > > https://codesearch.debian.net/search?q=mincore > > > > is a bit too much mess to get some idea quickly I am afraid. > Anyway, the Debian code search just results in mostly non-present > stuff. It's sad that google code search is no more. It was great for > exactly these kinds of questions. If you select the "Group search results by Debian source package" option on the search results page it makes it a lot easier to skim through. It looks to me like Firefox is expecting mincore() not to fail on libraries that it has mapped: https://sources.debian.org/src/firefox-esr/60.4.0esr-1/mozglue/linker/BaseElf.cpp/?hl=98#L98 - Kevin > > The mono runtime seems to have some mono_pages_not_faulted() function, > but I don't know if people use it for file mappings, and I couldn't > find any interesting users of it. > > I didn't find anything that seems to really care, but I gave up after > a few pages of really boring stuff. > > Linus > >
Re: Can we drop upstream Linux x32 support?
On Thu, Dec 13, 2018 at 10:05:14AM +0100, Richard Weinberger wrote: > On Thu, Dec 13, 2018 at 6:03 AM Kevin Easton wrote: > > > > On Tue, Dec 11, 2018 at 11:29:14AM +0100, John Paul Adrian Glaubitz wrote: > > ... > > > I can't say anything about the syscall interface. However, what I do know > > > is that the weird combination of a 32-bit userland with a 64-bit kernel > > > interface is sometimes causing issues. For example, application code > > > usually > > > expects things like time_t to be 32-bit on a 32-bit system. However, this > > > isn't the case for x32 which is why code fails to build. > > > > OpenBSD and NetBSD both have 64-bit time_t on 32-bit systems and have > > had for four or five years at this point. > > They can also do flag-day changes and break existing applications, Linux not. Sure, but the point is that most widely-used software has probably by now come in to contact with systems where time_t is bigger than long. - Kevin
Re: Can we drop upstream Linux x32 support?
On Tue, Dec 11, 2018 at 11:29:14AM +0100, John Paul Adrian Glaubitz wrote: ... > I can't say anything about the syscall interface. However, what I do know > is that the weird combination of a 32-bit userland with a 64-bit kernel > interface is sometimes causing issues. For example, application code usually > expects things like time_t to be 32-bit on a 32-bit system. However, this > isn't the case for x32 which is why code fails to build. OpenBSD and NetBSD both have 64-bit time_t on 32-bit systems and have had for four or five years at this point. - Kevin
Re: [PATCH] Add /proc/pid_generation
On Wed, Nov 21, 2018 at 12:14:44PM -0800, Daniel Colascione wrote: > This change adds a per-pid-namespace 64-bit generation number, > incremented on PID rollover, and exposes it via a new proc file > /proc/pid_generation. By examining this file before and after /proc > enumeration, user code can detect the potential reuse of a PID and > restart the task enumeration process, repeating until it gets a > coherent snapshot. I see downthread this patch has been withdrawn, but nonetheless I'm still curious - does this actually solve the problem? It seems to me that a PID could be reused within a scan even if the generation number remains the same at the beginning and end of a scan: Say you have a very long-lived task with PID 500 allocated in generation 0. The PID creation has since wrapped and we are now allocating from the start of the range again, with generation 1. We begin a scan of /proc, read the generation (1) and at this point, our task dies and PID 500 is then reallocated to a new task. We finish our scan, generation is still 1 but PID 500 is now ambiguous. Am I wrong? - Kevin
Re: [PATCH] Add /proc/pid_generation
On Wed, Nov 21, 2018 at 12:14:44PM -0800, Daniel Colascione wrote: > This change adds a per-pid-namespace 64-bit generation number, > incremented on PID rollover, and exposes it via a new proc file > /proc/pid_generation. By examining this file before and after /proc > enumeration, user code can detect the potential reuse of a PID and > restart the task enumeration process, repeating until it gets a > coherent snapshot. I see downthread this patch has been withdrawn, but nonetheless I'm still curious - does this actually solve the problem? It seems to me that a PID could be reused within a scan even if the generation number remains the same at the beginning and end of a scan: Say you have a very long-lived task with PID 500 allocated in generation 0. The PID creation has since wrapped and we are now allocating from the start of the range again, with generation 1. We begin a scan of /proc, read the generation (1) and at this point, our task dies and PID 500 is then reallocated to a new task. We finish our scan, generation is still 1 but PID 500 is now ambiguous. Am I wrong? - Kevin
Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
On Mon, May 07, 2018 at 04:03:25PM +0300, Michael S. Tsirkin wrote: > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote: > > The struct vhost_msg within struct vhost_msg_node is copied to userspace, > > so it should be allocated with kzalloc() to ensure all structure padding > > is zeroed. > > > > Signed-off-by: Kevin Easton <ke...@guarana.org> > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com > > --- > > drivers/vhost/vhost.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index f3bd8e9..1b84dcff 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify); > > /* Create a new message. */ > > struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type) > > { > > - struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL); > > + struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL); > > if (!node) > > return NULL; > > node->vq = vq; > > > Let's just init the msg though. > > OK it seems this is the best we can do for now, > we need a new feature bit to fix it for 32 bit > userspace on 64 bit kernels. > > Does the following help? Yes, the reproducer doesn't trigger the error with that patch applied. - Kevin
Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
On Mon, May 07, 2018 at 04:03:25PM +0300, Michael S. Tsirkin wrote: > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote: > > The struct vhost_msg within struct vhost_msg_node is copied to userspace, > > so it should be allocated with kzalloc() to ensure all structure padding > > is zeroed. > > > > Signed-off-by: Kevin Easton > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com > > --- > > drivers/vhost/vhost.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index f3bd8e9..1b84dcff 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify); > > /* Create a new message. */ > > struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type) > > { > > - struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL); > > + struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL); > > if (!node) > > return NULL; > > node->vq = vq; > > > Let's just init the msg though. > > OK it seems this is the best we can do for now, > we need a new feature bit to fix it for 32 bit > userspace on 64 bit kernels. > > Does the following help? Yes, the reproducer doesn't trigger the error with that patch applied. - Kevin
Re: [PATCH] vhost: make msg padding explicit
On Wed, May 02, 2018 at 05:19:27PM +0300, Michael S. Tsirkin wrote: > On Wed, May 02, 2018 at 10:04:46AM -0400, David Miller wrote: > > From: "Michael S. Tsirkin"> > Date: Wed, 2 May 2018 16:36:37 +0300 > > > > > Ouch. True - and in particular the 32 bit ABI on 64 bit kernels doesn't > > > work at all. Hmm. It's relatively new and maybe there aren't any 32 bit > > > users yet. Thoughts? > > > > If it's been in a released kernel version, we really aren't at liberty > > to play "maybe nobody uses this" UAPI changing games. > > > > Please send me a revert. > > Sent. Looking at compat mess now. Just naming the padding field isn't sufficient anyway, there also needs to be code in vhost_new_msg() to initialise it using the name. - Kevin
Re: [PATCH] vhost: make msg padding explicit
On Wed, May 02, 2018 at 05:19:27PM +0300, Michael S. Tsirkin wrote: > On Wed, May 02, 2018 at 10:04:46AM -0400, David Miller wrote: > > From: "Michael S. Tsirkin" > > Date: Wed, 2 May 2018 16:36:37 +0300 > > > > > Ouch. True - and in particular the 32 bit ABI on 64 bit kernels doesn't > > > work at all. Hmm. It's relatively new and maybe there aren't any 32 bit > > > users yet. Thoughts? > > > > If it's been in a released kernel version, we really aren't at liberty > > to play "maybe nobody uses this" UAPI changing games. > > > > Please send me a revert. > > Sent. Looking at compat mess now. Just naming the padding field isn't sufficient anyway, there also needs to be code in vhost_new_msg() to initialise it using the name. - Kevin
Re: [PATCH] vhost: make msg padding explicit
On Tue, May 01, 2018 at 02:05:51PM -0400, David Miller wrote: > From: "Michael S. Tsirkin" <m...@redhat.com> > Date: Tue, 1 May 2018 20:19:19 +0300 > > > On Tue, May 01, 2018 at 11:28:22AM -0400, David Miller wrote: > >> From: "Michael S. Tsirkin" <m...@redhat.com> > >> Date: Fri, 27 Apr 2018 19:02:05 +0300 > >> > >> > There's a 32 bit hole just after type. It's best to > >> > give it a name, this way compiler is forced to initialize > >> > it with rest of the structure. > >> > > >> > Reported-by: Kevin Easton <ke...@guarana.org> > >> > Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > >> > >> Michael, will you be sending this directly to Linus or would you like > >> me to apply it to net or net-next? > >> > >> Thanks. > > > > I'd prefer you to apply it for net and cc stable if possible. > > Ok, applied, and added to my -stable submission queue. Hold on, this patch changes the layout for i386 (where there is no padding at all). And it's part of UAPI. - Kevin >
Re: [PATCH] vhost: make msg padding explicit
On Tue, May 01, 2018 at 02:05:51PM -0400, David Miller wrote: > From: "Michael S. Tsirkin" > Date: Tue, 1 May 2018 20:19:19 +0300 > > > On Tue, May 01, 2018 at 11:28:22AM -0400, David Miller wrote: > >> From: "Michael S. Tsirkin" > >> Date: Fri, 27 Apr 2018 19:02:05 +0300 > >> > >> > There's a 32 bit hole just after type. It's best to > >> > give it a name, this way compiler is forced to initialize > >> > it with rest of the structure. > >> > > >> > Reported-by: Kevin Easton > >> > Signed-off-by: Michael S. Tsirkin > >> > >> Michael, will you be sending this directly to Linus or would you like > >> me to apply it to net or net-next? > >> > >> Thanks. > > > > I'd prefer you to apply it for net and cc stable if possible. > > Ok, applied, and added to my -stable submission queue. Hold on, this patch changes the layout for i386 (where there is no padding at all). And it's part of UAPI. - Kevin >
Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
On Fri, Apr 27, 2018 at 09:07:56PM -0400, Kevin Easton wrote: > On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote: > > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote: > > > The struct vhost_msg within struct vhost_msg_node is copied to userspace, > > > so it should be allocated with kzalloc() to ensure all structure padding > > > is zeroed. > > > > > > Signed-off-by: Kevin Easton <ke...@guarana.org> > > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com > > > > Does it help if a patch naming the padding is applied, > > and then we init just the relevant field? > > Just curious. > > No, I don't believe that is sufficient to fix the problem. Scratch that, somehow I missed the "..and then we init just the relevant field" part, sorry. There's still the padding after the vhost_iotlb_msg to consider. It's named in the union but I don't think that's guaranteed to be initialised when the iotlb member of the union is used to initialise things. > I didn't name the padding in my original patch because I wasn't sure > if the padding actually exists on 32 bit architectures? This might still be a concern, too? At the end of the day, zeroing 96 bytes (the full size of vhost_msg_node) is pretty quick. - Kevin
Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
On Fri, Apr 27, 2018 at 09:07:56PM -0400, Kevin Easton wrote: > On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote: > > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote: > > > The struct vhost_msg within struct vhost_msg_node is copied to userspace, > > > so it should be allocated with kzalloc() to ensure all structure padding > > > is zeroed. > > > > > > Signed-off-by: Kevin Easton > > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com > > > > Does it help if a patch naming the padding is applied, > > and then we init just the relevant field? > > Just curious. > > No, I don't believe that is sufficient to fix the problem. Scratch that, somehow I missed the "..and then we init just the relevant field" part, sorry. There's still the padding after the vhost_iotlb_msg to consider. It's named in the union but I don't think that's guaranteed to be initialised when the iotlb member of the union is used to initialise things. > I didn't name the padding in my original patch because I wasn't sure > if the padding actually exists on 32 bit architectures? This might still be a concern, too? At the end of the day, zeroing 96 bytes (the full size of vhost_msg_node) is pretty quick. - Kevin
Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote: > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote: > > The struct vhost_msg within struct vhost_msg_node is copied to userspace, > > so it should be allocated with kzalloc() to ensure all structure padding > > is zeroed. > > > > Signed-off-by: Kevin Easton <ke...@guarana.org> > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com > > Does it help if a patch naming the padding is applied, > and then we init just the relevant field? > Just curious. No, I don't believe that is sufficient to fix the problem. The structure is allocated by kmalloc(), then individual fields are initialised. The named adding would be forced to be initialised if it were initialised with a struct initialiser, but that's not the case. The compiler is free to leave padding0 with whatever junk kmalloc() left there. Having said that, naming the padding *does* help - technically, the compiler is allowed to put whatever it likes in the padding every time you modify the struct. It really needs both. I didn't name the padding in my original patch because I wasn't sure if the padding actually exists on 32 bit architectures? - Kevin
Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote: > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote: > > The struct vhost_msg within struct vhost_msg_node is copied to userspace, > > so it should be allocated with kzalloc() to ensure all structure padding > > is zeroed. > > > > Signed-off-by: Kevin Easton > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com > > Does it help if a patch naming the padding is applied, > and then we init just the relevant field? > Just curious. No, I don't believe that is sufficient to fix the problem. The structure is allocated by kmalloc(), then individual fields are initialised. The named adding would be forced to be initialised if it were initialised with a struct initialiser, but that's not the case. The compiler is free to leave padding0 with whatever junk kmalloc() left there. Having said that, naming the padding *does* help - technically, the compiler is allowed to put whatever it likes in the padding every time you modify the struct. It really needs both. I didn't name the padding in my original patch because I wasn't sure if the padding actually exists on 32 bit architectures? - Kevin
[PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
The struct vhost_msg within struct vhost_msg_node is copied to userspace, so it should be allocated with kzalloc() to ensure all structure padding is zeroed. Signed-off-by: Kevin Easton <ke...@guarana.org> Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com --- drivers/vhost/vhost.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index f3bd8e9..1b84dcff 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify); /* Create a new message. */ struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type) { - struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL); + struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL); if (!node) return NULL; node->vq = vq; -- 2.8.1
[PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
The struct vhost_msg within struct vhost_msg_node is copied to userspace, so it should be allocated with kzalloc() to ensure all structure padding is zeroed. Signed-off-by: Kevin Easton Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com --- drivers/vhost/vhost.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index f3bd8e9..1b84dcff 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify); /* Create a new message. */ struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type) { - struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL); + struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL); if (!node) return NULL; node->vq = vq; -- 2.8.1
Re: KASAN: slab-out-of-bounds Read in pfkey_add
On Mon, Apr 09, 2018 at 01:56:36AM -0400, Kevin Easton wrote: > On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote: > ... > > > > Looks like this is going to be fixed by > > https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length > > of > > provided sadb_key"), but it's not applied yet to the ipsec tree yet. > > Kevin, for > > future reference, for syzbot bugs it would be helpful to reply to the > > original > > bug report and say that a patch was sent out, or even better send the patch > > as a > > reply to the bug report email, e.g. > > > > git format-patch > > --in-reply-to="<001a114292fadd3e250560706...@google.com>" > > > > for this one (and the Message ID can be found in the syzkaller-bugs archive > > even > > if the email isn't in your inbox). > > Sure, I can do that. I recalled one reason I _didn't_ do this - the message ID is retrievable from the archived email, but because the archive is Google Groups the message recipients aren't (only masked). - Kevin
Re: KASAN: slab-out-of-bounds Read in pfkey_add
On Mon, Apr 09, 2018 at 01:56:36AM -0400, Kevin Easton wrote: > On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote: > ... > > > > Looks like this is going to be fixed by > > https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length > > of > > provided sadb_key"), but it's not applied yet to the ipsec tree yet. > > Kevin, for > > future reference, for syzbot bugs it would be helpful to reply to the > > original > > bug report and say that a patch was sent out, or even better send the patch > > as a > > reply to the bug report email, e.g. > > > > git format-patch > > --in-reply-to="<001a114292fadd3e250560706...@google.com>" > > > > for this one (and the Message ID can be found in the syzkaller-bugs archive > > even > > if the email isn't in your inbox). > > Sure, I can do that. I recalled one reason I _didn't_ do this - the message ID is retrievable from the archived email, but because the archive is Google Groups the message recipients aren't (only masked). - Kevin
Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
On Mon, Apr 09, 2018 at 12:34:42PM +0200, Steffen Klassert wrote: > On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote: > > Several places use (x + 7) / 8 to convert from a number of bits to a number > > of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency > > with other parts of the same file. > > > > Signed-off-by: Kevin Easton <ke...@guarana.org> > > Please resubmit this one to ipsec-next after the > merge window. Thanks! Will do! - Kevin
Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
On Mon, Apr 09, 2018 at 12:34:42PM +0200, Steffen Klassert wrote: > On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote: > > Several places use (x + 7) / 8 to convert from a number of bits to a number > > of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency > > with other parts of the same file. > > > > Signed-off-by: Kevin Easton > > Please resubmit this one to ipsec-next after the > merge window. Thanks! Will do! - Kevin
Re: KASAN: slab-out-of-bounds Read in pfkey_add
On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote: ... > > Looks like this is going to be fixed by > https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length of > provided sadb_key"), but it's not applied yet to the ipsec tree yet. Kevin, > for > future reference, for syzbot bugs it would be helpful to reply to the original > bug report and say that a patch was sent out, or even better send the patch > as a > reply to the bug report email, e.g. > > git format-patch > --in-reply-to="<001a114292fadd3e250560706...@google.com>" > > for this one (and the Message ID can be found in the syzkaller-bugs archive > even > if the email isn't in your inbox). Sure, I can do that. - Kevin
Re: KASAN: slab-out-of-bounds Read in pfkey_add
On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote: ... > > Looks like this is going to be fixed by > https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length of > provided sadb_key"), but it's not applied yet to the ipsec tree yet. Kevin, > for > future reference, for syzbot bugs it would be helpful to reply to the original > bug report and say that a patch was sent out, or even better send the patch > as a > reply to the bug report email, e.g. > > git format-patch > --in-reply-to="<001a114292fadd3e250560706...@google.com>" > > for this one (and the Message ID can be found in the syzkaller-bugs archive > even > if the email isn't in your inbox). Sure, I can do that. - Kevin
[PATCH v2 1/2] af_key: Always verify length of provided sadb_key
Key extensions (struct sadb_key) include a user-specified number of key bits. The kernel uses that number to determine how much key data to copy out of the message in pfkey_msg2xfrm_state(). The length of the sadb_key message must be verified to be long enough, even in the case of SADB_X_AALG_NULL. Furthermore, the sadb_key_len value must be long enough to include both the key data and the struct sadb_key itself. Introduce a helper function verify_key_len(), and call it from parse_exthdrs() where other exthdr types are similarly checked for correctness. Signed-off-by: Kevin Easton <ke...@guarana.org> Reported-by: syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com --- net/key/af_key.c | 45 +++-- 1 file changed, 35 insertions(+), 10 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 7e2e718..e62e52e 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -437,6 +437,24 @@ static int verify_address_len(const void *p) return 0; } +static inline int sadb_key_len(const struct sadb_key *key) +{ + int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8); + + return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes, + sizeof(uint64_t)); +} + +static int verify_key_len(const void *p) +{ + const struct sadb_key *key = p; + + if (sadb_key_len(key) > key->sadb_key_len) + return -EINVAL; + + return 0; +} + static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx) { return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) + @@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const struct sadb_msg *hdr, void * return -EINVAL; if (ext_hdrs[ext_type-1] != NULL) return -EINVAL; - if (ext_type == SADB_EXT_ADDRESS_SRC || - ext_type == SADB_EXT_ADDRESS_DST || - ext_type == SADB_EXT_ADDRESS_PROXY || - ext_type == SADB_X_EXT_NAT_T_OA) { + switch (ext_type) { + case SADB_EXT_ADDRESS_SRC: + case SADB_EXT_ADDRESS_DST: + case SADB_EXT_ADDRESS_PROXY: + case SADB_X_EXT_NAT_T_OA: if (verify_address_len(p)) return -EINVAL; - } - if (ext_type == SADB_X_EXT_SEC_CTX) { + break; + case SADB_X_EXT_SEC_CTX: if (verify_sec_ctx_len(p)) return -EINVAL; + break; + case SADB_EXT_KEY_AUTH: + case SADB_EXT_KEY_ENCRYPT: + if (verify_key_len(p)) + return -EINVAL; + break; + default: + break; } ext_hdrs[ext_type-1] = (void *) p; } @@ -1104,14 +1131,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, key = ext_hdrs[SADB_EXT_KEY_AUTH - 1]; if (key != NULL && sa->sadb_sa_auth != SADB_X_AALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key != NULL && sa->sadb_sa_encrypt != SADB_EALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); x = xfrm_state_alloc(net); -- 2.8.1
[PATCH v2 0/2] af_key: Fix for sadb_key memcpy read overrun
As found by syzbot, af_key does not properly validate the key length in sadb_key messages from userspace. This can result in copying from beyond the end of the sadb_key part of the message, or indeed beyond the end of the entire packet. Both these patches apply cleanly to ipsec-next. Based on Steffen's feedback I have re-ordered them so that the fix only is in patch 1, which I would suggest is also a stable tree candidate, whereas patch 2 is a cleanup only. Kevin Easton (2): af_key: Always verify length of provided sadb_key af_key: Use DIV_ROUND_UP() instead of open-coded equivalent net/key/af_key.c | 58 1 file changed, 42 insertions(+), 16 deletions(-) -- 2.8.1
[PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
Several places use (x + 7) / 8 to convert from a number of bits to a number of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency with other parts of the same file. Signed-off-by: Kevin Easton <ke...@guarana.org> --- net/key/af_key.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index e62e52e..f3ebb84 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -822,12 +822,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, if (add_keys) { if (x->aalg && x->aalg->alg_key_len) { auth_key_size = - PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 8)); size += sizeof(struct sadb_key) + auth_key_size; } if (x->ealg && x->ealg->alg_key_len) { encrypt_key_size = - PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 8)); size += sizeof(struct sadb_key) + encrypt_key_size; } } @@ -987,7 +987,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_exttype = SADB_EXT_KEY_AUTH; key->sadb_key_bits = x->aalg->alg_key_len; key->sadb_key_reserved = 0; - memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8); + memcpy(key + 1, x->aalg->alg_key, + DIV_ROUND_UP(x->aalg->alg_key_len, 8)); } /* encrypt key */ if (add_keys && encrypt_key_size) { @@ -998,7 +999,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_bits = x->ealg->alg_key_len; key->sadb_key_reserved = 0; memcpy(key + 1, x->ealg->alg_key, - (x->ealg->alg_key_len+7)/8); + DIV_ROUND_UP(x->ealg->alg_key_len, 8)); } /* sa */ @@ -1193,7 +1194,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, goto out; } if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL); if (!x->aalg) { err = -ENOMEM; @@ -1232,7 +1233,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, } key = (struct sadb_key*) ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->ealg = kmalloc(sizeof(*x->ealg) + keysize, GFP_KERNEL); if (!x->ealg) { err = -ENOMEM; -- 2.8.1
[PATCH v2 1/2] af_key: Always verify length of provided sadb_key
Key extensions (struct sadb_key) include a user-specified number of key bits. The kernel uses that number to determine how much key data to copy out of the message in pfkey_msg2xfrm_state(). The length of the sadb_key message must be verified to be long enough, even in the case of SADB_X_AALG_NULL. Furthermore, the sadb_key_len value must be long enough to include both the key data and the struct sadb_key itself. Introduce a helper function verify_key_len(), and call it from parse_exthdrs() where other exthdr types are similarly checked for correctness. Signed-off-by: Kevin Easton Reported-by: syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com --- net/key/af_key.c | 45 +++-- 1 file changed, 35 insertions(+), 10 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 7e2e718..e62e52e 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -437,6 +437,24 @@ static int verify_address_len(const void *p) return 0; } +static inline int sadb_key_len(const struct sadb_key *key) +{ + int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8); + + return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes, + sizeof(uint64_t)); +} + +static int verify_key_len(const void *p) +{ + const struct sadb_key *key = p; + + if (sadb_key_len(key) > key->sadb_key_len) + return -EINVAL; + + return 0; +} + static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx) { return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) + @@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const struct sadb_msg *hdr, void * return -EINVAL; if (ext_hdrs[ext_type-1] != NULL) return -EINVAL; - if (ext_type == SADB_EXT_ADDRESS_SRC || - ext_type == SADB_EXT_ADDRESS_DST || - ext_type == SADB_EXT_ADDRESS_PROXY || - ext_type == SADB_X_EXT_NAT_T_OA) { + switch (ext_type) { + case SADB_EXT_ADDRESS_SRC: + case SADB_EXT_ADDRESS_DST: + case SADB_EXT_ADDRESS_PROXY: + case SADB_X_EXT_NAT_T_OA: if (verify_address_len(p)) return -EINVAL; - } - if (ext_type == SADB_X_EXT_SEC_CTX) { + break; + case SADB_X_EXT_SEC_CTX: if (verify_sec_ctx_len(p)) return -EINVAL; + break; + case SADB_EXT_KEY_AUTH: + case SADB_EXT_KEY_ENCRYPT: + if (verify_key_len(p)) + return -EINVAL; + break; + default: + break; } ext_hdrs[ext_type-1] = (void *) p; } @@ -1104,14 +1131,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, key = ext_hdrs[SADB_EXT_KEY_AUTH - 1]; if (key != NULL && sa->sadb_sa_auth != SADB_X_AALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key != NULL && sa->sadb_sa_encrypt != SADB_EALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); x = xfrm_state_alloc(net); -- 2.8.1
[PATCH v2 0/2] af_key: Fix for sadb_key memcpy read overrun
As found by syzbot, af_key does not properly validate the key length in sadb_key messages from userspace. This can result in copying from beyond the end of the sadb_key part of the message, or indeed beyond the end of the entire packet. Both these patches apply cleanly to ipsec-next. Based on Steffen's feedback I have re-ordered them so that the fix only is in patch 1, which I would suggest is also a stable tree candidate, whereas patch 2 is a cleanup only. Kevin Easton (2): af_key: Always verify length of provided sadb_key af_key: Use DIV_ROUND_UP() instead of open-coded equivalent net/key/af_key.c | 58 1 file changed, 42 insertions(+), 16 deletions(-) -- 2.8.1
[PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
Several places use (x + 7) / 8 to convert from a number of bits to a number of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency with other parts of the same file. Signed-off-by: Kevin Easton --- net/key/af_key.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index e62e52e..f3ebb84 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -822,12 +822,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, if (add_keys) { if (x->aalg && x->aalg->alg_key_len) { auth_key_size = - PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 8)); size += sizeof(struct sadb_key) + auth_key_size; } if (x->ealg && x->ealg->alg_key_len) { encrypt_key_size = - PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 8)); size += sizeof(struct sadb_key) + encrypt_key_size; } } @@ -987,7 +987,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_exttype = SADB_EXT_KEY_AUTH; key->sadb_key_bits = x->aalg->alg_key_len; key->sadb_key_reserved = 0; - memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8); + memcpy(key + 1, x->aalg->alg_key, + DIV_ROUND_UP(x->aalg->alg_key_len, 8)); } /* encrypt key */ if (add_keys && encrypt_key_size) { @@ -998,7 +999,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_bits = x->ealg->alg_key_len; key->sadb_key_reserved = 0; memcpy(key + 1, x->ealg->alg_key, - (x->ealg->alg_key_len+7)/8); + DIV_ROUND_UP(x->ealg->alg_key_len, 8)); } /* sa */ @@ -1193,7 +1194,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, goto out; } if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL); if (!x->aalg) { err = -ENOMEM; @@ -1232,7 +1233,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, } key = (struct sadb_key*) ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->ealg = kmalloc(sizeof(*x->ealg) + keysize, GFP_KERNEL); if (!x->ealg) { err = -ENOMEM; -- 2.8.1
Re: [PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
On Wed, Mar 28, 2018 at 07:59:25AM +0200, Steffen Klassert wrote: > On Mon, Mar 26, 2018 at 07:39:16AM -0400, Kevin Easton wrote: > > Several places use (x + 7) / 8 to convert from a number of bits to a number > > of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency > > with other parts of the same file. > > > > Signed-off-by: Kevin Easton <ke...@guarana.org> > > Is this a fix or just a cleanup? > If it is just a cleanup, please resent it based on ipsec-next. Hi Steffen, This patch (1/2) is just a cleanup, but it's in preparation for patch 2/2 which is a fix and modifies some of the same lines. - Kevin
Re: [PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
On Wed, Mar 28, 2018 at 07:59:25AM +0200, Steffen Klassert wrote: > On Mon, Mar 26, 2018 at 07:39:16AM -0400, Kevin Easton wrote: > > Several places use (x + 7) / 8 to convert from a number of bits to a number > > of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency > > with other parts of the same file. > > > > Signed-off-by: Kevin Easton > > Is this a fix or just a cleanup? > If it is just a cleanup, please resent it based on ipsec-next. Hi Steffen, This patch (1/2) is just a cleanup, but it's in preparation for patch 2/2 which is a fix and modifies some of the same lines. - Kevin
[PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
Several places use (x + 7) / 8 to convert from a number of bits to a number of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency with other parts of the same file. Signed-off-by: Kevin Easton <ke...@guarana.org> --- net/key/af_key.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 7e2e718..911b68d 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -795,12 +795,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, if (add_keys) { if (x->aalg && x->aalg->alg_key_len) { auth_key_size = - PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 8)); size += sizeof(struct sadb_key) + auth_key_size; } if (x->ealg && x->ealg->alg_key_len) { encrypt_key_size = - PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 8)); size += sizeof(struct sadb_key) + encrypt_key_size; } } @@ -960,7 +960,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_exttype = SADB_EXT_KEY_AUTH; key->sadb_key_bits = x->aalg->alg_key_len; key->sadb_key_reserved = 0; - memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8); + memcpy(key + 1, x->aalg->alg_key, + DIV_ROUND_UP(x->aalg->alg_key_len, 8)); } /* encrypt key */ if (add_keys && encrypt_key_size) { @@ -971,7 +972,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_bits = x->ealg->alg_key_len; key->sadb_key_reserved = 0; memcpy(key + 1, x->ealg->alg_key, - (x->ealg->alg_key_len+7)/8); + DIV_ROUND_UP(x->ealg->alg_key_len, 8)); } /* sa */ @@ -1104,14 +1105,14 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, key = ext_hdrs[SADB_EXT_KEY_AUTH - 1]; if (key != NULL && sa->sadb_sa_auth != SADB_X_AALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || +DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) return ERR_PTR(-EINVAL); key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key != NULL && sa->sadb_sa_encrypt != SADB_EALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || +DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) return ERR_PTR(-EINVAL); x = xfrm_state_alloc(net); @@ -1168,7 +1169,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, goto out; } if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL); if (!x->aalg) { err = -ENOMEM; @@ -1207,7 +1208,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, } key = (struct sadb_key*) ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->ealg = kmalloc(sizeof(*x->ealg) + keysize, GFP_KERNEL); if (!x->ealg) { err = -ENOMEM; -- 2.8.1
[PATCH 2/2] af_key: Always verify length of provided sadb_key
Key extensions (struct sadb_key) include a user-specified number of key bits. The kernel uses that number to determine how much key data to copy out of the message in pfkey_msg2xfrm_state(). The length of the sadb_key message must be verified to be long enough, even in the case of SADB_X_AALG_NULL. Furthermore, the sadb_key_len value must be long enough to include both the key data and the struct sadb_key itself. Introduce a helper function verify_key_len(), and call it from parse_exthdrs() where other exthdr types are similarly checked for correctness. Signed-off-by: Kevin Easton <ke...@guarana.org> Reported-by: syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com --- net/key/af_key.c | 45 +++-- 1 file changed, 35 insertions(+), 10 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 911b68d..f3ebb84 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -437,6 +437,24 @@ static int verify_address_len(const void *p) return 0; } +static inline int sadb_key_len(const struct sadb_key *key) +{ + int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8); + + return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes, + sizeof(uint64_t)); +} + +static int verify_key_len(const void *p) +{ + const struct sadb_key *key = p; + + if (sadb_key_len(key) > key->sadb_key_len) + return -EINVAL; + + return 0; +} + static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx) { return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) + @@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const struct sadb_msg *hdr, void * return -EINVAL; if (ext_hdrs[ext_type-1] != NULL) return -EINVAL; - if (ext_type == SADB_EXT_ADDRESS_SRC || - ext_type == SADB_EXT_ADDRESS_DST || - ext_type == SADB_EXT_ADDRESS_PROXY || - ext_type == SADB_X_EXT_NAT_T_OA) { + switch (ext_type) { + case SADB_EXT_ADDRESS_SRC: + case SADB_EXT_ADDRESS_DST: + case SADB_EXT_ADDRESS_PROXY: + case SADB_X_EXT_NAT_T_OA: if (verify_address_len(p)) return -EINVAL; - } - if (ext_type == SADB_X_EXT_SEC_CTX) { + break; + case SADB_X_EXT_SEC_CTX: if (verify_sec_ctx_len(p)) return -EINVAL; + break; + case SADB_EXT_KEY_AUTH: + case SADB_EXT_KEY_ENCRYPT: + if (verify_key_len(p)) + return -EINVAL; + break; + default: + break; } ext_hdrs[ext_type-1] = (void *) p; } @@ -1105,14 +1132,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, key = ext_hdrs[SADB_EXT_KEY_AUTH - 1]; if (key != NULL && sa->sadb_sa_auth != SADB_X_AALG_NULL && - (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || -DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key != NULL && sa->sadb_sa_encrypt != SADB_EALG_NULL && - (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || -DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); x = xfrm_state_alloc(net); -- 2.8.1
[PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
Several places use (x + 7) / 8 to convert from a number of bits to a number of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency with other parts of the same file. Signed-off-by: Kevin Easton --- net/key/af_key.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 7e2e718..911b68d 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -795,12 +795,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, if (add_keys) { if (x->aalg && x->aalg->alg_key_len) { auth_key_size = - PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 8)); size += sizeof(struct sadb_key) + auth_key_size; } if (x->ealg && x->ealg->alg_key_len) { encrypt_key_size = - PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8); + PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 8)); size += sizeof(struct sadb_key) + encrypt_key_size; } } @@ -960,7 +960,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_exttype = SADB_EXT_KEY_AUTH; key->sadb_key_bits = x->aalg->alg_key_len; key->sadb_key_reserved = 0; - memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8); + memcpy(key + 1, x->aalg->alg_key, + DIV_ROUND_UP(x->aalg->alg_key_len, 8)); } /* encrypt key */ if (add_keys && encrypt_key_size) { @@ -971,7 +972,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, key->sadb_key_bits = x->ealg->alg_key_len; key->sadb_key_reserved = 0; memcpy(key + 1, x->ealg->alg_key, - (x->ealg->alg_key_len+7)/8); + DIV_ROUND_UP(x->ealg->alg_key_len, 8)); } /* sa */ @@ -1104,14 +1105,14 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, key = ext_hdrs[SADB_EXT_KEY_AUTH - 1]; if (key != NULL && sa->sadb_sa_auth != SADB_X_AALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || +DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) return ERR_PTR(-EINVAL); key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key != NULL && sa->sadb_sa_encrypt != SADB_EALG_NULL && - ((key->sadb_key_bits+7) / 8 == 0 || -(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t))) + (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || +DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) return ERR_PTR(-EINVAL); x = xfrm_state_alloc(net); @@ -1168,7 +1169,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, goto out; } if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL); if (!x->aalg) { err = -ENOMEM; @@ -1207,7 +1208,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, } key = (struct sadb_key*) ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key) - keysize = (key->sadb_key_bits + 7) / 8; + keysize = DIV_ROUND_UP(key->sadb_key_bits, 8); x->ealg = kmalloc(sizeof(*x->ealg) + keysize, GFP_KERNEL); if (!x->ealg) { err = -ENOMEM; -- 2.8.1
[PATCH 2/2] af_key: Always verify length of provided sadb_key
Key extensions (struct sadb_key) include a user-specified number of key bits. The kernel uses that number to determine how much key data to copy out of the message in pfkey_msg2xfrm_state(). The length of the sadb_key message must be verified to be long enough, even in the case of SADB_X_AALG_NULL. Furthermore, the sadb_key_len value must be long enough to include both the key data and the struct sadb_key itself. Introduce a helper function verify_key_len(), and call it from parse_exthdrs() where other exthdr types are similarly checked for correctness. Signed-off-by: Kevin Easton Reported-by: syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com --- net/key/af_key.c | 45 +++-- 1 file changed, 35 insertions(+), 10 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 911b68d..f3ebb84 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -437,6 +437,24 @@ static int verify_address_len(const void *p) return 0; } +static inline int sadb_key_len(const struct sadb_key *key) +{ + int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8); + + return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes, + sizeof(uint64_t)); +} + +static int verify_key_len(const void *p) +{ + const struct sadb_key *key = p; + + if (sadb_key_len(key) > key->sadb_key_len) + return -EINVAL; + + return 0; +} + static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx) { return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) + @@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const struct sadb_msg *hdr, void * return -EINVAL; if (ext_hdrs[ext_type-1] != NULL) return -EINVAL; - if (ext_type == SADB_EXT_ADDRESS_SRC || - ext_type == SADB_EXT_ADDRESS_DST || - ext_type == SADB_EXT_ADDRESS_PROXY || - ext_type == SADB_X_EXT_NAT_T_OA) { + switch (ext_type) { + case SADB_EXT_ADDRESS_SRC: + case SADB_EXT_ADDRESS_DST: + case SADB_EXT_ADDRESS_PROXY: + case SADB_X_EXT_NAT_T_OA: if (verify_address_len(p)) return -EINVAL; - } - if (ext_type == SADB_X_EXT_SEC_CTX) { + break; + case SADB_X_EXT_SEC_CTX: if (verify_sec_ctx_len(p)) return -EINVAL; + break; + case SADB_EXT_KEY_AUTH: + case SADB_EXT_KEY_ENCRYPT: + if (verify_key_len(p)) + return -EINVAL; + break; + default: + break; } ext_hdrs[ext_type-1] = (void *) p; } @@ -1105,14 +1132,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, key = ext_hdrs[SADB_EXT_KEY_AUTH - 1]; if (key != NULL && sa->sadb_sa_auth != SADB_X_AALG_NULL && - (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || -DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1]; if (key != NULL && sa->sadb_sa_encrypt != SADB_EALG_NULL && - (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 || -DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * sizeof(uint64_t))) + key->sadb_key_bits == 0) return ERR_PTR(-EINVAL); x = xfrm_state_alloc(net); -- 2.8.1
[PATCH 0/2] af_key: Fix for sadb_key memcpy read overrun
As found by syzbot, af_key does not properly validate the key length in sadb_key messages from userspace. This can result in copying from beyond the end of the sadb_key part of the message, or indeed beyond the end of the entire packet. Kevin Easton (2): af_key: Use DIV_ROUND_UP() instead of open-coded equivalent af_key: Always verify length of provided sadb_key net/key/af_key.c | 58 1 file changed, 42 insertions(+), 16 deletions(-) -- 2.8.1
[PATCH 0/2] af_key: Fix for sadb_key memcpy read overrun
As found by syzbot, af_key does not properly validate the key length in sadb_key messages from userspace. This can result in copying from beyond the end of the sadb_key part of the message, or indeed beyond the end of the entire packet. Kevin Easton (2): af_key: Use DIV_ROUND_UP() instead of open-coded equivalent af_key: Always verify length of provided sadb_key net/key/af_key.c | 58 1 file changed, 42 insertions(+), 16 deletions(-) -- 2.8.1
Re: [RFC PATCH 2/6] fs: provide a generic compat_sys_truncate64() implementation
On Sun, Mar 18, 2018 at 05:10:52PM +0100, Dominik Brodowski wrote: > The compat_sys_truncate64() implementations in mips, powerpc, s390, sparc > and x86 only differed based on whether the u64 parameter needed padding > and on its endianness. > ... > +#ifdef __ARCH_WANT_COMPAT_SYS_TRUNCATE64 > +#if defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \ > + defined(__ARCH_WANT_LE_COMPAT_SYS) > +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 > padding, > +unsigned int, offset_low, unsigned int, offset_high) > +#elif defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \ > + !defined(__ARCH_WANT_LE_COMPAT_SYS) > +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 > padding, > +unsigned int, offset_high, unsigned int, offset_low) Notwithstanding the other comments, shouldn't there be a comma between 'u32' and 'padding' in those? - Kevin
Re: [RFC PATCH 2/6] fs: provide a generic compat_sys_truncate64() implementation
On Sun, Mar 18, 2018 at 05:10:52PM +0100, Dominik Brodowski wrote: > The compat_sys_truncate64() implementations in mips, powerpc, s390, sparc > and x86 only differed based on whether the u64 parameter needed padding > and on its endianness. > ... > +#ifdef __ARCH_WANT_COMPAT_SYS_TRUNCATE64 > +#if defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \ > + defined(__ARCH_WANT_LE_COMPAT_SYS) > +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 > padding, > +unsigned int, offset_low, unsigned int, offset_high) > +#elif defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \ > + !defined(__ARCH_WANT_LE_COMPAT_SYS) > +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 > padding, > +unsigned int, offset_high, unsigned int, offset_low) Notwithstanding the other comments, shouldn't there be a comma between 'u32' and 'padding' in those? - Kevin
Re: [RFC PATCH 4/7] kconfig: support new special property shell=
On Sat, Feb 10, 2018 at 10:05:12AM -0800, Randy Dunlap wrote: > On 02/10/2018 12:55 AM, Ulf Magnusson wrote: > > How many compilers don't support -fno-stack-protector by the way? > > > > config CC_HAS_STACKPROTECTOR_STRONG > > bool > > option shell="$CC -Werror -fstack-protector-strong -c -x c > > /dev/null" > > > > config CC_HAS_STACKPROTECTOR_REGULAR > > bool > > option shell="$CC -Werror -fstack-protector -c -x c /dev/null" > > > > config CC_HAS_STACKPROTECTOR_NONE > > bool > > default y > > option shell="$CC -Werror -fno-stack-protector -c -x c > > /dev/null" > > I ran: > gcc -Werror -fno-stack-protector -c -x c /dev/null > > It worked (gcc (SUSE Linux) 4.8.5) but it did leave a null.o file for me. > Might need to add that to 'make clean' or just rm it immediately. gcc -Werror -fno-stack-protector -c -x c /dev/null -o /dev/null seems to DTRT without leaving anything behind. - Kevin
Re: [RFC PATCH 4/7] kconfig: support new special property shell=
On Sat, Feb 10, 2018 at 10:05:12AM -0800, Randy Dunlap wrote: > On 02/10/2018 12:55 AM, Ulf Magnusson wrote: > > How many compilers don't support -fno-stack-protector by the way? > > > > config CC_HAS_STACKPROTECTOR_STRONG > > bool > > option shell="$CC -Werror -fstack-protector-strong -c -x c > > /dev/null" > > > > config CC_HAS_STACKPROTECTOR_REGULAR > > bool > > option shell="$CC -Werror -fstack-protector -c -x c /dev/null" > > > > config CC_HAS_STACKPROTECTOR_NONE > > bool > > default y > > option shell="$CC -Werror -fno-stack-protector -c -x c > > /dev/null" > > I ran: > gcc -Werror -fno-stack-protector -c -x c /dev/null > > It worked (gcc (SUSE Linux) 4.8.5) but it did leave a null.o file for me. > Might need to add that to 'make clean' or just rm it immediately. gcc -Werror -fno-stack-protector -c -x c /dev/null -o /dev/null seems to DTRT without leaving anything behind. - Kevin
Re: [PATCH 30/35] x86/speculation: Use Indirect Branch Prediction Barrier in context switch
On Thu, Jan 18, 2018 at 04:38:32PM -0800, Tim Chen wrote: > On 01/18/2018 05:48 AM, Peter Zijlstra wrote: > > > >+/* > >+ * Avoid user/user BTB poisoning by flushing the branch > >predictor > >+ * when switching between processes. This stops one process from > >+ * doing spectre-v2 attacks on another process's data. > >+ */ > >+indirect_branch_prediction_barrier(); > >+ > > Some optimizations can be done here to avoid overhead in barrier call. > > For example, don't do the barrier if prev and next mm are > the same. If the two process trust each other, or the new process > already have rights to look into the previous process, > the barrier could be skipped. Isn't it the other way around with the BTB poisoning? previous is potentially attacking next, so the barrier can be avoided only if previous is allowed to ptrace next? - Kevin
Re: [PATCH 30/35] x86/speculation: Use Indirect Branch Prediction Barrier in context switch
On Thu, Jan 18, 2018 at 04:38:32PM -0800, Tim Chen wrote: > On 01/18/2018 05:48 AM, Peter Zijlstra wrote: > > > >+/* > >+ * Avoid user/user BTB poisoning by flushing the branch > >predictor > >+ * when switching between processes. This stops one process from > >+ * doing spectre-v2 attacks on another process's data. > >+ */ > >+indirect_branch_prediction_barrier(); > >+ > > Some optimizations can be done here to avoid overhead in barrier call. > > For example, don't do the barrier if prev and next mm are > the same. If the two process trust each other, or the new process > already have rights to look into the previous process, > the barrier could be skipped. Isn't it the other way around with the BTB poisoning? previous is potentially attacking next, so the barrier can be avoided only if previous is allowed to ptrace next? - Kevin
Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
> From: Dave Hansen> > The "SYSENTER" stack is used for a lot more than SYSENTER now. > Give it a better string to display in stack dumps. > > We should probably cleanse the 64-bit code of the remaining > "SYSENTER" nomenclature too at some point. > > Signed-off-by: Dave Hansen > --- > > b/arch//x86/kernel/dumpstack_64.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff -puN arch//x86/kernel/dumpstack_64.c~SYSENTER-rename > arch//x86/kernel/dumpstack_64.c > --- a/arch//x86/kernel/dumpstack_64.c~SYSENTER-rename 2017-12-01 > 12:43:16.768707737 -0800 > +++ b/arch//x86/kernel/dumpstack_64.c 2017-12-01 13:19:21.741702337 -0800 > @@ -37,8 +37,14 @@ const char *stack_type_name(enum stack_t > if (type == STACK_TYPE_IRQ) > return "IRQ"; > > - if (type == STACK_TYPE_SYSENTER) > - return "SYSENTER"; > + if (type == STACK_TYPE_SYSENTER) { > + /* > + * On 64-bit, we have a generic entry stack that we > + * use for all the kernel try points, including > + * SYSENTER. ITYM "kernel entry points". - Kevin
Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
> From: Dave Hansen > > The "SYSENTER" stack is used for a lot more than SYSENTER now. > Give it a better string to display in stack dumps. > > We should probably cleanse the 64-bit code of the remaining > "SYSENTER" nomenclature too at some point. > > Signed-off-by: Dave Hansen > --- > > b/arch//x86/kernel/dumpstack_64.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff -puN arch//x86/kernel/dumpstack_64.c~SYSENTER-rename > arch//x86/kernel/dumpstack_64.c > --- a/arch//x86/kernel/dumpstack_64.c~SYSENTER-rename 2017-12-01 > 12:43:16.768707737 -0800 > +++ b/arch//x86/kernel/dumpstack_64.c 2017-12-01 13:19:21.741702337 -0800 > @@ -37,8 +37,14 @@ const char *stack_type_name(enum stack_t > if (type == STACK_TYPE_IRQ) > return "IRQ"; > > - if (type == STACK_TYPE_SYSENTER) > - return "SYSENTER"; > + if (type == STACK_TYPE_SYSENTER) { > + /* > + * On 64-bit, we have a generic entry stack that we > + * use for all the kernel try points, including > + * SYSENTER. ITYM "kernel entry points". - Kevin
Re: [GIT pull] printk updates for 4.15
On Sat, Nov 18, 2017 at 01:26:07AM +0100, Thomas Gleixner wrote: > On Thu, 16 Nov 2017, Thomas Gleixner wrote: > > I hope that I can find a few spare cycles to whip up a POC patch which does > > not make stuff fall apart immediately. > > Here you go. It survived suspend resume in a VM. > > Thanks, > > tglx > > 8< > Subject: timekeeping: Make monotonic behave like boottime > From: Thomas Gleixner> Date: Fri, 17 Nov 2017 11:46:48 +0100 > > Clock MONOTONIC is not fast forwarded by the time spent in suspend on > resume. This is only done for BOOTTIME. > > It would be desired to get rid of that difference, but the difference > between clock MONOTONIC and clock BOOTTIME is well documented so there > might be applications which depend on that behaviour. As a comment from the userspace peanut gallery, I personally hope this does pass muster. The existing POSIX wording implies that MONOTONIC doesn't stop counting over suspend, and that's what you need when you want to know the time elapsed between two external events. - Kevin >
Re: [GIT pull] printk updates for 4.15
On Sat, Nov 18, 2017 at 01:26:07AM +0100, Thomas Gleixner wrote: > On Thu, 16 Nov 2017, Thomas Gleixner wrote: > > I hope that I can find a few spare cycles to whip up a POC patch which does > > not make stuff fall apart immediately. > > Here you go. It survived suspend resume in a VM. > > Thanks, > > tglx > > 8< > Subject: timekeeping: Make monotonic behave like boottime > From: Thomas Gleixner > Date: Fri, 17 Nov 2017 11:46:48 +0100 > > Clock MONOTONIC is not fast forwarded by the time spent in suspend on > resume. This is only done for BOOTTIME. > > It would be desired to get rid of that difference, but the difference > between clock MONOTONIC and clock BOOTTIME is well documented so there > might be applications which depend on that behaviour. As a comment from the userspace peanut gallery, I personally hope this does pass muster. The existing POSIX wording implies that MONOTONIC doesn't stop counting over suspend, and that's what you need when you want to know the time elapsed between two external events. - Kevin >
Re: [PATCH 05/14] isdn: isdnloop: suppress a gcc-7 warning
On Fri, Jul 14, 2017 at 12:37:05PM +0200, Arnd Bergmann wrote: > On Fri, Jul 14, 2017 at 12:08 PM, Joe Percheswrote: > > On Fri, 2017-07-14 at 11:25 +0200, Arnd Bergmann wrote: > >> We test whether a bit is set in a mask here, which is correct > >> but gcc warns about it as it thinks it might be confusing: > >> > >> drivers/isdn/isdnloop/isdnloop.c:412:37: error: ?: using integer constants > >> in boolean context, the expression will always evaluate to 'true' > >> [-Werror=int-in-bool-context] ... > > Perhaps this is a logic defect and should be: > > > > if (!(card->flags & ((channel) ? ISDNLOOP_FLAGS_B2ACTIVE : > > ISDNLOOP_FLAGS_B1ACTIVE))) > > Yes, good catch. I had thought about it for a bit whether that would be > the answer, but come to the wrong conclusion on my own. > > Note that the version you suggested will still have the warning, so I think > it needs to be It shouldn't - the warning is for using an integer *constant* in boolean context, but the result of & isn't a constant and should be fine. !(flags & mask) is a very common idiom. - Kevin
Re: [PATCH 05/14] isdn: isdnloop: suppress a gcc-7 warning
On Fri, Jul 14, 2017 at 12:37:05PM +0200, Arnd Bergmann wrote: > On Fri, Jul 14, 2017 at 12:08 PM, Joe Perches wrote: > > On Fri, 2017-07-14 at 11:25 +0200, Arnd Bergmann wrote: > >> We test whether a bit is set in a mask here, which is correct > >> but gcc warns about it as it thinks it might be confusing: > >> > >> drivers/isdn/isdnloop/isdnloop.c:412:37: error: ?: using integer constants > >> in boolean context, the expression will always evaluate to 'true' > >> [-Werror=int-in-bool-context] ... > > Perhaps this is a logic defect and should be: > > > > if (!(card->flags & ((channel) ? ISDNLOOP_FLAGS_B2ACTIVE : > > ISDNLOOP_FLAGS_B1ACTIVE))) > > Yes, good catch. I had thought about it for a bit whether that would be > the answer, but come to the wrong conclusion on my own. > > Note that the version you suggested will still have the warning, so I think > it needs to be It shouldn't - the warning is for using an integer *constant* in boolean context, but the result of & isn't a constant and should be fine. !(flags & mask) is a very common idiom. - Kevin
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings> > > wrote: > > > > > > > > We have: > > > > > > > > bottom = 0xff803fff > > > > sp =?0xb178 > > > > > > > > The relevant mappings are: > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > fffdd000-e000 rw-p 00:00 > > > > 0??[stack] > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > to use up almost all of it, and there's only about 28kB left between > > > "bottom" and that 'rwx' mapping. > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > really is almost exactly 8MB below the stack. > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > mmap, with a condition to catch that particular one? > > [...] > > > > Found it, and it's now clear why only i386 is affected: > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? On the plus side, the code in that page (a single RET) is only executed once when the workaround function is called. Notice that 'codebuf' is never even returned out of that function. The only reason they even leave that page mapped is to stop the exec shield limit from being lowered on them again. - Kevin
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings > > > wrote: > > > > > > > > We have: > > > > > > > > bottom = 0xff803fff > > > > sp =?0xb178 > > > > > > > > The relevant mappings are: > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > fffdd000-e000 rw-p 00:00 > > > > 0??[stack] > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > to use up almost all of it, and there's only about 28kB left between > > > "bottom" and that 'rwx' mapping. > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > really is almost exactly 8MB below the stack. > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > mmap, with a condition to catch that particular one? > > [...] > > > > Found it, and it's now clear why only i386 is affected: > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? On the plus side, the code in that page (a single RET) is only executed once when the workaround function is called. Notice that 'codebuf' is never even returned out of that function. The only reason they even leave that page mapped is to stop the exec shield limit from being lowered on them again. - Kevin
Re: [PATCH v3 04/13] crypto/rng: ensure that the RNG is ready before using
On Tue, Jun 06, 2017 at 05:56:20AM +0200, Jason A. Donenfeld wrote: > Hey Ted, > > On Tue, Jun 6, 2017 at 5:00 AM, Theodore Ts'owrote: > > Note that crypto_rng_reset() is called by big_key_init() in > > security/keys/big_key.c as a late_initcall(). So if we are on a > > system where the crng doesn't get initialized until during the system > > boot scripts, and big_key is compiled directly into the kernel, the > > boot could end up deadlocking. > > > > There may be other instances of where crypto_rng_reset() is called by > > an initcall, so big_key_init() may not be an exhaustive enumeration of > > potential problems. But this is an example of why the synchronous > > API, although definitely much more convenient, can end up being a trap > > for the unwary > > Thanks for pointing this out. I'll look more closely into it and see > if I can figure out a good way of approaching this. Would it work for wait_for_random_bytes() to include a WARN_ON(system_state < SYSTEM_RUNNING); to catch those kinds of cases? - Kevin
Re: [PATCH v3 04/13] crypto/rng: ensure that the RNG is ready before using
On Tue, Jun 06, 2017 at 05:56:20AM +0200, Jason A. Donenfeld wrote: > Hey Ted, > > On Tue, Jun 6, 2017 at 5:00 AM, Theodore Ts'o wrote: > > Note that crypto_rng_reset() is called by big_key_init() in > > security/keys/big_key.c as a late_initcall(). So if we are on a > > system where the crng doesn't get initialized until during the system > > boot scripts, and big_key is compiled directly into the kernel, the > > boot could end up deadlocking. > > > > There may be other instances of where crypto_rng_reset() is called by > > an initcall, so big_key_init() may not be an exhaustive enumeration of > > potential problems. But this is an example of why the synchronous > > API, although definitely much more convenient, can end up being a trap > > for the unwary > > Thanks for pointing this out. I'll look more closely into it and see > if I can figure out a good way of approaching this. Would it work for wait_for_random_bytes() to include a WARN_ON(system_state < SYSTEM_RUNNING); to catch those kinds of cases? - Kevin
Re: [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging
On Thu, May 25, 2017 at 05:40:16PM -0700, Andy Lutomirski wrote: > On Thu, May 25, 2017 at 4:24 PM, Linus Torvalds >wrote: > > On Thu, May 25, 2017 at 1:33 PM, Kirill A. Shutemov > > wrote: > >> Here' my first attempt to bring boot-time between 4- and 5-level paging. > >> It looks not too terrible to me. I've expected it to be worse. > > > > If I read this right, you just made it a global on/off thing. > > > > May I suggest possibly a different model entirely? Can you make it a > > per-mm flag instead? > > > > And then we > > > > (a) make all kthreads use the 4-level page tables > > > > (b) which means that all the init code uses the 4-level page tables > > > > (c) which means that all those checks for "start_secondary" etc can > > just go away, because those all run with 4-level page tables. > > > > Or is it just much too expensive to switch between 4-level and 5-level > > paging at run-time? > > > > Even ignoring expensiveness, I'm not convinced it's practical. AFAICT > you can't atomically switch the paging mode and CR3, so either you > need some magic page table with trampoline that works in both modes > (which is presumably doable with some trickery) or you need to flip > paging off. Good luck if an NMI hits in the mean time. There was > code like that once upon a time for EFI mixed mode, but it got deleted > due to triple-faults. According to Intel's documentation you pretty much have to disable paging anyway: "The processor allows software to modify CR4.LA57 only outside of IA-32e mode. In IA-32e mode, an attempt to modify CR4.LA57 using the MOV CR instruction causes a general-protection exception (#GP)." (If it weren't for that, maybe you could point the last entry in the PML4 at the PML4 itself, so it also works as a PML5 for accessing kernel addresses? And of course make sure nothing gets loaded above 0xff80). - Kevin
Re: [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging
On Thu, May 25, 2017 at 05:40:16PM -0700, Andy Lutomirski wrote: > On Thu, May 25, 2017 at 4:24 PM, Linus Torvalds > wrote: > > On Thu, May 25, 2017 at 1:33 PM, Kirill A. Shutemov > > wrote: > >> Here' my first attempt to bring boot-time between 4- and 5-level paging. > >> It looks not too terrible to me. I've expected it to be worse. > > > > If I read this right, you just made it a global on/off thing. > > > > May I suggest possibly a different model entirely? Can you make it a > > per-mm flag instead? > > > > And then we > > > > (a) make all kthreads use the 4-level page tables > > > > (b) which means that all the init code uses the 4-level page tables > > > > (c) which means that all those checks for "start_secondary" etc can > > just go away, because those all run with 4-level page tables. > > > > Or is it just much too expensive to switch between 4-level and 5-level > > paging at run-time? > > > > Even ignoring expensiveness, I'm not convinced it's practical. AFAICT > you can't atomically switch the paging mode and CR3, so either you > need some magic page table with trampoline that works in both modes > (which is presumably doable with some trickery) or you need to flip > paging off. Good luck if an NMI hits in the mean time. There was > code like that once upon a time for EFI mixed mode, but it got deleted > due to triple-faults. According to Intel's documentation you pretty much have to disable paging anyway: "The processor allows software to modify CR4.LA57 only outside of IA-32e mode. In IA-32e mode, an attempt to modify CR4.LA57 using the MOV CR instruction causes a general-protection exception (#GP)." (If it weren't for that, maybe you could point the last entry in the PML4 at the PML4 itself, so it also works as a PML5 for accessing kernel addresses? And of course make sure nothing gets loaded above 0xff80). - Kevin
Re: [PATCH] kernel/fork.c: use sizeof() instead of sizeof
On Tue, Feb 02, 2016 at 05:04:06PM +, Al Viro wrote: > FWIW, the actual rules are > unary-expression: postfix-expression | > ++ unary-expression | > -- unary-expression | > - cast-expression | > + cast-expression | > ! cast-expression | > ~ cast-expression | > * cast-expression | > & cast-expression | > sizeof unary-expression | > sizeof ( type-name ) > cast-expression: unary-expression | >( type-name ) cast-expression > Note that while e.g. > ++ ++ n > is allowed by grammar, it runs afoul of the constraint for ++ argument, which > must be a modifiable lvalue. None of the operators above yield that, so > the rules for ++ and -- might as well have been ++ postfix-expression and > -- postfix-expression resp. Unless I'm mistaken, * cast-expression yields an lvalue. - Kevin
Re: [PATCH] kernel/fork.c: use sizeof() instead of sizeof
On Tue, Feb 02, 2016 at 05:04:06PM +, Al Viro wrote: > FWIW, the actual rules are > unary-expression: postfix-expression | > ++ unary-expression | > -- unary-expression | > - cast-expression | > + cast-expression | > ! cast-expression | > ~ cast-expression | > * cast-expression | > & cast-expression | > sizeof unary-expression | > sizeof ( type-name ) > cast-expression: unary-expression | >( type-name ) cast-expression > Note that while e.g. > ++ ++ n > is allowed by grammar, it runs afoul of the constraint for ++ argument, which > must be a modifiable lvalue. None of the operators above yield that, so > the rules for ++ and -- might as well have been ++ postfix-expression and > -- postfix-expression resp. Unless I'm mistaken, * cast-expression yields an lvalue. - Kevin
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: > On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > > > Let me re-ask the question that I asked last week (and was apparently > > > ignored). Why not trying to use the lazytime feature instead of > > > pointing a head straight at the application's --- and system > > > administrators' --- heads? > > > > Sorry Ted, I thought I responded already. > > > > The goal is to avoid inode writeout entirely when we can, and > > as I understand it lazytime will still force writeout before the inode > > is dropped from the cache. In systems like Ceph in particular, the > > IOs can be spread across lots of files, so simply deferring writeout > > doesn't always help. > > Sure, but it would reduce the writeout by orders of magnitude. I can > understand if you want to reduce it further, but it might be good > enough for your purposes. > > I considered doing the equivalent of O_NOMTIME for our purposes at > $WORK, and our use case is actually not that different from Ceph's > (i.e., using a local disk file system to support a cluster file > system), and lazytime was (a) something I figured was something I > could upstream in good conscience, and (b) was more than good enough > for us. A safer alternative might be a chattr file attribute that if set, the mtime is not updated on writes, and stat() on the file always shows the mtime as "right now". At least that way, the file won't accidentally get left out of backups that rely on the mtime. (If the file attribute is unset, you immediately update the mtime then too, and from then on the file is back to normal). - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: Let me re-ask the question that I asked last week (and was apparently ignored). Why not trying to use the lazytime feature instead of pointing a head straight at the application's --- and system administrators' --- heads? Sorry Ted, I thought I responded already. The goal is to avoid inode writeout entirely when we can, and as I understand it lazytime will still force writeout before the inode is dropped from the cache. In systems like Ceph in particular, the IOs can be spread across lots of files, so simply deferring writeout doesn't always help. Sure, but it would reduce the writeout by orders of magnitude. I can understand if you want to reduce it further, but it might be good enough for your purposes. I considered doing the equivalent of O_NOMTIME for our purposes at $WORK, and our use case is actually not that different from Ceph's (i.e., using a local disk file system to support a cluster file system), and lazytime was (a) something I figured was something I could upstream in good conscience, and (b) was more than good enough for us. A safer alternative might be a chattr file attribute that if set, the mtime is not updated on writes, and stat() on the file always shows the mtime as right now. At least that way, the file won't accidentally get left out of backups that rely on the mtime. (If the file attribute is unset, you immediately update the mtime then too, and from then on the file is back to normal). - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
On Thu, May 07, 2015 at 08:18:43PM +0100, One Thousand Gnomes wrote: > > We could keep heap metadata as R/O and only make it R/W inside of > > malloc() itself to catch corruption more quickly. > > If you implement multiple malloc pools you can chop up lots of stuff. > > In library land it isn't just stuff like malloc, you can use it as > a debug weapon to protect library private data from naughty application > code. How could a library (or debugger, for that matter) arbitrate ownership of the protection domains with the application? One interesting use for it might be to be to provide an interface to allocate memory and associate it with a lock that's supposed to be held while accessing that memory. The allocation function hashes the lock address down to one of the 15 non-zero protection domains and applies that key to the memory, the lock function then adds RW access to the appropriate protection domain and the unlock function removes it. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
On Thu, May 07, 2015 at 08:18:43PM +0100, One Thousand Gnomes wrote: We could keep heap metadata as R/O and only make it R/W inside of malloc() itself to catch corruption more quickly. If you implement multiple malloc pools you can chop up lots of stuff. In library land it isn't just stuff like malloc, you can use it as a debug weapon to protect library private data from naughty application code. How could a library (or debugger, for that matter) arbitrate ownership of the protection domains with the application? One interesting use for it might be to be to provide an interface to allocate memory and associate it with a lock that's supposed to be held while accessing that memory. The allocation function hashes the lock address down to one of the 15 non-zero protection domains and applies that key to the memory, the lock function then adds RW access to the appropriate protection domain and the unlock function removes it. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote: > Unfortunately, using facilities like Unix domain sockets to merely pass file > descriptors between "worker" processes is unnecessarily difficult, due to > the following common consideration: > > 1. Domain sockets and named pipes are persistent objects. Applications must > manage their lifetime and devise unambiguous access schemes in case multiple > application instances are to be run within the same OS instance. Usually, they > would also require a writable file system to be mounted. I believe this particular issue has long been addressed in Linux, with the "abstract namespace" domain sockets. These aren't persistent - they go away when the bound socket is closed - and they don't need a writable filesystem. If you derived the name in the abstract namespace from your PID (or better, application identifier and PID) then you would have exactly the same "ambiguous access" scheme as your proposal. > int sendfd(pid_t pid, int sig, int fd) PIDs tend to be regarded as a bit of an iffy way to refer to another process, because they tend to be racy. If the process you think you're talking to dies, and has its PID reused by another unrelated sendfd()-aware process, you've just sent your open file to somewhere unexpected. You can avoid that if the process is a child of yours, but in that case you could have set up a no-fuss domain socket connection with socketpair() too. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote: Unfortunately, using facilities like Unix domain sockets to merely pass file descriptors between worker processes is unnecessarily difficult, due to the following common consideration: 1. Domain sockets and named pipes are persistent objects. Applications must manage their lifetime and devise unambiguous access schemes in case multiple application instances are to be run within the same OS instance. Usually, they would also require a writable file system to be mounted. I believe this particular issue has long been addressed in Linux, with the abstract namespace domain sockets. These aren't persistent - they go away when the bound socket is closed - and they don't need a writable filesystem. If you derived the name in the abstract namespace from your PID (or better, application identifier and PID) then you would have exactly the same ambiguous access scheme as your proposal. int sendfd(pid_t pid, int sig, int fd) PIDs tend to be regarded as a bit of an iffy way to refer to another process, because they tend to be racy. If the process you think you're talking to dies, and has its PID reused by another unrelated sendfd()-aware process, you've just sent your open file to somewhere unexpected. You can avoid that if the process is a child of yours, but in that case you could have set up a no-fuss domain socket connection with socketpair() too. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER
On Wed, Sep 17, 2014 at 09:43:09PM -0700, Dave Hansen wrote: > On 09/17/2014 08:23 PM, Kevin Easton wrote: > > I was actually thinking that the kernel would take care of the xsave / > > xrstor (for current), updating tsk->thread.fpu.state (for non-running > > threads) and sending an IPI for threads running on other CPUs. > > > > Of course userspace can always then manually change the bounds directory > > address itself, but then it's quite clear that they're doing something > > unsupported. Just an idea, anyway. > > What's the benefit of that? > > As it stands now, MPX is likely to be enabled well before any threads > are created, and the MPX enabling state will be inherited by the new > thread at clone() time. The current mechanism allows a thread to > individually enable or disable MPX independently of the other threads. > > I think it makes it both more complicated and less flexible. I was assuming that if an application did want to enable MPX after threads had already been created, it would generally want to enable it simultaneously across all threads. This would be a lot easier for the kernel than for the application. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER
On Wed, Sep 17, 2014 at 09:43:09PM -0700, Dave Hansen wrote: On 09/17/2014 08:23 PM, Kevin Easton wrote: I was actually thinking that the kernel would take care of the xsave / xrstor (for current), updating tsk-thread.fpu.state (for non-running threads) and sending an IPI for threads running on other CPUs. Of course userspace can always then manually change the bounds directory address itself, but then it's quite clear that they're doing something unsupported. Just an idea, anyway. What's the benefit of that? As it stands now, MPX is likely to be enabled well before any threads are created, and the MPX enabling state will be inherited by the new thread at clone() time. The current mechanism allows a thread to individually enable or disable MPX independently of the other threads. I think it makes it both more complicated and less flexible. I was assuming that if an application did want to enable MPX after threads had already been created, it would generally want to enable it simultaneously across all threads. This would be a lot easier for the kernel than for the application. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER
On Thu, Sep 18, 2014 at 12:40:29AM +, Ren, Qiaowei wrote: > > Would it be prudent to use an error code other than EINVAL for the > > "hardware doesn't support it" case? > > > Seems like no specific error code for this case. ENXIO would probably be OK. It's not too important as long as it's documented. > > >> @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned > > long, arg2, unsigned long, arg3, > >>me->mm->def_flags &= ~VM_NOHUGEPAGE; > >>up_write(>mm->mmap_sem); > >>break; > >> + case PR_MPX_REGISTER: > >> + error = MPX_REGISTER(me); > >> + break; > >> + case PR_MPX_UNREGISTER: > >> + error = MPX_UNREGISTER(me); > >> + break; > > > > If you pass me->mm from prctl, that makes it clear that it's > > per-process not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE. > > > > This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if > > it's not using them, otherwise you'll be sunk if you ever want to use them > > later. > > > > It seems like it only makes sense for all threads using the mm to have > > the same bounds directory set. If the interface was changed to > > directly pass the address, then could the kernel take care of setting > > it for *all* of the threads in the process? This seems like something > > that would be easier for the kernel to do than userspace. > > > If the interface was changed to this, it will be possible for insane > application to pass error bounds directory address to kernel. We still > have to call fpu_xsave() to check this. I was actually thinking that the kernel would take care of the xsave / xrstor (for current), updating tsk->thread.fpu.state (for non-running threads) and sending an IPI for threads running on other CPUs. Of course userspace can always then manually change the bounds directory address itself, but then it's quite clear that they're doing something unsupported. Just an idea, anyway. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/vdso: Add prctl to set per-process VDSO load
On Tue, Sep 16, 2014 at 05:05:51PM -0700, Richard Larocque wrote: > + case PR_SET_VDSO: > + if (arg2 == PR_VDSO_ENABLE) > + me->signal->disable_vdso = 0; > + else if (arg2 == PR_VDSO_DISABLE) > + me->signal->disable_vdso = 1; > + else > + return -EINVAL; > + break; > + case PR_GET_VDSO: > + if (!me->signal->disable_vdso) > + error = put_user(PR_VDSO_ENABLE, (int __user *)arg2); > + else > + error = put_user(PR_VDSO_DISABLE, (int __user *)arg2); > + break; Perhaps both of those should do if (arg3 || arg4 || arg5) return -EINVAL; - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/vdso: Add prctl to set per-process VDSO load
On Tue, Sep 16, 2014 at 05:05:51PM -0700, Richard Larocque wrote: + case PR_SET_VDSO: + if (arg2 == PR_VDSO_ENABLE) + me-signal-disable_vdso = 0; + else if (arg2 == PR_VDSO_DISABLE) + me-signal-disable_vdso = 1; + else + return -EINVAL; + break; + case PR_GET_VDSO: + if (!me-signal-disable_vdso) + error = put_user(PR_VDSO_ENABLE, (int __user *)arg2); + else + error = put_user(PR_VDSO_DISABLE, (int __user *)arg2); + break; Perhaps both of those should do if (arg3 || arg4 || arg5) return -EINVAL; - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER
On Thu, Sep 18, 2014 at 12:40:29AM +, Ren, Qiaowei wrote: Would it be prudent to use an error code other than EINVAL for the hardware doesn't support it case? Seems like no specific error code for this case. ENXIO would probably be OK. It's not too important as long as it's documented. @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, me-mm-def_flags = ~VM_NOHUGEPAGE; up_write(me-mm-mmap_sem); break; + case PR_MPX_REGISTER: + error = MPX_REGISTER(me); + break; + case PR_MPX_UNREGISTER: + error = MPX_UNREGISTER(me); + break; If you pass me-mm from prctl, that makes it clear that it's per-process not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE. This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if it's not using them, otherwise you'll be sunk if you ever want to use them later. It seems like it only makes sense for all threads using the mm to have the same bounds directory set. If the interface was changed to directly pass the address, then could the kernel take care of setting it for *all* of the threads in the process? This seems like something that would be easier for the kernel to do than userspace. If the interface was changed to this, it will be possible for insane application to pass error bounds directory address to kernel. We still have to call fpu_xsave() to check this. I was actually thinking that the kernel would take care of the xsave / xrstor (for current), updating tsk-thread.fpu.state (for non-running threads) and sending an IPI for threads running on other CPUs. Of course userspace can always then manually change the bounds directory address itself, but then it's quite clear that they're doing something unsupported. Just an idea, anyway. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER
On Thu, Sep 11, 2014 at 04:46:48PM +0800, Qiaowei Ren wrote: > +static __user void *task_get_bounds_dir(struct task_struct *tsk) > +{ > + struct xsave_struct *xsave_buf; > + > + fpu_xsave(>thread.fpu); > + xsave_buf = &(tsk->thread.fpu.state->xsave); > + if (!(xsave_buf->bndcsr.cfg_reg_u & MPX_BNDCFG_ENABLE_FLAG)) > + return NULL; > + > + return (void __user *)(unsigned long)(xsave_buf->bndcsr.cfg_reg_u & > + MPX_BNDCFG_ADDR_MASK); > +} This only makes sense if called with 'current', so is there any need for the function argument? > + > +int mpx_register(struct task_struct *tsk) > +{ > + struct mm_struct *mm = tsk->mm; > + > + if (!cpu_has_mpx) > + return -EINVAL; > + > + /* > + * runtime in the userspace will be responsible for allocation of > + * the bounds directory. Then, it will save the base of the bounds > + * directory into XSAVE/XRSTOR Save Area and enable MPX through > + * XRSTOR instruction. > + * > + * fpu_xsave() is expected to be very expensive. In order to do > + * performance optimization, here we get the base of the bounds > + * directory and then save it into mm_struct to be used in future. > + */ > + mm->bd_addr = task_get_bounds_dir(tsk); > + if (!mm->bd_addr) > + return -EINVAL; > + > + return 0; > +} > + > +int mpx_unregister(struct task_struct *tsk) > +{ > + struct mm_struct *mm = current->mm; > + > + if (!cpu_has_mpx) > + return -EINVAL; > + > + mm->bd_addr = NULL; > + return 0; > +} If that's changed, then mpx_register() and mpx_unregister() don't need a task_struct, just an mm_struct. Probably these functions should be locking mmap_sem. Would it be prudent to use an error code other than EINVAL for the "hardware doesn't support it" case? > @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, > arg2, unsigned long, arg3, > me->mm->def_flags &= ~VM_NOHUGEPAGE; > up_write(>mm->mmap_sem); > break; > + case PR_MPX_REGISTER: > + error = MPX_REGISTER(me); > + break; > + case PR_MPX_UNREGISTER: > + error = MPX_UNREGISTER(me); > + break; If you pass me->mm from prctl, that makes it clear that it's per-process not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE. This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if it's not using them, otherwise you'll be sunk if you ever want to use them later. It seems like it only makes sense for all threads using the mm to have the same bounds directory set. If the interface was changed to directly pass the address, then could the kernel take care of setting it for *all* of the threads in the process? This seems like something that would be easier for the kernel to do than userspace. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER
On Thu, Sep 11, 2014 at 04:46:48PM +0800, Qiaowei Ren wrote: +static __user void *task_get_bounds_dir(struct task_struct *tsk) +{ + struct xsave_struct *xsave_buf; + + fpu_xsave(tsk-thread.fpu); + xsave_buf = (tsk-thread.fpu.state-xsave); + if (!(xsave_buf-bndcsr.cfg_reg_u MPX_BNDCFG_ENABLE_FLAG)) + return NULL; + + return (void __user *)(unsigned long)(xsave_buf-bndcsr.cfg_reg_u + MPX_BNDCFG_ADDR_MASK); +} This only makes sense if called with 'current', so is there any need for the function argument? + +int mpx_register(struct task_struct *tsk) +{ + struct mm_struct *mm = tsk-mm; + + if (!cpu_has_mpx) + return -EINVAL; + + /* + * runtime in the userspace will be responsible for allocation of + * the bounds directory. Then, it will save the base of the bounds + * directory into XSAVE/XRSTOR Save Area and enable MPX through + * XRSTOR instruction. + * + * fpu_xsave() is expected to be very expensive. In order to do + * performance optimization, here we get the base of the bounds + * directory and then save it into mm_struct to be used in future. + */ + mm-bd_addr = task_get_bounds_dir(tsk); + if (!mm-bd_addr) + return -EINVAL; + + return 0; +} + +int mpx_unregister(struct task_struct *tsk) +{ + struct mm_struct *mm = current-mm; + + if (!cpu_has_mpx) + return -EINVAL; + + mm-bd_addr = NULL; + return 0; +} If that's changed, then mpx_register() and mpx_unregister() don't need a task_struct, just an mm_struct. Probably these functions should be locking mmap_sem. Would it be prudent to use an error code other than EINVAL for the hardware doesn't support it case? @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, me-mm-def_flags = ~VM_NOHUGEPAGE; up_write(me-mm-mmap_sem); break; + case PR_MPX_REGISTER: + error = MPX_REGISTER(me); + break; + case PR_MPX_UNREGISTER: + error = MPX_UNREGISTER(me); + break; If you pass me-mm from prctl, that makes it clear that it's per-process not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE. This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if it's not using them, otherwise you'll be sunk if you ever want to use them later. It seems like it only makes sense for all threads using the mm to have the same bounds directory set. If the interface was changed to directly pass the address, then could the kernel take care of setting it for *all* of the threads in the process? This seems like something that would be easier for the kernel to do than userspace. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs: Cleanup string initializations (char[] instead of char *)
On Sat, May 17, 2014 at 05:53:45PM +0100, Al Viro wrote: > On Sat, May 17, 2014 at 05:00:18PM +0200, Manuel Sch?lling wrote: > > Initializations like 'char *foo = "bar"' will create two variables: a static > > string and a pointer (foo) to that static string. Instead 'char foo[] = > > "bar"' > > will declare a single variable and will end up in shorter > > assembly (according to Jeff Garzik on the KernelJanitor's TODO list). > > The hell it will. Compare assembler generated e.g. for 32bit x86 before > and after. > > > { > > char *dp; > > char *status = "disabled"; > > - const char * flags = "flags: "; > > + const char flags[] = "flags: "; > > The first variant puts address of constant array into local variable > (on stack or in a register). The second one fills local _array_ - the > string itself goes on stack. Right - if you're going to do this, you ireally want to make the array static as well: static const char flags[] = "flags: "; (it's very unlikely that you would want a const array that isn't static). As Al showed though, usually the optimiser will figure things out on its own, and the pointer variable in the original variant gets optimised away. There's also optimisations that apply to string literals but don't apply to arrays, like string literal merging, which mean that the original could well still be preferred. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs: Cleanup string initializations (char[] instead of char *)
On Sat, May 17, 2014 at 05:53:45PM +0100, Al Viro wrote: On Sat, May 17, 2014 at 05:00:18PM +0200, Manuel Sch?lling wrote: Initializations like 'char *foo = bar' will create two variables: a static string and a pointer (foo) to that static string. Instead 'char foo[] = bar' will declare a single variable and will end up in shorter assembly (according to Jeff Garzik on the KernelJanitor's TODO list). The hell it will. Compare assembler generated e.g. for 32bit x86 before and after. { char *dp; char *status = disabled; - const char * flags = flags: ; + const char flags[] = flags: ; The first variant puts address of constant array into local variable (on stack or in a register). The second one fills local _array_ - the string itself goes on stack. Right - if you're going to do this, you ireally want to make the array static as well: static const char flags[] = flags: ; (it's very unlikely that you would want a const array that isn't static). As Al showed though, usually the optimiser will figure things out on its own, and the pointer variable in the original variant gets optimised away. There's also optimisations that apply to string literals but don't apply to arrays, like string literal merging, which mean that the original could well still be preferred. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: > On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner wrote: > > I'm just dying to hear a "normal" use case then. :) > > So the more "normal" use cause would be marking objects volatile and > then non-volatile w/o accessing them in-between. In this case the > zero-fill vs SIGBUS semantics don't really matter, its really just a > trade off in how we handle applications deviating (intentionally or > not) from this use case. > > So to maybe flesh out the context here for folks who are following > along (but weren't in the hallway at LSF :), Johannes made a fairly > interesting proposal (Johannes: Please correct me here where I'm maybe > slightly off here) to use only the dirty bits of the ptes to mark a > page as volatile. Then the kernel could reclaim these clean pages as > it needed, and when we marked the range as non-volatile, the pages > would be re-dirtied and if any of the pages were missing, we could > return a flag with the purged state. This had some different > semantics then what I've been working with for awhile (for example, > any writes to pages would implicitly clear volatility), so I wasn't > completely comfortable with it, but figured I'd think about it to see > if it could be done. Particularly since it would in some ways simplify > tmpfs/shm shared volatility that I'd eventually like to do. ... > Now, while for the case I'm personally most interested in (ashmem), > zero-fill would technically be ok, since that's what Android does. > Even so, I don't think its the best approach for the interface, since > applications may end up quite surprised by the results when they > accidentally don't follow the "don't touch volatile pages" rule. > > That point beside, I think the other problem with the page-cleaning > volatility approach is that there are other awkward side effects. For > example: Say an application marks a range as volatile. One page in the > range is then purged. The application, due to a bug or otherwise, > reads the volatile range. This causes the page to be zero-filled in, > and the application silently uses the corrupted data (which isn't > great). More problematic though, is that by faulting the page in, > they've in effect lost the purge state for that page. When the > application then goes to mark the range as non-volatile, all pages are > present, so we'd return that no pages were purged. From an > application perspective this is pretty ugly. The write-implicitly-clears-volatile semantics would actually be an advantage for some use cases. If you have a volatile cache of many sub-page-size objects, the application can just include at the start of each page "int present, in_use;". "present" is set to non-zero before marking volatile, and when the application wants unmark as volatile it writes to "in_use" and tests the value of "present". No need for a syscall at all, although it does take a minor fault. The syscall would be better for the case of large objects, though. Or is that fatally flawed? - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote: I'm just dying to hear a normal use case then. :) So the more normal use cause would be marking objects volatile and then non-volatile w/o accessing them in-between. In this case the zero-fill vs SIGBUS semantics don't really matter, its really just a trade off in how we handle applications deviating (intentionally or not) from this use case. So to maybe flesh out the context here for folks who are following along (but weren't in the hallway at LSF :), Johannes made a fairly interesting proposal (Johannes: Please correct me here where I'm maybe slightly off here) to use only the dirty bits of the ptes to mark a page as volatile. Then the kernel could reclaim these clean pages as it needed, and when we marked the range as non-volatile, the pages would be re-dirtied and if any of the pages were missing, we could return a flag with the purged state. This had some different semantics then what I've been working with for awhile (for example, any writes to pages would implicitly clear volatility), so I wasn't completely comfortable with it, but figured I'd think about it to see if it could be done. Particularly since it would in some ways simplify tmpfs/shm shared volatility that I'd eventually like to do. ... Now, while for the case I'm personally most interested in (ashmem), zero-fill would technically be ok, since that's what Android does. Even so, I don't think its the best approach for the interface, since applications may end up quite surprised by the results when they accidentally don't follow the don't touch volatile pages rule. That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. The write-implicitly-clears-volatile semantics would actually be an advantage for some use cases. If you have a volatile cache of many sub-page-size objects, the application can just include at the start of each page int present, in_use;. present is set to non-zero before marking volatile, and when the application wants unmark as volatile it writes to in_use and tests the value of present. No need for a syscall at all, although it does take a minor fault. The syscall would be better for the case of large objects, though. Or is that fatally flawed? - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] Pre-emption control for userspace
On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote: > Anything else? If it was possible to make the time remaining in the current timeslice available to userspace through the vdso, the thread could do something like: if (sys_timeleft() < CRITICAL_SECTION_SIZE) yield(); lock(); to avoid running out of timeslice in the middle of the critical section. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] Pre-emption control for userspace
On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote: Anything else? If it was possible to make the time remaining in the current timeslice available to userspace through the vdso, the thread could do something like: if (sys_timeleft() CRITICAL_SECTION_SIZE) yield(); lock(); to avoid running out of timeslice in the middle of the critical section. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Update of file offset on write() etc. is non-atomic with I/O
On Fri, Feb 21, 2014 at 07:01:31AM +0100, Michael Kerrisk (man-pages) wrote: > Here's the fulll list from POSIX.1-2008/SUSv4 Section XSI 2.9.7: > > [[ > 2.9.7 Thread Interactions with Regular File Operations > > All of the following functions shall be atomic with respect to each > other in the effects specified in > POSIX.1-2008 when they operate on regular files or symbolic links: > > chmod( ) > chown( ) > close( ) > creat( ) > dup2( ) > fchmod( ) > fchmodat( ) > fchown( ) > fchownat( ) > fcntl( ) > fstat( ) > fstatat( ) > ftruncate( ) > lchown( ) > link( ) > linkat( ) > lseek( ) > lstat( ) > open( ) > openat( ) > pread( ) > read( ) > readlink( ) > readlinkat( ) > readv( ) > pwrite( ) > rename( ) > renameat( ) > stat( ) > symlink( ) > symlinkat( ) > truncate( ) > unlink( ) > unlinkat( ) > utime( ) > utimensat( ) > utimes( ) > write( ) > writev( ) > > If two threads each call one of these functions, each call shall > either see all of the specified effects > of the other call, or none of them. > ]] > > I'd bet that there's a bunch of violations to be found, but the > read/write f_pos case is one of the most egregious. > > For example, I got curious about stat() versus rename(). If one > stat()s a directory() while a subdirectory is being renamed to a new > name within that directory, does the link count of the parent > directory ever change--that is, could stat() ever see a changed link > count in the middle of the rename()? My experiments suggest that it > can. I suppose it would have to be a very unusual application that > would be troubled by that, but it does appear to be a violation of > 2.9.7. A directory isn't a regular file or symlink, though, so the warranty would seem to be void in that case. If you {f}stat() a regular file that is being concurrently renamed() then the link count shouldn't vary, though. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Update of file offset on write() etc. is non-atomic with I/O
On Fri, Feb 21, 2014 at 07:01:31AM +0100, Michael Kerrisk (man-pages) wrote: Here's the fulll list from POSIX.1-2008/SUSv4 Section XSI 2.9.7: [[ 2.9.7 Thread Interactions with Regular File Operations All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: chmod( ) chown( ) close( ) creat( ) dup2( ) fchmod( ) fchmodat( ) fchown( ) fchownat( ) fcntl( ) fstat( ) fstatat( ) ftruncate( ) lchown( ) link( ) linkat( ) lseek( ) lstat( ) open( ) openat( ) pread( ) read( ) readlink( ) readlinkat( ) readv( ) pwrite( ) rename( ) renameat( ) stat( ) symlink( ) symlinkat( ) truncate( ) unlink( ) unlinkat( ) utime( ) utimensat( ) utimes( ) write( ) writev( ) If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them. ]] I'd bet that there's a bunch of violations to be found, but the read/write f_pos case is one of the most egregious. For example, I got curious about stat() versus rename(). If one stat()s a directory() while a subdirectory is being renamed to a new name within that directory, does the link count of the parent directory ever change--that is, could stat() ever see a changed link count in the middle of the rename()? My experiments suggest that it can. I suppose it would have to be a very unusual application that would be troubled by that, but it does appear to be a violation of 2.9.7. A directory isn't a regular file or symlink, though, so the warranty would seem to be void in that case. If you {f}stat() a regular file that is being concurrently renamed() then the link count shouldn't vary, though. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.12: kernel panic when resuming from suspend to RAM (x86_64)
On Sun, Nov 17, 2013 at 11:06:12PM +0100, Borislav Petkov wrote: > and the virtual address in rIP is 8106f5c3, i.e. the same one > as in the photo. Thus, the CALL instruction tries to call the timer > function 'fn' which we pass as an argument to call_timer_fn. > > However, the address we're trying to call in %r14 is garbage: > 0x455300323d504544 and not in canonical form, causing the #GP. That's part of an ASCII string, "DEP=2\0SE", so if that looks familiar to anyone (part of a kernel command line?) it might give a clue to where the timer callback pointer is being overwritten. > So basically what happens is suspend to RAM corrupts something > containing one or more timer functions and we end up calling crap after > resume. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.12: kernel panic when resuming from suspend to RAM (x86_64)
On Sun, Nov 17, 2013 at 11:06:12PM +0100, Borislav Petkov wrote: and the virtual address in rIP is 8106f5c3, i.e. the same one as in the photo. Thus, the CALL instruction tries to call the timer function 'fn' which we pass as an argument to call_timer_fn. However, the address we're trying to call in %r14 is garbage: 0x455300323d504544 and not in canonical form, causing the #GP. That's part of an ASCII string, DEP=2\0SE, so if that looks familiar to anyone (part of a kernel command line?) it might give a clue to where the timer callback pointer is being overwritten. So basically what happens is suspend to RAM corrupts something containing one or more timer functions and we end up calling crap after resume. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] preempt_count rework -v2
On Thu, Sep 12, 2013 at 04:20:40AM +0200, Peter Zijlstra wrote: > On Wed, Sep 11, 2013 at 04:02:14PM -0700, Linus Torvalds wrote: > > On Wed, Sep 11, 2013 at 11:59 AM, Peter Zijlstra > > wrote: > > > > > > OK, stripped it down further, I couldn't quite see how to collapse the > > > unary and binary operator variants though :/ > > > > Ok, this looks pretty good. I assume it works too? ;) > > Only compile tested that one.. the below is kvm boot tested until root > mount -- I'll try on actual hardware when I've gotten some sleep. > > I split the thing up into two macros GEN_UNARY_RMWcc and > GEN_BINARY_RMWcc which ends up being more readable as well as smaller > code overall. If you wanted to collapse the unary and binary variants as you mentioned upthread, you could do something like (for the CC_HAVE_ASM_GOTO case): #define GEN_RMWcc(fullop, cc, ...) \ do { \ asm volatile goto (fullop \ "j" cc " %l[cc_label]" \ : : __VA_ARGS__ \ : "memory" : cc_label); \ return 0; \ cc_label: \ return 1; \ } while (0) #define GEN_UNARY_RMWcc(op, var, arg0, cc) GEN_RMWcc(op " " arg0 ";", cc, "m" (var)) #define GEN_BINARY_RMWcc(op, var, val, arg0, cc) GEN_RMWcc(op " %1, " arg0 ";", cc, "m" (var), "er" (val)) - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] preempt_count rework -v2
On Thu, Sep 12, 2013 at 04:20:40AM +0200, Peter Zijlstra wrote: On Wed, Sep 11, 2013 at 04:02:14PM -0700, Linus Torvalds wrote: On Wed, Sep 11, 2013 at 11:59 AM, Peter Zijlstra pet...@infradead.org wrote: OK, stripped it down further, I couldn't quite see how to collapse the unary and binary operator variants though :/ Ok, this looks pretty good. I assume it works too? ;) Only compile tested that one.. the below is kvm boot tested until root mount -- I'll try on actual hardware when I've gotten some sleep. I split the thing up into two macros GEN_UNARY_RMWcc and GEN_BINARY_RMWcc which ends up being more readable as well as smaller code overall. If you wanted to collapse the unary and binary variants as you mentioned upthread, you could do something like (for the CC_HAVE_ASM_GOTO case): #define GEN_RMWcc(fullop, cc, ...) \ do { \ asm volatile goto (fullop \ j cc %l[cc_label] \ : : __VA_ARGS__ \ : memory : cc_label); \ return 0; \ cc_label: \ return 1; \ } while (0) #define GEN_UNARY_RMWcc(op, var, arg0, cc) GEN_RMWcc(op arg0 ;, cc, m (var)) #define GEN_BINARY_RMWcc(op, var, val, arg0, cc) GEN_RMWcc(op %1, arg0 ;, cc, m (var), er (val)) - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH jiffies] Avoid undefined behavior from signed overflow
Quoting "Paul E. McKenney" : On Mon, Jul 29, 2013 at 03:30:35PM +1000, c...@guarana.org wrote: Quoting "Paul E. McKenney" : ... > >Note that the C standard considers the cast from signed to >unsigned to be implementation-defined, see 6.3.1.3p3. ... Don't worry, the case from signed to unsigned is actually well-defined - the relevant part is 6.3.1.3p2 (in C99): >Otherwise, if the new type is unsigned, the value is converted by >repeatedly adding or subtracting one more than the maximum value that >can be represented in the new type until the value is in the range of >the new type. Yep, but we are going in the other direction, from unsigned to signed. Ahh, there's an error in the commit message (it says signed to unsigned). - Kevin This message was sent using IMP, the Internet Messaging Program. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH jiffies] Avoid undefined behavior from signed overflow
Quoting Paul E. McKenney paul...@linux.vnet.ibm.com: On Mon, Jul 29, 2013 at 03:30:35PM +1000, c...@guarana.org wrote: Quoting Paul E. McKenney paul...@linux.vnet.ibm.com: ... Note that the C standard considers the cast from signed to unsigned to be implementation-defined, see 6.3.1.3p3. ... Don't worry, the case from signed to unsigned is actually well-defined - the relevant part is 6.3.1.3p2 (in C99): Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type. Yep, but we are going in the other direction, from unsigned to signed. Ahh, there's an error in the commit message (it says signed to unsigned). - Kevin This message was sent using IMP, the Internet Messaging Program. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] clone.2, execve.2: Describe interaction of execve(2) and CLONE_FILES
Hi Michael, This very small patch against man-pages-3.50 adds a description of the way in which a shared file descriptor table, as created by the CLONE_FILES flag of clone(2), is unshared when a process calls execve(2). It's easy to see this from the source - there's an unshare_files() call early in do_execve_common(). - Kevin diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2 --- man-pages-3.50.orig/man2/clone.22013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/clone.2 2013-03-28 21:50:04.745932956 +1100 @@ -180,6 +180,9 @@ .BR fcntl (2) .B F_SETFD operation), the other process is also affected. +If a process sharing a file descriptor table calls +.BR execve (2), +its file descriptor table is duplicated (unshared). If .B CLONE_FILES diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2 --- man-pages-3.50.orig/man2/execve.2 2013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/execve.22013-03-28 22:02:56.793844616 +1100 @@ -204,6 +204,11 @@ .B SIGCHLD (see .BR clone (2)). +.IP * +The file descriptor table is unshared, undoing the effect of the +.B CLONE_FILES +flag of +.BR clone (2). .PP Note the following further points: .IP * 3 diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2 --- man-pages-3.50.orig/man2/clone.2 2013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/clone.2 2013-03-28 21:50:04.745932956 +1100 @@ -180,6 +180,9 @@ .BR fcntl (2) .B F_SETFD operation), the other process is also affected. +If a process sharing a file descriptor table calls +.BR execve (2), +its file descriptor table is duplicated (unshared). If .B CLONE_FILES diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2 --- man-pages-3.50.orig/man2/execve.2 2013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/execve.2 2013-03-28 22:02:56.793844616 +1100 @@ -204,6 +204,11 @@ .B SIGCHLD (see .BR clone (2)). +.IP * +The file descriptor table is unshared, undoing the effect of the +.B CLONE_FILES +flag of +.BR clone (2). .PP Note the following further points: .IP * 3
[patch] clone.2, execve.2: Describe interaction of execve(2) and CLONE_FILES
Hi Michael, This very small patch against man-pages-3.50 adds a description of the way in which a shared file descriptor table, as created by the CLONE_FILES flag of clone(2), is unshared when a process calls execve(2). It's easy to see this from the source - there's an unshare_files() call early in do_execve_common(). - Kevin diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2 --- man-pages-3.50.orig/man2/clone.22013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/clone.2 2013-03-28 21:50:04.745932956 +1100 @@ -180,6 +180,9 @@ .BR fcntl (2) .B F_SETFD operation), the other process is also affected. +If a process sharing a file descriptor table calls +.BR execve (2), +its file descriptor table is duplicated (unshared). If .B CLONE_FILES diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2 --- man-pages-3.50.orig/man2/execve.2 2013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/execve.22013-03-28 22:02:56.793844616 +1100 @@ -204,6 +204,11 @@ .B SIGCHLD (see .BR clone (2)). +.IP * +The file descriptor table is unshared, undoing the effect of the +.B CLONE_FILES +flag of +.BR clone (2). .PP Note the following further points: .IP * 3 diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2 --- man-pages-3.50.orig/man2/clone.2 2013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/clone.2 2013-03-28 21:50:04.745932956 +1100 @@ -180,6 +180,9 @@ .BR fcntl (2) .B F_SETFD operation), the other process is also affected. +If a process sharing a file descriptor table calls +.BR execve (2), +its file descriptor table is duplicated (unshared). If .B CLONE_FILES diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2 --- man-pages-3.50.orig/man2/execve.2 2013-03-15 18:17:35.0 +1100 +++ man-pages-3.50/man2/execve.2 2013-03-28 22:02:56.793844616 +1100 @@ -204,6 +204,11 @@ .B SIGCHLD (see .BR clone (2)). +.IP * +The file descriptor table is unshared, undoing the effect of the +.B CLONE_FILES +flag of +.BR clone (2). .PP Note the following further points: .IP * 3
Re: New system call wanted: fdreopen
Quoting Tristan Wibberley : Why === A common idiom on Linux is to open a file and keep the fd open so that the underlying file can be unlinked from its directory. But if the file needs to be read from several different parts of the codebase then due to the file descriptor having exactly one read pointer those different parts must be synchronised which is a relatively difficult task. Another alternative is to use pwrite() / pread(), which do not affect the file pointer. They're in POSIX. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New system call wanted: fdreopen
Quoting Tristan Wibberley tristan.wibber...@gmail.com: Why === A common idiom on Linux is to open a file and keep the fd open so that the underlying file can be unlinked from its directory. But if the file needs to be read from several different parts of the codebase then due to the file descriptor having exactly one read pointer those different parts must be synchronised which is a relatively difficult task. Another alternative is to use pwrite() / pread(), which do not affect the file pointer. They're in POSIX. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1] firmware loader: introduce module parameter to customize fw search path
Quoting Al Viro : On Thu, Oct 25, 2012 at 08:38:25PM -0700, Linus Torvalds wrote: It's valid to cast a non-const pointer to a const one. It's the *other* way around that is invalid. So marking fw_path[] as having 'const char *' elements just means that we won't be changing those elements through the fw_path[] array (correct: we only read them). The fact that one of those same pointers is then also available through a non-const pointer variable means that they can change through *that* pointer, but that doesn't change the fact that fw_path[] itself contains const pointers. Remember: in C, a "const pointer" does *not* mean that the thing it points to cannot change. It only means that it cannot change through *that* pointer. It's a bit trickier, unfortunately - pointer to pointer to const char and pointer to pointer to char do not mix. Just for fun, try to constify envp and argv arguments of call_usermodehelper()... That's because if it _was_ allowed, you could use it to silently launder the const away: const char *c = "rodata"; char *x; const char **y; y = *y = c; /* We now have (const char) values accessible through a (char *) pointer x */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1] firmware loader: introduce module parameter to customize fw search path
Quoting Al Viro v...@zeniv.linux.org.uk: On Thu, Oct 25, 2012 at 08:38:25PM -0700, Linus Torvalds wrote: It's valid to cast a non-const pointer to a const one. It's the *other* way around that is invalid. So marking fw_path[] as having 'const char *' elements just means that we won't be changing those elements through the fw_path[] array (correct: we only read them). The fact that one of those same pointers is then also available through a non-const pointer variable means that they can change through *that* pointer, but that doesn't change the fact that fw_path[] itself contains const pointers. Remember: in C, a const pointer does *not* mean that the thing it points to cannot change. It only means that it cannot change through *that* pointer. It's a bit trickier, unfortunately - pointer to pointer to const char and pointer to pointer to char do not mix. Just for fun, try to constify envp and argv arguments of call_usermodehelper()... That's because if it _was_ allowed, you could use it to silently launder the const away: const char *c = rodata; char *x; const char **y; y = x; *y = c; /* We now have (const char) values accessible through a (char *) pointer x */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/