from:"Kevin Easton"

Re: "fs/namei.c: keep track of nd->root refcount status" causes boot panic

2019-09-05 Thread Kevin Easton

On Tue, Sep 03, 2019 at 06:56:10PM +0100, Al Viro wrote:
> On Tue, Sep 03, 2019 at 08:39:30AM -0700, Christoph Hellwig wrote:
> 
> > > There's much nastier situation than "new upstream kernel released,
> > > need to rebuild" - it's bisect in mainline trying to locate something...
> > 
> > I really don't get the point.  And it's not like we've card about
> > this anywhere else.  And jumping wildly around with the numeric values
> > for constants will lead to bugs like the one you added and fixed again
> > and again.
> 
> The thing is, there are several groups - it's not as if all additions
> were guaranteed to be at the end.  So either we play with renumbering
> again and again, or we are back to the square one...
> 
> Is there any common trick that would allow to verify the lack of duplicates
> at the build time?

What about:

static_assert(
 (LOOKUP_FOLLOW^LOOKUP_DIRECTORY^LOOKUP_AUTOMOUNT^LOOKUP_EMPTY^LOOKUP_DOWN^
  LOOKUP_REVAL^LOOKUP_RCU^
  LOOKUP_OPEN^LOOKUP_CREATE^LOOKUP_EXCL^LOOKUP_RENAME_TARGET^
  LOOKUP_PARENT^LOOKUP_NO_REVAL^LOOKUP_JUMPED^LOOKUP_ROOT^LOOKUP_ROOT_GRABBED)
 ==
 (LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_AUTOMOUNT|LOOKUP_EMPTY|LOOKUP_DOWN|
  LOOKUP_REVAL|LOOKUP_RCU|
  LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL|LOOKUP_RENAME_TARGET|
  LOOKUP_PARENT|LOOKUP_NO_REVAL|LOOKUP_JUMPED|LOOKUP_ROOT|LOOKUP_ROOT_GRABBED)
 , "duplicated LOOKUP_* constant");

?

- Kevin

[PATCH] libertas: Add missing sentinel at end of if_usb.c fw_table

2019-07-10 Thread Kevin Easton

This sentinel tells the firmware loading process when to stop.

Reported-and-tested-by: syzbot+98156c174c5a2cad9...@syzkaller.appspotmail.com
Signed-off-by: Kevin Easton 
---
 drivers/net/wireless/marvell/libertas/if_usb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/marvell/libertas/if_usb.c 
b/drivers/net/wireless/marvell/libertas/if_usb.c
index f1622f0ff8c9..fe3142d85d1e 100644
--- a/drivers/net/wireless/marvell/libertas/if_usb.c
+++ b/drivers/net/wireless/marvell/libertas/if_usb.c
@@ -50,7 +50,8 @@ static const struct lbs_fw_table fw_table[] = {
{ MODEL_8388, "libertas/usb8388_v5.bin", NULL },
{ MODEL_8388, "libertas/usb8388.bin", NULL },
{ MODEL_8388, "usb8388.bin", NULL },
-   { MODEL_8682, "libertas/usb8682.bin", NULL }
+   { MODEL_8682, "libertas/usb8682.bin", NULL },
+   { 0, NULL, NULL }
 };
 
 static const struct usb_device_id if_usb_table[] = {
-- 
2.11.0

Re: [PATCH 4.19 053/105] mm/mincore.c: make mincore() more conservative

2019-05-23 Thread Kevin Easton

On Wed, May 22, 2019 at 11:21:11AM +0200, Michal Hocko wrote:
> On Wed 22-05-19 10:57:41, Pavel Machek wrote:
> > Hi!
> > 
> > > commit 134fca9063ad4851de767d1768180e5dede9a881 upstream.
> > > 
> > > The semantics of what mincore() considers to be resident is not
> > > completely clear, but Linux has always (since 2.3.52, which is when
> > > mincore() was initially done) treated it as "page is available in page
> > > cache".
> > > 
> > > That's potentially a problem, as that [in]directly exposes
> > > meta-information about pagecache / memory mapping state even about
> > > memory not strictly belonging to the process executing the syscall,
> > > opening possibilities for sidechannel attacks.
> > > 
> > > Change the semantics of mincore() so that it only reveals pagecache
> > > information for non-anonymous mappings that belog to files that the
> > > calling process could (if it tried to) successfully open for writing;
> > > otherwise we'd be including shared non-exclusive mappings, which
> > > 
> > >  - is the sidechannel
> > > 
> > >  - is not the usecase for mincore(), as that's primarily used for data,
> > >not (shared) text
> > 
> > ...
> > 
> > > @@ -189,8 +205,13 @@ static long do_mincore(unsigned long add
> > >   vma = find_vma(current->mm, addr);
> > >   if (!vma || addr < vma->vm_start)
> > >   return -ENOMEM;
> > > - mincore_walk.mm = vma->vm_mm;
> > >   end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> > > + if (!can_do_mincore(vma)) {
> > > + unsigned long pages = DIV_ROUND_UP(end - addr, PAGE_SIZE);
> > > + memset(vec, 1, pages);
> > > + return pages;
> > > + }
> > > + mincore_walk.mm = vma->vm_mm;
> > >   err = walk_page_range(addr, end, _walk);
> > 
> > We normally return errors when we deny permissions; but this one just
> > returns success and wrong data.
> > 
> > Could we return -EPERM there? If not, should it at least get a
> > comment?
> 
> This was a deliberate decision AFAIR. We cannot return failure because
> this could lead to an unexpected userspace failure. We are pretendeing
> that those pages are present because that is the safest option -
> e.g. consider an application which tries to refault until the page is
> present...

Yes, in particular several userspace applications I found used mincore()
to find out whether a particular range is mapped at all or not, treating
any error as "unmapped" and any non-error return as "mapped".

- Kevin

Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-20 Thread Kevin Easton

On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
> >
> > On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > clone() system call as previously discussed.
> > >
> > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > consider when introducing new CLONE_* flags.
> > >
> > > The reason I'm asking is:
> > >
> > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > processes can change their own namespace at will. For that, certain
> > > traditional unix'ish things have to be disabled, most notably suid.
> > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > about making this its own feature. Doing that switch on clone() seems
> > > a nice place for that, IMHO.
> >
> > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > Not granting privileges such as setuid during execve(2) is the main
> > point of that flag.
> >
> 
> I would personally *love* it if distros started setting no_new_privs
> for basically all processes.  And pidfd actually gets us part of the
> way toward a straightforward way to make sudo and su still work in a
> no_new_privs world: su could call into a daemon that would spawn the
> privileged task, and su would get a (read-only!) pidfd back and then
> wait for the fd and exit.  I suppose that, done naively, this might
> cause some odd effects with respect to tty handling, but I bet it's
> solveable.  I suppose it would be nifty if there were a way for a
> process, by mutual agreement, to reparent itself to an unrelated
> process.
> 
> Anyway, clone(2) is an enormous mess.  Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec().  It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs".  This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.
> 
> As basic examples, the improved process creation API should take a
> list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> from, fds to close (or, maybe even better, a list of fds to *not*
> close), a list of rlimit changes to make, a list of signal changes to
> make, the ability to set sid, pgrp, uid, gid (as in
> setresuid/setresgid), the ability to do capset() operations, etc.  The
> posix_spawn() API, for all that it's rather complicated, covers a
> bunch of the basics pretty well.

The idea of a system call that takes an infinitely-extendable laundry
list of operations to perform in kernel space seems quite inelegant, if
only for the error-reporting reason.

Instead, I suggest that what you'd want is a way to create a new
embryonic process that has no address space and isn't yet schedulable.
You then just need other-process-directed variants of all the normal
setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
etc.

Then when it's all set up you pr_execve() to kick it off.

- Kevin

Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-01-16 Thread Kevin Easton

On Tue, Jan 15, 2019 at 11:52:25PM -0800, Josh Snyder wrote:
> On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet 
> wrote:
> >
> > There is a difference with your previous patch though, that used to list no
> > page in core when it didn't know; this patch lists pages as in core when it
> > refuses to tell. I don't think that's very important, though.
> 
> Is there a reason not to return -EPERM in this case?

When I was looking through the Debian Code Search results, quite a few
of the hits were for code that uses mincore() as a way to check if
_anything_ is mapped at an address or not.  This code doesn't care about
the in core / not in core result of mincore(), just whether it returned
an error or not.

I think a new error return would break most of the instances of that
code I saw.

- Kevin

Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-01-08 Thread Kevin Easton

On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina  wrote:
> >
> > > Who actually _uses_ mincore()? That's probably the best guide to what
> > > we should do. Maybe they open the file read-only even if they are the
> > > owner, and we really should look at file ownership instead.
> >
> > Yeah, well
> >
> > https://codesearch.debian.net/search?q=mincore
> >
> > is a bit too much mess to get some idea quickly I am afraid.
> 
> Yeah, heh.
> 
> And the first hit is 'fincore', which probably nobody cares about
> anyway, but it does
> 
> fd = open (name, O_RDONLY)
> ..
> mmap(window, len, PROT_NONE, MAP_PRIVATE, ..
> 
> so if we want to keep that working, we'd really need to actually check
> file ownership rather than just looking at f_mode.
> 
> But I don't know if anybody *uses* and cares about fincore, and it's
> particularly questionable for non-root users.
> 
...
> I didn't find anything that seems to really care, but I gave up after
> a few pages of really boring stuff.

I've gone through everything in the Debian code search, and this is the
stuff that seems like it would be affected at all by the current patch:

util-linux
Contains 'fincore' as already noted above.

e2fsprogs
e4defrag tries to drop pages that it caused to be loaded into the
page cache, but it's not clear that this ever worked as designed 
anyway (it calls mincore() before ioctl(fd, EXT4_IOC_MOVE_EXT ..)
but then after the sync_file_range it drops the pages that *were*
in the page cache at the time of mincore()).

pgfincore
postgresql extension used to try to dump/restore page cache status
of database backing files across reboots.  It uses a fresh mapping
with mincore() to try to determine the current page cache status of
a file.

nocache
LD_PRELOAD library that tries to drop any pages that the victim
program has caused to be loaded into the page cache, uses mincore
on a fresh mapping to see what was resident beforehand.  Also 
includes 'cachestats' command that's basically another 'fincore'.

xfsprogs
xfs_io has a 'mincore' sub-command that is roughly equivalent to
'fincore'.

vmtouch
vmtouch is "Portable file system cache diagnostics and control",
among other things it implements 'fincore' type functionality, and
one of its touted use-cases is "Preserving virtual memory profile
when failing over servers".

qemu
qemu uses mincore() with a fresh PROT_NONE, MAP_PRIVATE mapping to
implement the "x-check-cache-dropped" option.  
( https://patchwork.kernel.org/patch/10395865/ )

(Everything else I could see was either looking at anonymous VMAs, its
own existing mapping that it's been using for actual IO, or was just
using mincore() to see if an address was part of any mapping at all).

- Kevin

Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-01-06 Thread Kevin Easton

On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina  wrote:
> >
> > > Who actually _uses_ mincore()? That's probably the best guide to what
> > > we should do. Maybe they open the file read-only even if they are the
> > > owner, and we really should look at file ownership instead.
> >
> > Yeah, well
> >
> > https://codesearch.debian.net/search?q=mincore
> >
> > is a bit too much mess to get some idea quickly I am afraid.

> Anyway, the Debian code search just results in mostly non-present
> stuff. It's sad that google code search is no more. It was great for
> exactly these kinds of questions.

If you select the "Group search results by Debian source package"
option on the search results page it makes it a lot easier to skim
through.

It looks to me like Firefox is expecting mincore() not to fail on
libraries that it has mapped:

https://sources.debian.org/src/firefox-esr/60.4.0esr-1/mozglue/linker/BaseElf.cpp/?hl=98#L98

- Kevin
> 
> The mono runtime seems to have some mono_pages_not_faulted() function,
> but I don't know if people use it for file mappings, and I couldn't
> find any interesting users of it.
> 
> I didn't find anything that seems to really care, but I gave up after
> a few pages of really boring stuff.
> 
> Linus
> 
>

Re: Can we drop upstream Linux x32 support?

2018-12-13 Thread Kevin Easton

On Thu, Dec 13, 2018 at 10:05:14AM +0100, Richard Weinberger wrote:
> On Thu, Dec 13, 2018 at 6:03 AM Kevin Easton  wrote:
> >
> > On Tue, Dec 11, 2018 at 11:29:14AM +0100, John Paul Adrian Glaubitz wrote:
> > ...
> > > I can't say anything about the syscall interface. However, what I do know
> > > is that the weird combination of a 32-bit userland with a 64-bit kernel
> > > interface is sometimes causing issues. For example, application code 
> > > usually
> > > expects things like time_t to be 32-bit on a 32-bit system. However, this
> > > isn't the case for x32 which is why code fails to build.
> >
> > OpenBSD and NetBSD both have 64-bit time_t on 32-bit systems and have
> > had for four or five years at this point.
> 
> They can also do flag-day changes and break existing applications, Linux not.

Sure, but the point is that most widely-used software has probably by
now come in to contact with systems where time_t is bigger than long.

- Kevin

Re: Can we drop upstream Linux x32 support?

2018-12-12 Thread Kevin Easton

On Tue, Dec 11, 2018 at 11:29:14AM +0100, John Paul Adrian Glaubitz wrote:
...
> I can't say anything about the syscall interface. However, what I do know
> is that the weird combination of a 32-bit userland with a 64-bit kernel
> interface is sometimes causing issues. For example, application code usually
> expects things like time_t to be 32-bit on a 32-bit system. However, this
> isn't the case for x32 which is why code fails to build.

OpenBSD and NetBSD both have 64-bit time_t on 32-bit systems and have
had for four or five years at this point.

- Kevin

Re: [PATCH] Add /proc/pid_generation

2018-11-22 Thread Kevin Easton

On Wed, Nov 21, 2018 at 12:14:44PM -0800, Daniel Colascione wrote:
> This change adds a per-pid-namespace 64-bit generation number,
> incremented on PID rollover, and exposes it via a new proc file
> /proc/pid_generation. By examining this file before and after /proc
> enumeration, user code can detect the potential reuse of a PID and
> restart the task enumeration process, repeating until it gets a
> coherent snapshot.

I see downthread this patch has been withdrawn, but nonetheless I'm
still curious - does this actually solve the problem?

It seems to me that a PID could be reused within a scan even if the
generation number remains the same at the beginning and end of a scan:

Say you have a very long-lived task with PID 500 allocated in generation
0.  The PID creation has since wrapped and we are now allocating from the
start of the range again, with generation 1.  We begin a scan of /proc, 
read the generation (1) and at this point, our task dies and PID 500 is
then reallocated to a new task.  We finish our scan, generation is still
1 but PID 500 is now ambiguous.

Am I wrong?

- Kevin

Re: [PATCH] Add /proc/pid_generation

2018-11-22 Thread Kevin Easton

On Wed, Nov 21, 2018 at 12:14:44PM -0800, Daniel Colascione wrote:
> This change adds a per-pid-namespace 64-bit generation number,
> incremented on PID rollover, and exposes it via a new proc file
> /proc/pid_generation. By examining this file before and after /proc
> enumeration, user code can detect the potential reuse of a PID and
> restart the task enumeration process, repeating until it gets a
> coherent snapshot.

I see downthread this patch has been withdrawn, but nonetheless I'm
still curious - does this actually solve the problem?

It seems to me that a PID could be reused within a scan even if the
generation number remains the same at the beginning and end of a scan:

Say you have a very long-lived task with PID 500 allocated in generation
0.  The PID creation has since wrapped and we are now allocating from the
start of the range again, with generation 1.  We begin a scan of /proc, 
read the generation (1) and at this point, our task dies and PID 500 is
then reallocated to a new task.  We finish our scan, generation is still
1 but PID 500 is now ambiguous.

Am I wrong?

- Kevin

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-05-08 Thread Kevin Easton

On Mon, May 07, 2018 at 04:03:25PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > so it should be allocated with kzalloc() to ensure all structure padding
> > is zeroed.
> > 
> > Signed-off-by: Kevin Easton <ke...@guarana.org>
> > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> > ---
> >  drivers/vhost/vhost.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index f3bd8e9..1b84dcff 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
> >  /* Create a new message. */
> >  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
> >  {
> > -   struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
> > +   struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
> > if (!node)
> > return NULL;
> > node->vq = vq;
> 
> 
> Let's just init the msg though.
> 
> OK it seems this is the best we can do for now,
> we need a new feature bit to fix it for 32 bit
> userspace on 64 bit kernels.
> 
> Does the following help?

Yes, the reproducer doesn't trigger the error with that patch applied.

- Kevin

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-05-08 Thread Kevin Easton

On Mon, May 07, 2018 at 04:03:25PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > so it should be allocated with kzalloc() to ensure all structure padding
> > is zeroed.
> > 
> > Signed-off-by: Kevin Easton 
> > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> > ---
> >  drivers/vhost/vhost.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index f3bd8e9..1b84dcff 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
> >  /* Create a new message. */
> >  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
> >  {
> > -   struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
> > +   struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
> > if (!node)
> > return NULL;
> > node->vq = vq;
> 
> 
> Let's just init the msg though.
> 
> OK it seems this is the best we can do for now,
> we need a new feature bit to fix it for 32 bit
> userspace on 64 bit kernels.
> 
> Does the following help?

Yes, the reproducer doesn't trigger the error with that patch applied.

- Kevin

Re: [PATCH] vhost: make msg padding explicit

2018-05-02 Thread Kevin Easton

On Wed, May 02, 2018 at 05:19:27PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 10:04:46AM -0400, David Miller wrote:
> > From: "Michael S. Tsirkin" 
> > Date: Wed, 2 May 2018 16:36:37 +0300
> > 
> > > Ouch.  True - and in particular the 32 bit ABI on 64 bit kernels doesn't
> > > work at all. Hmm. It's relatively new and maybe there aren't any 32 bit
> > > users yet. Thoughts?
> > 
> > If it's been in a released kernel version, we really aren't at liberty
> > to play "maybe nobody uses this" UAPI changing games.
> > 
> > Please send me a revert.
> 
> Sent. Looking at compat mess now.

Just naming the padding field isn't sufficient anyway, there also needs to
be code in vhost_new_msg() to initialise it using the name.

- Kevin

Re: [PATCH] vhost: make msg padding explicit

2018-05-02 Thread Kevin Easton

On Wed, May 02, 2018 at 05:19:27PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 10:04:46AM -0400, David Miller wrote:
> > From: "Michael S. Tsirkin" 
> > Date: Wed, 2 May 2018 16:36:37 +0300
> > 
> > > Ouch.  True - and in particular the 32 bit ABI on 64 bit kernels doesn't
> > > work at all. Hmm. It's relatively new and maybe there aren't any 32 bit
> > > users yet. Thoughts?
> > 
> > If it's been in a released kernel version, we really aren't at liberty
> > to play "maybe nobody uses this" UAPI changing games.
> > 
> > Please send me a revert.
> 
> Sent. Looking at compat mess now.

Just naming the padding field isn't sufficient anyway, there also needs to
be code in vhost_new_msg() to initialise it using the name.

- Kevin

Re: [PATCH] vhost: make msg padding explicit

2018-05-02 Thread Kevin Easton

On Tue, May 01, 2018 at 02:05:51PM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <m...@redhat.com>
> Date: Tue, 1 May 2018 20:19:19 +0300
> 
> > On Tue, May 01, 2018 at 11:28:22AM -0400, David Miller wrote:
> >> From: "Michael S. Tsirkin" <m...@redhat.com>
> >> Date: Fri, 27 Apr 2018 19:02:05 +0300
> >> 
> >> > There's a 32 bit hole just after type. It's best to
> >> > give it a name, this way compiler is forced to initialize
> >> > it with rest of the structure.
> >> > 
> >> > Reported-by: Kevin Easton <ke...@guarana.org>
> >> > Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> >> 
> >> Michael, will you be sending this directly to Linus or would you like
> >> me to apply it to net or net-next?
> >> 
> >> Thanks.
> > 
> > I'd prefer you to apply it for net and cc stable if possible.
> 
> Ok, applied, and added to my -stable submission queue.

Hold on, this patch changes the layout for i386 (where there is
no padding at all).  And it's part of UAPI.

- Kevin

>

Re: [PATCH] vhost: make msg padding explicit

2018-05-02 Thread Kevin Easton

On Tue, May 01, 2018 at 02:05:51PM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" 
> Date: Tue, 1 May 2018 20:19:19 +0300
> 
> > On Tue, May 01, 2018 at 11:28:22AM -0400, David Miller wrote:
> >> From: "Michael S. Tsirkin" 
> >> Date: Fri, 27 Apr 2018 19:02:05 +0300
> >> 
> >> > There's a 32 bit hole just after type. It's best to
> >> > give it a name, this way compiler is forced to initialize
> >> > it with rest of the structure.
> >> > 
> >> > Reported-by: Kevin Easton 
> >> > Signed-off-by: Michael S. Tsirkin 
> >> 
> >> Michael, will you be sending this directly to Linus or would you like
> >> me to apply it to net or net-next?
> >> 
> >> Thanks.
> > 
> > I'd prefer you to apply it for net and cc stable if possible.
> 
> Ok, applied, and added to my -stable submission queue.

Hold on, this patch changes the layout for i386 (where there is
no padding at all).  And it's part of UAPI.

- Kevin

>

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

On Fri, Apr 27, 2018 at 09:07:56PM -0400, Kevin Easton wrote:
> On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > > so it should be allocated with kzalloc() to ensure all structure padding
> > > is zeroed.
> > > 
> > > Signed-off-by: Kevin Easton <ke...@guarana.org>
> > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> > 
> > Does it help if a patch naming the padding is applied,
> > and then we init just the relevant field?
> > Just curious.
> 
> No, I don't believe that is sufficient to fix the problem.

Scratch that, somehow I missed the "..and then we init just the
relevant field" part, sorry.

There's still the padding after the vhost_iotlb_msg to consider.  It's
named in the union but I don't think that's guaranteed to be initialised
when the iotlb member of the union is used to initialise things.

> I didn't name the padding in my original patch because I wasn't sure
> if the padding actually exists on 32 bit architectures?

This might still be a concern, too?

At the end of the day, zeroing 96 bytes (the full size of vhost_msg_node)
is pretty quick.

- Kevin

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

On Fri, Apr 27, 2018 at 09:07:56PM -0400, Kevin Easton wrote:
> On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > > so it should be allocated with kzalloc() to ensure all structure padding
> > > is zeroed.
> > > 
> > > Signed-off-by: Kevin Easton 
> > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> > 
> > Does it help if a patch naming the padding is applied,
> > and then we init just the relevant field?
> > Just curious.
> 
> No, I don't believe that is sufficient to fix the problem.

Scratch that, somehow I missed the "..and then we init just the
relevant field" part, sorry.

There's still the padding after the vhost_iotlb_msg to consider.  It's
named in the union but I don't think that's guaranteed to be initialised
when the iotlb member of the union is used to initialise things.

> I didn't name the padding in my original patch because I wasn't sure
> if the padding actually exists on 32 bit architectures?

This might still be a concern, too?

At the end of the day, zeroing 96 bytes (the full size of vhost_msg_node)
is pretty quick.

- Kevin

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > so it should be allocated with kzalloc() to ensure all structure padding
> > is zeroed.
> > 
> > Signed-off-by: Kevin Easton <ke...@guarana.org>
> > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> 
> Does it help if a patch naming the padding is applied,
> and then we init just the relevant field?
> Just curious.

No, I don't believe that is sufficient to fix the problem.

The structure is allocated by kmalloc(), then individual fields are
initialised.  The named adding would be forced to be initialised if
it were initialised with a struct initialiser, but that's not the case.
The compiler is free to leave padding0 with whatever junk kmalloc()
left there.

Having said that, naming the padding *does* help - technically, the
compiler is allowed to put whatever it likes in the padding every time
you modify the struct.  It really needs both.

I didn't name the padding in my original patch because I wasn't sure
if the padding actually exists on 32 bit architectures?

- Kevin

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > so it should be allocated with kzalloc() to ensure all structure padding
> > is zeroed.
> > 
> > Signed-off-by: Kevin Easton 
> > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> 
> Does it help if a patch naming the padding is applied,
> and then we init just the relevant field?
> Just curious.

No, I don't believe that is sufficient to fix the problem.

The structure is allocated by kmalloc(), then individual fields are
initialised.  The named adding would be forced to be initialised if
it were initialised with a struct initialiser, but that's not the case.
The compiler is free to leave padding0 with whatever junk kmalloc()
left there.

Having said that, naming the padding *does* help - technically, the
compiler is allowed to put whatever it likes in the padding every time
you modify the struct.  It really needs both.

I didn't name the padding in my original patch because I wasn't sure
if the padding actually exists on 32 bit architectures?

- Kevin

[PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

The struct vhost_msg within struct vhost_msg_node is copied to userspace,
so it should be allocated with kzalloc() to ensure all structure padding
is zeroed.

Signed-off-by: Kevin Easton <ke...@guarana.org>
Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f3bd8e9..1b84dcff 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
 /* Create a new message. */
 struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
 {
-   struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
+   struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
if (!node)
return NULL;
node->vq = vq;
-- 
2.8.1

[PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

The struct vhost_msg within struct vhost_msg_node is copied to userspace,
so it should be allocated with kzalloc() to ensure all structure padding
is zeroed.

Signed-off-by: Kevin Easton 
Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f3bd8e9..1b84dcff 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
 /* Create a new message. */
 struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
 {
-   struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
+   struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
if (!node)
return NULL;
node->vq = vq;
-- 
2.8.1

Re: KASAN: slab-out-of-bounds Read in pfkey_add

2018-04-11 Thread Kevin Easton

On Mon, Apr 09, 2018 at 01:56:36AM -0400, Kevin Easton wrote:
> On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote:
> ...
> > 
> > Looks like this is going to be fixed by
> > https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length 
> > of
> > provided sadb_key"), but it's not applied yet to the ipsec tree yet.  
> > Kevin, for
> > future reference, for syzbot bugs it would be helpful to reply to the 
> > original
> > bug report and say that a patch was sent out, or even better send the patch 
> > as a
> > reply to the bug report email, e.g.
> > 
> > git format-patch 
> > --in-reply-to="<001a114292fadd3e250560706...@google.com>"
> > 
> > for this one (and the Message ID can be found in the syzkaller-bugs archive 
> > even
> > if the email isn't in your inbox).
> 
> Sure, I can do that.

I recalled one reason I _didn't_ do this - the message ID is retrievable
from the archived email, but because the archive is Google Groups the
message recipients aren't (only masked).

- Kevin

Re: KASAN: slab-out-of-bounds Read in pfkey_add

2018-04-11 Thread Kevin Easton

On Mon, Apr 09, 2018 at 01:56:36AM -0400, Kevin Easton wrote:
> On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote:
> ...
> > 
> > Looks like this is going to be fixed by
> > https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length 
> > of
> > provided sadb_key"), but it's not applied yet to the ipsec tree yet.  
> > Kevin, for
> > future reference, for syzbot bugs it would be helpful to reply to the 
> > original
> > bug report and say that a patch was sent out, or even better send the patch 
> > as a
> > reply to the bug report email, e.g.
> > 
> > git format-patch 
> > --in-reply-to="<001a114292fadd3e250560706...@google.com>"
> > 
> > for this one (and the Message ID can be found in the syzkaller-bugs archive 
> > even
> > if the email isn't in your inbox).
> 
> Sure, I can do that.

I recalled one reason I _didn't_ do this - the message ID is retrievable
from the archived email, but because the archive is Google Groups the
message recipients aren't (only masked).

- Kevin

Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-04-10 Thread Kevin Easton

On Mon, Apr 09, 2018 at 12:34:42PM +0200, Steffen Klassert wrote:
> On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote:
> > Several places use (x + 7) / 8 to convert from a number of bits to a number
> > of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > with other parts of the same file.
> > 
> > Signed-off-by: Kevin Easton <ke...@guarana.org>
> 
> Please resubmit this one to ipsec-next after the
> merge window. Thanks!

Will do!

- Kevin

Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-04-10 Thread Kevin Easton

On Mon, Apr 09, 2018 at 12:34:42PM +0200, Steffen Klassert wrote:
> On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote:
> > Several places use (x + 7) / 8 to convert from a number of bits to a number
> > of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > with other parts of the same file.
> > 
> > Signed-off-by: Kevin Easton 
> 
> Please resubmit this one to ipsec-next after the
> merge window. Thanks!

Will do!

- Kevin

Re: KASAN: slab-out-of-bounds Read in pfkey_add

2018-04-08 Thread Kevin Easton

On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote:
...
> 
> Looks like this is going to be fixed by
> https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length of
> provided sadb_key"), but it's not applied yet to the ipsec tree yet.  Kevin, 
> for
> future reference, for syzbot bugs it would be helpful to reply to the original
> bug report and say that a patch was sent out, or even better send the patch 
> as a
> reply to the bug report email, e.g.
> 
>   git format-patch 
> --in-reply-to="<001a114292fadd3e250560706...@google.com>"
> 
> for this one (and the Message ID can be found in the syzkaller-bugs archive 
> even
> if the email isn't in your inbox).

Sure, I can do that.

- Kevin

Re: KASAN: slab-out-of-bounds Read in pfkey_add

2018-04-08 Thread Kevin Easton

On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote:
...
> 
> Looks like this is going to be fixed by
> https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length of
> provided sadb_key"), but it's not applied yet to the ipsec tree yet.  Kevin, 
> for
> future reference, for syzbot bugs it would be helpful to reply to the original
> bug report and say that a patch was sent out, or even better send the patch 
> as a
> reply to the bug report email, e.g.
> 
>   git format-patch 
> --in-reply-to="<001a114292fadd3e250560706...@google.com>"
> 
> for this one (and the Message ID can be found in the syzkaller-bugs archive 
> even
> if the email isn't in your inbox).

Sure, I can do that.

- Kevin

[PATCH v2 1/2] af_key: Always verify length of provided sadb_key

2018-04-07 Thread Kevin Easton

Key extensions (struct sadb_key) include a user-specified number of key
bits.  The kernel uses that number to determine how much key data to copy
out of the message in pfkey_msg2xfrm_state().

The length of the sadb_key message must be verified to be long enough,
even in the case of SADB_X_AALG_NULL.  Furthermore, the sadb_key_len value
must be long enough to include both the key data and the struct sadb_key
itself.

Introduce a helper function verify_key_len(), and call it from
parse_exthdrs() where other exthdr types are similarly checked for
correctness.

Signed-off-by: Kevin Easton <ke...@guarana.org>
Reported-by: 
syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com
---
 net/key/af_key.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 7e2e718..e62e52e 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -437,6 +437,24 @@ static int verify_address_len(const void *p)
return 0;
 }
 
+static inline int sadb_key_len(const struct sadb_key *key)
+{
+   int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8);
+
+   return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes,
+   sizeof(uint64_t));
+}
+
+static int verify_key_len(const void *p)
+{
+   const struct sadb_key *key = p;
+
+   if (sadb_key_len(key) > key->sadb_key_len)
+   return -EINVAL;
+
+   return 0;
+}
+
 static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx)
 {
return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) +
@@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const 
struct sadb_msg *hdr, void *
return -EINVAL;
if (ext_hdrs[ext_type-1] != NULL)
return -EINVAL;
-   if (ext_type == SADB_EXT_ADDRESS_SRC ||
-   ext_type == SADB_EXT_ADDRESS_DST ||
-   ext_type == SADB_EXT_ADDRESS_PROXY ||
-   ext_type == SADB_X_EXT_NAT_T_OA) {
+   switch (ext_type) {
+   case SADB_EXT_ADDRESS_SRC:
+   case SADB_EXT_ADDRESS_DST:
+   case SADB_EXT_ADDRESS_PROXY:
+   case SADB_X_EXT_NAT_T_OA:
if (verify_address_len(p))
return -EINVAL;
-   }
-   if (ext_type == SADB_X_EXT_SEC_CTX) {
+   break;
+   case SADB_X_EXT_SEC_CTX:
if (verify_sec_ctx_len(p))
return -EINVAL;
+   break;
+   case SADB_EXT_KEY_AUTH:
+   case SADB_EXT_KEY_ENCRYPT:
+   if (verify_key_len(p))
+   return -EINVAL;
+   break;
+   default:
+   break;
}
ext_hdrs[ext_type-1] = (void *) p;
}
@@ -1104,14 +1131,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
key = ext_hdrs[SADB_EXT_KEY_AUTH - 1];
if (key != NULL &&
sa->sadb_sa_auth != SADB_X_AALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key != NULL &&
sa->sadb_sa_encrypt != SADB_EALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
 
x = xfrm_state_alloc(net);
-- 
2.8.1

[PATCH v2 0/2] af_key: Fix for sadb_key memcpy read overrun

2018-04-07 Thread Kevin Easton

As found by syzbot, af_key does not properly validate the key length in
sadb_key messages from userspace.  This can result in copying from beyond
the end of the sadb_key part of the message, or indeed beyond the end of
the entire packet.

Both these patches apply cleanly to ipsec-next.  Based on Steffen's
feedback I have re-ordered them so that the fix only is in patch 1, which
I would suggest is also a stable tree candidate, whereas patch 2 is a
cleanup only.

Kevin Easton (2):
  af_key: Always verify length of provided sadb_key
  af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

 net/key/af_key.c | 58 
 1 file changed, 42 insertions(+), 16 deletions(-)

-- 
2.8.1

[PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-04-07 Thread Kevin Easton

Several places use (x + 7) / 8 to convert from a number of bits to a number
of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
with other parts of the same file.

Signed-off-by: Kevin Easton <ke...@guarana.org>
---
 net/key/af_key.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index e62e52e..f3ebb84 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -822,12 +822,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const 
struct xfrm_state *x,
if (add_keys) {
if (x->aalg && x->aalg->alg_key_len) {
auth_key_size =
-   PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + auth_key_size;
}
if (x->ealg && x->ealg->alg_key_len) {
encrypt_key_size =
-   PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + encrypt_key_size;
}
}
@@ -987,7 +987,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_exttype = SADB_EXT_KEY_AUTH;
key->sadb_key_bits = x->aalg->alg_key_len;
key->sadb_key_reserved = 0;
-   memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8);
+   memcpy(key + 1, x->aalg->alg_key,
+   DIV_ROUND_UP(x->aalg->alg_key_len, 8));
}
/* encrypt key */
if (add_keys && encrypt_key_size) {
@@ -998,7 +999,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_bits = x->ealg->alg_key_len;
key->sadb_key_reserved = 0;
memcpy(key + 1, x->ealg->alg_key,
-  (x->ealg->alg_key_len+7)/8);
+   DIV_ROUND_UP(x->ealg->alg_key_len, 8));
}
 
/* sa */
@@ -1193,7 +1194,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
goto out;
}
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL);
if (!x->aalg) {
err = -ENOMEM;
@@ -1232,7 +1233,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
}
key = (struct sadb_key*) 
ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->ealg = kmalloc(sizeof(*x->ealg) + keysize, 
GFP_KERNEL);
if (!x->ealg) {
err = -ENOMEM;
-- 
2.8.1

[PATCH v2 1/2] af_key: Always verify length of provided sadb_key

2018-04-07 Thread Kevin Easton

Key extensions (struct sadb_key) include a user-specified number of key
bits.  The kernel uses that number to determine how much key data to copy
out of the message in pfkey_msg2xfrm_state().

The length of the sadb_key message must be verified to be long enough,
even in the case of SADB_X_AALG_NULL.  Furthermore, the sadb_key_len value
must be long enough to include both the key data and the struct sadb_key
itself.

Introduce a helper function verify_key_len(), and call it from
parse_exthdrs() where other exthdr types are similarly checked for
correctness.

Signed-off-by: Kevin Easton 
Reported-by: 
syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com
---
 net/key/af_key.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 7e2e718..e62e52e 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -437,6 +437,24 @@ static int verify_address_len(const void *p)
return 0;
 }
 
+static inline int sadb_key_len(const struct sadb_key *key)
+{
+   int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8);
+
+   return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes,
+   sizeof(uint64_t));
+}
+
+static int verify_key_len(const void *p)
+{
+   const struct sadb_key *key = p;
+
+   if (sadb_key_len(key) > key->sadb_key_len)
+   return -EINVAL;
+
+   return 0;
+}
+
 static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx)
 {
return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) +
@@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const 
struct sadb_msg *hdr, void *
return -EINVAL;
if (ext_hdrs[ext_type-1] != NULL)
return -EINVAL;
-   if (ext_type == SADB_EXT_ADDRESS_SRC ||
-   ext_type == SADB_EXT_ADDRESS_DST ||
-   ext_type == SADB_EXT_ADDRESS_PROXY ||
-   ext_type == SADB_X_EXT_NAT_T_OA) {
+   switch (ext_type) {
+   case SADB_EXT_ADDRESS_SRC:
+   case SADB_EXT_ADDRESS_DST:
+   case SADB_EXT_ADDRESS_PROXY:
+   case SADB_X_EXT_NAT_T_OA:
if (verify_address_len(p))
return -EINVAL;
-   }
-   if (ext_type == SADB_X_EXT_SEC_CTX) {
+   break;
+   case SADB_X_EXT_SEC_CTX:
if (verify_sec_ctx_len(p))
return -EINVAL;
+   break;
+   case SADB_EXT_KEY_AUTH:
+   case SADB_EXT_KEY_ENCRYPT:
+   if (verify_key_len(p))
+   return -EINVAL;
+   break;
+   default:
+   break;
}
ext_hdrs[ext_type-1] = (void *) p;
}
@@ -1104,14 +1131,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
key = ext_hdrs[SADB_EXT_KEY_AUTH - 1];
if (key != NULL &&
sa->sadb_sa_auth != SADB_X_AALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key != NULL &&
sa->sadb_sa_encrypt != SADB_EALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
 
x = xfrm_state_alloc(net);
-- 
2.8.1

[PATCH v2 0/2] af_key: Fix for sadb_key memcpy read overrun

2018-04-07 Thread Kevin Easton

As found by syzbot, af_key does not properly validate the key length in
sadb_key messages from userspace.  This can result in copying from beyond
the end of the sadb_key part of the message, or indeed beyond the end of
the entire packet.

Both these patches apply cleanly to ipsec-next.  Based on Steffen's
feedback I have re-ordered them so that the fix only is in patch 1, which
I would suggest is also a stable tree candidate, whereas patch 2 is a
cleanup only.

Kevin Easton (2):
  af_key: Always verify length of provided sadb_key
  af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

 net/key/af_key.c | 58 
 1 file changed, 42 insertions(+), 16 deletions(-)

-- 
2.8.1

[PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-04-07 Thread Kevin Easton

Several places use (x + 7) / 8 to convert from a number of bits to a number
of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
with other parts of the same file.

Signed-off-by: Kevin Easton 
---
 net/key/af_key.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index e62e52e..f3ebb84 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -822,12 +822,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const 
struct xfrm_state *x,
if (add_keys) {
if (x->aalg && x->aalg->alg_key_len) {
auth_key_size =
-   PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + auth_key_size;
}
if (x->ealg && x->ealg->alg_key_len) {
encrypt_key_size =
-   PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + encrypt_key_size;
}
}
@@ -987,7 +987,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_exttype = SADB_EXT_KEY_AUTH;
key->sadb_key_bits = x->aalg->alg_key_len;
key->sadb_key_reserved = 0;
-   memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8);
+   memcpy(key + 1, x->aalg->alg_key,
+   DIV_ROUND_UP(x->aalg->alg_key_len, 8));
}
/* encrypt key */
if (add_keys && encrypt_key_size) {
@@ -998,7 +999,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_bits = x->ealg->alg_key_len;
key->sadb_key_reserved = 0;
memcpy(key + 1, x->ealg->alg_key,
-  (x->ealg->alg_key_len+7)/8);
+   DIV_ROUND_UP(x->ealg->alg_key_len, 8));
}
 
/* sa */
@@ -1193,7 +1194,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
goto out;
}
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL);
if (!x->aalg) {
err = -ENOMEM;
@@ -1232,7 +1233,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
}
key = (struct sadb_key*) 
ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->ealg = kmalloc(sizeof(*x->ealg) + keysize, 
GFP_KERNEL);
if (!x->ealg) {
err = -ENOMEM;
-- 
2.8.1

Re: [PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-03-28 Thread Kevin Easton

On Wed, Mar 28, 2018 at 07:59:25AM +0200, Steffen Klassert wrote:
> On Mon, Mar 26, 2018 at 07:39:16AM -0400, Kevin Easton wrote:
> > Several places use (x + 7) / 8 to convert from a number of bits to a number
> > of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > with other parts of the same file.
> > 
> > Signed-off-by: Kevin Easton <ke...@guarana.org>
> 
> Is this a fix or just a cleanup?
> If it is just a cleanup, please resent it based on ipsec-next.

Hi Steffen,

This patch (1/2) is just a cleanup, but it's in preparation for patch 2/2
which is a fix and modifies some of the same lines.

- Kevin

Re: [PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-03-28 Thread Kevin Easton

On Wed, Mar 28, 2018 at 07:59:25AM +0200, Steffen Klassert wrote:
> On Mon, Mar 26, 2018 at 07:39:16AM -0400, Kevin Easton wrote:
> > Several places use (x + 7) / 8 to convert from a number of bits to a number
> > of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > with other parts of the same file.
> > 
> > Signed-off-by: Kevin Easton 
> 
> Is this a fix or just a cleanup?
> If it is just a cleanup, please resent it based on ipsec-next.

Hi Steffen,

This patch (1/2) is just a cleanup, but it's in preparation for patch 2/2
which is a fix and modifies some of the same lines.

- Kevin

[PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-03-26 Thread Kevin Easton

Several places use (x + 7) / 8 to convert from a number of bits to a number
of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
with other parts of the same file.

Signed-off-by: Kevin Easton <ke...@guarana.org>
---
 net/key/af_key.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 7e2e718..911b68d 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -795,12 +795,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const 
struct xfrm_state *x,
if (add_keys) {
if (x->aalg && x->aalg->alg_key_len) {
auth_key_size =
-   PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + auth_key_size;
}
if (x->ealg && x->ealg->alg_key_len) {
encrypt_key_size =
-   PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + encrypt_key_size;
}
}
@@ -960,7 +960,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_exttype = SADB_EXT_KEY_AUTH;
key->sadb_key_bits = x->aalg->alg_key_len;
key->sadb_key_reserved = 0;
-   memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8);
+   memcpy(key + 1, x->aalg->alg_key,
+   DIV_ROUND_UP(x->aalg->alg_key_len, 8));
}
/* encrypt key */
if (add_keys && encrypt_key_size) {
@@ -971,7 +972,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_bits = x->ealg->alg_key_len;
key->sadb_key_reserved = 0;
memcpy(key + 1, x->ealg->alg_key,
-  (x->ealg->alg_key_len+7)/8);
+   DIV_ROUND_UP(x->ealg->alg_key_len, 8));
}
 
/* sa */
@@ -1104,14 +1105,14 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
key = ext_hdrs[SADB_EXT_KEY_AUTH - 1];
if (key != NULL &&
sa->sadb_sa_auth != SADB_X_AALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
+DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
return ERR_PTR(-EINVAL);
key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key != NULL &&
sa->sadb_sa_encrypt != SADB_EALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
+DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
return ERR_PTR(-EINVAL);
 
x = xfrm_state_alloc(net);
@@ -1168,7 +1169,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
goto out;
}
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL);
if (!x->aalg) {
err = -ENOMEM;
@@ -1207,7 +1208,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
}
key = (struct sadb_key*) 
ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->ealg = kmalloc(sizeof(*x->ealg) + keysize, 
GFP_KERNEL);
if (!x->ealg) {
err = -ENOMEM;
-- 
2.8.1

[PATCH 2/2] af_key: Always verify length of provided sadb_key

2018-03-26 Thread Kevin Easton

Key extensions (struct sadb_key) include a user-specified number of key
bits.  The kernel uses that number to determine how much key data to copy
out of the message in pfkey_msg2xfrm_state().

The length of the sadb_key message must be verified to be long enough,
even in the case of SADB_X_AALG_NULL.  Furthermore, the sadb_key_len value
must be long enough to include both the key data and the struct sadb_key
itself.

Introduce a helper function verify_key_len(), and call it from
parse_exthdrs() where other exthdr types are similarly checked for
correctness.

Signed-off-by: Kevin Easton <ke...@guarana.org>
Reported-by: 
syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com
---
 net/key/af_key.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 911b68d..f3ebb84 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -437,6 +437,24 @@ static int verify_address_len(const void *p)
return 0;
 }
 
+static inline int sadb_key_len(const struct sadb_key *key)
+{
+   int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8);
+
+   return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes,
+   sizeof(uint64_t));
+}
+
+static int verify_key_len(const void *p)
+{
+   const struct sadb_key *key = p;
+
+   if (sadb_key_len(key) > key->sadb_key_len)
+   return -EINVAL;
+
+   return 0;
+}
+
 static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx)
 {
return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) +
@@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const 
struct sadb_msg *hdr, void *
return -EINVAL;
if (ext_hdrs[ext_type-1] != NULL)
return -EINVAL;
-   if (ext_type == SADB_EXT_ADDRESS_SRC ||
-   ext_type == SADB_EXT_ADDRESS_DST ||
-   ext_type == SADB_EXT_ADDRESS_PROXY ||
-   ext_type == SADB_X_EXT_NAT_T_OA) {
+   switch (ext_type) {
+   case SADB_EXT_ADDRESS_SRC:
+   case SADB_EXT_ADDRESS_DST:
+   case SADB_EXT_ADDRESS_PROXY:
+   case SADB_X_EXT_NAT_T_OA:
if (verify_address_len(p))
return -EINVAL;
-   }
-   if (ext_type == SADB_X_EXT_SEC_CTX) {
+   break;
+   case SADB_X_EXT_SEC_CTX:
if (verify_sec_ctx_len(p))
return -EINVAL;
+   break;
+   case SADB_EXT_KEY_AUTH:
+   case SADB_EXT_KEY_ENCRYPT:
+   if (verify_key_len(p))
+   return -EINVAL;
+   break;
+   default:
+   break;
}
ext_hdrs[ext_type-1] = (void *) p;
}
@@ -1105,14 +1132,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
key = ext_hdrs[SADB_EXT_KEY_AUTH - 1];
if (key != NULL &&
sa->sadb_sa_auth != SADB_X_AALG_NULL &&
-   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
-DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key != NULL &&
sa->sadb_sa_encrypt != SADB_EALG_NULL &&
-   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
-DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
 
x = xfrm_state_alloc(net);
-- 
2.8.1

[PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-03-26 Thread Kevin Easton

Several places use (x + 7) / 8 to convert from a number of bits to a number
of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
with other parts of the same file.

Signed-off-by: Kevin Easton 
---
 net/key/af_key.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 7e2e718..911b68d 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -795,12 +795,12 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const 
struct xfrm_state *x,
if (add_keys) {
if (x->aalg && x->aalg->alg_key_len) {
auth_key_size =
-   PFKEY_ALIGN8((x->aalg->alg_key_len + 7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->aalg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + auth_key_size;
}
if (x->ealg && x->ealg->alg_key_len) {
encrypt_key_size =
-   PFKEY_ALIGN8((x->ealg->alg_key_len+7) / 8);
+   PFKEY_ALIGN8(DIV_ROUND_UP(x->ealg->alg_key_len, 
8));
size += sizeof(struct sadb_key) + encrypt_key_size;
}
}
@@ -960,7 +960,8 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_exttype = SADB_EXT_KEY_AUTH;
key->sadb_key_bits = x->aalg->alg_key_len;
key->sadb_key_reserved = 0;
-   memcpy(key + 1, x->aalg->alg_key, (x->aalg->alg_key_len+7)/8);
+   memcpy(key + 1, x->aalg->alg_key,
+   DIV_ROUND_UP(x->aalg->alg_key_len, 8));
}
/* encrypt key */
if (add_keys && encrypt_key_size) {
@@ -971,7 +972,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct 
xfrm_state *x,
key->sadb_key_bits = x->ealg->alg_key_len;
key->sadb_key_reserved = 0;
memcpy(key + 1, x->ealg->alg_key,
-  (x->ealg->alg_key_len+7)/8);
+   DIV_ROUND_UP(x->ealg->alg_key_len, 8));
}
 
/* sa */
@@ -1104,14 +1105,14 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
key = ext_hdrs[SADB_EXT_KEY_AUTH - 1];
if (key != NULL &&
sa->sadb_sa_auth != SADB_X_AALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
+DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
return ERR_PTR(-EINVAL);
key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key != NULL &&
sa->sadb_sa_encrypt != SADB_EALG_NULL &&
-   ((key->sadb_key_bits+7) / 8 == 0 ||
-(key->sadb_key_bits+7) / 8 > key->sadb_key_len * sizeof(uint64_t)))
+   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
+DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
return ERR_PTR(-EINVAL);
 
x = xfrm_state_alloc(net);
@@ -1168,7 +1169,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
goto out;
}
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->aalg = kmalloc(sizeof(*x->aalg) + keysize, GFP_KERNEL);
if (!x->aalg) {
err = -ENOMEM;
@@ -1207,7 +1208,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
}
key = (struct sadb_key*) 
ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key)
-   keysize = (key->sadb_key_bits + 7) / 8;
+   keysize = DIV_ROUND_UP(key->sadb_key_bits, 8);
x->ealg = kmalloc(sizeof(*x->ealg) + keysize, 
GFP_KERNEL);
if (!x->ealg) {
err = -ENOMEM;
-- 
2.8.1

[PATCH 2/2] af_key: Always verify length of provided sadb_key

2018-03-26 Thread Kevin Easton

Key extensions (struct sadb_key) include a user-specified number of key
bits.  The kernel uses that number to determine how much key data to copy
out of the message in pfkey_msg2xfrm_state().

The length of the sadb_key message must be verified to be long enough,
even in the case of SADB_X_AALG_NULL.  Furthermore, the sadb_key_len value
must be long enough to include both the key data and the struct sadb_key
itself.

Introduce a helper function verify_key_len(), and call it from
parse_exthdrs() where other exthdr types are similarly checked for
correctness.

Signed-off-by: Kevin Easton 
Reported-by: 
syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com
---
 net/key/af_key.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 911b68d..f3ebb84 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -437,6 +437,24 @@ static int verify_address_len(const void *p)
return 0;
 }
 
+static inline int sadb_key_len(const struct sadb_key *key)
+{
+   int key_bytes = DIV_ROUND_UP(key->sadb_key_bits, 8);
+
+   return DIV_ROUND_UP(sizeof(struct sadb_key) + key_bytes,
+   sizeof(uint64_t));
+}
+
+static int verify_key_len(const void *p)
+{
+   const struct sadb_key *key = p;
+
+   if (sadb_key_len(key) > key->sadb_key_len)
+   return -EINVAL;
+
+   return 0;
+}
+
 static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx)
 {
return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) +
@@ -533,16 +551,25 @@ static int parse_exthdrs(struct sk_buff *skb, const 
struct sadb_msg *hdr, void *
return -EINVAL;
if (ext_hdrs[ext_type-1] != NULL)
return -EINVAL;
-   if (ext_type == SADB_EXT_ADDRESS_SRC ||
-   ext_type == SADB_EXT_ADDRESS_DST ||
-   ext_type == SADB_EXT_ADDRESS_PROXY ||
-   ext_type == SADB_X_EXT_NAT_T_OA) {
+   switch (ext_type) {
+   case SADB_EXT_ADDRESS_SRC:
+   case SADB_EXT_ADDRESS_DST:
+   case SADB_EXT_ADDRESS_PROXY:
+   case SADB_X_EXT_NAT_T_OA:
if (verify_address_len(p))
return -EINVAL;
-   }
-   if (ext_type == SADB_X_EXT_SEC_CTX) {
+   break;
+   case SADB_X_EXT_SEC_CTX:
if (verify_sec_ctx_len(p))
return -EINVAL;
+   break;
+   case SADB_EXT_KEY_AUTH:
+   case SADB_EXT_KEY_ENCRYPT:
+   if (verify_key_len(p))
+   return -EINVAL;
+   break;
+   default:
+   break;
}
ext_hdrs[ext_type-1] = (void *) p;
}
@@ -1105,14 +1132,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
key = ext_hdrs[SADB_EXT_KEY_AUTH - 1];
if (key != NULL &&
sa->sadb_sa_auth != SADB_X_AALG_NULL &&
-   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
-DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
key = ext_hdrs[SADB_EXT_KEY_ENCRYPT-1];
if (key != NULL &&
sa->sadb_sa_encrypt != SADB_EALG_NULL &&
-   (DIV_ROUND_UP(key->sadb_key_bits, 8) == 0 ||
-DIV_ROUND_UP(key->sadb_key_bits, 8) > key->sadb_key_len * 
sizeof(uint64_t)))
+   key->sadb_key_bits == 0)
return ERR_PTR(-EINVAL);
 
x = xfrm_state_alloc(net);
-- 
2.8.1

[PATCH 0/2] af_key: Fix for sadb_key memcpy read overrun

2018-03-26 Thread Kevin Easton

As found by syzbot, af_key does not properly validate the key length in
sadb_key messages from userspace.  This can result in copying from beyond
the end of the sadb_key part of the message, or indeed beyond the end of
the entire packet.

Kevin Easton (2):
  af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
  af_key: Always verify length of provided sadb_key

 net/key/af_key.c | 58 
 1 file changed, 42 insertions(+), 16 deletions(-)

-- 
2.8.1

[PATCH 0/2] af_key: Fix for sadb_key memcpy read overrun

2018-03-26 Thread Kevin Easton

As found by syzbot, af_key does not properly validate the key length in
sadb_key messages from userspace.  This can result in copying from beyond
the end of the sadb_key part of the message, or indeed beyond the end of
the entire packet.

Kevin Easton (2):
  af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
  af_key: Always verify length of provided sadb_key

 net/key/af_key.c | 58 
 1 file changed, 42 insertions(+), 16 deletions(-)

-- 
2.8.1

Re: [RFC PATCH 2/6] fs: provide a generic compat_sys_truncate64() implementation

2018-03-19 Thread Kevin Easton

On Sun, Mar 18, 2018 at 05:10:52PM +0100, Dominik Brodowski wrote:
> The compat_sys_truncate64() implementations in mips, powerpc, s390, sparc
> and x86 only differed based on whether the u64 parameter needed padding
> and on its endianness.
> 

...
  
> +#ifdef __ARCH_WANT_COMPAT_SYS_TRUNCATE64
> +#if defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \
> + defined(__ARCH_WANT_LE_COMPAT_SYS)
> +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 
> padding,
> +unsigned int, offset_low, unsigned int, offset_high)
> +#elif defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \
> + !defined(__ARCH_WANT_LE_COMPAT_SYS)
> +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 
> padding,
> +unsigned int, offset_high, unsigned int, offset_low)

Notwithstanding the other comments, shouldn't there be a comma between
'u32' and 'padding' in those?

- Kevin

Re: [RFC PATCH 2/6] fs: provide a generic compat_sys_truncate64() implementation

2018-03-19 Thread Kevin Easton

On Sun, Mar 18, 2018 at 05:10:52PM +0100, Dominik Brodowski wrote:
> The compat_sys_truncate64() implementations in mips, powerpc, s390, sparc
> and x86 only differed based on whether the u64 parameter needed padding
> and on its endianness.
> 

...
  
> +#ifdef __ARCH_WANT_COMPAT_SYS_TRUNCATE64
> +#if defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \
> + defined(__ARCH_WANT_LE_COMPAT_SYS)
> +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 
> padding,
> +unsigned int, offset_low, unsigned int, offset_high)
> +#elif defined(__ARCH_WANT_COMPAT_SYS_WITH_PADDING) && \
> + !defined(__ARCH_WANT_LE_COMPAT_SYS)
> +COMPAT_SYSCALL_DEFINE4(truncate64, const char __user *, filename, u32 
> padding,
> +unsigned int, offset_high, unsigned int, offset_low)

Notwithstanding the other comments, shouldn't there be a comma between
'u32' and 'padding' in those?

- Kevin

Re: [RFC PATCH 4/7] kconfig: support new special property shell=

2018-02-10 Thread Kevin Easton

On Sat, Feb 10, 2018 at 10:05:12AM -0800, Randy Dunlap wrote:
> On 02/10/2018 12:55 AM, Ulf Magnusson wrote:
> > How many compilers don't support -fno-stack-protector by the way?
> > 
> > config CC_HAS_STACKPROTECTOR_STRONG
> > bool
> > option shell="$CC -Werror -fstack-protector-strong -c -x c 
> > /dev/null"
> > 
> > config CC_HAS_STACKPROTECTOR_REGULAR
> > bool
> > option shell="$CC -Werror -fstack-protector -c -x c /dev/null"
> > 
> > config CC_HAS_STACKPROTECTOR_NONE
> > bool
> > default y
> > option shell="$CC -Werror -fno-stack-protector -c -x c 
> > /dev/null"
> 
> I ran:
> gcc -Werror -fno-stack-protector -c -x c /dev/null
> 
> It worked (gcc (SUSE Linux) 4.8.5) but it did leave a null.o file for me.
> Might need to add that to 'make clean' or just rm it immediately.

gcc -Werror -fno-stack-protector -c -x c /dev/null -o /dev/null

seems to DTRT without leaving anything behind.

- Kevin

Re: [RFC PATCH 4/7] kconfig: support new special property shell=

2018-02-10 Thread Kevin Easton

On Sat, Feb 10, 2018 at 10:05:12AM -0800, Randy Dunlap wrote:
> On 02/10/2018 12:55 AM, Ulf Magnusson wrote:
> > How many compilers don't support -fno-stack-protector by the way?
> > 
> > config CC_HAS_STACKPROTECTOR_STRONG
> > bool
> > option shell="$CC -Werror -fstack-protector-strong -c -x c 
> > /dev/null"
> > 
> > config CC_HAS_STACKPROTECTOR_REGULAR
> > bool
> > option shell="$CC -Werror -fstack-protector -c -x c /dev/null"
> > 
> > config CC_HAS_STACKPROTECTOR_NONE
> > bool
> > default y
> > option shell="$CC -Werror -fno-stack-protector -c -x c 
> > /dev/null"
> 
> I ran:
> gcc -Werror -fno-stack-protector -c -x c /dev/null
> 
> It worked (gcc (SUSE Linux) 4.8.5) but it did leave a null.o file for me.
> Might need to add that to 'make clean' or just rm it immediately.

gcc -Werror -fno-stack-protector -c -x c /dev/null -o /dev/null

seems to DTRT without leaving anything behind.

- Kevin

Re: [PATCH 30/35] x86/speculation: Use Indirect Branch Prediction Barrier in context switch

2018-01-18 Thread Kevin Easton

On Thu, Jan 18, 2018 at 04:38:32PM -0800, Tim Chen wrote:
> On 01/18/2018 05:48 AM, Peter Zijlstra wrote:
> >
> >+/*
> >+ * Avoid user/user BTB poisoning by flushing the branch 
> >predictor
> >+ * when switching between processes. This stops one process from
> >+ * doing spectre-v2 attacks on another process's data.
> >+ */
> >+indirect_branch_prediction_barrier();
> >+
> 
> Some optimizations can be done here to avoid overhead in barrier call.
> 
> For example, don't do the barrier if prev and next mm are
> the same.  If the two process trust each other, or the new process
> already have rights to look into the previous process,
> the barrier could be skipped.

Isn't it the other way around with the BTB poisoning? previous is
potentially attacking next, so the barrier can be avoided only if previous
is allowed to ptrace next?

- Kevin

Re: [PATCH 30/35] x86/speculation: Use Indirect Branch Prediction Barrier in context switch

2018-01-18 Thread Kevin Easton

On Thu, Jan 18, 2018 at 04:38:32PM -0800, Tim Chen wrote:
> On 01/18/2018 05:48 AM, Peter Zijlstra wrote:
> >
> >+/*
> >+ * Avoid user/user BTB poisoning by flushing the branch 
> >predictor
> >+ * when switching between processes. This stops one process from
> >+ * doing spectre-v2 attacks on another process's data.
> >+ */
> >+indirect_branch_prediction_barrier();
> >+
> 
> Some optimizations can be done here to avoid overhead in barrier call.
> 
> For example, don't do the barrier if prev and next mm are
> the same.  If the two process trust each other, or the new process
> already have rights to look into the previous process,
> the barrier could be skipped.

Isn't it the other way around with the BTB poisoning? previous is
potentially attacking next, so the barrier can be avoided only if previous
is allowed to ptrace next?

- Kevin

Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries

2017-12-01 Thread Kevin Easton

> From: Dave Hansen 
> 
> The "SYSENTER" stack is used for a lot more than SYSENTER now.
> Give it a better string to display in stack dumps.
> 
> We should probably cleanse the 64-bit code of the remaining
> "SYSENTER" nomenclature too at some point.
> 
> Signed-off-by: Dave Hansen 
> ---
> 
>  b/arch//x86/kernel/dumpstack_64.c |   10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff -puN arch//x86/kernel/dumpstack_64.c~SYSENTER-rename 
> arch//x86/kernel/dumpstack_64.c
> --- a/arch//x86/kernel/dumpstack_64.c~SYSENTER-rename 2017-12-01 
> 12:43:16.768707737 -0800
> +++ b/arch//x86/kernel/dumpstack_64.c 2017-12-01 13:19:21.741702337 -0800
> @@ -37,8 +37,14 @@ const char *stack_type_name(enum stack_t
>   if (type == STACK_TYPE_IRQ)
>   return "IRQ";
>  
> - if (type == STACK_TYPE_SYSENTER)
> - return "SYSENTER";
> + if (type == STACK_TYPE_SYSENTER) {
> + /*
> +  * On 64-bit, we have a generic entry stack that we
> +  * use for all the kernel try points, including
> +  * SYSENTER.

ITYM "kernel entry points".

- Kevin

Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries

2017-12-01 Thread Kevin Easton

> From: Dave Hansen 
> 
> The "SYSENTER" stack is used for a lot more than SYSENTER now.
> Give it a better string to display in stack dumps.
> 
> We should probably cleanse the 64-bit code of the remaining
> "SYSENTER" nomenclature too at some point.
> 
> Signed-off-by: Dave Hansen 
> ---
> 
>  b/arch//x86/kernel/dumpstack_64.c |   10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff -puN arch//x86/kernel/dumpstack_64.c~SYSENTER-rename 
> arch//x86/kernel/dumpstack_64.c
> --- a/arch//x86/kernel/dumpstack_64.c~SYSENTER-rename 2017-12-01 
> 12:43:16.768707737 -0800
> +++ b/arch//x86/kernel/dumpstack_64.c 2017-12-01 13:19:21.741702337 -0800
> @@ -37,8 +37,14 @@ const char *stack_type_name(enum stack_t
>   if (type == STACK_TYPE_IRQ)
>   return "IRQ";
>  
> - if (type == STACK_TYPE_SYSENTER)
> - return "SYSENTER";
> + if (type == STACK_TYPE_SYSENTER) {
> + /*
> +  * On 64-bit, we have a generic entry stack that we
> +  * use for all the kernel try points, including
> +  * SYSENTER.

ITYM "kernel entry points".

- Kevin

Re: [GIT pull] printk updates for 4.15

2017-11-19 Thread Kevin Easton

On Sat, Nov 18, 2017 at 01:26:07AM +0100, Thomas Gleixner wrote:
> On Thu, 16 Nov 2017, Thomas Gleixner wrote:
> > I hope that I can find a few spare cycles to whip up a POC patch which does
> > not make stuff fall apart immediately.
> 
> Here you go. It survived suspend resume in a VM.
> 
> Thanks,
> 
>   tglx
> 
> 8<
> Subject: timekeeping: Make monotonic behave like boottime
> From: Thomas Gleixner 
> Date: Fri, 17 Nov 2017 11:46:48 +0100
> 
> Clock MONOTONIC is not fast forwarded by the time spent in suspend on
> resume. This is only done for BOOTTIME.
> 
> It would be desired to get rid of that difference, but the difference
> between clock MONOTONIC and clock BOOTTIME is well documented so there
> might be applications which depend on that behaviour.

As a comment from the userspace peanut gallery, I personally hope this
does pass muster.  The existing POSIX wording implies that MONOTONIC
doesn't stop counting over suspend, and that's what you need when you
want to know the time elapsed between two external events.

- Kevin
>

Re: [GIT pull] printk updates for 4.15

2017-11-19 Thread Kevin Easton

On Sat, Nov 18, 2017 at 01:26:07AM +0100, Thomas Gleixner wrote:
> On Thu, 16 Nov 2017, Thomas Gleixner wrote:
> > I hope that I can find a few spare cycles to whip up a POC patch which does
> > not make stuff fall apart immediately.
> 
> Here you go. It survived suspend resume in a VM.
> 
> Thanks,
> 
>   tglx
> 
> 8<
> Subject: timekeeping: Make monotonic behave like boottime
> From: Thomas Gleixner 
> Date: Fri, 17 Nov 2017 11:46:48 +0100
> 
> Clock MONOTONIC is not fast forwarded by the time spent in suspend on
> resume. This is only done for BOOTTIME.
> 
> It would be desired to get rid of that difference, but the difference
> between clock MONOTONIC and clock BOOTTIME is well documented so there
> might be applications which depend on that behaviour.

As a comment from the userspace peanut gallery, I personally hope this
does pass muster.  The existing POSIX wording implies that MONOTONIC
doesn't stop counting over suspend, and that's what you need when you
want to know the time elapsed between two external events.

- Kevin
>

Re: [PATCH 05/14] isdn: isdnloop: suppress a gcc-7 warning

2017-07-14 Thread Kevin Easton

On Fri, Jul 14, 2017 at 12:37:05PM +0200, Arnd Bergmann wrote:
> On Fri, Jul 14, 2017 at 12:08 PM, Joe Perches  wrote:
> > On Fri, 2017-07-14 at 11:25 +0200, Arnd Bergmann wrote:
> >> We test whether a bit is set in a mask here, which is correct
> >> but gcc warns about it as it thinks it might be confusing:
> >>
> >> drivers/isdn/isdnloop/isdnloop.c:412:37: error: ?: using integer constants 
> >> in boolean context, the expression will always evaluate to 'true' 
> >> [-Werror=int-in-bool-context]

...

> > Perhaps this is a logic defect and should be:
> >
> > if (!(card->flags & ((channel) ? ISDNLOOP_FLAGS_B2ACTIVE : 
> > ISDNLOOP_FLAGS_B1ACTIVE)))
> 
> Yes, good catch. I had thought about it for a bit whether that would be
> the answer, but come to the wrong conclusion on my own.
> 
> Note that the version you suggested will still have the warning, so I think
> it needs to be

It shouldn't - the warning is for using an integer *constant* in boolean
context, but the result of & isn't a constant and should be fine.

!(flags & mask) is a very common idiom.

- Kevin

Re: [PATCH 05/14] isdn: isdnloop: suppress a gcc-7 warning

2017-07-14 Thread Kevin Easton

On Fri, Jul 14, 2017 at 12:37:05PM +0200, Arnd Bergmann wrote:
> On Fri, Jul 14, 2017 at 12:08 PM, Joe Perches  wrote:
> > On Fri, 2017-07-14 at 11:25 +0200, Arnd Bergmann wrote:
> >> We test whether a bit is set in a mask here, which is correct
> >> but gcc warns about it as it thinks it might be confusing:
> >>
> >> drivers/isdn/isdnloop/isdnloop.c:412:37: error: ?: using integer constants 
> >> in boolean context, the expression will always evaluate to 'true' 
> >> [-Werror=int-in-bool-context]

...

> > Perhaps this is a logic defect and should be:
> >
> > if (!(card->flags & ((channel) ? ISDNLOOP_FLAGS_B2ACTIVE : 
> > ISDNLOOP_FLAGS_B1ACTIVE)))
> 
> Yes, good catch. I had thought about it for a bit whether that would be
> the answer, but come to the wrong conclusion on my own.
> 
> Note that the version you suggested will still have the warning, so I think
> it needs to be

It shouldn't - the warning is for using an integer *constant* in boolean
context, but the result of & isn't a constant and should be fine.

!(flags & mask) is a very common idiom.

- Kevin

Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kevin Easton

On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > wrote:
> > > > 
> > > > We have:
> > > > 
> > > > bottom = 0xff803fff
> > > > sp =?0xb178
> > > > 
> > > > The relevant mappings are:
> > > > 
> > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > fffdd000-e000 rw-p  00:00
> > > > 0??[stack]
> > > 
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > > 
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > > 
> > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > > 
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> > [...]
> > 
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> 
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

On the plus side, the code in that page (a single RET) is only executed
once when the workaround function is called.  Notice that 'codebuf'
is never even returned out of that function.

The only reason they even leave that page mapped is to stop the exec
shield limit from being lowered on them again.

- Kevin

Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kevin Easton

On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > wrote:
> > > > 
> > > > We have:
> > > > 
> > > > bottom = 0xff803fff
> > > > sp =?0xb178
> > > > 
> > > > The relevant mappings are:
> > > > 
> > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > fffdd000-e000 rw-p  00:00
> > > > 0??[stack]
> > > 
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > > 
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > > 
> > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > > 
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> > [...]
> > 
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> 
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

On the plus side, the code in that page (a single RET) is only executed
once when the workaround function is called.  Notice that 'codebuf'
is never even returned out of that function.

The only reason they even leave that page mapped is to stop the exec
shield limit from being lowered on them again.

- Kevin

Re: [PATCH v3 04/13] crypto/rng: ensure that the RNG is ready before using

2017-06-08 Thread Kevin Easton

On Tue, Jun 06, 2017 at 05:56:20AM +0200, Jason A. Donenfeld wrote:
> Hey Ted,
> 
> On Tue, Jun 6, 2017 at 5:00 AM, Theodore Ts'o  wrote:
> > Note that crypto_rng_reset() is called by big_key_init() in
> > security/keys/big_key.c as a late_initcall().  So if we are on a
> > system where the crng doesn't get initialized until during the system
> > boot scripts, and big_key is compiled directly into the kernel, the
> > boot could end up deadlocking.
> >
> > There may be other instances of where crypto_rng_reset() is called by
> > an initcall, so big_key_init() may not be an exhaustive enumeration of
> > potential problems.  But this is an example of why the synchronous
> > API, although definitely much more convenient, can end up being a trap
> > for the unwary
> 
> Thanks for pointing this out. I'll look more closely into it and see
> if I can figure out a good way of approaching this.

Would it work for wait_for_random_bytes() to include a

WARN_ON(system_state < SYSTEM_RUNNING);

to catch those kinds of cases?

- Kevin

Re: [PATCH v3 04/13] crypto/rng: ensure that the RNG is ready before using

2017-06-08 Thread Kevin Easton

On Tue, Jun 06, 2017 at 05:56:20AM +0200, Jason A. Donenfeld wrote:
> Hey Ted,
> 
> On Tue, Jun 6, 2017 at 5:00 AM, Theodore Ts'o  wrote:
> > Note that crypto_rng_reset() is called by big_key_init() in
> > security/keys/big_key.c as a late_initcall().  So if we are on a
> > system where the crng doesn't get initialized until during the system
> > boot scripts, and big_key is compiled directly into the kernel, the
> > boot could end up deadlocking.
> >
> > There may be other instances of where crypto_rng_reset() is called by
> > an initcall, so big_key_init() may not be an exhaustive enumeration of
> > potential problems.  But this is an example of why the synchronous
> > API, although definitely much more convenient, can end up being a trap
> > for the unwary
> 
> Thanks for pointing this out. I'll look more closely into it and see
> if I can figure out a good way of approaching this.

Would it work for wait_for_random_bytes() to include a

WARN_ON(system_state < SYSTEM_RUNNING);

to catch those kinds of cases?

- Kevin

Re: [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging

2017-05-25 Thread Kevin Easton

On Thu, May 25, 2017 at 05:40:16PM -0700, Andy Lutomirski wrote:
> On Thu, May 25, 2017 at 4:24 PM, Linus Torvalds
>  wrote:
> > On Thu, May 25, 2017 at 1:33 PM, Kirill A. Shutemov
> >  wrote:
> >> Here' my first attempt to bring boot-time between 4- and 5-level paging.
> >> It looks not too terrible to me. I've expected it to be worse.
> >
> > If I read this right, you just made it a global on/off thing.
> >
> > May I suggest possibly a different model entirely? Can you make it a
> > per-mm flag instead?
> >
> > And then we
> >
> >  (a) make all kthreads use the 4-level page tables
> >
> >  (b) which means that all the init code uses the 4-level page tables
> >
> >  (c) which means that all those checks for "start_secondary" etc can
> > just go away, because those all run with 4-level page tables.
> >
> > Or is it just much too expensive to switch between 4-level and 5-level
> > paging at run-time?
> >
> 
> Even ignoring expensiveness, I'm not convinced it's practical.  AFAICT
> you can't atomically switch the paging mode and CR3, so either you
> need some magic page table with trampoline that works in both modes
> (which is presumably doable with some trickery) or you need to flip
> paging off.  Good luck if an NMI hits in the mean time.  There was
> code like that once upon a time for EFI mixed mode, but it got deleted
> due to triple-faults.

According to Intel's documentation you pretty much have to disable
paging anyway:

"The processor allows software to modify CR4.LA57 only outside of IA-32e
mode. In IA-32e mode, an attempt to modify CR4.LA57 using the MOV CR
instruction causes a general-protection exception (#GP)."

(If it weren't for that, maybe you could point the last entry in the PML4
at the PML4 itself, so it also works as a PML5 for accessing kernel
addresses? And of course make sure nothing gets loaded above 
0xff80).

- Kevin

Re: [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging

2017-05-25 Thread Kevin Easton

On Thu, May 25, 2017 at 05:40:16PM -0700, Andy Lutomirski wrote:
> On Thu, May 25, 2017 at 4:24 PM, Linus Torvalds
>  wrote:
> > On Thu, May 25, 2017 at 1:33 PM, Kirill A. Shutemov
> >  wrote:
> >> Here' my first attempt to bring boot-time between 4- and 5-level paging.
> >> It looks not too terrible to me. I've expected it to be worse.
> >
> > If I read this right, you just made it a global on/off thing.
> >
> > May I suggest possibly a different model entirely? Can you make it a
> > per-mm flag instead?
> >
> > And then we
> >
> >  (a) make all kthreads use the 4-level page tables
> >
> >  (b) which means that all the init code uses the 4-level page tables
> >
> >  (c) which means that all those checks for "start_secondary" etc can
> > just go away, because those all run with 4-level page tables.
> >
> > Or is it just much too expensive to switch between 4-level and 5-level
> > paging at run-time?
> >
> 
> Even ignoring expensiveness, I'm not convinced it's practical.  AFAICT
> you can't atomically switch the paging mode and CR3, so either you
> need some magic page table with trampoline that works in both modes
> (which is presumably doable with some trickery) or you need to flip
> paging off.  Good luck if an NMI hits in the mean time.  There was
> code like that once upon a time for EFI mixed mode, but it got deleted
> due to triple-faults.

According to Intel's documentation you pretty much have to disable
paging anyway:

"The processor allows software to modify CR4.LA57 only outside of IA-32e
mode. In IA-32e mode, an attempt to modify CR4.LA57 using the MOV CR
instruction causes a general-protection exception (#GP)."

(If it weren't for that, maybe you could point the last entry in the PML4
at the PML4 itself, so it also works as a PML5 for accessing kernel
addresses? And of course make sure nothing gets loaded above 
0xff80).

- Kevin

Re: [PATCH] kernel/fork.c: use sizeof() instead of sizeof

2016-02-02 Thread Kevin Easton

On Tue, Feb 02, 2016 at 05:04:06PM +, Al Viro wrote:
 
> FWIW, the actual rules are
>   unary-expression: postfix-expression |
> ++ unary-expression |
> -- unary-expression |
> - cast-expression |
> + cast-expression |
> ! cast-expression |
> ~ cast-expression |
> * cast-expression |
> & cast-expression |
> sizeof unary-expression |
> sizeof ( type-name )
>   cast-expression: unary-expression |
>( type-name ) cast-expression
> Note that while e.g.
>   ++ ++ n
> is allowed by grammar, it runs afoul of the constraint for ++ argument, which
> must be a modifiable lvalue.  None of the operators above yield that, so
> the rules for ++ and -- might as well have been ++ postfix-expression and
> -- postfix-expression resp.

Unless I'm mistaken, * cast-expression yields an lvalue.

- Kevin

Re: [PATCH] kernel/fork.c: use sizeof() instead of sizeof

2016-02-02 Thread Kevin Easton

On Tue, Feb 02, 2016 at 05:04:06PM +, Al Viro wrote:
 
> FWIW, the actual rules are
>   unary-expression: postfix-expression |
> ++ unary-expression |
> -- unary-expression |
> - cast-expression |
> + cast-expression |
> ! cast-expression |
> ~ cast-expression |
> * cast-expression |
> & cast-expression |
> sizeof unary-expression |
> sizeof ( type-name )
>   cast-expression: unary-expression |
>( type-name ) cast-expression
> Note that while e.g.
>   ++ ++ n
> is allowed by grammar, it runs afoul of the constraint for ++ argument, which
> must be a modifiable lvalue.  None of the operators above yield that, so
> the rules for ++ and -- might as well have been ++ postfix-expression and
> -- postfix-expression resp.

Unless I'm mistaken, * cast-expression yields an lvalue.

- Kevin

Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Kevin Easton

On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> > > Let me re-ask the question that I asked last week (and was apparently
> > > ignored).  Why not trying to use the lazytime feature instead of
> > > pointing a head straight at the application's --- and system
> > > administrators' --- heads?
> > 
> > Sorry Ted, I thought I responded already.
> > 
> > The goal is to avoid inode writeout entirely when we can, and 
> > as I understand it lazytime will still force writeout before the inode 
> > is dropped from the cache.  In systems like Ceph in particular, the 
> > IOs can be spread across lots of files, so simply deferring writeout 
> > doesn't always help.
> 
> Sure, but it would reduce the writeout by orders of magnitude.  I can
> understand if you want to reduce it further, but it might be good
> enough for your purposes.
> 
> I considered doing the equivalent of O_NOMTIME for our purposes at
> $WORK, and our use case is actually not that different from Ceph's
> (i.e., using a local disk file system to support a cluster file
> system), and lazytime was (a) something I figured was something I
> could upstream in good conscience, and (b) was more than good enough
> for us.

A safer alternative might be a chattr file attribute that if set, the
mtime is not updated on writes, and stat() on the file always shows the
mtime as "right now".  At least that way, the file won't accidentally
get left out of backups that rely on the mtime.

(If the file attribute is unset, you immediately update the mtime then
too, and from then on the file is back to normal).

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Kevin Easton

On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
 On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
   Let me re-ask the question that I asked last week (and was apparently
   ignored).  Why not trying to use the lazytime feature instead of
   pointing a head straight at the application's --- and system
   administrators' --- heads?
  
  Sorry Ted, I thought I responded already.
  
  The goal is to avoid inode writeout entirely when we can, and 
  as I understand it lazytime will still force writeout before the inode 
  is dropped from the cache.  In systems like Ceph in particular, the 
  IOs can be spread across lots of files, so simply deferring writeout 
  doesn't always help.
 
 Sure, but it would reduce the writeout by orders of magnitude.  I can
 understand if you want to reduce it further, but it might be good
 enough for your purposes.
 
 I considered doing the equivalent of O_NOMTIME for our purposes at
 $WORK, and our use case is actually not that different from Ceph's
 (i.e., using a local disk file system to support a cluster file
 system), and lazytime was (a) something I figured was something I
 could upstream in good conscience, and (b) was more than good enough
 for us.

A safer alternative might be a chattr file attribute that if set, the
mtime is not updated on writes, and stat() on the file always shows the
mtime as right now.  At least that way, the file won't accidentally
get left out of backups that rely on the mtime.

(If the file attribute is unset, you immediately update the mtime then
too, and from then on the file is back to normal).

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

2015-05-07 Thread Kevin Easton

On Thu, May 07, 2015 at 08:18:43PM +0100, One Thousand Gnomes wrote:
> > We could keep heap metadata as R/O and only make it R/W inside of
> > malloc() itself to catch corruption more quickly.
> 
> If you implement multiple malloc pools you can chop up lots of stuff.
> 
> In library land it isn't just stuff like malloc, you can use it as
> a debug weapon to protect library private data from naughty application
> code.

How could a library (or debugger, for that matter) arbitrate ownership
of the protection domains with the application?

One interesting use for it might be to be to provide an interface to
allocate memory and associate it with a lock that's supposed to be held
while accessing that memory.  The allocation function hashes the lock
address down to one of the 15 non-zero protection domains and applies 
that key to the memory, the lock function then adds RW access to the
appropriate protection domain and the unlock function removes it.

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

2015-05-07 Thread Kevin Easton

On Thu, May 07, 2015 at 08:18:43PM +0100, One Thousand Gnomes wrote:
  We could keep heap metadata as R/O and only make it R/W inside of
  malloc() itself to catch corruption more quickly.
 
 If you implement multiple malloc pools you can chop up lots of stuff.
 
 In library land it isn't just stuff like malloc, you can use it as
 a debug weapon to protect library private data from naughty application
 code.

How could a library (or debugger, for that matter) arbitrate ownership
of the protection domains with the application?

One interesting use for it might be to be to provide an interface to
allocate memory and associate it with a lock that's supposed to be held
while accessing that memory.  The allocation function hashes the lock
address down to one of the 15 non-zero protection domains and applies 
that key to the memory, the lock function then adds RW access to the
appropriate protection domain and the unlock function removes it.

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-17 Thread Kevin Easton

On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote:
> Unfortunately, using facilities like Unix domain sockets to merely pass file
> descriptors between "worker" processes is unnecessarily difficult, due to
> the following common consideration:
> 
> 1. Domain sockets and named pipes are persistent objects. Applications must
> manage their lifetime and devise unambiguous access schemes in case multiple
> application instances are to be run within the same OS instance. Usually, they
> would also require a writable file system to be mounted.

I believe this particular issue has long been addressed in Linux, with
the "abstract namespace" domain sockets.

These aren't persistent - they go away when the bound socket is closed -
and they don't need a writable filesystem.

If you derived the name in the abstract namespace from your PID (or better,
application identifier and PID) then you would have exactly the same
"ambiguous access" scheme as your proposal.

> int sendfd(pid_t pid, int sig, int fd)

PIDs tend to be regarded as a bit of an iffy way to refer to another
process, because they tend to be racy.  If the process you think you're
talking to dies, and has its PID reused by another unrelated sendfd()-aware
process, you've just sent your open file to somewhere unexpected.

You can avoid that if the process is a child of yours, but in that case
you could have set up a no-fuss domain socket connection with socketpair()
too.

- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-17 Thread Kevin Easton

On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote:
 Unfortunately, using facilities like Unix domain sockets to merely pass file
 descriptors between worker processes is unnecessarily difficult, due to
 the following common consideration:
 
 1. Domain sockets and named pipes are persistent objects. Applications must
 manage their lifetime and devise unambiguous access schemes in case multiple
 application instances are to be run within the same OS instance. Usually, they
 would also require a writable file system to be mounted.

I believe this particular issue has long been addressed in Linux, with
the abstract namespace domain sockets.

These aren't persistent - they go away when the bound socket is closed -
and they don't need a writable filesystem.

If you derived the name in the abstract namespace from your PID (or better,
application identifier and PID) then you would have exactly the same
ambiguous access scheme as your proposal.

 int sendfd(pid_t pid, int sig, int fd)

PIDs tend to be regarded as a bit of an iffy way to refer to another
process, because they tend to be racy.  If the process you think you're
talking to dies, and has its PID reused by another unrelated sendfd()-aware
process, you've just sent your open file to somewhere unexpected.

You can avoid that if the process is a child of yours, but in that case
you could have set up a no-fuss domain socket connection with socketpair()
too.

- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER

2014-09-18 Thread Kevin Easton

On Wed, Sep 17, 2014 at 09:43:09PM -0700, Dave Hansen wrote:
> On 09/17/2014 08:23 PM, Kevin Easton wrote:
> > I was actually thinking that the kernel would take care of the xsave / 
> > xrstor (for current), updating tsk->thread.fpu.state (for non-running
> > threads) and sending an IPI for threads running on other CPUs.
> > 
> > Of course userspace can always then manually change the bounds directory
> > address itself, but then it's quite clear that they're doing something
> > unsupported.  Just an idea, anyway.
> 
> What's the benefit of that?
> 
> As it stands now, MPX is likely to be enabled well before any threads
> are created, and the MPX enabling state will be inherited by the new
> thread at clone() time.  The current mechanism allows a thread to
> individually enable or disable MPX independently of the other threads.
> 
> I think it makes it both more complicated and less flexible.

I was assuming that if an application did want to enable MPX after threads
had already been created, it would generally want to enable it
simultaneously across all threads.  This would be a lot easier for the
kernel than for the application.

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER

2014-09-18 Thread Kevin Easton

On Wed, Sep 17, 2014 at 09:43:09PM -0700, Dave Hansen wrote:
 On 09/17/2014 08:23 PM, Kevin Easton wrote:
  I was actually thinking that the kernel would take care of the xsave / 
  xrstor (for current), updating tsk-thread.fpu.state (for non-running
  threads) and sending an IPI for threads running on other CPUs.
  
  Of course userspace can always then manually change the bounds directory
  address itself, but then it's quite clear that they're doing something
  unsupported.  Just an idea, anyway.
 
 What's the benefit of that?
 
 As it stands now, MPX is likely to be enabled well before any threads
 are created, and the MPX enabling state will be inherited by the new
 thread at clone() time.  The current mechanism allows a thread to
 individually enable or disable MPX independently of the other threads.
 
 I think it makes it both more complicated and less flexible.

I was assuming that if an application did want to enable MPX after threads
had already been created, it would generally want to enable it
simultaneously across all threads.  This would be a lot easier for the
kernel than for the application.

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER

2014-09-17 Thread Kevin Easton

On Thu, Sep 18, 2014 at 12:40:29AM +, Ren, Qiaowei wrote:
> > Would it be prudent to use an error code other than EINVAL for the
> > "hardware doesn't support it" case?
> >
> Seems like no specific error code for this case.

ENXIO would probably be OK.  It's not too important as long as it's
documented.

> 
> >> @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned
> > long, arg2, unsigned long, arg3,
> >>me->mm->def_flags &= ~VM_NOHUGEPAGE;
> >>up_write(>mm->mmap_sem);
> >>break;
> >> +  case PR_MPX_REGISTER:
> >> +  error = MPX_REGISTER(me);
> >> +  break;
> >> +  case PR_MPX_UNREGISTER:
> >> +  error = MPX_UNREGISTER(me);
> >> +  break;
> > 
> > If you pass me->mm from prctl, that makes it clear that it's
> > per-process not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE.
> > 
> > This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if
> > it's not using them, otherwise you'll be sunk if you ever want to use them 
> > later.
> > 
> > It seems like it only makes sense for all threads using the mm to have
> > the same bounds directory set.  If the interface was changed to
> > directly pass the address, then could the kernel take care of setting
> > it for *all* of the threads in the process? This seems like something
> > that would be easier for the kernel to do than userspace.
> > 
> If the interface was changed to this, it will be possible for insane 
> application to pass error bounds directory address to kernel. We still 
> have to call fpu_xsave() to check this.

I was actually thinking that the kernel would take care of the xsave / 
xrstor (for current), updating tsk->thread.fpu.state (for non-running
threads) and sending an IPI for threads running on other CPUs.

Of course userspace can always then manually change the bounds directory
address itself, but then it's quite clear that they're doing something
unsupported.  Just an idea, anyway.

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86/vdso: Add prctl to set per-process VDSO load

2014-09-17 Thread Kevin Easton

On Tue, Sep 16, 2014 at 05:05:51PM -0700, Richard Larocque wrote:
> + case PR_SET_VDSO:
> + if (arg2 == PR_VDSO_ENABLE)
> + me->signal->disable_vdso = 0;
> + else if (arg2 == PR_VDSO_DISABLE)
> + me->signal->disable_vdso = 1;
> + else
> + return -EINVAL;
> + break;
> + case PR_GET_VDSO:
> + if (!me->signal->disable_vdso)
> + error = put_user(PR_VDSO_ENABLE, (int __user *)arg2);
> + else
> + error = put_user(PR_VDSO_DISABLE, (int __user *)arg2);
> + break;

Perhaps both of those should do

if (arg3 || arg4 || arg5)
return -EINVAL;

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86/vdso: Add prctl to set per-process VDSO load

2014-09-17 Thread Kevin Easton

On Tue, Sep 16, 2014 at 05:05:51PM -0700, Richard Larocque wrote:
 + case PR_SET_VDSO:
 + if (arg2 == PR_VDSO_ENABLE)
 + me-signal-disable_vdso = 0;
 + else if (arg2 == PR_VDSO_DISABLE)
 + me-signal-disable_vdso = 1;
 + else
 + return -EINVAL;
 + break;
 + case PR_GET_VDSO:
 + if (!me-signal-disable_vdso)
 + error = put_user(PR_VDSO_ENABLE, (int __user *)arg2);
 + else
 + error = put_user(PR_VDSO_DISABLE, (int __user *)arg2);
 + break;

Perhaps both of those should do

if (arg3 || arg4 || arg5)
return -EINVAL;

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER

2014-09-17 Thread Kevin Easton

On Thu, Sep 18, 2014 at 12:40:29AM +, Ren, Qiaowei wrote:
  Would it be prudent to use an error code other than EINVAL for the
  hardware doesn't support it case?
 
 Seems like no specific error code for this case.

ENXIO would probably be OK.  It's not too important as long as it's
documented.

 
  @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned
  long, arg2, unsigned long, arg3,
 me-mm-def_flags = ~VM_NOHUGEPAGE;
 up_write(me-mm-mmap_sem);
 break;
  +  case PR_MPX_REGISTER:
  +  error = MPX_REGISTER(me);
  +  break;
  +  case PR_MPX_UNREGISTER:
  +  error = MPX_UNREGISTER(me);
  +  break;
  
  If you pass me-mm from prctl, that makes it clear that it's
  per-process not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE.
  
  This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if
  it's not using them, otherwise you'll be sunk if you ever want to use them 
  later.
  
  It seems like it only makes sense for all threads using the mm to have
  the same bounds directory set.  If the interface was changed to
  directly pass the address, then could the kernel take care of setting
  it for *all* of the threads in the process? This seems like something
  that would be easier for the kernel to do than userspace.
  
 If the interface was changed to this, it will be possible for insane 
 application to pass error bounds directory address to kernel. We still 
 have to call fpu_xsave() to check this.

I was actually thinking that the kernel would take care of the xsave / 
xrstor (for current), updating tsk-thread.fpu.state (for non-running
threads) and sending an IPI for threads running on other CPUs.

Of course userspace can always then manually change the bounds directory
address itself, but then it's quite clear that they're doing something
unsupported.  Just an idea, anyway.

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER

2014-09-16 Thread Kevin Easton

On Thu, Sep 11, 2014 at 04:46:48PM +0800, Qiaowei Ren wrote:

> +static __user void *task_get_bounds_dir(struct task_struct *tsk)
> +{
> + struct xsave_struct *xsave_buf;
> +
> + fpu_xsave(>thread.fpu);
> + xsave_buf = &(tsk->thread.fpu.state->xsave);
> + if (!(xsave_buf->bndcsr.cfg_reg_u & MPX_BNDCFG_ENABLE_FLAG))
> + return NULL;
> +
> + return (void __user *)(unsigned long)(xsave_buf->bndcsr.cfg_reg_u &
> + MPX_BNDCFG_ADDR_MASK);
> +}

This only makes sense if called with 'current', so is there any need
for the function argument?

> +
> +int mpx_register(struct task_struct *tsk)
> +{
> + struct mm_struct *mm = tsk->mm;
> +
> + if (!cpu_has_mpx)
> + return -EINVAL;
> +
> + /*
> +  * runtime in the userspace will be responsible for allocation of
> +  * the bounds directory. Then, it will save the base of the bounds
> +  * directory into XSAVE/XRSTOR Save Area and enable MPX through
> +  * XRSTOR instruction.
> +  *
> +  * fpu_xsave() is expected to be very expensive. In order to do
> +  * performance optimization, here we get the base of the bounds
> +  * directory and then save it into mm_struct to be used in future.
> +  */
> + mm->bd_addr = task_get_bounds_dir(tsk);
> + if (!mm->bd_addr)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +int mpx_unregister(struct task_struct *tsk)
> +{
> + struct mm_struct *mm = current->mm;
> +
> + if (!cpu_has_mpx)
> + return -EINVAL;
> +
> + mm->bd_addr = NULL;
> + return 0;
> +}

If that's changed, then mpx_register() and mpx_unregister() don't need
a task_struct, just an mm_struct.

Probably these functions should be locking mmap_sem.

Would it be prudent to use an error code other than EINVAL for the 
"hardware doesn't support it" case?

> @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   me->mm->def_flags &= ~VM_NOHUGEPAGE;
>   up_write(>mm->mmap_sem);
>   break;
> + case PR_MPX_REGISTER:
> + error = MPX_REGISTER(me);
> + break;
> + case PR_MPX_UNREGISTER:
> + error = MPX_UNREGISTER(me);
> + break;

If you pass me->mm from prctl, that makes it clear that it's per-process
not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE.

This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if it's
not using them, otherwise you'll be sunk if you ever want to use them later.

It seems like it only makes sense for all threads using the mm to have the
same bounds directory set.  If the interface was changed to directly pass
the address, then could the kernel take care of setting it for *all* of
the threads in the process? This seems like something that would be easier
for the kernel to do than userspace.

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 08/10] x86, mpx: add prctl commands PR_MPX_REGISTER, PR_MPX_UNREGISTER

2014-09-16 Thread Kevin Easton

On Thu, Sep 11, 2014 at 04:46:48PM +0800, Qiaowei Ren wrote:

 +static __user void *task_get_bounds_dir(struct task_struct *tsk)
 +{
 + struct xsave_struct *xsave_buf;
 +
 + fpu_xsave(tsk-thread.fpu);
 + xsave_buf = (tsk-thread.fpu.state-xsave);
 + if (!(xsave_buf-bndcsr.cfg_reg_u  MPX_BNDCFG_ENABLE_FLAG))
 + return NULL;
 +
 + return (void __user *)(unsigned long)(xsave_buf-bndcsr.cfg_reg_u 
 + MPX_BNDCFG_ADDR_MASK);
 +}

This only makes sense if called with 'current', so is there any need
for the function argument?

 +
 +int mpx_register(struct task_struct *tsk)
 +{
 + struct mm_struct *mm = tsk-mm;
 +
 + if (!cpu_has_mpx)
 + return -EINVAL;
 +
 + /*
 +  * runtime in the userspace will be responsible for allocation of
 +  * the bounds directory. Then, it will save the base of the bounds
 +  * directory into XSAVE/XRSTOR Save Area and enable MPX through
 +  * XRSTOR instruction.
 +  *
 +  * fpu_xsave() is expected to be very expensive. In order to do
 +  * performance optimization, here we get the base of the bounds
 +  * directory and then save it into mm_struct to be used in future.
 +  */
 + mm-bd_addr = task_get_bounds_dir(tsk);
 + if (!mm-bd_addr)
 + return -EINVAL;
 +
 + return 0;
 +}
 +
 +int mpx_unregister(struct task_struct *tsk)
 +{
 + struct mm_struct *mm = current-mm;
 +
 + if (!cpu_has_mpx)
 + return -EINVAL;
 +
 + mm-bd_addr = NULL;
 + return 0;
 +}

If that's changed, then mpx_register() and mpx_unregister() don't need
a task_struct, just an mm_struct.

Probably these functions should be locking mmap_sem.

Would it be prudent to use an error code other than EINVAL for the 
hardware doesn't support it case?

 @@ -2011,6 +2017,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
 arg2, unsigned long, arg3,
   me-mm-def_flags = ~VM_NOHUGEPAGE;
   up_write(me-mm-mmap_sem);
   break;
 + case PR_MPX_REGISTER:
 + error = MPX_REGISTER(me);
 + break;
 + case PR_MPX_UNREGISTER:
 + error = MPX_UNREGISTER(me);
 + break;

If you pass me-mm from prctl, that makes it clear that it's per-process
not per-thread, just like PR_SET_DUMPABLE / PR_GET_DUMPABLE.

This code should also enforce nulls in arg2 / arg3 / arg4,/ arg5 if it's
not using them, otherwise you'll be sunk if you ever want to use them later.

It seems like it only makes sense for all threads using the mm to have the
same bounds directory set.  If the interface was changed to directly pass
the address, then could the kernel take care of setting it for *all* of
the threads in the process? This seems like something that would be easier
for the kernel to do than userspace.

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fs: Cleanup string initializations (char[] instead of char *)

2014-05-18 Thread Kevin Easton

On Sat, May 17, 2014 at 05:53:45PM +0100, Al Viro wrote:
> On Sat, May 17, 2014 at 05:00:18PM +0200, Manuel Sch?lling wrote:
> > Initializations like 'char *foo = "bar"' will create two variables: a static
> > string and a pointer (foo) to that static string. Instead 'char foo[] = 
> > "bar"'
> > will declare a single variable and will end up in shorter
> > assembly (according to Jeff Garzik on the KernelJanitor's TODO list).
> 
> The hell it will.  Compare assembler generated e.g. for 32bit x86 before
> and after.
> 
> >  {
> > char *dp;
> > char *status = "disabled";
> > -   const char * flags = "flags: ";
> > +   const char flags[] = "flags: ";
> 
> The first variant puts address of constant array into local variable
> (on stack or in a register).  The second one fills local _array_ - the
> string itself goes on stack.

Right - if you're going to do this, you ireally want to make the array
static as well:

static const char flags[] = "flags: ";

(it's very unlikely that you would want a const array that isn't static).

As Al showed though, usually the optimiser will figure things out on its
own, and the pointer variable in the original variant gets optimised away.
There's also optimisations that apply to string literals but don't apply to
arrays, like string literal merging, which mean that the original could
well still be preferred.

- Kevin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fs: Cleanup string initializations (char[] instead of char *)

2014-05-18 Thread Kevin Easton

On Sat, May 17, 2014 at 05:53:45PM +0100, Al Viro wrote:
 On Sat, May 17, 2014 at 05:00:18PM +0200, Manuel Sch?lling wrote:
  Initializations like 'char *foo = bar' will create two variables: a static
  string and a pointer (foo) to that static string. Instead 'char foo[] = 
  bar'
  will declare a single variable and will end up in shorter
  assembly (according to Jeff Garzik on the KernelJanitor's TODO list).
 
 The hell it will.  Compare assembler generated e.g. for 32bit x86 before
 and after.
 
   {
  char *dp;
  char *status = disabled;
  -   const char * flags = flags: ;
  +   const char flags[] = flags: ;
 
 The first variant puts address of constant array into local variable
 (on stack or in a register).  The second one fills local _array_ - the
 string itself goes on stack.

Right - if you're going to do this, you ireally want to make the array
static as well:

static const char flags[] = flags: ;

(it's very unlikely that you would want a const array that isn't static).

As Al showed though, usually the optimiser will figure things out on its
own, and the pointer variable in the original variant gets optimised away.
There's also optimisations that apply to string literals but don't apply to
arrays, like string literal merging, which mean that the original could
well still be preferred.

- Kevin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-07 Thread Kevin Easton

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner  wrote:
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
...
> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.

The write-implicitly-clears-volatile semantics would actually be
an advantage for some use cases.  If you have a volatile cache of
many sub-page-size objects, the application can just include at
the start of each page "int present, in_use;".  "present" is set
to non-zero before marking volatile, and when the application wants
unmark as volatile it writes to "in_use" and tests the value of 
"present".  No need for a syscall at all, although it does take a
minor fault.

The syscall would be better for the case of large objects, though.

Or is that fatally flawed?

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-07 Thread Kevin Easton

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote:
  I'm just dying to hear a normal use case then. :)
 
 So the more normal use cause would be marking objects volatile and
 then non-volatile w/o accessing them in-between. In this case the
 zero-fill vs SIGBUS semantics don't really matter, its really just a
 trade off in how we handle applications deviating (intentionally or
 not) from this use case.
 
 So to maybe flesh out the context here for folks who are following
 along (but weren't in the hallway at LSF :),  Johannes made a fairly
 interesting proposal (Johannes: Please correct me here where I'm maybe
 slightly off here) to use only the dirty bits of the ptes to mark a
 page as volatile. Then the kernel could reclaim these clean pages as
 it needed, and when we marked the range as non-volatile, the pages
 would be re-dirtied and if any of the pages were missing, we could
 return a flag with the purged state.  This had some different
 semantics then what I've been working with for awhile (for example,
 any writes to pages would implicitly clear volatility), so I wasn't
 completely comfortable with it, but figured I'd think about it to see
 if it could be done. Particularly since it would in some ways simplify
 tmpfs/shm shared volatility that I'd eventually like to do.
...
 Now, while for the case I'm personally most interested in (ashmem),
 zero-fill would technically be ok, since that's what Android does.
 Even so, I don't think its the best approach for the interface, since
 applications may end up quite surprised by the results when they
 accidentally don't follow the don't touch volatile pages rule.
 
 That point beside, I think the other problem with the page-cleaning
 volatility approach is that there are other awkward side effects. For
 example: Say an application marks a range as volatile. One page in the
 range is then purged. The application, due to a bug or otherwise,
 reads the volatile range. This causes the page to be zero-filled in,
 and the application silently uses the corrupted data (which isn't
 great). More problematic though, is that by faulting the page in,
 they've in effect lost the purge state for that page. When the
 application then goes to mark the range as non-volatile, all pages are
 present, so we'd return that no pages were purged.  From an
 application perspective this is pretty ugly.

The write-implicitly-clears-volatile semantics would actually be
an advantage for some use cases.  If you have a volatile cache of
many sub-page-size objects, the application can just include at
the start of each page int present, in_use;.  present is set
to non-zero before marking volatile, and when the application wants
unmark as volatile it writes to in_use and tests the value of 
present.  No need for a syscall at all, although it does take a
minor fault.

The syscall would be better for the case of large objects, though.

Or is that fatally flawed?

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] [PATCH] Pre-emption control for userspace

2014-03-06 Thread Kevin Easton

On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> Anything else?

If it was possible to make the time remaining in the current timeslice
available to userspace through the vdso, the thread could do something like:

if (sys_timeleft() < CRITICAL_SECTION_SIZE)
yield();
lock();

to avoid running out of timeslice in the middle of the critical section.

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] [PATCH] Pre-emption control for userspace

2014-03-06 Thread Kevin Easton

On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
 Anything else?

If it was possible to make the time remaining in the current timeslice
available to userspace through the vdso, the thread could do something like:

if (sys_timeleft()  CRITICAL_SECTION_SIZE)
yield();
lock();

to avoid running out of timeslice in the middle of the critical section.

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Update of file offset on write() etc. is non-atomic with I/O

2014-02-22 Thread Kevin Easton

On Fri, Feb 21, 2014 at 07:01:31AM +0100, Michael Kerrisk (man-pages) wrote:
> Here's the fulll list from POSIX.1-2008/SUSv4 Section XSI 2.9.7:
> 
> [[
> 2.9.7 Thread Interactions with Regular File Operations
> 
> All of the following functions shall be atomic with respect to each
> other in the effects specified in
> POSIX.1-2008 when they operate on regular files or symbolic links:
> 
> chmod( )
> chown( )
> close( )
> creat( )
> dup2( )
> fchmod( )
> fchmodat( )
> fchown( )
> fchownat( )
> fcntl( )
> fstat( )
> fstatat( )
> ftruncate( )
> lchown( )
> link( )
> linkat( )
> lseek( )
> lstat( )
> open( )
> openat( )
> pread( )
> read( )
> readlink( )
> readlinkat( )
> readv( )
> pwrite( )
> rename( )
> renameat( )
> stat( )
> symlink( )
> symlinkat( )
> truncate( )
> unlink( )
> unlinkat( )
> utime( )
> utimensat( )
> utimes( )
> write( )
> writev( )
> 
> If two threads each call one of these functions, each call shall
> either see all of the specified effects
> of the other call, or none of them.
> ]]
> 
> I'd bet that there's a bunch of violations to be found, but the
> read/write f_pos case is one of the most egregious.
> 
> For example, I got curious about stat() versus rename(). If one
> stat()s a directory() while a subdirectory is being renamed to a new
> name within that directory, does the link count of the parent
> directory ever change--that is, could stat() ever see a changed link
> count in the middle of the rename()? My experiments suggest that it
> can. I suppose it would have to be a very unusual application that
> would be troubled by that, but it does appear to be a violation of
> 2.9.7.

A directory isn't a regular file or symlink, though, so the warranty
would seem to be void in that case.

If you {f}stat() a regular file that is being concurrently renamed() then
the link count shouldn't vary, though.

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Update of file offset on write() etc. is non-atomic with I/O

2014-02-22 Thread Kevin Easton

On Fri, Feb 21, 2014 at 07:01:31AM +0100, Michael Kerrisk (man-pages) wrote:
 Here's the fulll list from POSIX.1-2008/SUSv4 Section XSI 2.9.7:
 
 [[
 2.9.7 Thread Interactions with Regular File Operations
 
 All of the following functions shall be atomic with respect to each
 other in the effects specified in
 POSIX.1-2008 when they operate on regular files or symbolic links:
 
 chmod( )
 chown( )
 close( )
 creat( )
 dup2( )
 fchmod( )
 fchmodat( )
 fchown( )
 fchownat( )
 fcntl( )
 fstat( )
 fstatat( )
 ftruncate( )
 lchown( )
 link( )
 linkat( )
 lseek( )
 lstat( )
 open( )
 openat( )
 pread( )
 read( )
 readlink( )
 readlinkat( )
 readv( )
 pwrite( )
 rename( )
 renameat( )
 stat( )
 symlink( )
 symlinkat( )
 truncate( )
 unlink( )
 unlinkat( )
 utime( )
 utimensat( )
 utimes( )
 write( )
 writev( )
 
 If two threads each call one of these functions, each call shall
 either see all of the specified effects
 of the other call, or none of them.
 ]]
 
 I'd bet that there's a bunch of violations to be found, but the
 read/write f_pos case is one of the most egregious.
 
 For example, I got curious about stat() versus rename(). If one
 stat()s a directory() while a subdirectory is being renamed to a new
 name within that directory, does the link count of the parent
 directory ever change--that is, could stat() ever see a changed link
 count in the middle of the rename()? My experiments suggest that it
 can. I suppose it would have to be a very unusual application that
 would be troubled by that, but it does appear to be a violation of
 2.9.7.

A directory isn't a regular file or symlink, though, so the warranty
would seem to be void in that case.

If you {f}stat() a regular file that is being concurrently renamed() then
the link count shouldn't vary, though.

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.12: kernel panic when resuming from suspend to RAM (x86_64)

2013-11-17 Thread Kevin Easton

On Sun, Nov 17, 2013 at 11:06:12PM +0100, Borislav Petkov wrote:
> and the virtual address in rIP is 8106f5c3, i.e. the same one
> as in the photo. Thus, the CALL instruction tries to call the timer
> function 'fn' which we pass as an argument to call_timer_fn.
> 
> However, the address we're trying to call in %r14 is garbage:
> 0x455300323d504544 and not in canonical form, causing the #GP.

That's part of an ASCII string, "DEP=2\0SE", so if that looks familiar
to anyone (part of a kernel command line?) it might give a clue to
where the timer callback pointer is being overwritten.

> So basically what happens is suspend to RAM corrupts something
> containing one or more timer functions and we end up calling crap after
> resume.

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.12: kernel panic when resuming from suspend to RAM (x86_64)

2013-11-17 Thread Kevin Easton

On Sun, Nov 17, 2013 at 11:06:12PM +0100, Borislav Petkov wrote:
 and the virtual address in rIP is 8106f5c3, i.e. the same one
 as in the photo. Thus, the CALL instruction tries to call the timer
 function 'fn' which we pass as an argument to call_timer_fn.
 
 However, the address we're trying to call in %r14 is garbage:
 0x455300323d504544 and not in canonical form, causing the #GP.

That's part of an ASCII string, DEP=2\0SE, so if that looks familiar
to anyone (part of a kernel command line?) it might give a clue to
where the timer callback pointer is being overwritten.

 So basically what happens is suspend to RAM corrupts something
 containing one or more timer functions and we end up calling crap after
 resume.

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] preempt_count rework -v2

2013-09-13 Thread Kevin Easton

On Thu, Sep 12, 2013 at 04:20:40AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 11, 2013 at 04:02:14PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 11, 2013 at 11:59 AM, Peter Zijlstra  
> > wrote:
> > >
> > > OK, stripped it down further, I couldn't quite see how to collapse the
> > > unary and binary operator variants though :/
> > 
> > Ok, this looks pretty good. I assume it works too? ;)
> 
> Only compile tested that one.. the below is kvm boot tested until root
> mount -- I'll try on actual hardware when I've gotten some sleep.
> 
> I split the thing up into two macros GEN_UNARY_RMWcc and
> GEN_BINARY_RMWcc which ends up being more readable as well as smaller
> code overall.

If you wanted to collapse the unary and binary variants as you mentioned
upthread, you could do something like (for the CC_HAVE_ASM_GOTO case):

#define GEN_RMWcc(fullop, cc, ...) \
do { \
 asm volatile goto (fullop \
 "j" cc " %l[cc_label]" \
 : : __VA_ARGS__ \
 : "memory" : cc_label); \
 return 0; \
cc_label: \
 return 1; \
} while (0)

#define GEN_UNARY_RMWcc(op, var, arg0, cc) GEN_RMWcc(op " " arg0 ";", cc, "m" 
(var))
#define GEN_BINARY_RMWcc(op, var, val, arg0, cc) GEN_RMWcc(op " %1, " arg0 ";", 
cc, "m" (var), "er" (val))

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] preempt_count rework -v2

2013-09-13 Thread Kevin Easton

On Thu, Sep 12, 2013 at 04:20:40AM +0200, Peter Zijlstra wrote:
 On Wed, Sep 11, 2013 at 04:02:14PM -0700, Linus Torvalds wrote:
  On Wed, Sep 11, 2013 at 11:59 AM, Peter Zijlstra pet...@infradead.org 
  wrote:
  
   OK, stripped it down further, I couldn't quite see how to collapse the
   unary and binary operator variants though :/
  
  Ok, this looks pretty good. I assume it works too? ;)
 
 Only compile tested that one.. the below is kvm boot tested until root
 mount -- I'll try on actual hardware when I've gotten some sleep.
 
 I split the thing up into two macros GEN_UNARY_RMWcc and
 GEN_BINARY_RMWcc which ends up being more readable as well as smaller
 code overall.

If you wanted to collapse the unary and binary variants as you mentioned
upthread, you could do something like (for the CC_HAVE_ASM_GOTO case):

#define GEN_RMWcc(fullop, cc, ...) \
do { \
 asm volatile goto (fullop \
 j cc  %l[cc_label] \
 : : __VA_ARGS__ \
 : memory : cc_label); \
 return 0; \
cc_label: \
 return 1; \
} while (0)

#define GEN_UNARY_RMWcc(op, var, arg0, cc) GEN_RMWcc(op   arg0 ;, cc, m 
(var))
#define GEN_BINARY_RMWcc(op, var, val, arg0, cc) GEN_RMWcc(op  %1,  arg0 ;, 
cc, m (var), er (val))

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH jiffies] Avoid undefined behavior from signed overflow

2013-07-29 Thread Kevin Easton


Quoting "Paul E. McKenney" :


On Mon, Jul 29, 2013 at 03:30:35PM +1000, c...@guarana.org wrote:

Quoting "Paul E. McKenney" :


...

>
>Note that the C standard considers the cast from signed to
>unsigned to be implementation-defined, see 6.3.1.3p3.

...


Don't worry, the case from signed to unsigned is actually well-defined -
the relevant part is 6.3.1.3p2 (in C99):

>Otherwise, if the new type is unsigned, the value is converted by
>repeatedly adding or subtracting one more than the maximum value that
>can be represented in the new type until the value is in the range of
>the new type.


Yep, but we are going in the other direction, from unsigned to signed.


Ahh, there's an error in the commit message (it says signed to unsigned).

- Kevin


This message was sent using IMP, the Internet Messaging Program.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH jiffies] Avoid undefined behavior from signed overflow

2013-07-29 Thread Kevin Easton


Quoting Paul E. McKenney paul...@linux.vnet.ibm.com:


On Mon, Jul 29, 2013 at 03:30:35PM +1000, c...@guarana.org wrote:

Quoting Paul E. McKenney paul...@linux.vnet.ibm.com:


...


Note that the C standard considers the cast from signed to
unsigned to be implementation-defined, see 6.3.1.3p3.

...


Don't worry, the case from signed to unsigned is actually well-defined -
the relevant part is 6.3.1.3p2 (in C99):

Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.


Yep, but we are going in the other direction, from unsigned to signed.


Ahh, there's an error in the commit message (it says signed to unsigned).

- Kevin


This message was sent using IMP, the Internet Messaging Program.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] clone.2, execve.2: Describe interaction of execve(2) and CLONE_FILES

2013-03-28 Thread Kevin Easton

Hi Michael,

This very small patch against man-pages-3.50 adds a description of the
way in which a shared file descriptor table, as created by the
CLONE_FILES flag of clone(2), is unshared when a process calls execve(2).

It's easy to see this from the source - there's an unshare_files()
call early in do_execve_common().

- Kevin

diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2
--- man-pages-3.50.orig/man2/clone.22013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/clone.2 2013-03-28 21:50:04.745932956 +1100
@@ -180,6 +180,9 @@
 .BR fcntl (2)
 .B F_SETFD
 operation), the other process is also affected.
+If a process sharing a file descriptor table calls
+.BR execve (2),
+its file descriptor table is duplicated (unshared).

 If
 .B CLONE_FILES
diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2
--- man-pages-3.50.orig/man2/execve.2   2013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/execve.22013-03-28 22:02:56.793844616 +1100
@@ -204,6 +204,11 @@
 .B SIGCHLD
 (see
 .BR clone (2)).
+.IP *
+The file descriptor table is unshared, undoing the effect of the
+.B CLONE_FILES
+flag of
+.BR clone (2).
 .PP
 Note the following further points:
 .IP * 3

diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2
--- man-pages-3.50.orig/man2/clone.2	2013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/clone.2	2013-03-28 21:50:04.745932956 +1100
@@ -180,6 +180,9 @@
 .BR fcntl (2)
 .B F_SETFD
 operation), the other process is also affected.
+If a process sharing a file descriptor table calls
+.BR execve (2),
+its file descriptor table is duplicated (unshared).
 
 If
 .B CLONE_FILES
diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2
--- man-pages-3.50.orig/man2/execve.2	2013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/execve.2	2013-03-28 22:02:56.793844616 +1100
@@ -204,6 +204,11 @@
 .B SIGCHLD
 (see
 .BR clone (2)).
+.IP *
+The file descriptor table is unshared, undoing the effect of the
+.B CLONE_FILES
+flag of
+.BR clone (2).
 .PP
 Note the following further points:
 .IP * 3

[patch] clone.2, execve.2: Describe interaction of execve(2) and CLONE_FILES

2013-03-28 Thread Kevin Easton

Hi Michael,

This very small patch against man-pages-3.50 adds a description of the
way in which a shared file descriptor table, as created by the
CLONE_FILES flag of clone(2), is unshared when a process calls execve(2).

It's easy to see this from the source - there's an unshare_files()
call early in do_execve_common().

- Kevin

diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2
--- man-pages-3.50.orig/man2/clone.22013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/clone.2 2013-03-28 21:50:04.745932956 +1100
@@ -180,6 +180,9 @@
 .BR fcntl (2)
 .B F_SETFD
 operation), the other process is also affected.
+If a process sharing a file descriptor table calls
+.BR execve (2),
+its file descriptor table is duplicated (unshared).

 If
 .B CLONE_FILES
diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2
--- man-pages-3.50.orig/man2/execve.2   2013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/execve.22013-03-28 22:02:56.793844616 +1100
@@ -204,6 +204,11 @@
 .B SIGCHLD
 (see
 .BR clone (2)).
+.IP *
+The file descriptor table is unshared, undoing the effect of the
+.B CLONE_FILES
+flag of
+.BR clone (2).
 .PP
 Note the following further points:
 .IP * 3

diff -urN man-pages-3.50.orig/man2/clone.2 man-pages-3.50/man2/clone.2
--- man-pages-3.50.orig/man2/clone.2	2013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/clone.2	2013-03-28 21:50:04.745932956 +1100
@@ -180,6 +180,9 @@
 .BR fcntl (2)
 .B F_SETFD
 operation), the other process is also affected.
+If a process sharing a file descriptor table calls
+.BR execve (2),
+its file descriptor table is duplicated (unshared).
 
 If
 .B CLONE_FILES
diff -urN man-pages-3.50.orig/man2/execve.2 man-pages-3.50/man2/execve.2
--- man-pages-3.50.orig/man2/execve.2	2013-03-15 18:17:35.0 +1100
+++ man-pages-3.50/man2/execve.2	2013-03-28 22:02:56.793844616 +1100
@@ -204,6 +204,11 @@
 .B SIGCHLD
 (see
 .BR clone (2)).
+.IP *
+The file descriptor table is unshared, undoing the effect of the
+.B CLONE_FILES
+flag of
+.BR clone (2).
 .PP
 Note the following further points:
 .IP * 3

Re: New system call wanted: fdreopen

2012-12-10 Thread Kevin Easton


Quoting Tristan Wibberley :


Why
===

A common idiom on Linux is to open a file and keep the fd open so that
the underlying file can be unlinked from its directory. But if the file
needs to be read from several different parts of the codebase then due to
the file descriptor having exactly one read pointer those different parts
must be synchronised which is a relatively difficult task.


Another alternative is to use pwrite() / pread(), which do not affect the file
pointer.  They're in POSIX.

- Kevin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: New system call wanted: fdreopen

2012-12-10 Thread Kevin Easton


Quoting Tristan Wibberley tristan.wibber...@gmail.com:


Why
===

A common idiom on Linux is to open a file and keep the fd open so that
the underlying file can be unlinked from its directory. But if the file
needs to be read from several different parts of the codebase then due to
the file descriptor having exactly one read pointer those different parts
must be synchronised which is a relatively difficult task.


Another alternative is to use pwrite() / pread(), which do not affect the file
pointer.  They're in POSIX.

- Kevin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1] firmware loader: introduce module parameter to customize fw search path

2012-10-26 Thread Kevin Easton


Quoting Al Viro :


On Thu, Oct 25, 2012 at 08:38:25PM -0700, Linus Torvalds wrote:


It's valid to cast a non-const pointer to a const one. It's the
*other* way around that is invalid.

So marking fw_path[] as having 'const char *' elements just means that
we won't be changing those elements through the fw_path[] array
(correct: we only read them). The fact that one of those same pointers
is then also available through a non-const pointer variable means that
they can change through *that* pointer, but that doesn't change the
fact that fw_path[] itself contains const pointers.

Remember: in C, a "const pointer" does *not* mean that the thing it
points to cannot change. It only means that it cannot change through
*that* pointer.


It's a bit trickier, unfortunately - pointer to pointer to const char
and pointer to pointer to char do not mix.  Just for fun, try to constify
envp and argv arguments of call_usermodehelper()...


That's because if it _was_ allowed, you could use it to silently  
launder the const away:


const char *c = "rodata";
char *x;
const char **y;

y = 
*y = c;

/* We now have (const char) values accessible through a (char *) pointer x */



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1] firmware loader: introduce module parameter to customize fw search path

2012-10-26 Thread Kevin Easton


Quoting Al Viro v...@zeniv.linux.org.uk:


On Thu, Oct 25, 2012 at 08:38:25PM -0700, Linus Torvalds wrote:


It's valid to cast a non-const pointer to a const one. It's the
*other* way around that is invalid.

So marking fw_path[] as having 'const char *' elements just means that
we won't be changing those elements through the fw_path[] array
(correct: we only read them). The fact that one of those same pointers
is then also available through a non-const pointer variable means that
they can change through *that* pointer, but that doesn't change the
fact that fw_path[] itself contains const pointers.

Remember: in C, a const pointer does *not* mean that the thing it
points to cannot change. It only means that it cannot change through
*that* pointer.


It's a bit trickier, unfortunately - pointer to pointer to const char
and pointer to pointer to char do not mix.  Just for fun, try to constify
envp and argv arguments of call_usermodehelper()...


That's because if it _was_ allowed, you could use it to silently  
launder the const away:


const char *c = rodata;
char *x;
const char **y;

y = x;
*y = c;

/* We now have (const char) values accessible through a (char *) pointer x */



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

97 matches

Mail list logo