subject:"Re\: \[PATCHv2 1\/1\] Documentation\: describe how to add a system call"

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-08-01 Thread H. Peter Anvin

Desnoyers ,"linux-...@vger.kernel.org" 
,"linux-kernel@vger.kernel.org" 
,Peter Zijlstra 
Message-ID: 

Let me see if I can unbury it - probably not until Monday, though.

On July 31, 2015 11:18:06 PM PDT, Josh Triplett  wrote:
>On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote:
>> On 07/31/2015 09:32 PM, Josh Triplett wrote:
>> > 
>> > Sure, agreed.  But I really hope we don't create new kernel ABIs
>that
>> > involve constructs like that.
>> > 
>> 
>> It's worth noting I have pushed for auto-marshalling in general for a
>> long time, not the least to get rid of the awful syscall(3) wrapper. 
>I
>> even built a prototype but didn't have time to productize it.
>
>Interesting!  Can you post the prototype?
>
>- Josh Triplett

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-08-01 Thread Josh Triplett

On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote:
> On 07/31/2015 09:32 PM, Josh Triplett wrote:
> > 
> > Sure, agreed.  But I really hope we don't create new kernel ABIs that
> > involve constructs like that.
> > 
> 
> It's worth noting I have pushed for auto-marshalling in general for a
> long time, not the least to get rid of the awful syscall(3) wrapper.  I
> even built a prototype but didn't have time to productize it.

Interesting!  Can you post the prototype?

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-08-01 Thread H. Peter Anvin

Desnoyers mathieu.desnoy...@efficios.com,linux-...@vger.kernel.org 
linux-...@vger.kernel.org,linux-kernel@vger.kernel.org 
linux-kernel@vger.kernel.org,Peter Zijlstra a.p.zijls...@chello.nl
Message-ID: d190a265-cbf0-460e-b12a-df0e802c5...@zytor.com

Let me see if I can unbury it - probably not until Monday, though.

On July 31, 2015 11:18:06 PM PDT, Josh Triplett j...@joshtriplett.org wrote:
On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote:
 On 07/31/2015 09:32 PM, Josh Triplett wrote:
  
  Sure, agreed.  But I really hope we don't create new kernel ABIs
that
  involve constructs like that.
  
 
 It's worth noting I have pushed for auto-marshalling in general for a
 long time, not the least to get rid of the awful syscall(3) wrapper. 
I
 even built a prototype but didn't have time to productize it.

Interesting!  Can you post the prototype?

- Josh Triplett

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-08-01 Thread Josh Triplett

On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote:
 On 07/31/2015 09:32 PM, Josh Triplett wrote:
  
  Sure, agreed.  But I really hope we don't create new kernel ABIs that
  involve constructs like that.
  
 
 It's worth noting I have pushed for auto-marshalling in general for a
 long time, not the least to get rid of the awful syscall(3) wrapper.  I
 even built a prototype but didn't have time to productize it.

Interesting!  Can you post the prototype?

- Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread H. Peter Anvin

On 07/31/2015 09:32 PM, Josh Triplett wrote:
> 
> Sure, agreed.  But I really hope we don't create new kernel ABIs that
> involve constructs like that.
> 

It's worth noting I have pushed for auto-marshalling in general for a
long time, not the least to get rid of the awful syscall(3) wrapper.  I
even built a prototype but didn't have time to productize it.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Josh Triplett

On Fri, Jul 31, 2015 at 03:54:56PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 31, 2015 at 3:08 PM,   wrote:
> > On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 31, 2015 at 1:59 PM,   wrote:
> >> > Agreed.  I think the proposal above would be a net improvement, but
> >> > ideally you'd want something that's annotated and generates automatic
> >> > marshalling code.
> >> >
> >>
> >> I assume this is idle musing.  If, however, we were to actually do
> >> this, I'd suggest we seriously consider speaking the Cap'n Proto
> >> serialization format.  It's quite nice, it encodes and decodes *very*
> >> quickly and, unlike TLV schemes, you don't have to read it in order,
> >> making the read-side code less awkward.
> >
> > That seems like *massive* overkill for a kernel<->userspace syscall
> > interface.  I was more thinking about having a few standardized marshal
> > types, and incrementally adding more when more patterns show up.  For a
> > first pass, just automatically running copy_from_user and
> > copy_param_struct on appropriate sets of __user parameters identified as
> > such in a structured text file seems quite sufficient.  (Plus
> > automatically generating syscalls.h from that.)
> 
> If a param struct does the trick, then I agree.  It's when you start
> having lists and other variable-size stuff that it gets messier.

Sure, agreed.  But I really hope we don't create new kernel ABIs that
involve constructs like that.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Andy Lutomirski

On Fri, Jul 31, 2015 at 3:08 PM,   wrote:
> On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 31, 2015 at 1:59 PM,   wrote:
>> > Agreed.  I think the proposal above would be a net improvement, but
>> > ideally you'd want something that's annotated and generates automatic
>> > marshalling code.
>> >
>>
>> I assume this is idle musing.  If, however, we were to actually do
>> this, I'd suggest we seriously consider speaking the Cap'n Proto
>> serialization format.  It's quite nice, it encodes and decodes *very*
>> quickly and, unlike TLV schemes, you don't have to read it in order,
>> making the read-side code less awkward.
>
> That seems like *massive* overkill for a kernel<->userspace syscall
> interface.  I was more thinking about having a few standardized marshal
> types, and incrementally adding more when more patterns show up.  For a
> first pass, just automatically running copy_from_user and
> copy_param_struct on appropriate sets of __user parameters identified as
> such in a structured text file seems quite sufficient.  (Plus
> automatically generating syscalls.h from that.)

If a param struct does the trick, then I agree.  It's when you start
having lists and other variable-size stuff that it gets messier.

For reference, my Cap'n Proto parser (just the basic structure of
messages, not the actual schema bits, although the latter is more or
less just some structs) is about 300 lines of code.  It's arguably
simpler than nlattr, once you throw nesting in.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread josh

On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 31, 2015 at 1:59 PM,   wrote:
> > Agreed.  I think the proposal above would be a net improvement, but
> > ideally you'd want something that's annotated and generates automatic
> > marshalling code.
> >
> 
> I assume this is idle musing.  If, however, we were to actually do
> this, I'd suggest we seriously consider speaking the Cap'n Proto
> serialization format.  It's quite nice, it encodes and decodes *very*
> quickly and, unlike TLV schemes, you don't have to read it in order,
> making the read-side code less awkward.

That seems like *massive* overkill for a kernel<->userspace syscall
interface.  I was more thinking about having a few standardized marshal
types, and incrementally adding more when more patterns show up.  For a
first pass, just automatically running copy_from_user and
copy_param_struct on appropriate sets of __user parameters identified as
such in a structured text file seems quite sufficient.  (Plus
automatically generating syscalls.h from that.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Andy Lutomirski

On Fri, Jul 31, 2015 at 1:59 PM,   wrote:
> On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote:
>> On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett  wrote:
>> > On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
>> >> On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett  
>> >> wrote:
>> >> > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
>> >> >> I like this, it's a good description of both options. I'm still biased
>> >> >> about the approach: I prefer flags, since pointers to user structures
>> >> >> complicate syscall filtering. ;)
>> >> >
>> >> > Seems like we should do two things to make that easier:
>> >> >
>> >> > 1) Create a standardized kernel mechanism for parameter-struct handling,
>> >> >implementing the recommendations mentioned here.
>> >>
>> >> It's been suggested in the past that nlmsg is appropriate for such a
>> >> thing, but I remain suspicious. :)
>> >
>> > Likewise. :)
>> >
>> >> > 2) Integrate into that mechanism a way to filter the resulting parameter
>> >> >struct with BPF *after* it has been copied to kernel space (and thus
>> >> >can no longer be tampered with).
>> >>
>> >> Yeah, this is a irritating part: the structures operated on are copied
>> >> from userspace adhoc in each syscall. Doing argument checking would
>> >> mean double copies initially, and perhaps teaching syscalls about
>> >> optional "already copied" arguments or something as an optimization.
>> >
>> > No, double copies can't work for security reasons.  Because otherwise
>> > you could race the kernel from another thread, substituting different
>> > values after the check and before the use.
>>
>> Right, the double copy method would require setting up a per-thread
>> userspace memory mapping that was read-only from userspace but
>> writable from kernel space.
>
> Which seems like a lot more trouble than just copying it once.
>
>> > I think the right API looks *roughly* like this:
>> >
>> > int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
>> > user_len, void __user *user_struct)
>> > {
>> > if (user_len > kernel_len)
>> > return -EINVAL;
>> > if (user_len && copy_from_user(kernel_struct, user_struct, 
>> > user_len))
>> > return -EFAULT;
>> > if (user_len < kernel_len)
>> > memset(kernel_struct + user_len, 0, kernel_len - user_len);
>> > return 0;
>> > }
>> >
>> > #define copy_param_struct(kernel_struct, user_len, user_struct) 
>> > _copy_param_struct( \
>> > sizeof(*kernel_struct) + 
>> > BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
>> > kernel_struct, user_len, user_struct)
>> >
>> >
>> > Then the syscall looks like this:
>> >
>> > SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct 
>> > xyzzy_params __user *user_params)
>> > {
>> > int ret;
>> > struct xyzzy_params params;
>> >
>> > ret = copy_param_struct(, user_params_len, user_params);
>> > if (ret)
>> > return ret;
>> > ...
>> >
>> >
>> > And you could then hook copy_params_struct to add arbitrary additional
>> > syscall parameter validation.  Bonus if there's some way to make the
>> > copy and validation occur before the syscall is ever invoked, rather
>> > than inside the syscall, but that would require adding fancier syscall
>> > definition mechanisms that autogenerate such code.
>>
>> The trouble is that the hook for the syscall (both seccomp and ptrace)
>> happens before the sys_* function executes. So the param extract
>> suddenly becomes optional. As in, did ptrace/seccomp already extract
>> the args? If so, use that copy, else copy them out myself now that I
>> need them, etc.
>>
>> It's entirely doable, but it's going to require some careful design.
>
> Agreed.  I think the proposal above would be a net improvement, but
> ideally you'd want something that's annotated and generates automatic
> marshalling code.
>

I assume this is idle musing.  If, however, we were to actually do
this, I'd suggest we seriously consider speaking the Cap'n Proto
serialization format.  It's quite nice, it encodes and decodes *very*
quickly and, unlike TLV schemes, you don't have to read it in order,
making the read-side code less awkward.

(I wouldn't suggest that we use the official C++11 library, but the
format might be a good idea.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread josh

On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote:
> On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett  wrote:
> > On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
> >> On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett  
> >> wrote:
> >> > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
> >> >> I like this, it's a good description of both options. I'm still biased
> >> >> about the approach: I prefer flags, since pointers to user structures
> >> >> complicate syscall filtering. ;)
> >> >
> >> > Seems like we should do two things to make that easier:
> >> >
> >> > 1) Create a standardized kernel mechanism for parameter-struct handling,
> >> >implementing the recommendations mentioned here.
> >>
> >> It's been suggested in the past that nlmsg is appropriate for such a
> >> thing, but I remain suspicious. :)
> >
> > Likewise. :)
> >
> >> > 2) Integrate into that mechanism a way to filter the resulting parameter
> >> >struct with BPF *after* it has been copied to kernel space (and thus
> >> >can no longer be tampered with).
> >>
> >> Yeah, this is a irritating part: the structures operated on are copied
> >> from userspace adhoc in each syscall. Doing argument checking would
> >> mean double copies initially, and perhaps teaching syscalls about
> >> optional "already copied" arguments or something as an optimization.
> >
> > No, double copies can't work for security reasons.  Because otherwise
> > you could race the kernel from another thread, substituting different
> > values after the check and before the use.
> 
> Right, the double copy method would require setting up a per-thread
> userspace memory mapping that was read-only from userspace but
> writable from kernel space.

Which seems like a lot more trouble than just copying it once.

> > I think the right API looks *roughly* like this:
> >
> > int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
> > user_len, void __user *user_struct)
> > {
> > if (user_len > kernel_len)
> > return -EINVAL;
> > if (user_len && copy_from_user(kernel_struct, user_struct, 
> > user_len))
> > return -EFAULT;
> > if (user_len < kernel_len)
> > memset(kernel_struct + user_len, 0, kernel_len - user_len);
> > return 0;
> > }
> >
> > #define copy_param_struct(kernel_struct, user_len, user_struct) 
> > _copy_param_struct( \
> > sizeof(*kernel_struct) + 
> > BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
> > kernel_struct, user_len, user_struct)
> >
> >
> > Then the syscall looks like this:
> >
> > SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct 
> > xyzzy_params __user *user_params)
> > {
> > int ret;
> > struct xyzzy_params params;
> >
> > ret = copy_param_struct(, user_params_len, user_params);
> > if (ret)
> > return ret;
> > ...
> >
> >
> > And you could then hook copy_params_struct to add arbitrary additional
> > syscall parameter validation.  Bonus if there's some way to make the
> > copy and validation occur before the syscall is ever invoked, rather
> > than inside the syscall, but that would require adding fancier syscall
> > definition mechanisms that autogenerate such code.
> 
> The trouble is that the hook for the syscall (both seccomp and ptrace)
> happens before the sys_* function executes. So the param extract
> suddenly becomes optional. As in, did ptrace/seccomp already extract
> the args? If so, use that copy, else copy them out myself now that I
> need them, etc.
> 
> It's entirely doable, but it's going to require some careful design.

Agreed.  I think the proposal above would be a net improvement, but
ideally you'd want something that's annotated and generates automatic
marshalling code.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Kees Cook

On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett  wrote:
> On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
>> On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett  
>> wrote:
>> > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
>> >> I like this, it's a good description of both options. I'm still biased
>> >> about the approach: I prefer flags, since pointers to user structures
>> >> complicate syscall filtering. ;)
>> >
>> > Seems like we should do two things to make that easier:
>> >
>> > 1) Create a standardized kernel mechanism for parameter-struct handling,
>> >implementing the recommendations mentioned here.
>>
>> It's been suggested in the past that nlmsg is appropriate for such a
>> thing, but I remain suspicious. :)
>
> Likewise. :)
>
>> > 2) Integrate into that mechanism a way to filter the resulting parameter
>> >struct with BPF *after* it has been copied to kernel space (and thus
>> >can no longer be tampered with).
>>
>> Yeah, this is a irritating part: the structures operated on are copied
>> from userspace adhoc in each syscall. Doing argument checking would
>> mean double copies initially, and perhaps teaching syscalls about
>> optional "already copied" arguments or something as an optimization.
>
> No, double copies can't work for security reasons.  Because otherwise
> you could race the kernel from another thread, substituting different
> values after the check and before the use.

Right, the double copy method would require setting up a per-thread
userspace memory mapping that was read-only from userspace but
writable from kernel space.

> I think the right API looks *roughly* like this:
>
> int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
> user_len, void __user *user_struct)
> {
> if (user_len > kernel_len)
> return -EINVAL;
> if (user_len && copy_from_user(kernel_struct, user_struct, user_len))
> return -EFAULT;
> if (user_len < kernel_len)
> memset(kernel_struct + user_len, 0, kernel_len - user_len);
> return 0;
> }
>
> #define copy_param_struct(kernel_struct, user_len, user_struct) 
> _copy_param_struct( \
> sizeof(*kernel_struct) + 
> BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
> kernel_struct, user_len, user_struct)
>
>
> Then the syscall looks like this:
>
> SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params 
> __user *user_params)
> {
> int ret;
> struct xyzzy_params params;
>
> ret = copy_param_struct(, user_params_len, user_params);
> if (ret)
> return ret;
> ...
>
>
> And you could then hook copy_params_struct to add arbitrary additional
> syscall parameter validation.  Bonus if there's some way to make the
> copy and validation occur before the syscall is ever invoked, rather
> than inside the syscall, but that would require adding fancier syscall
> definition mechanisms that autogenerate such code.

The trouble is that the hook for the syscall (both seccomp and ptrace)
happens before the sys_* function executes. So the param extract
suddenly becomes optional. As in, did ptrace/seccomp already extract
the args? If so, use that copy, else copy them out myself now that I
need them, etc.

It's entirely doable, but it's going to require some careful design.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread David Drysdale

On Fri, Jul 31, 2015 at 2:06 PM, Josh Triplett  wrote:
> On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote:
>> On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett  wrote:
>> > On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
>> >> +needs to be governed by the appropriate Linux capability bit (checked 
>> >> with a
>> >> +call to capable()), as described in the capabilities(7) man page.
>> >> +
>> >> + - If there is an existing capability that governs related 
>> >> functionality, then
>> >> +   use that.  However, avoid combining lots of only vaguely related 
>> >> functions
>> >> +   together under the same bit, as this goes against capabilities' 
>> >> purpose of
>> >> +   splitting the power of root.  In particular, avoid adding new uses of 
>> >> the
>> >> +   already overly-general CAP_SYS_ADMIN capability.
>> >> + - If there is no related capability, then consider adding a new 
>> >> capability
>> >> +   bit -- but bear in mind that the numbering space is limited, and each 
>> >> new
>> >> +   bit needs to be understood and administered by sysadmins.
>> >
>> > Many previous discussions on this topic seem to have concluded that it's
>> > almost impossible to add a new capability without breaking existing
>> > programs.  I would suggest not even mentioning this possibility.
>>
>> I'm not particularly knowledgable about capabilities (at least, not the
>> POSIX.1e/Linux kind), so I'll confess that I got this suggestion from
>> Michael Kerrisk.
>>
>> Michael, do you agree that we should just drop the possibility of adding
>> new capability bits?
>>
>> Also, Josh, do you have any references to the earlier discussions on the
>> topic so I can update the Sources section?
>
> No direct links handy at the moment without some searching, but one
> iteration of it came up when Matthew Garrett suggested adding
> CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability
> semantics suggested that the way the capability space was defined and
> controlled by userspace meant that adding any new capability would
> effectively break userspace ABI.  The userspace ABI for capabilities is
> not clear; some applications drop all possible capabilities and could
> get surprised by a new capability being dropped, while other
> applications drop only capabilities they know about and could get
> surprised by a new capability *not* being dropped.

BTW, I left this paragraph unchanged in the v3 version I just sent
out -- I'll update it for v4 when I get back from vacation, depending
on what discussion occurs in the meantime...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Josh Triplett

On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote:
> On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett  wrote:
> > On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
> >> Add a document describing the process of adding a new system call,
> >> including the need for a flags argument for future compatibility, and
> >> covering 32-bit/64-bit concerns (albeit in an x86-centric way).
> >>
> >> Signed-off-by: David Drysdale 
> >> Reviewed-by: Michael Kerrisk 
> >> Reviewed-by: Eric B Munson 
> >> Reviewed-by: Kees Cook 
> >> Reviewed-by: Randy Dunlap 
> >
> > A few comments below.  I also agree with Ingo's comment about
> > documenting the param-structure approach in addition to flags.
> 
> Many thanks for looking through the doc in detail!
> 
> I'll send out an updated doc today; I'm then on vacation for the next
> week so I'll collate further feedback after I return.

Thanks for the quick response and for incorporating these changes.

> > Some things to add a this point, before talking about adding a new file
> > descriptor:
> >
> > If you want to provide userspace with a handle to a kernel object, do so
> > in the form of a file descriptor.  Do not invent a new type of userspace
> > object handle with its own namespace.  The kernel already has numerous
> > system calls and well-defined semantics for managing and passing around
> > file descriptors, as well as for cleaning up when the file descriptor is
> > no longer in use.
> >
> > If your new system call returns a file descriptor, think about what it
> > means to use the poll(2) family of syscalls on that file descriptor.  If
> > you want to send an event to userspace associated with your kernel
> > object, you should do so by making your file descriptor ready for
> > reading or writing.
> 
> Good points, I've added a couple of paragraphs along these lines:
> 
> "
> If your new system call allows userspace to refer to a kernel object, it
> should use a file descriptor as the handle for that object -- don't invent a
> new type of userspace object handle when the kernel already has mechanisms and
> well-defined semantics for using file descriptors.
> 
> [snip: existing text on CLOEXEC]
> 
> If your system call returns a new file descriptor, you should also consider
> what it means to use the poll(2) family of system calls on that file
> descriptor. Making a file descriptor ready for reading or writing is the
> normal way for the kernel to indicate to userspace that an event has
> occurred on the corresponding kernel object.
> "

These look good to me.

> >> +needs to be governed by the appropriate Linux capability bit (checked 
> >> with a
> >> +call to capable()), as described in the capabilities(7) man page.
> >> +
> >> + - If there is an existing capability that governs related functionality, 
> >> then
> >> +   use that.  However, avoid combining lots of only vaguely related 
> >> functions
> >> +   together under the same bit, as this goes against capabilities' 
> >> purpose of
> >> +   splitting the power of root.  In particular, avoid adding new uses of 
> >> the
> >> +   already overly-general CAP_SYS_ADMIN capability.
> >> + - If there is no related capability, then consider adding a new 
> >> capability
> >> +   bit -- but bear in mind that the numbering space is limited, and each 
> >> new
> >> +   bit needs to be understood and administered by sysadmins.
> >
> > Many previous discussions on this topic seem to have concluded that it's
> > almost impossible to add a new capability without breaking existing
> > programs.  I would suggest not even mentioning this possibility.
> 
> I'm not particularly knowledgable about capabilities (at least, not the
> POSIX.1e/Linux kind), so I'll confess that I got this suggestion from
> Michael Kerrisk.
> 
> Michael, do you agree that we should just drop the possibility of adding
> new capability bits?
> 
> Also, Josh, do you have any references to the earlier discussions on the
> topic so I can update the Sources section?

No direct links handy at the moment without some searching, but one
iteration of it came up when Matthew Garrett suggested adding
CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability
semantics suggested that the way the capability space was defined and
controlled by userspace meant that adding any new capability would
effectively break userspace ABI.  The userspace ABI for capabilities is
not clear; some applications drop all possible capabilities and could
get surprised by a new capability being dropped, while other
applications drop only capabilities they know about and could get
surprised by a new capability *not* being dropped.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread David Drysdale

On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett  wrote:
> On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
>> Add a document describing the process of adding a new system call,
>> including the need for a flags argument for future compatibility, and
>> covering 32-bit/64-bit concerns (albeit in an x86-centric way).
>>
>> Signed-off-by: David Drysdale 
>> Reviewed-by: Michael Kerrisk 
>> Reviewed-by: Eric B Munson 
>> Reviewed-by: Kees Cook 
>> Reviewed-by: Randy Dunlap 
>
> A few comments below.  I also agree with Ingo's comment about
> documenting the param-structure approach in addition to flags.

Many thanks for looking through the doc in detail!

I'll send out an updated doc today; I'm then on vacation for the next
week so I'll collate further feedback after I return.

>> --- /dev/null
>> +++ b/Documentation/adding-syscalls.txt
>> @@ -0,0 +1,463 @@
>> +Adding a New System Call
>> +
>> +
>> +This document describes what's involved in adding a new system call to the
>> +Linux kernel, over and above the normal submission advice in
>> +Documentation/SubmittingPatches.
>> +
>> +
>> +System Call Alternatives
>> +
>> +
>> +The first thing to consider when adding a new system call is whether one of
>> +the alternatives might be suitable instead.  Although system calls are the
>> +most traditional and most obvious interaction points between userspace and 
>> the
>> +kernel, there are other possibilities -- choose what fits best for your
>> +interface.
>> +
>> + - If the operations involved can be made to look like a filesystem-like
>> +   object, it may make more sense to create a new filesystem or device.  
>> This
>> +   also makes it easier to encapsulate the new functionality in a kernel 
>> module
>> +   rather than requiring it to be built into the main kernel.
>> + - If the new functionality involves operations where the kernel 
>> notifies
>> +   userspace that something has happened, then returning a new file
>> +   descriptor for the relevant object allows userspace to use
>> +   poll/select/epoll to receive that notification.
>> + - However, operations that don't map to read(2)/write(2)-like 
>> operations
>> +   have to be implemented as ioctl(2) requests, which can lead to a
>> +   somewhat opaque API.
>> + - If you're just exposing runtime system information, a new node in sysfs
>> +   (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be
>> +   more appropriate.  However, access to these mechanisms requires that the
>> +   relevant filesystem is mounted, which might not always be the case (e.g.
>> +   in a namespaced/sandboxed/chrooted environment).  Avoid adding anything 
>> to
>> +   debugfs, as this is not considered a 'production' interface to userspace.
>
> s/anything/any "API"/

Done.

>> + - If the operation is specific to a particular file or file descriptor, 
>> then
>> +   an additional fcntl(2) command option may be more appropriate.  However,
>> +   fcntl(2) is a multiplexing system call that hides a lot of complexity, so
>> +   this option is best for when the new function is closely analogous to
>> +   existing fcntl(2) functionality, or the new functionality is very simple
>> +   (for example, getting/setting a simple flag related to a file 
>> descriptor).
>> + - If the operation is specific to a particular task or process, then an
>> +   additional prctl(2) command option may be more appropriate.  As with
>> +   fcntl(2), this system call is a complicated multiplexor so is best 
>> reserved
>> +   for near-analogs of existing prctl() commands or getting/setting a simple
>> +   flag related to a process.
>> +
>> +
>> +Designing the API
>> +-
>> +
>> +A new system call forms part of the API of the kernel, and has to be 
>> supported
>> +indefinitely.  As such, it's a very good idea to explicitly discuss the
>> +interface on the kernel mailing list, and to plan for future extensions of 
>> the
>> +interface.  In particular:
>> +
>> +  **Include a flags argument for every new system call**
>> +
>> +The syscall table is littered with historical examples where this wasn't 
>> done,
>> +together with the corresponding follow-up system calls (eventfd/eventfd2,
>> +dup2/dup3, inotify_init/inotify_init1,  pipe/pipe2, renameat/renameat2), so
>> +learn from the history of the kernel and include a flags argument from the
>> +start.
>> +
>> +Also, to make sure that userspace programs can safely use flags between 
>> kernel
>> +versions, check whether the flags value holds any unknown flags, and reject 
>> the
>> +sycall (with EINVAL) if it does:
>> +
>> +if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
>> +return -EINVAL;
>> +
>> +(If no flags values are used yet, check that the flags argument is zero.)
>
> Some things to add a this point, before talking about adding a new file
> descriptor:
>
> If you want to provide userspace with a handle to a

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Josh Triplett

On Fri, Jul 31, 2015 at 03:54:56PM -0700, Andy Lutomirski wrote:
 On Fri, Jul 31, 2015 at 3:08 PM,  j...@joshtriplett.org wrote:
  On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote:
  On Fri, Jul 31, 2015 at 1:59 PM,  j...@joshtriplett.org wrote:
   Agreed.  I think the proposal above would be a net improvement, but
   ideally you'd want something that's annotated and generates automatic
   marshalling code.
  
 
  I assume this is idle musing.  If, however, we were to actually do
  this, I'd suggest we seriously consider speaking the Cap'n Proto
  serialization format.  It's quite nice, it encodes and decodes *very*
  quickly and, unlike TLV schemes, you don't have to read it in order,
  making the read-side code less awkward.
 
  That seems like *massive* overkill for a kernel-userspace syscall
  interface.  I was more thinking about having a few standardized marshal
  types, and incrementally adding more when more patterns show up.  For a
  first pass, just automatically running copy_from_user and
  copy_param_struct on appropriate sets of __user parameters identified as
  such in a structured text file seems quite sufficient.  (Plus
  automatically generating syscalls.h from that.)
 
 If a param struct does the trick, then I agree.  It's when you start
 having lists and other variable-size stuff that it gets messier.

Sure, agreed.  But I really hope we don't create new kernel ABIs that
involve constructs like that.

- Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread H. Peter Anvin

On 07/31/2015 09:32 PM, Josh Triplett wrote:
 
 Sure, agreed.  But I really hope we don't create new kernel ABIs that
 involve constructs like that.
 

It's worth noting I have pushed for auto-marshalling in general for a
long time, not the least to get rid of the awful syscall(3) wrapper.  I
even built a prototype but didn't have time to productize it.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Josh Triplett

On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote:
 On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett j...@joshtriplett.org wrote:
  On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
  Add a document describing the process of adding a new system call,
  including the need for a flags argument for future compatibility, and
  covering 32-bit/64-bit concerns (albeit in an x86-centric way).
 
  Signed-off-by: David Drysdale drysd...@google.com
  Reviewed-by: Michael Kerrisk mtk.manpa...@gmail.com
  Reviewed-by: Eric B Munson emun...@akamai.com
  Reviewed-by: Kees Cook keesc...@chromium.org
  Reviewed-by: Randy Dunlap rdun...@infradead.org
 
  A few comments below.  I also agree with Ingo's comment about
  documenting the param-structure approach in addition to flags.
 
 Many thanks for looking through the doc in detail!
 
 I'll send out an updated doc today; I'm then on vacation for the next
 week so I'll collate further feedback after I return.

Thanks for the quick response and for incorporating these changes.

  Some things to add a this point, before talking about adding a new file
  descriptor:
 
  If you want to provide userspace with a handle to a kernel object, do so
  in the form of a file descriptor.  Do not invent a new type of userspace
  object handle with its own namespace.  The kernel already has numerous
  system calls and well-defined semantics for managing and passing around
  file descriptors, as well as for cleaning up when the file descriptor is
  no longer in use.
 
  If your new system call returns a file descriptor, think about what it
  means to use the poll(2) family of syscalls on that file descriptor.  If
  you want to send an event to userspace associated with your kernel
  object, you should do so by making your file descriptor ready for
  reading or writing.
 
 Good points, I've added a couple of paragraphs along these lines:
 
 
 If your new system call allows userspace to refer to a kernel object, it
 should use a file descriptor as the handle for that object -- don't invent a
 new type of userspace object handle when the kernel already has mechanisms and
 well-defined semantics for using file descriptors.
 
 [snip: existing text on CLOEXEC]
 
 If your system call returns a new file descriptor, you should also consider
 what it means to use the poll(2) family of system calls on that file
 descriptor. Making a file descriptor ready for reading or writing is the
 normal way for the kernel to indicate to userspace that an event has
 occurred on the corresponding kernel object.
 

These look good to me.

  +needs to be governed by the appropriate Linux capability bit (checked 
  with a
  +call to capable()), as described in the capabilities(7) man page.
  +
  + - If there is an existing capability that governs related functionality, 
  then
  +   use that.  However, avoid combining lots of only vaguely related 
  functions
  +   together under the same bit, as this goes against capabilities' 
  purpose of
  +   splitting the power of root.  In particular, avoid adding new uses of 
  the
  +   already overly-general CAP_SYS_ADMIN capability.
  + - If there is no related capability, then consider adding a new 
  capability
  +   bit -- but bear in mind that the numbering space is limited, and each 
  new
  +   bit needs to be understood and administered by sysadmins.
 
  Many previous discussions on this topic seem to have concluded that it's
  almost impossible to add a new capability without breaking existing
  programs.  I would suggest not even mentioning this possibility.
 
 I'm not particularly knowledgable about capabilities (at least, not the
 POSIX.1e/Linux kind), so I'll confess that I got this suggestion from
 Michael Kerrisk.
 
 Michael, do you agree that we should just drop the possibility of adding
 new capability bits?
 
 Also, Josh, do you have any references to the earlier discussions on the
 topic so I can update the Sources section?

No direct links handy at the moment without some searching, but one
iteration of it came up when Matthew Garrett suggested adding
CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability
semantics suggested that the way the capability space was defined and
controlled by userspace meant that adding any new capability would
effectively break userspace ABI.  The userspace ABI for capabilities is
not clear; some applications drop all possible capabilities and could
get surprised by a new capability being dropped, while other
applications drop only capabilities they know about and could get
surprised by a new capability *not* being dropped.

- Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread David Drysdale

On Fri, Jul 31, 2015 at 2:06 PM, Josh Triplett j...@joshtriplett.org wrote:
 On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote:
 On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett j...@joshtriplett.org wrote:
  On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
  +needs to be governed by the appropriate Linux capability bit (checked 
  with a
  +call to capable()), as described in the capabilities(7) man page.
  +
  + - If there is an existing capability that governs related 
  functionality, then
  +   use that.  However, avoid combining lots of only vaguely related 
  functions
  +   together under the same bit, as this goes against capabilities' 
  purpose of
  +   splitting the power of root.  In particular, avoid adding new uses of 
  the
  +   already overly-general CAP_SYS_ADMIN capability.
  + - If there is no related capability, then consider adding a new 
  capability
  +   bit -- but bear in mind that the numbering space is limited, and each 
  new
  +   bit needs to be understood and administered by sysadmins.
 
  Many previous discussions on this topic seem to have concluded that it's
  almost impossible to add a new capability without breaking existing
  programs.  I would suggest not even mentioning this possibility.

 I'm not particularly knowledgable about capabilities (at least, not the
 POSIX.1e/Linux kind), so I'll confess that I got this suggestion from
 Michael Kerrisk.

 Michael, do you agree that we should just drop the possibility of adding
 new capability bits?

 Also, Josh, do you have any references to the earlier discussions on the
 topic so I can update the Sources section?

 No direct links handy at the moment without some searching, but one
 iteration of it came up when Matthew Garrett suggested adding
 CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability
 semantics suggested that the way the capability space was defined and
 controlled by userspace meant that adding any new capability would
 effectively break userspace ABI.  The userspace ABI for capabilities is
 not clear; some applications drop all possible capabilities and could
 get surprised by a new capability being dropped, while other
 applications drop only capabilities they know about and could get
 surprised by a new capability *not* being dropped.

BTW, I left this paragraph unchanged in the v3 version I just sent
out -- I'll update it for v4 when I get back from vacation, depending
on what discussion occurs in the meantime...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Kees Cook

On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett j...@joshtriplett.org wrote:
 On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
 On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org 
 wrote:
  On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
  I like this, it's a good description of both options. I'm still biased
  about the approach: I prefer flags, since pointers to user structures
  complicate syscall filtering. ;)
 
  Seems like we should do two things to make that easier:
 
  1) Create a standardized kernel mechanism for parameter-struct handling,
 implementing the recommendations mentioned here.

 It's been suggested in the past that nlmsg is appropriate for such a
 thing, but I remain suspicious. :)

 Likewise. :)

  2) Integrate into that mechanism a way to filter the resulting parameter
 struct with BPF *after* it has been copied to kernel space (and thus
 can no longer be tampered with).

 Yeah, this is a irritating part: the structures operated on are copied
 from userspace adhoc in each syscall. Doing argument checking would
 mean double copies initially, and perhaps teaching syscalls about
 optional already copied arguments or something as an optimization.

 No, double copies can't work for security reasons.  Because otherwise
 you could race the kernel from another thread, substituting different
 values after the check and before the use.

Right, the double copy method would require setting up a per-thread
userspace memory mapping that was read-only from userspace but
writable from kernel space.

 I think the right API looks *roughly* like this:

 int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
 user_len, void __user *user_struct)
 {
 if (user_len  kernel_len)
 return -EINVAL;
 if (user_len  copy_from_user(kernel_struct, user_struct, user_len))
 return -EFAULT;
 if (user_len  kernel_len)
 memset(kernel_struct + user_len, 0, kernel_len - user_len);
 return 0;
 }

 #define copy_param_struct(kernel_struct, user_len, user_struct) 
 _copy_param_struct( \
 sizeof(*kernel_struct) + 
 BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
 kernel_struct, user_len, user_struct)


 Then the syscall looks like this:

 SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params 
 __user *user_params)
 {
 int ret;
 struct xyzzy_params params;

 ret = copy_param_struct(params, user_params_len, user_params);
 if (ret)
 return ret;
 ...


 And you could then hook copy_params_struct to add arbitrary additional
 syscall parameter validation.  Bonus if there's some way to make the
 copy and validation occur before the syscall is ever invoked, rather
 than inside the syscall, but that would require adding fancier syscall
 definition mechanisms that autogenerate such code.

The trouble is that the hook for the syscall (both seccomp and ptrace)
happens before the sys_* function executes. So the param extract
suddenly becomes optional. As in, did ptrace/seccomp already extract
the args? If so, use that copy, else copy them out myself now that I
need them, etc.

It's entirely doable, but it's going to require some careful design.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread josh

On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote:
 On Fri, Jul 31, 2015 at 1:59 PM,  j...@joshtriplett.org wrote:
  Agreed.  I think the proposal above would be a net improvement, but
  ideally you'd want something that's annotated and generates automatic
  marshalling code.
 
 
 I assume this is idle musing.  If, however, we were to actually do
 this, I'd suggest we seriously consider speaking the Cap'n Proto
 serialization format.  It's quite nice, it encodes and decodes *very*
 quickly and, unlike TLV schemes, you don't have to read it in order,
 making the read-side code less awkward.

That seems like *massive* overkill for a kernel-userspace syscall
interface.  I was more thinking about having a few standardized marshal
types, and incrementally adding more when more patterns show up.  For a
first pass, just automatically running copy_from_user and
copy_param_struct on appropriate sets of __user parameters identified as
such in a structured text file seems quite sufficient.  (Plus
automatically generating syscalls.h from that.)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread josh

On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote:
 On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett j...@joshtriplett.org wrote:
  On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
  On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org 
  wrote:
   On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
   I like this, it's a good description of both options. I'm still biased
   about the approach: I prefer flags, since pointers to user structures
   complicate syscall filtering. ;)
  
   Seems like we should do two things to make that easier:
  
   1) Create a standardized kernel mechanism for parameter-struct handling,
  implementing the recommendations mentioned here.
 
  It's been suggested in the past that nlmsg is appropriate for such a
  thing, but I remain suspicious. :)
 
  Likewise. :)
 
   2) Integrate into that mechanism a way to filter the resulting parameter
  struct with BPF *after* it has been copied to kernel space (and thus
  can no longer be tampered with).
 
  Yeah, this is a irritating part: the structures operated on are copied
  from userspace adhoc in each syscall. Doing argument checking would
  mean double copies initially, and perhaps teaching syscalls about
  optional already copied arguments or something as an optimization.
 
  No, double copies can't work for security reasons.  Because otherwise
  you could race the kernel from another thread, substituting different
  values after the check and before the use.
 
 Right, the double copy method would require setting up a per-thread
 userspace memory mapping that was read-only from userspace but
 writable from kernel space.

Which seems like a lot more trouble than just copying it once.

  I think the right API looks *roughly* like this:
 
  int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
  user_len, void __user *user_struct)
  {
  if (user_len  kernel_len)
  return -EINVAL;
  if (user_len  copy_from_user(kernel_struct, user_struct, 
  user_len))
  return -EFAULT;
  if (user_len  kernel_len)
  memset(kernel_struct + user_len, 0, kernel_len - user_len);
  return 0;
  }
 
  #define copy_param_struct(kernel_struct, user_len, user_struct) 
  _copy_param_struct( \
  sizeof(*kernel_struct) + 
  BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
  kernel_struct, user_len, user_struct)
 
 
  Then the syscall looks like this:
 
  SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct 
  xyzzy_params __user *user_params)
  {
  int ret;
  struct xyzzy_params params;
 
  ret = copy_param_struct(params, user_params_len, user_params);
  if (ret)
  return ret;
  ...
 
 
  And you could then hook copy_params_struct to add arbitrary additional
  syscall parameter validation.  Bonus if there's some way to make the
  copy and validation occur before the syscall is ever invoked, rather
  than inside the syscall, but that would require adding fancier syscall
  definition mechanisms that autogenerate such code.
 
 The trouble is that the hook for the syscall (both seccomp and ptrace)
 happens before the sys_* function executes. So the param extract
 suddenly becomes optional. As in, did ptrace/seccomp already extract
 the args? If so, use that copy, else copy them out myself now that I
 need them, etc.
 
 It's entirely doable, but it's going to require some careful design.

Agreed.  I think the proposal above would be a net improvement, but
ideally you'd want something that's annotated and generates automatic
marshalling code.

- Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Andy Lutomirski

On Fri, Jul 31, 2015 at 1:59 PM,  j...@joshtriplett.org wrote:
 On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote:
 On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett j...@joshtriplett.org wrote:
  On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
  On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org 
  wrote:
   On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
   I like this, it's a good description of both options. I'm still biased
   about the approach: I prefer flags, since pointers to user structures
   complicate syscall filtering. ;)
  
   Seems like we should do two things to make that easier:
  
   1) Create a standardized kernel mechanism for parameter-struct handling,
  implementing the recommendations mentioned here.
 
  It's been suggested in the past that nlmsg is appropriate for such a
  thing, but I remain suspicious. :)
 
  Likewise. :)
 
   2) Integrate into that mechanism a way to filter the resulting parameter
  struct with BPF *after* it has been copied to kernel space (and thus
  can no longer be tampered with).
 
  Yeah, this is a irritating part: the structures operated on are copied
  from userspace adhoc in each syscall. Doing argument checking would
  mean double copies initially, and perhaps teaching syscalls about
  optional already copied arguments or something as an optimization.
 
  No, double copies can't work for security reasons.  Because otherwise
  you could race the kernel from another thread, substituting different
  values after the check and before the use.

 Right, the double copy method would require setting up a per-thread
 userspace memory mapping that was read-only from userspace but
 writable from kernel space.

 Which seems like a lot more trouble than just copying it once.

  I think the right API looks *roughly* like this:
 
  int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
  user_len, void __user *user_struct)
  {
  if (user_len  kernel_len)
  return -EINVAL;
  if (user_len  copy_from_user(kernel_struct, user_struct, 
  user_len))
  return -EFAULT;
  if (user_len  kernel_len)
  memset(kernel_struct + user_len, 0, kernel_len - user_len);
  return 0;
  }
 
  #define copy_param_struct(kernel_struct, user_len, user_struct) 
  _copy_param_struct( \
  sizeof(*kernel_struct) + 
  BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
  kernel_struct, user_len, user_struct)
 
 
  Then the syscall looks like this:
 
  SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct 
  xyzzy_params __user *user_params)
  {
  int ret;
  struct xyzzy_params params;
 
  ret = copy_param_struct(params, user_params_len, user_params);
  if (ret)
  return ret;
  ...
 
 
  And you could then hook copy_params_struct to add arbitrary additional
  syscall parameter validation.  Bonus if there's some way to make the
  copy and validation occur before the syscall is ever invoked, rather
  than inside the syscall, but that would require adding fancier syscall
  definition mechanisms that autogenerate such code.

 The trouble is that the hook for the syscall (both seccomp and ptrace)
 happens before the sys_* function executes. So the param extract
 suddenly becomes optional. As in, did ptrace/seccomp already extract
 the args? If so, use that copy, else copy them out myself now that I
 need them, etc.

 It's entirely doable, but it's going to require some careful design.

 Agreed.  I think the proposal above would be a net improvement, but
 ideally you'd want something that's annotated and generates automatic
 marshalling code.


I assume this is idle musing.  If, however, we were to actually do
this, I'd suggest we seriously consider speaking the Cap'n Proto
serialization format.  It's quite nice, it encodes and decodes *very*
quickly and, unlike TLV schemes, you don't have to read it in order,
making the read-side code less awkward.

(I wouldn't suggest that we use the official C++11 library, but the
format might be a good idea.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread Andy Lutomirski

On Fri, Jul 31, 2015 at 3:08 PM,  j...@joshtriplett.org wrote:
 On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote:
 On Fri, Jul 31, 2015 at 1:59 PM,  j...@joshtriplett.org wrote:
  Agreed.  I think the proposal above would be a net improvement, but
  ideally you'd want something that's annotated and generates automatic
  marshalling code.
 

 I assume this is idle musing.  If, however, we were to actually do
 this, I'd suggest we seriously consider speaking the Cap'n Proto
 serialization format.  It's quite nice, it encodes and decodes *very*
 quickly and, unlike TLV schemes, you don't have to read it in order,
 making the read-side code less awkward.

 That seems like *massive* overkill for a kernel-userspace syscall
 interface.  I was more thinking about having a few standardized marshal
 types, and incrementally adding more when more patterns show up.  For a
 first pass, just automatically running copy_from_user and
 copy_param_struct on appropriate sets of __user parameters identified as
 such in a structured text file seems quite sufficient.  (Plus
 automatically generating syscalls.h from that.)

If a param struct does the trick, then I agree.  It's when you start
having lists and other variable-size stuff that it gets messier.

For reference, my Cap'n Proto parser (just the basic structure of
messages, not the actual schema bits, although the latter is more or
less just some structs) is about 300 lines of code.  It's arguably
simpler than nlattr, once you throw nesting in.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-31 Thread David Drysdale

On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett j...@joshtriplett.org wrote:
 On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
 Add a document describing the process of adding a new system call,
 including the need for a flags argument for future compatibility, and
 covering 32-bit/64-bit concerns (albeit in an x86-centric way).

 Signed-off-by: David Drysdale drysd...@google.com
 Reviewed-by: Michael Kerrisk mtk.manpa...@gmail.com
 Reviewed-by: Eric B Munson emun...@akamai.com
 Reviewed-by: Kees Cook keesc...@chromium.org
 Reviewed-by: Randy Dunlap rdun...@infradead.org

 A few comments below.  I also agree with Ingo's comment about
 documenting the param-structure approach in addition to flags.

Many thanks for looking through the doc in detail!

I'll send out an updated doc today; I'm then on vacation for the next
week so I'll collate further feedback after I return.

 --- /dev/null
 +++ b/Documentation/adding-syscalls.txt
 @@ -0,0 +1,463 @@
 +Adding a New System Call
 +
 +
 +This document describes what's involved in adding a new system call to the
 +Linux kernel, over and above the normal submission advice in
 +Documentation/SubmittingPatches.
 +
 +
 +System Call Alternatives
 +
 +
 +The first thing to consider when adding a new system call is whether one of
 +the alternatives might be suitable instead.  Although system calls are the
 +most traditional and most obvious interaction points between userspace and 
 the
 +kernel, there are other possibilities -- choose what fits best for your
 +interface.
 +
 + - If the operations involved can be made to look like a filesystem-like
 +   object, it may make more sense to create a new filesystem or device.  
 This
 +   also makes it easier to encapsulate the new functionality in a kernel 
 module
 +   rather than requiring it to be built into the main kernel.
 + - If the new functionality involves operations where the kernel 
 notifies
 +   userspace that something has happened, then returning a new file
 +   descriptor for the relevant object allows userspace to use
 +   poll/select/epoll to receive that notification.
 + - However, operations that don't map to read(2)/write(2)-like 
 operations
 +   have to be implemented as ioctl(2) requests, which can lead to a
 +   somewhat opaque API.
 + - If you're just exposing runtime system information, a new node in sysfs
 +   (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be
 +   more appropriate.  However, access to these mechanisms requires that the
 +   relevant filesystem is mounted, which might not always be the case (e.g.
 +   in a namespaced/sandboxed/chrooted environment).  Avoid adding anything 
 to
 +   debugfs, as this is not considered a 'production' interface to userspace.

 s/anything/any API/

Done.

 + - If the operation is specific to a particular file or file descriptor, 
 then
 +   an additional fcntl(2) command option may be more appropriate.  However,
 +   fcntl(2) is a multiplexing system call that hides a lot of complexity, so
 +   this option is best for when the new function is closely analogous to
 +   existing fcntl(2) functionality, or the new functionality is very simple
 +   (for example, getting/setting a simple flag related to a file 
 descriptor).
 + - If the operation is specific to a particular task or process, then an
 +   additional prctl(2) command option may be more appropriate.  As with
 +   fcntl(2), this system call is a complicated multiplexor so is best 
 reserved
 +   for near-analogs of existing prctl() commands or getting/setting a simple
 +   flag related to a process.
 +
 +
 +Designing the API
 +-
 +
 +A new system call forms part of the API of the kernel, and has to be 
 supported
 +indefinitely.  As such, it's a very good idea to explicitly discuss the
 +interface on the kernel mailing list, and to plan for future extensions of 
 the
 +interface.  In particular:
 +
 +  **Include a flags argument for every new system call**
 +
 +The syscall table is littered with historical examples where this wasn't 
 done,
 +together with the corresponding follow-up system calls (eventfd/eventfd2,
 +dup2/dup3, inotify_init/inotify_init1,  pipe/pipe2, renameat/renameat2), so
 +learn from the history of the kernel and include a flags argument from the
 +start.
 +
 +Also, to make sure that userspace programs can safely use flags between 
 kernel
 +versions, check whether the flags value holds any unknown flags, and reject 
 the
 +sycall (with EINVAL) if it does:
 +
 +if (flags  ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
 +return -EINVAL;
 +
 +(If no flags values are used yet, check that the flags argument is zero.)

 Some things to add a this point, before talking about adding a new file
 descriptor:

 If you want to provide userspace with a handle to a kernel object, do so
 in the form of a file descriptor.  Do not invent a new type of

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 06:02:34PM -0700, Josh Triplett wrote:
> On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
> > On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett  
> > wrote:
> > > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
> > >> I like this, it's a good description of both options. I'm still biased
> > >> about the approach: I prefer flags, since pointers to user structures
> > >> complicate syscall filtering. ;)
> > >
> > > Seems like we should do two things to make that easier:
> > >
> > > 1) Create a standardized kernel mechanism for parameter-struct handling,
> > >implementing the recommendations mentioned here.
> > 
> > It's been suggested in the past that nlmsg is appropriate for such a
> > thing, but I remain suspicious. :)
> 
> Likewise. :)
> 
> > > 2) Integrate into that mechanism a way to filter the resulting parameter
> > >struct with BPF *after* it has been copied to kernel space (and thus
> > >can no longer be tampered with).
> > 
> > Yeah, this is a irritating part: the structures operated on are copied
> > from userspace adhoc in each syscall. Doing argument checking would
> > mean double copies initially, and perhaps teaching syscalls about
> > optional "already copied" arguments or something as an optimization.
> 
> No, double copies can't work for security reasons.  Because otherwise
> you could race the kernel from another thread, substituting different
> values after the check and before the use.
> 
> I think the right API looks *roughly* like this:
> 
> int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
> user_len, void __user *user_struct)
> {
>   if (user_len > kernel_len)
>   return -EINVAL;
>   if (user_len && copy_from_user(kernel_struct, user_struct, user_len))
>   return -EFAULT;
>   if (user_len < kernel_len)
>   memset(kernel_struct + user_len, 0, kernel_len - user_len);
>   return 0;
> }
> 
> #define copy_param_struct(kernel_struct, user_len, user_struct) 
> _copy_param_struct( \
>   sizeof(*kernel_struct) + 
> BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
>   kernel_struct, user_len, user_struct)
> 
> 
> Then the syscall looks like this:
> 
> SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params 
> __user *user_params)

Missed a couple of commas here (after the types and before the names).

> {
>   int ret;
>   struct xyzzy_params params;
> 
>   ret = copy_param_struct(, user_params_len, user_params);
>   if (ret)
>   return ret;
>   ...
> 
> 
> And you could then hook copy_params_struct to add arbitrary additional
> syscall parameter validation.  Bonus if there's some way to make the
> copy and validation occur before the syscall is ever invoked, rather
> than inside the syscall, but that would require adding fancier syscall
> definition mechanisms that autogenerate such code.
> 
> - Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
> On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett  wrote:
> > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
> >> I like this, it's a good description of both options. I'm still biased
> >> about the approach: I prefer flags, since pointers to user structures
> >> complicate syscall filtering. ;)
> >
> > Seems like we should do two things to make that easier:
> >
> > 1) Create a standardized kernel mechanism for parameter-struct handling,
> >implementing the recommendations mentioned here.
> 
> It's been suggested in the past that nlmsg is appropriate for such a
> thing, but I remain suspicious. :)

Likewise. :)

> > 2) Integrate into that mechanism a way to filter the resulting parameter
> >struct with BPF *after* it has been copied to kernel space (and thus
> >can no longer be tampered with).
> 
> Yeah, this is a irritating part: the structures operated on are copied
> from userspace adhoc in each syscall. Doing argument checking would
> mean double copies initially, and perhaps teaching syscalls about
> optional "already copied" arguments or something as an optimization.

No, double copies can't work for security reasons.  Because otherwise
you could race the kernel from another thread, substituting different
values after the check and before the use.

I think the right API looks *roughly* like this:

int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, 
void __user *user_struct)
{
if (user_len > kernel_len)
return -EINVAL;
if (user_len && copy_from_user(kernel_struct, user_struct, user_len))
return -EFAULT;
if (user_len < kernel_len)
memset(kernel_struct + user_len, 0, kernel_len - user_len);
return 0;
}

#define copy_param_struct(kernel_struct, user_len, user_struct) 
_copy_param_struct( \
sizeof(*kernel_struct) + 
BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
kernel_struct, user_len, user_struct)


Then the syscall looks like this:

SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params 
__user *user_params)
{
int ret;
struct xyzzy_params params;

ret = copy_param_struct(, user_params_len, user_params);
if (ret)
return ret;
...


And you could then hook copy_params_struct to add arbitrary additional
syscall parameter validation.  Bonus if there's some way to make the
copy and validation occur before the syscall is ever invoked, rather
than inside the syscall, but that would require adding fancier syscall
definition mechanisms that autogenerate such code.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Kees Cook

On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett  wrote:
> On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
>> I like this, it's a good description of both options. I'm still biased
>> about the approach: I prefer flags, since pointers to user structures
>> complicate syscall filtering. ;)
>
> Seems like we should do two things to make that easier:
>
> 1) Create a standardized kernel mechanism for parameter-struct handling,
>implementing the recommendations mentioned here.

It's been suggested in the past that nlmsg is appropriate for such a
thing, but I remain suspicious. :)

> 2) Integrate into that mechanism a way to filter the resulting parameter
>struct with BPF *after* it has been copied to kernel space (and thus
>can no longer be tampered with).

Yeah, this is a irritating part: the structures operated on are copied
from userspace adhoc in each syscall. Doing argument checking would
mean double copies initially, and perhaps teaching syscalls about
optional "already copied" arguments or something as an optimization.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
> I like this, it's a good description of both options. I'm still biased
> about the approach: I prefer flags, since pointers to user structures
> complicate syscall filtering. ;)

Seems like we should do two things to make that easier:

1) Create a standardized kernel mechanism for parameter-struct handling,
   implementing the recommendations mentioned here.
2) Integrate into that mechanism a way to filter the resulting parameter
   struct with BPF *after* it has been copied to kernel space (and thus
   can no longer be tampered with).

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
> Add a document describing the process of adding a new system call,
> including the need for a flags argument for future compatibility, and
> covering 32-bit/64-bit concerns (albeit in an x86-centric way).
> 
> Signed-off-by: David Drysdale 
> Reviewed-by: Michael Kerrisk 
> Reviewed-by: Eric B Munson 
> Reviewed-by: Kees Cook 
> Reviewed-by: Randy Dunlap 

A few comments below.  I also agree with Ingo's comment about
documenting the param-structure approach in addition to flags.

> --- /dev/null
> +++ b/Documentation/adding-syscalls.txt
> @@ -0,0 +1,463 @@
> +Adding a New System Call
> +
> +
> +This document describes what's involved in adding a new system call to the
> +Linux kernel, over and above the normal submission advice in
> +Documentation/SubmittingPatches.
> +
> +
> +System Call Alternatives
> +
> +
> +The first thing to consider when adding a new system call is whether one of
> +the alternatives might be suitable instead.  Although system calls are the
> +most traditional and most obvious interaction points between userspace and 
> the
> +kernel, there are other possibilities -- choose what fits best for your
> +interface.
> +
> + - If the operations involved can be made to look like a filesystem-like
> +   object, it may make more sense to create a new filesystem or device.  This
> +   also makes it easier to encapsulate the new functionality in a kernel 
> module
> +   rather than requiring it to be built into the main kernel.
> + - If the new functionality involves operations where the kernel notifies
> +   userspace that something has happened, then returning a new file
> +   descriptor for the relevant object allows userspace to use
> +   poll/select/epoll to receive that notification.
> + - However, operations that don't map to read(2)/write(2)-like operations
> +   have to be implemented as ioctl(2) requests, which can lead to a
> +   somewhat opaque API.
> + - If you're just exposing runtime system information, a new node in sysfs
> +   (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be
> +   more appropriate.  However, access to these mechanisms requires that the
> +   relevant filesystem is mounted, which might not always be the case (e.g.
> +   in a namespaced/sandboxed/chrooted environment).  Avoid adding anything to
> +   debugfs, as this is not considered a 'production' interface to userspace.

s/anything/any "API"/

> + - If the operation is specific to a particular file or file descriptor, then
> +   an additional fcntl(2) command option may be more appropriate.  However,
> +   fcntl(2) is a multiplexing system call that hides a lot of complexity, so
> +   this option is best for when the new function is closely analogous to
> +   existing fcntl(2) functionality, or the new functionality is very simple
> +   (for example, getting/setting a simple flag related to a file descriptor).
> + - If the operation is specific to a particular task or process, then an
> +   additional prctl(2) command option may be more appropriate.  As with
> +   fcntl(2), this system call is a complicated multiplexor so is best 
> reserved
> +   for near-analogs of existing prctl() commands or getting/setting a simple
> +   flag related to a process.
> +
> +
> +Designing the API
> +-
> +
> +A new system call forms part of the API of the kernel, and has to be 
> supported
> +indefinitely.  As such, it's a very good idea to explicitly discuss the
> +interface on the kernel mailing list, and to plan for future extensions of 
> the
> +interface.  In particular:
> +
> +  **Include a flags argument for every new system call**
> +
> +The syscall table is littered with historical examples where this wasn't 
> done,
> +together with the corresponding follow-up system calls (eventfd/eventfd2,
> +dup2/dup3, inotify_init/inotify_init1,  pipe/pipe2, renameat/renameat2), so
> +learn from the history of the kernel and include a flags argument from the
> +start.
> +
> +Also, to make sure that userspace programs can safely use flags between 
> kernel
> +versions, check whether the flags value holds any unknown flags, and reject 
> the
> +sycall (with EINVAL) if it does:
> +
> +if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
> +return -EINVAL;
> +
> +(If no flags values are used yet, check that the flags argument is zero.)

Some things to add a this point, before talking about adding a new file
descriptor:

If you want to provide userspace with a handle to a kernel object, do so
in the form of a file descriptor.  Do not invent a new type of userspace
object handle with its own namespace.  The kernel already has numerous
system calls and well-defined semantics for managing and passing around
file descriptors, as well as for cleaning up when the file descriptor is
no longer in use.

If your new system call returns a file

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 10:38:31AM +0200, Ingo Molnar wrote:
> When the system call is extended in the future on the kernel side, with 'u64 
> param_4', then the structure expands from an old size of 24 to a new size of 
> 32 
> bytes. The following scenarios might occur:
> 
>  - the common case: new user-space calls the new kernel code, ->size is 32 on 
> both 
>sides.
> 
>  - old binaries might call the kernel with params->size == 24, in which case 
> the 
>kernel sets the new fields to 0. The new feature should be written
>accordingly, so that a value of 0 means the old behavior.
> 
>  - new binaries might run on old kernels, with params->size == 32. In this 
> case 
>the old kernel will check that all the new fields it does not know about 
> are 
>set to 0 - if they are nonzero (if the new feature is used) it returns 
> with 
>-ENOSYS or -EINVAL.

Nit: it seems easier, rather than having the kernel check this, to have
userspace only use the minimum version of the structure that contains
the features they use, and then have older kernels reject sizes bigger
than they understand.  If you don't need param_4, don't use the version
of the structure that has param_4.  That also means the kernel doesn't
need to copy and read those values.  Either approach works, though.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Kees Cook

On Thu, Jul 30, 2015 at 4:10 AM, David Drysdale  wrote:
> On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar  wrote:
>>
>> * David Drysdale  wrote:
>>
>>> +Designing the API
>>> +-
>>> +
>>> +A new system call forms part of the API of the kernel, and has to be 
>>> supported
>>> +indefinitely.  As such, it's a very good idea to explicitly discuss the
>>> +interface on the kernel mailing list, and to plan for future extensions of 
>>> the
>>> +interface.  In particular:
>>> +
>>> +  **Include a flags argument for every new system call**
>>
>> Sorry, but I think that's bad avice, because even a 'flags' field is 
>> inflexible
>> and stupid in many cases - it fosters an 'ioctl' kind of design.
>>
>>> +The syscall table is littered with historical examples where this wasn't 
>>> done,
>>> +together with the corresponding follow-up system calls (eventfd/eventfd2,
>>> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so
>>> +learn from the history of the kernel and include a flags argument from the
>>> +start.
>>
>> The syscall table is also littered with system calls that have an argument 
>> space
>> considerably larger than what 6 parameters can express, where various 
>> 'flags' are
>> used to bring in different parts of new APIs, in a rather messy way.
>>
>> The right approach IMHO is to think about how extensible a system call is 
>> expected
>> to be, and to plan accordingly.
>>
>> If you are anywhere close to 6 parameters, you should not introduce 'flags' 
>> but
>> you should _reduce_ the number of parameters to a clean essential of 2 or 3
>> parameters and should shuffle parameters out to a separate 
>> 'parameters/attributes'
>> structure that is passed in by pointer:
>>
>> SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);
>>
>> And it's the design of 'struct params' that determines future flexibility of 
>> the
>> interface. A very flexible approach is to not use flags but a 'size' 
>> argument:
>>
>> struct params {
>> u32 size;
>> u32 param_1;
>> u64 param_2;
>> u64 param_3;
>> };
>>
>> Where 'size' is set by user-space to the size of 'struct params' known to it 
>> at
>> build time:
>>
>> params->size = sizeof(*params);
>>
>> In the normal case the kernel will get param->size == sizeof(*params) as 
>> known to
>> the kernel.
>>
>> When the system call is extended in the future on the kernel side, with 'u64
>> param_4', then the structure expands from an old size of 24 to a new size of 
>> 32
>> bytes. The following scenarios might occur:
>>
>>  - the common case: new user-space calls the new kernel code, ->size is 32 
>> on both
>>sides.
>>
>>  - old binaries might call the kernel with params->size == 24, in which case 
>> the
>>kernel sets the new fields to 0. The new feature should be written
>>accordingly, so that a value of 0 means the old behavior.
>>
>>  - new binaries might run on old kernels, with params->size == 32. In this 
>> case
>>the old kernel will check that all the new fields it does not know about 
>> are
>>set to 0 - if they are nonzero (if the new feature is used) it returns 
>> with
>>-ENOSYS or -EINVAL.
>>
>> With this approach we have both backwards and forwards binary compatibility: 
>> new
>> binaries will run on old kernels just fine, even if they have ->size set to 
>> 32, as
>> long as they make use of the features.
>>
>> This design simplifies application design considerably: as new code can 
>> mostly
>> forget about old ABIs, there's no multiple versions to be taken care of, 
>> there's
>> just a single 'struct param' known to both sides, and there's no version 
>> skew.
>>
>> We are using such a design in perf_event_open(), see perf_copy_attr() in
>> kernel/events/core.c. And yes, ironically that system call still has a 
>> historic
>> 'flags' argument, but it's not used anymore for extension: we've made over 30
>> extensions to the ABI in the last 3 years, which would have been impossible 
>> with a
>> 'flags' approach.
>>
>> Thanks,
>>
>> Ingo
>
> Fair point, there are other ways to ensure extendability that just the
> simple flags
> approach -- and as you say, for more sophisticated interfaces flags might 
> hinder
> more than they help.
>
> How about the attached text as an attempt to cover both bases?
>
> Thanks,
> David
>
>
> --
>
> Designing the API: Planning for Extension
> -
>
> A new system call forms part of the API of the kernel, and has to be supported
> indefinitely.  As such, it's a very good idea to explicitly discuss the
> interface on the kernel mailing list, and it's important to plan for future
> extensions of the interface.
>
> (The syscall table is littered with historical examples where this wasn't 
> done,
> together with the corresponding follow-up system calls -- eventfd/eventfd2,
> dup2/dup3,

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Greg Kroah-Hartman

On Thu, Jul 30, 2015 at 06:30:07PM +0200, Cyril Hrubis wrote:
> Hi!
> > +Testing
> > +---
> > +
> > +A new system call should obviously be tested; it is also useful to provide
> > +reviewers with a demonstration of how user space programs will use the 
> > system
> > +call.  A good way to combine these aims is to include a simple self-test
> > +program in a new directory under tools/testing/selftests/.
> 
> I know that this is a bit off topic, but since the selftest is now the
> official place to add kernel testcases to let me rant about it a bit.
> 
> It's a bit of a pain seeing people reinvent the wheel and trying to
> figure out consistent test interface while most of the problems has been
> solved in LTP[1] test library quite some time ago. Especially use of the
> SAFE_MACROS[2] would simplify writing test setups quite a lot.
> 
> I wonder if we can at least share the test library, pulling it out of
> LTP, or at least the interesting parts, wouldn't be hard at all.
> 
> [1]
> https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure
> 
> [2]
> https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros

That would be great, please send patches to do so to the linux-api
mailing list and the test maintainer.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Cyril Hrubis

Hi!
> +Testing
> +---
> +
> +A new system call should obviously be tested; it is also useful to provide
> +reviewers with a demonstration of how user space programs will use the system
> +call.  A good way to combine these aims is to include a simple self-test
> +program in a new directory under tools/testing/selftests/.

I know that this is a bit off topic, but since the selftest is now the
official place to add kernel testcases to let me rant about it a bit.

It's a bit of a pain seeing people reinvent the wheel and trying to
figure out consistent test interface while most of the problems has been
solved in LTP[1] test library quite some time ago. Especially use of the
SAFE_MACROS[2] would simplify writing test setups quite a lot.

I wonder if we can at least share the test library, pulling it out of
LTP, or at least the interesting parts, wouldn't be hard at all.

[1]
https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure

[2]
https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros


-- 
Cyril Hrubis
chru...@suse.cz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread David Drysdale

On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar  wrote:
>
> * David Drysdale  wrote:
>
>> +Designing the API
>> +-
>> +
>> +A new system call forms part of the API of the kernel, and has to be 
>> supported
>> +indefinitely.  As such, it's a very good idea to explicitly discuss the
>> +interface on the kernel mailing list, and to plan for future extensions of 
>> the
>> +interface.  In particular:
>> +
>> +  **Include a flags argument for every new system call**
>
> Sorry, but I think that's bad avice, because even a 'flags' field is 
> inflexible
> and stupid in many cases - it fosters an 'ioctl' kind of design.
>
>> +The syscall table is littered with historical examples where this wasn't 
>> done,
>> +together with the corresponding follow-up system calls (eventfd/eventfd2,
>> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so
>> +learn from the history of the kernel and include a flags argument from the
>> +start.
>
> The syscall table is also littered with system calls that have an argument 
> space
> considerably larger than what 6 parameters can express, where various 'flags' 
> are
> used to bring in different parts of new APIs, in a rather messy way.
>
> The right approach IMHO is to think about how extensible a system call is 
> expected
> to be, and to plan accordingly.
>
> If you are anywhere close to 6 parameters, you should not introduce 'flags' 
> but
> you should _reduce_ the number of parameters to a clean essential of 2 or 3
> parameters and should shuffle parameters out to a separate 
> 'parameters/attributes'
> structure that is passed in by pointer:
>
> SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);
>
> And it's the design of 'struct params' that determines future flexibility of 
> the
> interface. A very flexible approach is to not use flags but a 'size' argument:
>
> struct params {
> u32 size;
> u32 param_1;
> u64 param_2;
> u64 param_3;
> };
>
> Where 'size' is set by user-space to the size of 'struct params' known to it 
> at
> build time:
>
> params->size = sizeof(*params);
>
> In the normal case the kernel will get param->size == sizeof(*params) as 
> known to
> the kernel.
>
> When the system call is extended in the future on the kernel side, with 'u64
> param_4', then the structure expands from an old size of 24 to a new size of 
> 32
> bytes. The following scenarios might occur:
>
>  - the common case: new user-space calls the new kernel code, ->size is 32 on 
> both
>sides.
>
>  - old binaries might call the kernel with params->size == 24, in which case 
> the
>kernel sets the new fields to 0. The new feature should be written
>accordingly, so that a value of 0 means the old behavior.
>
>  - new binaries might run on old kernels, with params->size == 32. In this 
> case
>the old kernel will check that all the new fields it does not know about 
> are
>set to 0 - if they are nonzero (if the new feature is used) it returns with
>-ENOSYS or -EINVAL.
>
> With this approach we have both backwards and forwards binary compatibility: 
> new
> binaries will run on old kernels just fine, even if they have ->size set to 
> 32, as
> long as they make use of the features.
>
> This design simplifies application design considerably: as new code can mostly
> forget about old ABIs, there's no multiple versions to be taken care of, 
> there's
> just a single 'struct param' known to both sides, and there's no version skew.
>
> We are using such a design in perf_event_open(), see perf_copy_attr() in
> kernel/events/core.c. And yes, ironically that system call still has a 
> historic
> 'flags' argument, but it's not used anymore for extension: we've made over 30
> extensions to the ABI in the last 3 years, which would have been impossible 
> with a
> 'flags' approach.
>
> Thanks,
>
> Ingo

Fair point, there are other ways to ensure extendability that just the
simple flags
approach -- and as you say, for more sophisticated interfaces flags might hinder
more than they help.

How about the attached text as an attempt to cover both bases?

Thanks,
David


--

Designing the API: Planning for Extension
-

A new system call forms part of the API of the kernel, and has to be supported
indefinitely.  As such, it's a very good idea to explicitly discuss the
interface on the kernel mailing list, and it's important to plan for future
extensions of the interface.

(The syscall table is littered with historical examples where this wasn't done,
together with the corresponding follow-up system calls -- eventfd/eventfd2,
dup2/dup3, inotify_init/inotify_init1,  pipe/pipe2, renameat/renameat2 -- so
learn from the history of the kernel and plan for extensions from the start.)

For simpler system calls that only take a couple of arguments, the preferred way
to allow for future

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Ingo Molnar

* David Drysdale  wrote:

> +Designing the API
> +-
> +
> +A new system call forms part of the API of the kernel, and has to be 
> supported
> +indefinitely.  As such, it's a very good idea to explicitly discuss the
> +interface on the kernel mailing list, and to plan for future extensions of 
> the
> +interface.  In particular:
> +
> +  **Include a flags argument for every new system call**

Sorry, but I think that's bad avice, because even a 'flags' field is inflexible 
and stupid in many cases - it fosters an 'ioctl' kind of design.

> +The syscall table is littered with historical examples where this wasn't 
> done, 
> +together with the corresponding follow-up system calls (eventfd/eventfd2, 
> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so 
> +learn from the history of the kernel and include a flags argument from the 
> +start.

The syscall table is also littered with system calls that have an argument 
space 
considerably larger than what 6 parameters can express, where various 'flags' 
are 
used to bring in different parts of new APIs, in a rather messy way.

The right approach IMHO is to think about how extensible a system call is 
expected 
to be, and to plan accordingly.

If you are anywhere close to 6 parameters, you should not introduce 'flags' but 
you should _reduce_ the number of parameters to a clean essential of 2 or 3 
parameters and should shuffle parameters out to a separate 
'parameters/attributes' 
structure that is passed in by pointer:

SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);

And it's the design of 'struct params' that determines future flexibility of 
the 
interface. A very flexible approach is to not use flags but a 'size' argument:

struct params {
u32 size;
u32 param_1;
u64 param_2;
u64 param_3;
};

Where 'size' is set by user-space to the size of 'struct params' known to it at 
build time:

params->size = sizeof(*params);

In the normal case the kernel will get param->size == sizeof(*params) as known 
to 
the kernel.

When the system call is extended in the future on the kernel side, with 'u64 
param_4', then the structure expands from an old size of 24 to a new size of 32 
bytes. The following scenarios might occur:

 - the common case: new user-space calls the new kernel code, ->size is 32 on 
both 
   sides.

 - old binaries might call the kernel with params->size == 24, in which case 
the 
   kernel sets the new fields to 0. The new feature should be written
   accordingly, so that a value of 0 means the old behavior.

 - new binaries might run on old kernels, with params->size == 32. In this case 
   the old kernel will check that all the new fields it does not know about are 
   set to 0 - if they are nonzero (if the new feature is used) it returns with 
   -ENOSYS or -EINVAL.

With this approach we have both backwards and forwards binary compatibility: 
new 
binaries will run on old kernels just fine, even if they have ->size set to 32, 
as 
long as they make use of the features.

This design simplifies application design considerably: as new code can mostly 
forget about old ABIs, there's no multiple versions to be taken care of, 
there's 
just a single 'struct param' known to both sides, and there's no version skew.

We are using such a design in perf_event_open(), see perf_copy_attr() in 
kernel/events/core.c. And yes, ironically that system call still has a historic 
'flags' argument, but it's not used anymore for extension: we've made over 30 
extensions to the ABI in the last 3 years, which would have been impossible 
with a 
'flags' approach.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Kees Cook

On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote:
 On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
 I like this, it's a good description of both options. I'm still biased
 about the approach: I prefer flags, since pointers to user structures
 complicate syscall filtering. ;)

 Seems like we should do two things to make that easier:

 1) Create a standardized kernel mechanism for parameter-struct handling,
implementing the recommendations mentioned here.

It's been suggested in the past that nlmsg is appropriate for such a
thing, but I remain suspicious. :)

 2) Integrate into that mechanism a way to filter the resulting parameter
struct with BPF *after* it has been copied to kernel space (and thus
can no longer be tampered with).

Yeah, this is a irritating part: the structures operated on are copied
from userspace adhoc in each syscall. Doing argument checking would
mean double copies initially, and perhaps teaching syscalls about
optional already copied arguments or something as an optimization.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
 On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote:
  On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
  I like this, it's a good description of both options. I'm still biased
  about the approach: I prefer flags, since pointers to user structures
  complicate syscall filtering. ;)
 
  Seems like we should do two things to make that easier:
 
  1) Create a standardized kernel mechanism for parameter-struct handling,
 implementing the recommendations mentioned here.
 
 It's been suggested in the past that nlmsg is appropriate for such a
 thing, but I remain suspicious. :)

Likewise. :)

  2) Integrate into that mechanism a way to filter the resulting parameter
 struct with BPF *after* it has been copied to kernel space (and thus
 can no longer be tampered with).
 
 Yeah, this is a irritating part: the structures operated on are copied
 from userspace adhoc in each syscall. Doing argument checking would
 mean double copies initially, and perhaps teaching syscalls about
 optional already copied arguments or something as an optimization.

No, double copies can't work for security reasons.  Because otherwise
you could race the kernel from another thread, substituting different
values after the check and before the use.

I think the right API looks *roughly* like this:

int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, 
void __user *user_struct)
{
if (user_len  kernel_len)
return -EINVAL;
if (user_len  copy_from_user(kernel_struct, user_struct, user_len))
return -EFAULT;
if (user_len  kernel_len)
memset(kernel_struct + user_len, 0, kernel_len - user_len);
return 0;
}

#define copy_param_struct(kernel_struct, user_len, user_struct) 
_copy_param_struct( \
sizeof(*kernel_struct) + 
BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
kernel_struct, user_len, user_struct)


Then the syscall looks like this:

SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params 
__user *user_params)
{
int ret;
struct xyzzy_params params;

ret = copy_param_struct(params, user_params_len, user_params);
if (ret)
return ret;
...


And you could then hook copy_params_struct to add arbitrary additional
syscall parameter validation.  Bonus if there's some way to make the
copy and validation occur before the syscall is ever invoked, rather
than inside the syscall, but that would require adding fancier syscall
definition mechanisms that autogenerate such code.

- Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 06:02:34PM -0700, Josh Triplett wrote:
 On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote:
  On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org 
  wrote:
   On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
   I like this, it's a good description of both options. I'm still biased
   about the approach: I prefer flags, since pointers to user structures
   complicate syscall filtering. ;)
  
   Seems like we should do two things to make that easier:
  
   1) Create a standardized kernel mechanism for parameter-struct handling,
  implementing the recommendations mentioned here.
  
  It's been suggested in the past that nlmsg is appropriate for such a
  thing, but I remain suspicious. :)
 
 Likewise. :)
 
   2) Integrate into that mechanism a way to filter the resulting parameter
  struct with BPF *after* it has been copied to kernel space (and thus
  can no longer be tampered with).
  
  Yeah, this is a irritating part: the structures operated on are copied
  from userspace adhoc in each syscall. Doing argument checking would
  mean double copies initially, and perhaps teaching syscalls about
  optional already copied arguments or something as an optimization.
 
 No, double copies can't work for security reasons.  Because otherwise
 you could race the kernel from another thread, substituting different
 values after the check and before the use.
 
 I think the right API looks *roughly* like this:
 
 int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t 
 user_len, void __user *user_struct)
 {
   if (user_len  kernel_len)
   return -EINVAL;
   if (user_len  copy_from_user(kernel_struct, user_struct, user_len))
   return -EFAULT;
   if (user_len  kernel_len)
   memset(kernel_struct + user_len, 0, kernel_len - user_len);
   return 0;
 }
 
 #define copy_param_struct(kernel_struct, user_len, user_struct) 
 _copy_param_struct( \
   sizeof(*kernel_struct) + 
 BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \
   kernel_struct, user_len, user_struct)
 
 
 Then the syscall looks like this:
 
 SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params 
 __user *user_params)

Missed a couple of commas here (after the types and before the names).

 {
   int ret;
   struct xyzzy_params params;
 
   ret = copy_param_struct(params, user_params_len, user_params);
   if (ret)
   return ret;
   ...
 
 
 And you could then hook copy_params_struct to add arbitrary additional
 syscall parameter validation.  Bonus if there's some way to make the
 copy and validation occur before the syscall is ever invoked, rather
 than inside the syscall, but that would require adding fancier syscall
 definition mechanisms that autogenerate such code.
 
 - Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote:
 I like this, it's a good description of both options. I'm still biased
 about the approach: I prefer flags, since pointers to user structures
 complicate syscall filtering. ;)

Seems like we should do two things to make that easier:

1) Create a standardized kernel mechanism for parameter-struct handling,
   implementing the recommendations mentioned here.
2) Integrate into that mechanism a way to filter the resulting parameter
   struct with BPF *after* it has been copied to kernel space (and thus
   can no longer be tampered with).

- Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Ingo Molnar


* David Drysdale drysd...@google.com wrote:

 +Designing the API
 +-
 +
 +A new system call forms part of the API of the kernel, and has to be 
 supported
 +indefinitely.  As such, it's a very good idea to explicitly discuss the
 +interface on the kernel mailing list, and to plan for future extensions of 
 the
 +interface.  In particular:
 +
 +  **Include a flags argument for every new system call**

Sorry, but I think that's bad avice, because even a 'flags' field is inflexible 
and stupid in many cases - it fosters an 'ioctl' kind of design.

 +The syscall table is littered with historical examples where this wasn't 
 done, 
 +together with the corresponding follow-up system calls (eventfd/eventfd2, 
 +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so 
 +learn from the history of the kernel and include a flags argument from the 
 +start.

The syscall table is also littered with system calls that have an argument 
space 
considerably larger than what 6 parameters can express, where various 'flags' 
are 
used to bring in different parts of new APIs, in a rather messy way.

The right approach IMHO is to think about how extensible a system call is 
expected 
to be, and to plan accordingly.

If you are anywhere close to 6 parameters, you should not introduce 'flags' but 
you should _reduce_ the number of parameters to a clean essential of 2 or 3 
parameters and should shuffle parameters out to a separate 
'parameters/attributes' 
structure that is passed in by pointer:

SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);

And it's the design of 'struct params' that determines future flexibility of 
the 
interface. A very flexible approach is to not use flags but a 'size' argument:

struct params {
u32 size;
u32 param_1;
u64 param_2;
u64 param_3;
};

Where 'size' is set by user-space to the size of 'struct params' known to it at 
build time:

params-size = sizeof(*params);

In the normal case the kernel will get param-size == sizeof(*params) as known 
to 
the kernel.

When the system call is extended in the future on the kernel side, with 'u64 
param_4', then the structure expands from an old size of 24 to a new size of 32 
bytes. The following scenarios might occur:

 - the common case: new user-space calls the new kernel code, -size is 32 on 
both 
   sides.

 - old binaries might call the kernel with params-size == 24, in which case 
the 
   kernel sets the new fields to 0. The new feature should be written
   accordingly, so that a value of 0 means the old behavior.

 - new binaries might run on old kernels, with params-size == 32. In this case 
   the old kernel will check that all the new fields it does not know about are 
   set to 0 - if they are nonzero (if the new feature is used) it returns with 
   -ENOSYS or -EINVAL.

With this approach we have both backwards and forwards binary compatibility: 
new 
binaries will run on old kernels just fine, even if they have -size set to 32, 
as 
long as they make use of the features.

This design simplifies application design considerably: as new code can mostly 
forget about old ABIs, there's no multiple versions to be taken care of, 
there's 
just a single 'struct param' known to both sides, and there's no version skew.

We are using such a design in perf_event_open(), see perf_copy_attr() in 
kernel/events/core.c. And yes, ironically that system call still has a historic 
'flags' argument, but it's not used anymore for extension: we've made over 30 
extensions to the ABI in the last 3 years, which would have been impossible 
with a 
'flags' approach.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Kees Cook

On Thu, Jul 30, 2015 at 4:10 AM, David Drysdale drysd...@google.com wrote:
 On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar mi...@kernel.org wrote:

 * David Drysdale drysd...@google.com wrote:

 +Designing the API
 +-
 +
 +A new system call forms part of the API of the kernel, and has to be 
 supported
 +indefinitely.  As such, it's a very good idea to explicitly discuss the
 +interface on the kernel mailing list, and to plan for future extensions of 
 the
 +interface.  In particular:
 +
 +  **Include a flags argument for every new system call**

 Sorry, but I think that's bad avice, because even a 'flags' field is 
 inflexible
 and stupid in many cases - it fosters an 'ioctl' kind of design.

 +The syscall table is littered with historical examples where this wasn't 
 done,
 +together with the corresponding follow-up system calls (eventfd/eventfd2,
 +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so
 +learn from the history of the kernel and include a flags argument from the
 +start.

 The syscall table is also littered with system calls that have an argument 
 space
 considerably larger than what 6 parameters can express, where various 
 'flags' are
 used to bring in different parts of new APIs, in a rather messy way.

 The right approach IMHO is to think about how extensible a system call is 
 expected
 to be, and to plan accordingly.

 If you are anywhere close to 6 parameters, you should not introduce 'flags' 
 but
 you should _reduce_ the number of parameters to a clean essential of 2 or 3
 parameters and should shuffle parameters out to a separate 
 'parameters/attributes'
 structure that is passed in by pointer:

 SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);

 And it's the design of 'struct params' that determines future flexibility of 
 the
 interface. A very flexible approach is to not use flags but a 'size' 
 argument:

 struct params {
 u32 size;
 u32 param_1;
 u64 param_2;
 u64 param_3;
 };

 Where 'size' is set by user-space to the size of 'struct params' known to it 
 at
 build time:

 params-size = sizeof(*params);

 In the normal case the kernel will get param-size == sizeof(*params) as 
 known to
 the kernel.

 When the system call is extended in the future on the kernel side, with 'u64
 param_4', then the structure expands from an old size of 24 to a new size of 
 32
 bytes. The following scenarios might occur:

  - the common case: new user-space calls the new kernel code, -size is 32 
 on both
sides.

  - old binaries might call the kernel with params-size == 24, in which case 
 the
kernel sets the new fields to 0. The new feature should be written
accordingly, so that a value of 0 means the old behavior.

  - new binaries might run on old kernels, with params-size == 32. In this 
 case
the old kernel will check that all the new fields it does not know about 
 are
set to 0 - if they are nonzero (if the new feature is used) it returns 
 with
-ENOSYS or -EINVAL.

 With this approach we have both backwards and forwards binary compatibility: 
 new
 binaries will run on old kernels just fine, even if they have -size set to 
 32, as
 long as they make use of the features.

 This design simplifies application design considerably: as new code can 
 mostly
 forget about old ABIs, there's no multiple versions to be taken care of, 
 there's
 just a single 'struct param' known to both sides, and there's no version 
 skew.

 We are using such a design in perf_event_open(), see perf_copy_attr() in
 kernel/events/core.c. And yes, ironically that system call still has a 
 historic
 'flags' argument, but it's not used anymore for extension: we've made over 30
 extensions to the ABI in the last 3 years, which would have been impossible 
 with a
 'flags' approach.

 Thanks,

 Ingo

 Fair point, there are other ways to ensure extendability that just the
 simple flags
 approach -- and as you say, for more sophisticated interfaces flags might 
 hinder
 more than they help.

 How about the attached text as an attempt to cover both bases?

 Thanks,
 David


 --

 Designing the API: Planning for Extension
 -

 A new system call forms part of the API of the kernel, and has to be supported
 indefinitely.  As such, it's a very good idea to explicitly discuss the
 interface on the kernel mailing list, and it's important to plan for future
 extensions of the interface.

 (The syscall table is littered with historical examples where this wasn't 
 done,
 together with the corresponding follow-up system calls -- eventfd/eventfd2,
 dup2/dup3, inotify_init/inotify_init1,  pipe/pipe2, renameat/renameat2 -- so
 learn from the history of the kernel and plan for extensions from the start.)

 For simpler system calls that only take a couple of arguments, the preferred 
 way
 to allow

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 10:38:31AM +0200, Ingo Molnar wrote:
 When the system call is extended in the future on the kernel side, with 'u64 
 param_4', then the structure expands from an old size of 24 to a new size of 
 32 
 bytes. The following scenarios might occur:
 
  - the common case: new user-space calls the new kernel code, -size is 32 on 
 both 
sides.
 
  - old binaries might call the kernel with params-size == 24, in which case 
 the 
kernel sets the new fields to 0. The new feature should be written
accordingly, so that a value of 0 means the old behavior.
 
  - new binaries might run on old kernels, with params-size == 32. In this 
 case 
the old kernel will check that all the new fields it does not know about 
 are 
set to 0 - if they are nonzero (if the new feature is used) it returns 
 with 
-ENOSYS or -EINVAL.

Nit: it seems easier, rather than having the kernel check this, to have
userspace only use the minimum version of the structure that contains
the features they use, and then have older kernels reject sizes bigger
than they understand.  If you don't need param_4, don't use the version
of the structure that has param_4.  That also means the kernel doesn't
need to copy and read those values.  Either approach works, though.

- Josh Triplett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Greg Kroah-Hartman

On Thu, Jul 30, 2015 at 06:30:07PM +0200, Cyril Hrubis wrote:
 Hi!
  +Testing
  +---
  +
  +A new system call should obviously be tested; it is also useful to provide
  +reviewers with a demonstration of how user space programs will use the 
  system
  +call.  A good way to combine these aims is to include a simple self-test
  +program in a new directory under tools/testing/selftests/.
 
 I know that this is a bit off topic, but since the selftest is now the
 official place to add kernel testcases to let me rant about it a bit.
 
 It's a bit of a pain seeing people reinvent the wheel and trying to
 figure out consistent test interface while most of the problems has been
 solved in LTP[1] test library quite some time ago. Especially use of the
 SAFE_MACROS[2] would simplify writing test setups quite a lot.
 
 I wonder if we can at least share the test library, pulling it out of
 LTP, or at least the interesting parts, wouldn't be hard at all.
 
 [1]
 https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure
 
 [2]
 https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros

That would be great, please send patches to do so to the linux-api
mailing list and the test maintainer.

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Cyril Hrubis

Hi!
 +Testing
 +---
 +
 +A new system call should obviously be tested; it is also useful to provide
 +reviewers with a demonstration of how user space programs will use the system
 +call.  A good way to combine these aims is to include a simple self-test
 +program in a new directory under tools/testing/selftests/.

I know that this is a bit off topic, but since the selftest is now the
official place to add kernel testcases to let me rant about it a bit.

It's a bit of a pain seeing people reinvent the wheel and trying to
figure out consistent test interface while most of the problems has been
solved in LTP[1] test library quite some time ago. Especially use of the
SAFE_MACROS[2] would simplify writing test setups quite a lot.

I wonder if we can at least share the test library, pulling it out of
LTP, or at least the interesting parts, wouldn't be hard at all.

[1]
https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure

[2]
https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros


-- 
Cyril Hrubis
chru...@suse.cz
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread Josh Triplett

On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote:
 Add a document describing the process of adding a new system call,
 including the need for a flags argument for future compatibility, and
 covering 32-bit/64-bit concerns (albeit in an x86-centric way).
 
 Signed-off-by: David Drysdale drysd...@google.com
 Reviewed-by: Michael Kerrisk mtk.manpa...@gmail.com
 Reviewed-by: Eric B Munson emun...@akamai.com
 Reviewed-by: Kees Cook keesc...@chromium.org
 Reviewed-by: Randy Dunlap rdun...@infradead.org

A few comments below.  I also agree with Ingo's comment about
documenting the param-structure approach in addition to flags.

 --- /dev/null
 +++ b/Documentation/adding-syscalls.txt
 @@ -0,0 +1,463 @@
 +Adding a New System Call
 +
 +
 +This document describes what's involved in adding a new system call to the
 +Linux kernel, over and above the normal submission advice in
 +Documentation/SubmittingPatches.
 +
 +
 +System Call Alternatives
 +
 +
 +The first thing to consider when adding a new system call is whether one of
 +the alternatives might be suitable instead.  Although system calls are the
 +most traditional and most obvious interaction points between userspace and 
 the
 +kernel, there are other possibilities -- choose what fits best for your
 +interface.
 +
 + - If the operations involved can be made to look like a filesystem-like
 +   object, it may make more sense to create a new filesystem or device.  This
 +   also makes it easier to encapsulate the new functionality in a kernel 
 module
 +   rather than requiring it to be built into the main kernel.
 + - If the new functionality involves operations where the kernel notifies
 +   userspace that something has happened, then returning a new file
 +   descriptor for the relevant object allows userspace to use
 +   poll/select/epoll to receive that notification.
 + - However, operations that don't map to read(2)/write(2)-like operations
 +   have to be implemented as ioctl(2) requests, which can lead to a
 +   somewhat opaque API.
 + - If you're just exposing runtime system information, a new node in sysfs
 +   (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be
 +   more appropriate.  However, access to these mechanisms requires that the
 +   relevant filesystem is mounted, which might not always be the case (e.g.
 +   in a namespaced/sandboxed/chrooted environment).  Avoid adding anything to
 +   debugfs, as this is not considered a 'production' interface to userspace.

s/anything/any API/

 + - If the operation is specific to a particular file or file descriptor, then
 +   an additional fcntl(2) command option may be more appropriate.  However,
 +   fcntl(2) is a multiplexing system call that hides a lot of complexity, so
 +   this option is best for when the new function is closely analogous to
 +   existing fcntl(2) functionality, or the new functionality is very simple
 +   (for example, getting/setting a simple flag related to a file descriptor).
 + - If the operation is specific to a particular task or process, then an
 +   additional prctl(2) command option may be more appropriate.  As with
 +   fcntl(2), this system call is a complicated multiplexor so is best 
 reserved
 +   for near-analogs of existing prctl() commands or getting/setting a simple
 +   flag related to a process.
 +
 +
 +Designing the API
 +-
 +
 +A new system call forms part of the API of the kernel, and has to be 
 supported
 +indefinitely.  As such, it's a very good idea to explicitly discuss the
 +interface on the kernel mailing list, and to plan for future extensions of 
 the
 +interface.  In particular:
 +
 +  **Include a flags argument for every new system call**
 +
 +The syscall table is littered with historical examples where this wasn't 
 done,
 +together with the corresponding follow-up system calls (eventfd/eventfd2,
 +dup2/dup3, inotify_init/inotify_init1,  pipe/pipe2, renameat/renameat2), so
 +learn from the history of the kernel and include a flags argument from the
 +start.
 +
 +Also, to make sure that userspace programs can safely use flags between 
 kernel
 +versions, check whether the flags value holds any unknown flags, and reject 
 the
 +sycall (with EINVAL) if it does:
 +
 +if (flags  ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
 +return -EINVAL;
 +
 +(If no flags values are used yet, check that the flags argument is zero.)

Some things to add a this point, before talking about adding a new file
descriptor:

If you want to provide userspace with a handle to a kernel object, do so
in the form of a file descriptor.  Do not invent a new type of userspace
object handle with its own namespace.  The kernel already has numerous
system calls and well-defined semantics for managing and passing around
file descriptors, as well as for cleaning up when the file descriptor is
no longer in use.

If your new system call returns a file

Re: [PATCHv2 1/1] Documentation: describe how to add a system call

2015-07-30 Thread David Drysdale

On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar mi...@kernel.org wrote:

 * David Drysdale drysd...@google.com wrote:

 +Designing the API
 +-
 +
 +A new system call forms part of the API of the kernel, and has to be 
 supported
 +indefinitely.  As such, it's a very good idea to explicitly discuss the
 +interface on the kernel mailing list, and to plan for future extensions of 
 the
 +interface.  In particular:
 +
 +  **Include a flags argument for every new system call**

 Sorry, but I think that's bad avice, because even a 'flags' field is 
 inflexible
 and stupid in many cases - it fosters an 'ioctl' kind of design.

 +The syscall table is littered with historical examples where this wasn't 
 done,
 +together with the corresponding follow-up system calls (eventfd/eventfd2,
 +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so
 +learn from the history of the kernel and include a flags argument from the
 +start.

 The syscall table is also littered with system calls that have an argument 
 space
 considerably larger than what 6 parameters can express, where various 'flags' 
 are
 used to bring in different parts of new APIs, in a rather messy way.

 The right approach IMHO is to think about how extensible a system call is 
 expected
 to be, and to plan accordingly.

 If you are anywhere close to 6 parameters, you should not introduce 'flags' 
 but
 you should _reduce_ the number of parameters to a clean essential of 2 or 3
 parameters and should shuffle parameters out to a separate 
 'parameters/attributes'
 structure that is passed in by pointer:

 SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);

 And it's the design of 'struct params' that determines future flexibility of 
 the
 interface. A very flexible approach is to not use flags but a 'size' argument:

 struct params {
 u32 size;
 u32 param_1;
 u64 param_2;
 u64 param_3;
 };

 Where 'size' is set by user-space to the size of 'struct params' known to it 
 at
 build time:

 params-size = sizeof(*params);

 In the normal case the kernel will get param-size == sizeof(*params) as 
 known to
 the kernel.

 When the system call is extended in the future on the kernel side, with 'u64
 param_4', then the structure expands from an old size of 24 to a new size of 
 32
 bytes. The following scenarios might occur:

  - the common case: new user-space calls the new kernel code, -size is 32 on 
 both
sides.

  - old binaries might call the kernel with params-size == 24, in which case 
 the
kernel sets the new fields to 0. The new feature should be written
accordingly, so that a value of 0 means the old behavior.

  - new binaries might run on old kernels, with params-size == 32. In this 
 case
the old kernel will check that all the new fields it does not know about 
 are
set to 0 - if they are nonzero (if the new feature is used) it returns with
-ENOSYS or -EINVAL.

 With this approach we have both backwards and forwards binary compatibility: 
 new
 binaries will run on old kernels just fine, even if they have -size set to 
 32, as
 long as they make use of the features.

 This design simplifies application design considerably: as new code can mostly
 forget about old ABIs, there's no multiple versions to be taken care of, 
 there's
 just a single 'struct param' known to both sides, and there's no version skew.

 We are using such a design in perf_event_open(), see perf_copy_attr() in
 kernel/events/core.c. And yes, ironically that system call still has a 
 historic
 'flags' argument, but it's not used anymore for extension: we've made over 30
 extensions to the ABI in the last 3 years, which would have been impossible 
 with a
 'flags' approach.

 Thanks,

 Ingo

Fair point, there are other ways to ensure extendability that just the
simple flags
approach -- and as you say, for more sophisticated interfaces flags might hinder
more than they help.

How about the attached text as an attempt to cover both bases?

Thanks,
David


--

Designing the API: Planning for Extension
-

A new system call forms part of the API of the kernel, and has to be supported
indefinitely.  As such, it's a very good idea to explicitly discuss the
interface on the kernel mailing list, and it's important to plan for future
extensions of the interface.

(The syscall table is littered with historical examples where this wasn't done,
together with the corresponding follow-up system calls -- eventfd/eventfd2,
dup2/dup3, inotify_init/inotify_init1,  pipe/pipe2, renameat/renameat2 -- so
learn from the history of the kernel and plan for extensions from the start.)

For simpler system calls that only take a couple of arguments, the preferred way
to allow for future extensibility is to include a flags argument to the system
call.  To make sure that userspace

46 matches

Mail list logo