Re: [PATCHv2 1/1] Documentation: describe how to add a system call
Desnoyers ,"linux-...@vger.kernel.org" ,"linux-kernel@vger.kernel.org" ,Peter Zijlstra Message-ID: Let me see if I can unbury it - probably not until Monday, though. On July 31, 2015 11:18:06 PM PDT, Josh Triplett wrote: >On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote: >> On 07/31/2015 09:32 PM, Josh Triplett wrote: >> > >> > Sure, agreed. But I really hope we don't create new kernel ABIs >that >> > involve constructs like that. >> > >> >> It's worth noting I have pushed for auto-marshalling in general for a >> long time, not the least to get rid of the awful syscall(3) wrapper. >I >> even built a prototype but didn't have time to productize it. > >Interesting! Can you post the prototype? > >- Josh Triplett -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote: > On 07/31/2015 09:32 PM, Josh Triplett wrote: > > > > Sure, agreed. But I really hope we don't create new kernel ABIs that > > involve constructs like that. > > > > It's worth noting I have pushed for auto-marshalling in general for a > long time, not the least to get rid of the awful syscall(3) wrapper. I > even built a prototype but didn't have time to productize it. Interesting! Can you post the prototype? - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
Desnoyers mathieu.desnoy...@efficios.com,linux-...@vger.kernel.org linux-...@vger.kernel.org,linux-kernel@vger.kernel.org linux-kernel@vger.kernel.org,Peter Zijlstra a.p.zijls...@chello.nl Message-ID: d190a265-cbf0-460e-b12a-df0e802c5...@zytor.com Let me see if I can unbury it - probably not until Monday, though. On July 31, 2015 11:18:06 PM PDT, Josh Triplett j...@joshtriplett.org wrote: On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote: On 07/31/2015 09:32 PM, Josh Triplett wrote: Sure, agreed. But I really hope we don't create new kernel ABIs that involve constructs like that. It's worth noting I have pushed for auto-marshalling in general for a long time, not the least to get rid of the awful syscall(3) wrapper. I even built a prototype but didn't have time to productize it. Interesting! Can you post the prototype? - Josh Triplett -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 09:56:46PM -0700, H. Peter Anvin wrote: On 07/31/2015 09:32 PM, Josh Triplett wrote: Sure, agreed. But I really hope we don't create new kernel ABIs that involve constructs like that. It's worth noting I have pushed for auto-marshalling in general for a long time, not the least to get rid of the awful syscall(3) wrapper. I even built a prototype but didn't have time to productize it. Interesting! Can you post the prototype? - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On 07/31/2015 09:32 PM, Josh Triplett wrote: > > Sure, agreed. But I really hope we don't create new kernel ABIs that > involve constructs like that. > It's worth noting I have pushed for auto-marshalling in general for a long time, not the least to get rid of the awful syscall(3) wrapper. I even built a prototype but didn't have time to productize it. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 03:54:56PM -0700, Andy Lutomirski wrote: > On Fri, Jul 31, 2015 at 3:08 PM, wrote: > > On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote: > >> On Fri, Jul 31, 2015 at 1:59 PM, wrote: > >> > Agreed. I think the proposal above would be a net improvement, but > >> > ideally you'd want something that's annotated and generates automatic > >> > marshalling code. > >> > > >> > >> I assume this is idle musing. If, however, we were to actually do > >> this, I'd suggest we seriously consider speaking the Cap'n Proto > >> serialization format. It's quite nice, it encodes and decodes *very* > >> quickly and, unlike TLV schemes, you don't have to read it in order, > >> making the read-side code less awkward. > > > > That seems like *massive* overkill for a kernel<->userspace syscall > > interface. I was more thinking about having a few standardized marshal > > types, and incrementally adding more when more patterns show up. For a > > first pass, just automatically running copy_from_user and > > copy_param_struct on appropriate sets of __user parameters identified as > > such in a structured text file seems quite sufficient. (Plus > > automatically generating syscalls.h from that.) > > If a param struct does the trick, then I agree. It's when you start > having lists and other variable-size stuff that it gets messier. Sure, agreed. But I really hope we don't create new kernel ABIs that involve constructs like that. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 3:08 PM, wrote: > On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote: >> On Fri, Jul 31, 2015 at 1:59 PM, wrote: >> > Agreed. I think the proposal above would be a net improvement, but >> > ideally you'd want something that's annotated and generates automatic >> > marshalling code. >> > >> >> I assume this is idle musing. If, however, we were to actually do >> this, I'd suggest we seriously consider speaking the Cap'n Proto >> serialization format. It's quite nice, it encodes and decodes *very* >> quickly and, unlike TLV schemes, you don't have to read it in order, >> making the read-side code less awkward. > > That seems like *massive* overkill for a kernel<->userspace syscall > interface. I was more thinking about having a few standardized marshal > types, and incrementally adding more when more patterns show up. For a > first pass, just automatically running copy_from_user and > copy_param_struct on appropriate sets of __user parameters identified as > such in a structured text file seems quite sufficient. (Plus > automatically generating syscalls.h from that.) If a param struct does the trick, then I agree. It's when you start having lists and other variable-size stuff that it gets messier. For reference, my Cap'n Proto parser (just the basic structure of messages, not the actual schema bits, although the latter is more or less just some structs) is about 300 lines of code. It's arguably simpler than nlattr, once you throw nesting in. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote: > On Fri, Jul 31, 2015 at 1:59 PM, wrote: > > Agreed. I think the proposal above would be a net improvement, but > > ideally you'd want something that's annotated and generates automatic > > marshalling code. > > > > I assume this is idle musing. If, however, we were to actually do > this, I'd suggest we seriously consider speaking the Cap'n Proto > serialization format. It's quite nice, it encodes and decodes *very* > quickly and, unlike TLV schemes, you don't have to read it in order, > making the read-side code less awkward. That seems like *massive* overkill for a kernel<->userspace syscall interface. I was more thinking about having a few standardized marshal types, and incrementally adding more when more patterns show up. For a first pass, just automatically running copy_from_user and copy_param_struct on appropriate sets of __user parameters identified as such in a structured text file seems quite sufficient. (Plus automatically generating syscalls.h from that.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 1:59 PM, wrote: > On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote: >> On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett wrote: >> > On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: >> >> On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett >> >> wrote: >> >> > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: >> >> >> I like this, it's a good description of both options. I'm still biased >> >> >> about the approach: I prefer flags, since pointers to user structures >> >> >> complicate syscall filtering. ;) >> >> > >> >> > Seems like we should do two things to make that easier: >> >> > >> >> > 1) Create a standardized kernel mechanism for parameter-struct handling, >> >> >implementing the recommendations mentioned here. >> >> >> >> It's been suggested in the past that nlmsg is appropriate for such a >> >> thing, but I remain suspicious. :) >> > >> > Likewise. :) >> > >> >> > 2) Integrate into that mechanism a way to filter the resulting parameter >> >> >struct with BPF *after* it has been copied to kernel space (and thus >> >> >can no longer be tampered with). >> >> >> >> Yeah, this is a irritating part: the structures operated on are copied >> >> from userspace adhoc in each syscall. Doing argument checking would >> >> mean double copies initially, and perhaps teaching syscalls about >> >> optional "already copied" arguments or something as an optimization. >> > >> > No, double copies can't work for security reasons. Because otherwise >> > you could race the kernel from another thread, substituting different >> > values after the check and before the use. >> >> Right, the double copy method would require setting up a per-thread >> userspace memory mapping that was read-only from userspace but >> writable from kernel space. > > Which seems like a lot more trouble than just copying it once. > >> > I think the right API looks *roughly* like this: >> > >> > int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t >> > user_len, void __user *user_struct) >> > { >> > if (user_len > kernel_len) >> > return -EINVAL; >> > if (user_len && copy_from_user(kernel_struct, user_struct, >> > user_len)) >> > return -EFAULT; >> > if (user_len < kernel_len) >> > memset(kernel_struct + user_len, 0, kernel_len - user_len); >> > return 0; >> > } >> > >> > #define copy_param_struct(kernel_struct, user_len, user_struct) >> > _copy_param_struct( \ >> > sizeof(*kernel_struct) + >> > BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ >> > kernel_struct, user_len, user_struct) >> > >> > >> > Then the syscall looks like this: >> > >> > SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct >> > xyzzy_params __user *user_params) >> > { >> > int ret; >> > struct xyzzy_params params; >> > >> > ret = copy_param_struct(, user_params_len, user_params); >> > if (ret) >> > return ret; >> > ... >> > >> > >> > And you could then hook copy_params_struct to add arbitrary additional >> > syscall parameter validation. Bonus if there's some way to make the >> > copy and validation occur before the syscall is ever invoked, rather >> > than inside the syscall, but that would require adding fancier syscall >> > definition mechanisms that autogenerate such code. >> >> The trouble is that the hook for the syscall (both seccomp and ptrace) >> happens before the sys_* function executes. So the param extract >> suddenly becomes optional. As in, did ptrace/seccomp already extract >> the args? If so, use that copy, else copy them out myself now that I >> need them, etc. >> >> It's entirely doable, but it's going to require some careful design. > > Agreed. I think the proposal above would be a net improvement, but > ideally you'd want something that's annotated and generates automatic > marshalling code. > I assume this is idle musing. If, however, we were to actually do this, I'd suggest we seriously consider speaking the Cap'n Proto serialization format. It's quite nice, it encodes and decodes *very* quickly and, unlike TLV schemes, you don't have to read it in order, making the read-side code less awkward. (I wouldn't suggest that we use the official C++11 library, but the format might be a good idea.) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote: > On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett wrote: > > On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: > >> On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett > >> wrote: > >> > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: > >> >> I like this, it's a good description of both options. I'm still biased > >> >> about the approach: I prefer flags, since pointers to user structures > >> >> complicate syscall filtering. ;) > >> > > >> > Seems like we should do two things to make that easier: > >> > > >> > 1) Create a standardized kernel mechanism for parameter-struct handling, > >> >implementing the recommendations mentioned here. > >> > >> It's been suggested in the past that nlmsg is appropriate for such a > >> thing, but I remain suspicious. :) > > > > Likewise. :) > > > >> > 2) Integrate into that mechanism a way to filter the resulting parameter > >> >struct with BPF *after* it has been copied to kernel space (and thus > >> >can no longer be tampered with). > >> > >> Yeah, this is a irritating part: the structures operated on are copied > >> from userspace adhoc in each syscall. Doing argument checking would > >> mean double copies initially, and perhaps teaching syscalls about > >> optional "already copied" arguments or something as an optimization. > > > > No, double copies can't work for security reasons. Because otherwise > > you could race the kernel from another thread, substituting different > > values after the check and before the use. > > Right, the double copy method would require setting up a per-thread > userspace memory mapping that was read-only from userspace but > writable from kernel space. Which seems like a lot more trouble than just copying it once. > > I think the right API looks *roughly* like this: > > > > int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t > > user_len, void __user *user_struct) > > { > > if (user_len > kernel_len) > > return -EINVAL; > > if (user_len && copy_from_user(kernel_struct, user_struct, > > user_len)) > > return -EFAULT; > > if (user_len < kernel_len) > > memset(kernel_struct + user_len, 0, kernel_len - user_len); > > return 0; > > } > > > > #define copy_param_struct(kernel_struct, user_len, user_struct) > > _copy_param_struct( \ > > sizeof(*kernel_struct) + > > BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ > > kernel_struct, user_len, user_struct) > > > > > > Then the syscall looks like this: > > > > SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct > > xyzzy_params __user *user_params) > > { > > int ret; > > struct xyzzy_params params; > > > > ret = copy_param_struct(, user_params_len, user_params); > > if (ret) > > return ret; > > ... > > > > > > And you could then hook copy_params_struct to add arbitrary additional > > syscall parameter validation. Bonus if there's some way to make the > > copy and validation occur before the syscall is ever invoked, rather > > than inside the syscall, but that would require adding fancier syscall > > definition mechanisms that autogenerate such code. > > The trouble is that the hook for the syscall (both seccomp and ptrace) > happens before the sys_* function executes. So the param extract > suddenly becomes optional. As in, did ptrace/seccomp already extract > the args? If so, use that copy, else copy them out myself now that I > need them, etc. > > It's entirely doable, but it's going to require some careful design. Agreed. I think the proposal above would be a net improvement, but ideally you'd want something that's annotated and generates automatic marshalling code. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett wrote: > On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: >> On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett >> wrote: >> > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: >> >> I like this, it's a good description of both options. I'm still biased >> >> about the approach: I prefer flags, since pointers to user structures >> >> complicate syscall filtering. ;) >> > >> > Seems like we should do two things to make that easier: >> > >> > 1) Create a standardized kernel mechanism for parameter-struct handling, >> >implementing the recommendations mentioned here. >> >> It's been suggested in the past that nlmsg is appropriate for such a >> thing, but I remain suspicious. :) > > Likewise. :) > >> > 2) Integrate into that mechanism a way to filter the resulting parameter >> >struct with BPF *after* it has been copied to kernel space (and thus >> >can no longer be tampered with). >> >> Yeah, this is a irritating part: the structures operated on are copied >> from userspace adhoc in each syscall. Doing argument checking would >> mean double copies initially, and perhaps teaching syscalls about >> optional "already copied" arguments or something as an optimization. > > No, double copies can't work for security reasons. Because otherwise > you could race the kernel from another thread, substituting different > values after the check and before the use. Right, the double copy method would require setting up a per-thread userspace memory mapping that was read-only from userspace but writable from kernel space. > I think the right API looks *roughly* like this: > > int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t > user_len, void __user *user_struct) > { > if (user_len > kernel_len) > return -EINVAL; > if (user_len && copy_from_user(kernel_struct, user_struct, user_len)) > return -EFAULT; > if (user_len < kernel_len) > memset(kernel_struct + user_len, 0, kernel_len - user_len); > return 0; > } > > #define copy_param_struct(kernel_struct, user_len, user_struct) > _copy_param_struct( \ > sizeof(*kernel_struct) + > BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ > kernel_struct, user_len, user_struct) > > > Then the syscall looks like this: > > SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params > __user *user_params) > { > int ret; > struct xyzzy_params params; > > ret = copy_param_struct(, user_params_len, user_params); > if (ret) > return ret; > ... > > > And you could then hook copy_params_struct to add arbitrary additional > syscall parameter validation. Bonus if there's some way to make the > copy and validation occur before the syscall is ever invoked, rather > than inside the syscall, but that would require adding fancier syscall > definition mechanisms that autogenerate such code. The trouble is that the hook for the syscall (both seccomp and ptrace) happens before the sys_* function executes. So the param extract suddenly becomes optional. As in, did ptrace/seccomp already extract the args? If so, use that copy, else copy them out myself now that I need them, etc. It's entirely doable, but it's going to require some careful design. -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 2:06 PM, Josh Triplett wrote: > On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote: >> On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett wrote: >> > On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: >> >> +needs to be governed by the appropriate Linux capability bit (checked >> >> with a >> >> +call to capable()), as described in the capabilities(7) man page. >> >> + >> >> + - If there is an existing capability that governs related >> >> functionality, then >> >> + use that. However, avoid combining lots of only vaguely related >> >> functions >> >> + together under the same bit, as this goes against capabilities' >> >> purpose of >> >> + splitting the power of root. In particular, avoid adding new uses of >> >> the >> >> + already overly-general CAP_SYS_ADMIN capability. >> >> + - If there is no related capability, then consider adding a new >> >> capability >> >> + bit -- but bear in mind that the numbering space is limited, and each >> >> new >> >> + bit needs to be understood and administered by sysadmins. >> > >> > Many previous discussions on this topic seem to have concluded that it's >> > almost impossible to add a new capability without breaking existing >> > programs. I would suggest not even mentioning this possibility. >> >> I'm not particularly knowledgable about capabilities (at least, not the >> POSIX.1e/Linux kind), so I'll confess that I got this suggestion from >> Michael Kerrisk. >> >> Michael, do you agree that we should just drop the possibility of adding >> new capability bits? >> >> Also, Josh, do you have any references to the earlier discussions on the >> topic so I can update the Sources section? > > No direct links handy at the moment without some searching, but one > iteration of it came up when Matthew Garrett suggested adding > CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability > semantics suggested that the way the capability space was defined and > controlled by userspace meant that adding any new capability would > effectively break userspace ABI. The userspace ABI for capabilities is > not clear; some applications drop all possible capabilities and could > get surprised by a new capability being dropped, while other > applications drop only capabilities they know about and could get > surprised by a new capability *not* being dropped. BTW, I left this paragraph unchanged in the v3 version I just sent out -- I'll update it for v4 when I get back from vacation, depending on what discussion occurs in the meantime... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote: > On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett wrote: > > On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: > >> Add a document describing the process of adding a new system call, > >> including the need for a flags argument for future compatibility, and > >> covering 32-bit/64-bit concerns (albeit in an x86-centric way). > >> > >> Signed-off-by: David Drysdale > >> Reviewed-by: Michael Kerrisk > >> Reviewed-by: Eric B Munson > >> Reviewed-by: Kees Cook > >> Reviewed-by: Randy Dunlap > > > > A few comments below. I also agree with Ingo's comment about > > documenting the param-structure approach in addition to flags. > > Many thanks for looking through the doc in detail! > > I'll send out an updated doc today; I'm then on vacation for the next > week so I'll collate further feedback after I return. Thanks for the quick response and for incorporating these changes. > > Some things to add a this point, before talking about adding a new file > > descriptor: > > > > If you want to provide userspace with a handle to a kernel object, do so > > in the form of a file descriptor. Do not invent a new type of userspace > > object handle with its own namespace. The kernel already has numerous > > system calls and well-defined semantics for managing and passing around > > file descriptors, as well as for cleaning up when the file descriptor is > > no longer in use. > > > > If your new system call returns a file descriptor, think about what it > > means to use the poll(2) family of syscalls on that file descriptor. If > > you want to send an event to userspace associated with your kernel > > object, you should do so by making your file descriptor ready for > > reading or writing. > > Good points, I've added a couple of paragraphs along these lines: > > " > If your new system call allows userspace to refer to a kernel object, it > should use a file descriptor as the handle for that object -- don't invent a > new type of userspace object handle when the kernel already has mechanisms and > well-defined semantics for using file descriptors. > > [snip: existing text on CLOEXEC] > > If your system call returns a new file descriptor, you should also consider > what it means to use the poll(2) family of system calls on that file > descriptor. Making a file descriptor ready for reading or writing is the > normal way for the kernel to indicate to userspace that an event has > occurred on the corresponding kernel object. > " These look good to me. > >> +needs to be governed by the appropriate Linux capability bit (checked > >> with a > >> +call to capable()), as described in the capabilities(7) man page. > >> + > >> + - If there is an existing capability that governs related functionality, > >> then > >> + use that. However, avoid combining lots of only vaguely related > >> functions > >> + together under the same bit, as this goes against capabilities' > >> purpose of > >> + splitting the power of root. In particular, avoid adding new uses of > >> the > >> + already overly-general CAP_SYS_ADMIN capability. > >> + - If there is no related capability, then consider adding a new > >> capability > >> + bit -- but bear in mind that the numbering space is limited, and each > >> new > >> + bit needs to be understood and administered by sysadmins. > > > > Many previous discussions on this topic seem to have concluded that it's > > almost impossible to add a new capability without breaking existing > > programs. I would suggest not even mentioning this possibility. > > I'm not particularly knowledgable about capabilities (at least, not the > POSIX.1e/Linux kind), so I'll confess that I got this suggestion from > Michael Kerrisk. > > Michael, do you agree that we should just drop the possibility of adding > new capability bits? > > Also, Josh, do you have any references to the earlier discussions on the > topic so I can update the Sources section? No direct links handy at the moment without some searching, but one iteration of it came up when Matthew Garrett suggested adding CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability semantics suggested that the way the capability space was defined and controlled by userspace meant that adding any new capability would effectively break userspace ABI. The userspace ABI for capabilities is not clear; some applications drop all possible capabilities and could get surprised by a new capability being dropped, while other applications drop only capabilities they know about and could get surprised by a new capability *not* being dropped. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett wrote: > On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: >> Add a document describing the process of adding a new system call, >> including the need for a flags argument for future compatibility, and >> covering 32-bit/64-bit concerns (albeit in an x86-centric way). >> >> Signed-off-by: David Drysdale >> Reviewed-by: Michael Kerrisk >> Reviewed-by: Eric B Munson >> Reviewed-by: Kees Cook >> Reviewed-by: Randy Dunlap > > A few comments below. I also agree with Ingo's comment about > documenting the param-structure approach in addition to flags. Many thanks for looking through the doc in detail! I'll send out an updated doc today; I'm then on vacation for the next week so I'll collate further feedback after I return. >> --- /dev/null >> +++ b/Documentation/adding-syscalls.txt >> @@ -0,0 +1,463 @@ >> +Adding a New System Call >> + >> + >> +This document describes what's involved in adding a new system call to the >> +Linux kernel, over and above the normal submission advice in >> +Documentation/SubmittingPatches. >> + >> + >> +System Call Alternatives >> + >> + >> +The first thing to consider when adding a new system call is whether one of >> +the alternatives might be suitable instead. Although system calls are the >> +most traditional and most obvious interaction points between userspace and >> the >> +kernel, there are other possibilities -- choose what fits best for your >> +interface. >> + >> + - If the operations involved can be made to look like a filesystem-like >> + object, it may make more sense to create a new filesystem or device. >> This >> + also makes it easier to encapsulate the new functionality in a kernel >> module >> + rather than requiring it to be built into the main kernel. >> + - If the new functionality involves operations where the kernel >> notifies >> + userspace that something has happened, then returning a new file >> + descriptor for the relevant object allows userspace to use >> + poll/select/epoll to receive that notification. >> + - However, operations that don't map to read(2)/write(2)-like >> operations >> + have to be implemented as ioctl(2) requests, which can lead to a >> + somewhat opaque API. >> + - If you're just exposing runtime system information, a new node in sysfs >> + (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be >> + more appropriate. However, access to these mechanisms requires that the >> + relevant filesystem is mounted, which might not always be the case (e.g. >> + in a namespaced/sandboxed/chrooted environment). Avoid adding anything >> to >> + debugfs, as this is not considered a 'production' interface to userspace. > > s/anything/any "API"/ Done. >> + - If the operation is specific to a particular file or file descriptor, >> then >> + an additional fcntl(2) command option may be more appropriate. However, >> + fcntl(2) is a multiplexing system call that hides a lot of complexity, so >> + this option is best for when the new function is closely analogous to >> + existing fcntl(2) functionality, or the new functionality is very simple >> + (for example, getting/setting a simple flag related to a file >> descriptor). >> + - If the operation is specific to a particular task or process, then an >> + additional prctl(2) command option may be more appropriate. As with >> + fcntl(2), this system call is a complicated multiplexor so is best >> reserved >> + for near-analogs of existing prctl() commands or getting/setting a simple >> + flag related to a process. >> + >> + >> +Designing the API >> +- >> + >> +A new system call forms part of the API of the kernel, and has to be >> supported >> +indefinitely. As such, it's a very good idea to explicitly discuss the >> +interface on the kernel mailing list, and to plan for future extensions of >> the >> +interface. In particular: >> + >> + **Include a flags argument for every new system call** >> + >> +The syscall table is littered with historical examples where this wasn't >> done, >> +together with the corresponding follow-up system calls (eventfd/eventfd2, >> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so >> +learn from the history of the kernel and include a flags argument from the >> +start. >> + >> +Also, to make sure that userspace programs can safely use flags between >> kernel >> +versions, check whether the flags value holds any unknown flags, and reject >> the >> +sycall (with EINVAL) if it does: >> + >> +if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) >> +return -EINVAL; >> + >> +(If no flags values are used yet, check that the flags argument is zero.) > > Some things to add a this point, before talking about adding a new file > descriptor: > > If you want to provide userspace with a handle to a
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 03:54:56PM -0700, Andy Lutomirski wrote: On Fri, Jul 31, 2015 at 3:08 PM, j...@joshtriplett.org wrote: On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote: On Fri, Jul 31, 2015 at 1:59 PM, j...@joshtriplett.org wrote: Agreed. I think the proposal above would be a net improvement, but ideally you'd want something that's annotated and generates automatic marshalling code. I assume this is idle musing. If, however, we were to actually do this, I'd suggest we seriously consider speaking the Cap'n Proto serialization format. It's quite nice, it encodes and decodes *very* quickly and, unlike TLV schemes, you don't have to read it in order, making the read-side code less awkward. That seems like *massive* overkill for a kernel-userspace syscall interface. I was more thinking about having a few standardized marshal types, and incrementally adding more when more patterns show up. For a first pass, just automatically running copy_from_user and copy_param_struct on appropriate sets of __user parameters identified as such in a structured text file seems quite sufficient. (Plus automatically generating syscalls.h from that.) If a param struct does the trick, then I agree. It's when you start having lists and other variable-size stuff that it gets messier. Sure, agreed. But I really hope we don't create new kernel ABIs that involve constructs like that. - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On 07/31/2015 09:32 PM, Josh Triplett wrote: Sure, agreed. But I really hope we don't create new kernel ABIs that involve constructs like that. It's worth noting I have pushed for auto-marshalling in general for a long time, not the least to get rid of the awful syscall(3) wrapper. I even built a prototype but didn't have time to productize it. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote: On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: Add a document describing the process of adding a new system call, including the need for a flags argument for future compatibility, and covering 32-bit/64-bit concerns (albeit in an x86-centric way). Signed-off-by: David Drysdale drysd...@google.com Reviewed-by: Michael Kerrisk mtk.manpa...@gmail.com Reviewed-by: Eric B Munson emun...@akamai.com Reviewed-by: Kees Cook keesc...@chromium.org Reviewed-by: Randy Dunlap rdun...@infradead.org A few comments below. I also agree with Ingo's comment about documenting the param-structure approach in addition to flags. Many thanks for looking through the doc in detail! I'll send out an updated doc today; I'm then on vacation for the next week so I'll collate further feedback after I return. Thanks for the quick response and for incorporating these changes. Some things to add a this point, before talking about adding a new file descriptor: If you want to provide userspace with a handle to a kernel object, do so in the form of a file descriptor. Do not invent a new type of userspace object handle with its own namespace. The kernel already has numerous system calls and well-defined semantics for managing and passing around file descriptors, as well as for cleaning up when the file descriptor is no longer in use. If your new system call returns a file descriptor, think about what it means to use the poll(2) family of syscalls on that file descriptor. If you want to send an event to userspace associated with your kernel object, you should do so by making your file descriptor ready for reading or writing. Good points, I've added a couple of paragraphs along these lines: If your new system call allows userspace to refer to a kernel object, it should use a file descriptor as the handle for that object -- don't invent a new type of userspace object handle when the kernel already has mechanisms and well-defined semantics for using file descriptors. [snip: existing text on CLOEXEC] If your system call returns a new file descriptor, you should also consider what it means to use the poll(2) family of system calls on that file descriptor. Making a file descriptor ready for reading or writing is the normal way for the kernel to indicate to userspace that an event has occurred on the corresponding kernel object. These look good to me. +needs to be governed by the appropriate Linux capability bit (checked with a +call to capable()), as described in the capabilities(7) man page. + + - If there is an existing capability that governs related functionality, then + use that. However, avoid combining lots of only vaguely related functions + together under the same bit, as this goes against capabilities' purpose of + splitting the power of root. In particular, avoid adding new uses of the + already overly-general CAP_SYS_ADMIN capability. + - If there is no related capability, then consider adding a new capability + bit -- but bear in mind that the numbering space is limited, and each new + bit needs to be understood and administered by sysadmins. Many previous discussions on this topic seem to have concluded that it's almost impossible to add a new capability without breaking existing programs. I would suggest not even mentioning this possibility. I'm not particularly knowledgable about capabilities (at least, not the POSIX.1e/Linux kind), so I'll confess that I got this suggestion from Michael Kerrisk. Michael, do you agree that we should just drop the possibility of adding new capability bits? Also, Josh, do you have any references to the earlier discussions on the topic so I can update the Sources section? No direct links handy at the moment without some searching, but one iteration of it came up when Matthew Garrett suggested adding CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability semantics suggested that the way the capability space was defined and controlled by userspace meant that adding any new capability would effectively break userspace ABI. The userspace ABI for capabilities is not clear; some applications drop all possible capabilities and could get surprised by a new capability being dropped, while other applications drop only capabilities they know about and could get surprised by a new capability *not* being dropped. - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 2:06 PM, Josh Triplett j...@joshtriplett.org wrote: On Fri, Jul 31, 2015 at 10:48:32AM +0100, David Drysdale wrote: On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: +needs to be governed by the appropriate Linux capability bit (checked with a +call to capable()), as described in the capabilities(7) man page. + + - If there is an existing capability that governs related functionality, then + use that. However, avoid combining lots of only vaguely related functions + together under the same bit, as this goes against capabilities' purpose of + splitting the power of root. In particular, avoid adding new uses of the + already overly-general CAP_SYS_ADMIN capability. + - If there is no related capability, then consider adding a new capability + bit -- but bear in mind that the numbering space is limited, and each new + bit needs to be understood and administered by sysadmins. Many previous discussions on this topic seem to have concluded that it's almost impossible to add a new capability without breaking existing programs. I would suggest not even mentioning this possibility. I'm not particularly knowledgable about capabilities (at least, not the POSIX.1e/Linux kind), so I'll confess that I got this suggestion from Michael Kerrisk. Michael, do you agree that we should just drop the possibility of adding new capability bits? Also, Josh, do you have any references to the earlier discussions on the topic so I can update the Sources section? No direct links handy at the moment without some searching, but one iteration of it came up when Matthew Garrett suggested adding CAP_COMPROMISE_KERNEL, and the ensuing discussion of capability semantics suggested that the way the capability space was defined and controlled by userspace meant that adding any new capability would effectively break userspace ABI. The userspace ABI for capabilities is not clear; some applications drop all possible capabilities and could get surprised by a new capability being dropped, while other applications drop only capabilities they know about and could get surprised by a new capability *not* being dropped. BTW, I left this paragraph unchanged in the v3 version I just sent out -- I'll update it for v4 when I get back from vacation, depending on what discussion occurs in the meantime... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: I like this, it's a good description of both options. I'm still biased about the approach: I prefer flags, since pointers to user structures complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. It's been suggested in the past that nlmsg is appropriate for such a thing, but I remain suspicious. :) Likewise. :) 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). Yeah, this is a irritating part: the structures operated on are copied from userspace adhoc in each syscall. Doing argument checking would mean double copies initially, and perhaps teaching syscalls about optional already copied arguments or something as an optimization. No, double copies can't work for security reasons. Because otherwise you could race the kernel from another thread, substituting different values after the check and before the use. Right, the double copy method would require setting up a per-thread userspace memory mapping that was read-only from userspace but writable from kernel space. I think the right API looks *roughly* like this: int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, void __user *user_struct) { if (user_len kernel_len) return -EINVAL; if (user_len copy_from_user(kernel_struct, user_struct, user_len)) return -EFAULT; if (user_len kernel_len) memset(kernel_struct + user_len, 0, kernel_len - user_len); return 0; } #define copy_param_struct(kernel_struct, user_len, user_struct) _copy_param_struct( \ sizeof(*kernel_struct) + BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ kernel_struct, user_len, user_struct) Then the syscall looks like this: SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params __user *user_params) { int ret; struct xyzzy_params params; ret = copy_param_struct(params, user_params_len, user_params); if (ret) return ret; ... And you could then hook copy_params_struct to add arbitrary additional syscall parameter validation. Bonus if there's some way to make the copy and validation occur before the syscall is ever invoked, rather than inside the syscall, but that would require adding fancier syscall definition mechanisms that autogenerate such code. The trouble is that the hook for the syscall (both seccomp and ptrace) happens before the sys_* function executes. So the param extract suddenly becomes optional. As in, did ptrace/seccomp already extract the args? If so, use that copy, else copy them out myself now that I need them, etc. It's entirely doable, but it's going to require some careful design. -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote: On Fri, Jul 31, 2015 at 1:59 PM, j...@joshtriplett.org wrote: Agreed. I think the proposal above would be a net improvement, but ideally you'd want something that's annotated and generates automatic marshalling code. I assume this is idle musing. If, however, we were to actually do this, I'd suggest we seriously consider speaking the Cap'n Proto serialization format. It's quite nice, it encodes and decodes *very* quickly and, unlike TLV schemes, you don't have to read it in order, making the read-side code less awkward. That seems like *massive* overkill for a kernel-userspace syscall interface. I was more thinking about having a few standardized marshal types, and incrementally adding more when more patterns show up. For a first pass, just automatically running copy_from_user and copy_param_struct on appropriate sets of __user parameters identified as such in a structured text file seems quite sufficient. (Plus automatically generating syscalls.h from that.) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote: On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: I like this, it's a good description of both options. I'm still biased about the approach: I prefer flags, since pointers to user structures complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. It's been suggested in the past that nlmsg is appropriate for such a thing, but I remain suspicious. :) Likewise. :) 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). Yeah, this is a irritating part: the structures operated on are copied from userspace adhoc in each syscall. Doing argument checking would mean double copies initially, and perhaps teaching syscalls about optional already copied arguments or something as an optimization. No, double copies can't work for security reasons. Because otherwise you could race the kernel from another thread, substituting different values after the check and before the use. Right, the double copy method would require setting up a per-thread userspace memory mapping that was read-only from userspace but writable from kernel space. Which seems like a lot more trouble than just copying it once. I think the right API looks *roughly* like this: int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, void __user *user_struct) { if (user_len kernel_len) return -EINVAL; if (user_len copy_from_user(kernel_struct, user_struct, user_len)) return -EFAULT; if (user_len kernel_len) memset(kernel_struct + user_len, 0, kernel_len - user_len); return 0; } #define copy_param_struct(kernel_struct, user_len, user_struct) _copy_param_struct( \ sizeof(*kernel_struct) + BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ kernel_struct, user_len, user_struct) Then the syscall looks like this: SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params __user *user_params) { int ret; struct xyzzy_params params; ret = copy_param_struct(params, user_params_len, user_params); if (ret) return ret; ... And you could then hook copy_params_struct to add arbitrary additional syscall parameter validation. Bonus if there's some way to make the copy and validation occur before the syscall is ever invoked, rather than inside the syscall, but that would require adding fancier syscall definition mechanisms that autogenerate such code. The trouble is that the hook for the syscall (both seccomp and ptrace) happens before the sys_* function executes. So the param extract suddenly becomes optional. As in, did ptrace/seccomp already extract the args? If so, use that copy, else copy them out myself now that I need them, etc. It's entirely doable, but it's going to require some careful design. Agreed. I think the proposal above would be a net improvement, but ideally you'd want something that's annotated and generates automatic marshalling code. - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 1:59 PM, j...@joshtriplett.org wrote: On Fri, Jul 31, 2015 at 11:56:06AM -0700, Kees Cook wrote: On Thu, Jul 30, 2015 at 6:02 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: I like this, it's a good description of both options. I'm still biased about the approach: I prefer flags, since pointers to user structures complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. It's been suggested in the past that nlmsg is appropriate for such a thing, but I remain suspicious. :) Likewise. :) 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). Yeah, this is a irritating part: the structures operated on are copied from userspace adhoc in each syscall. Doing argument checking would mean double copies initially, and perhaps teaching syscalls about optional already copied arguments or something as an optimization. No, double copies can't work for security reasons. Because otherwise you could race the kernel from another thread, substituting different values after the check and before the use. Right, the double copy method would require setting up a per-thread userspace memory mapping that was read-only from userspace but writable from kernel space. Which seems like a lot more trouble than just copying it once. I think the right API looks *roughly* like this: int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, void __user *user_struct) { if (user_len kernel_len) return -EINVAL; if (user_len copy_from_user(kernel_struct, user_struct, user_len)) return -EFAULT; if (user_len kernel_len) memset(kernel_struct + user_len, 0, kernel_len - user_len); return 0; } #define copy_param_struct(kernel_struct, user_len, user_struct) _copy_param_struct( \ sizeof(*kernel_struct) + BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ kernel_struct, user_len, user_struct) Then the syscall looks like this: SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params __user *user_params) { int ret; struct xyzzy_params params; ret = copy_param_struct(params, user_params_len, user_params); if (ret) return ret; ... And you could then hook copy_params_struct to add arbitrary additional syscall parameter validation. Bonus if there's some way to make the copy and validation occur before the syscall is ever invoked, rather than inside the syscall, but that would require adding fancier syscall definition mechanisms that autogenerate such code. The trouble is that the hook for the syscall (both seccomp and ptrace) happens before the sys_* function executes. So the param extract suddenly becomes optional. As in, did ptrace/seccomp already extract the args? If so, use that copy, else copy them out myself now that I need them, etc. It's entirely doable, but it's going to require some careful design. Agreed. I think the proposal above would be a net improvement, but ideally you'd want something that's annotated and generates automatic marshalling code. I assume this is idle musing. If, however, we were to actually do this, I'd suggest we seriously consider speaking the Cap'n Proto serialization format. It's quite nice, it encodes and decodes *very* quickly and, unlike TLV schemes, you don't have to read it in order, making the read-side code less awkward. (I wouldn't suggest that we use the official C++11 library, but the format might be a good idea.) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Fri, Jul 31, 2015 at 3:08 PM, j...@joshtriplett.org wrote: On Fri, Jul 31, 2015 at 02:19:29PM -0700, Andy Lutomirski wrote: On Fri, Jul 31, 2015 at 1:59 PM, j...@joshtriplett.org wrote: Agreed. I think the proposal above would be a net improvement, but ideally you'd want something that's annotated and generates automatic marshalling code. I assume this is idle musing. If, however, we were to actually do this, I'd suggest we seriously consider speaking the Cap'n Proto serialization format. It's quite nice, it encodes and decodes *very* quickly and, unlike TLV schemes, you don't have to read it in order, making the read-side code less awkward. That seems like *massive* overkill for a kernel-userspace syscall interface. I was more thinking about having a few standardized marshal types, and incrementally adding more when more patterns show up. For a first pass, just automatically running copy_from_user and copy_param_struct on appropriate sets of __user parameters identified as such in a structured text file seems quite sufficient. (Plus automatically generating syscalls.h from that.) If a param struct does the trick, then I agree. It's when you start having lists and other variable-size stuff that it gets messier. For reference, my Cap'n Proto parser (just the basic structure of messages, not the actual schema bits, although the latter is more or less just some structs) is about 300 lines of code. It's arguably simpler than nlattr, once you throw nesting in. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 7:50 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: Add a document describing the process of adding a new system call, including the need for a flags argument for future compatibility, and covering 32-bit/64-bit concerns (albeit in an x86-centric way). Signed-off-by: David Drysdale drysd...@google.com Reviewed-by: Michael Kerrisk mtk.manpa...@gmail.com Reviewed-by: Eric B Munson emun...@akamai.com Reviewed-by: Kees Cook keesc...@chromium.org Reviewed-by: Randy Dunlap rdun...@infradead.org A few comments below. I also agree with Ingo's comment about documenting the param-structure approach in addition to flags. Many thanks for looking through the doc in detail! I'll send out an updated doc today; I'm then on vacation for the next week so I'll collate further feedback after I return. --- /dev/null +++ b/Documentation/adding-syscalls.txt @@ -0,0 +1,463 @@ +Adding a New System Call + + +This document describes what's involved in adding a new system call to the +Linux kernel, over and above the normal submission advice in +Documentation/SubmittingPatches. + + +System Call Alternatives + + +The first thing to consider when adding a new system call is whether one of +the alternatives might be suitable instead. Although system calls are the +most traditional and most obvious interaction points between userspace and the +kernel, there are other possibilities -- choose what fits best for your +interface. + + - If the operations involved can be made to look like a filesystem-like + object, it may make more sense to create a new filesystem or device. This + also makes it easier to encapsulate the new functionality in a kernel module + rather than requiring it to be built into the main kernel. + - If the new functionality involves operations where the kernel notifies + userspace that something has happened, then returning a new file + descriptor for the relevant object allows userspace to use + poll/select/epoll to receive that notification. + - However, operations that don't map to read(2)/write(2)-like operations + have to be implemented as ioctl(2) requests, which can lead to a + somewhat opaque API. + - If you're just exposing runtime system information, a new node in sysfs + (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be + more appropriate. However, access to these mechanisms requires that the + relevant filesystem is mounted, which might not always be the case (e.g. + in a namespaced/sandboxed/chrooted environment). Avoid adding anything to + debugfs, as this is not considered a 'production' interface to userspace. s/anything/any API/ Done. + - If the operation is specific to a particular file or file descriptor, then + an additional fcntl(2) command option may be more appropriate. However, + fcntl(2) is a multiplexing system call that hides a lot of complexity, so + this option is best for when the new function is closely analogous to + existing fcntl(2) functionality, or the new functionality is very simple + (for example, getting/setting a simple flag related to a file descriptor). + - If the operation is specific to a particular task or process, then an + additional prctl(2) command option may be more appropriate. As with + fcntl(2), this system call is a complicated multiplexor so is best reserved + for near-analogs of existing prctl() commands or getting/setting a simple + flag related to a process. + + +Designing the API +- + +A new system call forms part of the API of the kernel, and has to be supported +indefinitely. As such, it's a very good idea to explicitly discuss the +interface on the kernel mailing list, and to plan for future extensions of the +interface. In particular: + + **Include a flags argument for every new system call** + +The syscall table is littered with historical examples where this wasn't done, +together with the corresponding follow-up system calls (eventfd/eventfd2, +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so +learn from the history of the kernel and include a flags argument from the +start. + +Also, to make sure that userspace programs can safely use flags between kernel +versions, check whether the flags value holds any unknown flags, and reject the +sycall (with EINVAL) if it does: + +if (flags ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) +return -EINVAL; + +(If no flags values are used yet, check that the flags argument is zero.) Some things to add a this point, before talking about adding a new file descriptor: If you want to provide userspace with a handle to a kernel object, do so in the form of a file descriptor. Do not invent a new type of
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 06:02:34PM -0700, Josh Triplett wrote: > On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: > > On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett > > wrote: > > > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: > > >> I like this, it's a good description of both options. I'm still biased > > >> about the approach: I prefer flags, since pointers to user structures > > >> complicate syscall filtering. ;) > > > > > > Seems like we should do two things to make that easier: > > > > > > 1) Create a standardized kernel mechanism for parameter-struct handling, > > >implementing the recommendations mentioned here. > > > > It's been suggested in the past that nlmsg is appropriate for such a > > thing, but I remain suspicious. :) > > Likewise. :) > > > > 2) Integrate into that mechanism a way to filter the resulting parameter > > >struct with BPF *after* it has been copied to kernel space (and thus > > >can no longer be tampered with). > > > > Yeah, this is a irritating part: the structures operated on are copied > > from userspace adhoc in each syscall. Doing argument checking would > > mean double copies initially, and perhaps teaching syscalls about > > optional "already copied" arguments or something as an optimization. > > No, double copies can't work for security reasons. Because otherwise > you could race the kernel from another thread, substituting different > values after the check and before the use. > > I think the right API looks *roughly* like this: > > int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t > user_len, void __user *user_struct) > { > if (user_len > kernel_len) > return -EINVAL; > if (user_len && copy_from_user(kernel_struct, user_struct, user_len)) > return -EFAULT; > if (user_len < kernel_len) > memset(kernel_struct + user_len, 0, kernel_len - user_len); > return 0; > } > > #define copy_param_struct(kernel_struct, user_len, user_struct) > _copy_param_struct( \ > sizeof(*kernel_struct) + > BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ > kernel_struct, user_len, user_struct) > > > Then the syscall looks like this: > > SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params > __user *user_params) Missed a couple of commas here (after the types and before the names). > { > int ret; > struct xyzzy_params params; > > ret = copy_param_struct(, user_params_len, user_params); > if (ret) > return ret; > ... > > > And you could then hook copy_params_struct to add arbitrary additional > syscall parameter validation. Bonus if there's some way to make the > copy and validation occur before the syscall is ever invoked, rather > than inside the syscall, but that would require adding fancier syscall > definition mechanisms that autogenerate such code. > > - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: > On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett wrote: > > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: > >> I like this, it's a good description of both options. I'm still biased > >> about the approach: I prefer flags, since pointers to user structures > >> complicate syscall filtering. ;) > > > > Seems like we should do two things to make that easier: > > > > 1) Create a standardized kernel mechanism for parameter-struct handling, > >implementing the recommendations mentioned here. > > It's been suggested in the past that nlmsg is appropriate for such a > thing, but I remain suspicious. :) Likewise. :) > > 2) Integrate into that mechanism a way to filter the resulting parameter > >struct with BPF *after* it has been copied to kernel space (and thus > >can no longer be tampered with). > > Yeah, this is a irritating part: the structures operated on are copied > from userspace adhoc in each syscall. Doing argument checking would > mean double copies initially, and perhaps teaching syscalls about > optional "already copied" arguments or something as an optimization. No, double copies can't work for security reasons. Because otherwise you could race the kernel from another thread, substituting different values after the check and before the use. I think the right API looks *roughly* like this: int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, void __user *user_struct) { if (user_len > kernel_len) return -EINVAL; if (user_len && copy_from_user(kernel_struct, user_struct, user_len)) return -EFAULT; if (user_len < kernel_len) memset(kernel_struct + user_len, 0, kernel_len - user_len); return 0; } #define copy_param_struct(kernel_struct, user_len, user_struct) _copy_param_struct( \ sizeof(*kernel_struct) + BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ kernel_struct, user_len, user_struct) Then the syscall looks like this: SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params __user *user_params) { int ret; struct xyzzy_params params; ret = copy_param_struct(, user_params_len, user_params); if (ret) return ret; ... And you could then hook copy_params_struct to add arbitrary additional syscall parameter validation. Bonus if there's some way to make the copy and validation occur before the syscall is ever invoked, rather than inside the syscall, but that would require adding fancier syscall definition mechanisms that autogenerate such code. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett wrote: > On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: >> I like this, it's a good description of both options. I'm still biased >> about the approach: I prefer flags, since pointers to user structures >> complicate syscall filtering. ;) > > Seems like we should do two things to make that easier: > > 1) Create a standardized kernel mechanism for parameter-struct handling, >implementing the recommendations mentioned here. It's been suggested in the past that nlmsg is appropriate for such a thing, but I remain suspicious. :) > 2) Integrate into that mechanism a way to filter the resulting parameter >struct with BPF *after* it has been copied to kernel space (and thus >can no longer be tampered with). Yeah, this is a irritating part: the structures operated on are copied from userspace adhoc in each syscall. Doing argument checking would mean double copies initially, and perhaps teaching syscalls about optional "already copied" arguments or something as an optimization. -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: > I like this, it's a good description of both options. I'm still biased > about the approach: I prefer flags, since pointers to user structures > complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: > Add a document describing the process of adding a new system call, > including the need for a flags argument for future compatibility, and > covering 32-bit/64-bit concerns (albeit in an x86-centric way). > > Signed-off-by: David Drysdale > Reviewed-by: Michael Kerrisk > Reviewed-by: Eric B Munson > Reviewed-by: Kees Cook > Reviewed-by: Randy Dunlap A few comments below. I also agree with Ingo's comment about documenting the param-structure approach in addition to flags. > --- /dev/null > +++ b/Documentation/adding-syscalls.txt > @@ -0,0 +1,463 @@ > +Adding a New System Call > + > + > +This document describes what's involved in adding a new system call to the > +Linux kernel, over and above the normal submission advice in > +Documentation/SubmittingPatches. > + > + > +System Call Alternatives > + > + > +The first thing to consider when adding a new system call is whether one of > +the alternatives might be suitable instead. Although system calls are the > +most traditional and most obvious interaction points between userspace and > the > +kernel, there are other possibilities -- choose what fits best for your > +interface. > + > + - If the operations involved can be made to look like a filesystem-like > + object, it may make more sense to create a new filesystem or device. This > + also makes it easier to encapsulate the new functionality in a kernel > module > + rather than requiring it to be built into the main kernel. > + - If the new functionality involves operations where the kernel notifies > + userspace that something has happened, then returning a new file > + descriptor for the relevant object allows userspace to use > + poll/select/epoll to receive that notification. > + - However, operations that don't map to read(2)/write(2)-like operations > + have to be implemented as ioctl(2) requests, which can lead to a > + somewhat opaque API. > + - If you're just exposing runtime system information, a new node in sysfs > + (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be > + more appropriate. However, access to these mechanisms requires that the > + relevant filesystem is mounted, which might not always be the case (e.g. > + in a namespaced/sandboxed/chrooted environment). Avoid adding anything to > + debugfs, as this is not considered a 'production' interface to userspace. s/anything/any "API"/ > + - If the operation is specific to a particular file or file descriptor, then > + an additional fcntl(2) command option may be more appropriate. However, > + fcntl(2) is a multiplexing system call that hides a lot of complexity, so > + this option is best for when the new function is closely analogous to > + existing fcntl(2) functionality, or the new functionality is very simple > + (for example, getting/setting a simple flag related to a file descriptor). > + - If the operation is specific to a particular task or process, then an > + additional prctl(2) command option may be more appropriate. As with > + fcntl(2), this system call is a complicated multiplexor so is best > reserved > + for near-analogs of existing prctl() commands or getting/setting a simple > + flag related to a process. > + > + > +Designing the API > +- > + > +A new system call forms part of the API of the kernel, and has to be > supported > +indefinitely. As such, it's a very good idea to explicitly discuss the > +interface on the kernel mailing list, and to plan for future extensions of > the > +interface. In particular: > + > + **Include a flags argument for every new system call** > + > +The syscall table is littered with historical examples where this wasn't > done, > +together with the corresponding follow-up system calls (eventfd/eventfd2, > +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so > +learn from the history of the kernel and include a flags argument from the > +start. > + > +Also, to make sure that userspace programs can safely use flags between > kernel > +versions, check whether the flags value holds any unknown flags, and reject > the > +sycall (with EINVAL) if it does: > + > +if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) > +return -EINVAL; > + > +(If no flags values are used yet, check that the flags argument is zero.) Some things to add a this point, before talking about adding a new file descriptor: If you want to provide userspace with a handle to a kernel object, do so in the form of a file descriptor. Do not invent a new type of userspace object handle with its own namespace. The kernel already has numerous system calls and well-defined semantics for managing and passing around file descriptors, as well as for cleaning up when the file descriptor is no longer in use. If your new system call returns a file
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 10:38:31AM +0200, Ingo Molnar wrote: > When the system call is extended in the future on the kernel side, with 'u64 > param_4', then the structure expands from an old size of 24 to a new size of > 32 > bytes. The following scenarios might occur: > > - the common case: new user-space calls the new kernel code, ->size is 32 on > both >sides. > > - old binaries might call the kernel with params->size == 24, in which case > the >kernel sets the new fields to 0. The new feature should be written >accordingly, so that a value of 0 means the old behavior. > > - new binaries might run on old kernels, with params->size == 32. In this > case >the old kernel will check that all the new fields it does not know about > are >set to 0 - if they are nonzero (if the new feature is used) it returns > with >-ENOSYS or -EINVAL. Nit: it seems easier, rather than having the kernel check this, to have userspace only use the minimum version of the structure that contains the features they use, and then have older kernels reject sizes bigger than they understand. If you don't need param_4, don't use the version of the structure that has param_4. That also means the kernel doesn't need to copy and read those values. Either approach works, though. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 4:10 AM, David Drysdale wrote: > On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar wrote: >> >> * David Drysdale wrote: >> >>> +Designing the API >>> +- >>> + >>> +A new system call forms part of the API of the kernel, and has to be >>> supported >>> +indefinitely. As such, it's a very good idea to explicitly discuss the >>> +interface on the kernel mailing list, and to plan for future extensions of >>> the >>> +interface. In particular: >>> + >>> + **Include a flags argument for every new system call** >> >> Sorry, but I think that's bad avice, because even a 'flags' field is >> inflexible >> and stupid in many cases - it fosters an 'ioctl' kind of design. >> >>> +The syscall table is littered with historical examples where this wasn't >>> done, >>> +together with the corresponding follow-up system calls (eventfd/eventfd2, >>> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so >>> +learn from the history of the kernel and include a flags argument from the >>> +start. >> >> The syscall table is also littered with system calls that have an argument >> space >> considerably larger than what 6 parameters can express, where various >> 'flags' are >> used to bring in different parts of new APIs, in a rather messy way. >> >> The right approach IMHO is to think about how extensible a system call is >> expected >> to be, and to plan accordingly. >> >> If you are anywhere close to 6 parameters, you should not introduce 'flags' >> but >> you should _reduce_ the number of parameters to a clean essential of 2 or 3 >> parameters and should shuffle parameters out to a separate >> 'parameters/attributes' >> structure that is passed in by pointer: >> >> SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params); >> >> And it's the design of 'struct params' that determines future flexibility of >> the >> interface. A very flexible approach is to not use flags but a 'size' >> argument: >> >> struct params { >> u32 size; >> u32 param_1; >> u64 param_2; >> u64 param_3; >> }; >> >> Where 'size' is set by user-space to the size of 'struct params' known to it >> at >> build time: >> >> params->size = sizeof(*params); >> >> In the normal case the kernel will get param->size == sizeof(*params) as >> known to >> the kernel. >> >> When the system call is extended in the future on the kernel side, with 'u64 >> param_4', then the structure expands from an old size of 24 to a new size of >> 32 >> bytes. The following scenarios might occur: >> >> - the common case: new user-space calls the new kernel code, ->size is 32 >> on both >>sides. >> >> - old binaries might call the kernel with params->size == 24, in which case >> the >>kernel sets the new fields to 0. The new feature should be written >>accordingly, so that a value of 0 means the old behavior. >> >> - new binaries might run on old kernels, with params->size == 32. In this >> case >>the old kernel will check that all the new fields it does not know about >> are >>set to 0 - if they are nonzero (if the new feature is used) it returns >> with >>-ENOSYS or -EINVAL. >> >> With this approach we have both backwards and forwards binary compatibility: >> new >> binaries will run on old kernels just fine, even if they have ->size set to >> 32, as >> long as they make use of the features. >> >> This design simplifies application design considerably: as new code can >> mostly >> forget about old ABIs, there's no multiple versions to be taken care of, >> there's >> just a single 'struct param' known to both sides, and there's no version >> skew. >> >> We are using such a design in perf_event_open(), see perf_copy_attr() in >> kernel/events/core.c. And yes, ironically that system call still has a >> historic >> 'flags' argument, but it's not used anymore for extension: we've made over 30 >> extensions to the ABI in the last 3 years, which would have been impossible >> with a >> 'flags' approach. >> >> Thanks, >> >> Ingo > > Fair point, there are other ways to ensure extendability that just the > simple flags > approach -- and as you say, for more sophisticated interfaces flags might > hinder > more than they help. > > How about the attached text as an attempt to cover both bases? > > Thanks, > David > > > -- > > Designing the API: Planning for Extension > - > > A new system call forms part of the API of the kernel, and has to be supported > indefinitely. As such, it's a very good idea to explicitly discuss the > interface on the kernel mailing list, and it's important to plan for future > extensions of the interface. > > (The syscall table is littered with historical examples where this wasn't > done, > together with the corresponding follow-up system calls -- eventfd/eventfd2, > dup2/dup3,
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 06:30:07PM +0200, Cyril Hrubis wrote: > Hi! > > +Testing > > +--- > > + > > +A new system call should obviously be tested; it is also useful to provide > > +reviewers with a demonstration of how user space programs will use the > > system > > +call. A good way to combine these aims is to include a simple self-test > > +program in a new directory under tools/testing/selftests/. > > I know that this is a bit off topic, but since the selftest is now the > official place to add kernel testcases to let me rant about it a bit. > > It's a bit of a pain seeing people reinvent the wheel and trying to > figure out consistent test interface while most of the problems has been > solved in LTP[1] test library quite some time ago. Especially use of the > SAFE_MACROS[2] would simplify writing test setups quite a lot. > > I wonder if we can at least share the test library, pulling it out of > LTP, or at least the interesting parts, wouldn't be hard at all. > > [1] > https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure > > [2] > https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros That would be great, please send patches to do so to the linux-api mailing list and the test maintainer. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
Hi! > +Testing > +--- > + > +A new system call should obviously be tested; it is also useful to provide > +reviewers with a demonstration of how user space programs will use the system > +call. A good way to combine these aims is to include a simple self-test > +program in a new directory under tools/testing/selftests/. I know that this is a bit off topic, but since the selftest is now the official place to add kernel testcases to let me rant about it a bit. It's a bit of a pain seeing people reinvent the wheel and trying to figure out consistent test interface while most of the problems has been solved in LTP[1] test library quite some time ago. Especially use of the SAFE_MACROS[2] would simplify writing test setups quite a lot. I wonder if we can at least share the test library, pulling it out of LTP, or at least the interesting parts, wouldn't be hard at all. [1] https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure [2] https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros -- Cyril Hrubis chru...@suse.cz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar wrote: > > * David Drysdale wrote: > >> +Designing the API >> +- >> + >> +A new system call forms part of the API of the kernel, and has to be >> supported >> +indefinitely. As such, it's a very good idea to explicitly discuss the >> +interface on the kernel mailing list, and to plan for future extensions of >> the >> +interface. In particular: >> + >> + **Include a flags argument for every new system call** > > Sorry, but I think that's bad avice, because even a 'flags' field is > inflexible > and stupid in many cases - it fosters an 'ioctl' kind of design. > >> +The syscall table is littered with historical examples where this wasn't >> done, >> +together with the corresponding follow-up system calls (eventfd/eventfd2, >> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so >> +learn from the history of the kernel and include a flags argument from the >> +start. > > The syscall table is also littered with system calls that have an argument > space > considerably larger than what 6 parameters can express, where various 'flags' > are > used to bring in different parts of new APIs, in a rather messy way. > > The right approach IMHO is to think about how extensible a system call is > expected > to be, and to plan accordingly. > > If you are anywhere close to 6 parameters, you should not introduce 'flags' > but > you should _reduce_ the number of parameters to a clean essential of 2 or 3 > parameters and should shuffle parameters out to a separate > 'parameters/attributes' > structure that is passed in by pointer: > > SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params); > > And it's the design of 'struct params' that determines future flexibility of > the > interface. A very flexible approach is to not use flags but a 'size' argument: > > struct params { > u32 size; > u32 param_1; > u64 param_2; > u64 param_3; > }; > > Where 'size' is set by user-space to the size of 'struct params' known to it > at > build time: > > params->size = sizeof(*params); > > In the normal case the kernel will get param->size == sizeof(*params) as > known to > the kernel. > > When the system call is extended in the future on the kernel side, with 'u64 > param_4', then the structure expands from an old size of 24 to a new size of > 32 > bytes. The following scenarios might occur: > > - the common case: new user-space calls the new kernel code, ->size is 32 on > both >sides. > > - old binaries might call the kernel with params->size == 24, in which case > the >kernel sets the new fields to 0. The new feature should be written >accordingly, so that a value of 0 means the old behavior. > > - new binaries might run on old kernels, with params->size == 32. In this > case >the old kernel will check that all the new fields it does not know about > are >set to 0 - if they are nonzero (if the new feature is used) it returns with >-ENOSYS or -EINVAL. > > With this approach we have both backwards and forwards binary compatibility: > new > binaries will run on old kernels just fine, even if they have ->size set to > 32, as > long as they make use of the features. > > This design simplifies application design considerably: as new code can mostly > forget about old ABIs, there's no multiple versions to be taken care of, > there's > just a single 'struct param' known to both sides, and there's no version skew. > > We are using such a design in perf_event_open(), see perf_copy_attr() in > kernel/events/core.c. And yes, ironically that system call still has a > historic > 'flags' argument, but it's not used anymore for extension: we've made over 30 > extensions to the ABI in the last 3 years, which would have been impossible > with a > 'flags' approach. > > Thanks, > > Ingo Fair point, there are other ways to ensure extendability that just the simple flags approach -- and as you say, for more sophisticated interfaces flags might hinder more than they help. How about the attached text as an attempt to cover both bases? Thanks, David -- Designing the API: Planning for Extension - A new system call forms part of the API of the kernel, and has to be supported indefinitely. As such, it's a very good idea to explicitly discuss the interface on the kernel mailing list, and it's important to plan for future extensions of the interface. (The syscall table is littered with historical examples where this wasn't done, together with the corresponding follow-up system calls -- eventfd/eventfd2, dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2 -- so learn from the history of the kernel and plan for extensions from the start.) For simpler system calls that only take a couple of arguments, the preferred way to allow for future
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
* David Drysdale wrote: > +Designing the API > +- > + > +A new system call forms part of the API of the kernel, and has to be > supported > +indefinitely. As such, it's a very good idea to explicitly discuss the > +interface on the kernel mailing list, and to plan for future extensions of > the > +interface. In particular: > + > + **Include a flags argument for every new system call** Sorry, but I think that's bad avice, because even a 'flags' field is inflexible and stupid in many cases - it fosters an 'ioctl' kind of design. > +The syscall table is littered with historical examples where this wasn't > done, > +together with the corresponding follow-up system calls (eventfd/eventfd2, > +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so > +learn from the history of the kernel and include a flags argument from the > +start. The syscall table is also littered with system calls that have an argument space considerably larger than what 6 parameters can express, where various 'flags' are used to bring in different parts of new APIs, in a rather messy way. The right approach IMHO is to think about how extensible a system call is expected to be, and to plan accordingly. If you are anywhere close to 6 parameters, you should not introduce 'flags' but you should _reduce_ the number of parameters to a clean essential of 2 or 3 parameters and should shuffle parameters out to a separate 'parameters/attributes' structure that is passed in by pointer: SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params); And it's the design of 'struct params' that determines future flexibility of the interface. A very flexible approach is to not use flags but a 'size' argument: struct params { u32 size; u32 param_1; u64 param_2; u64 param_3; }; Where 'size' is set by user-space to the size of 'struct params' known to it at build time: params->size = sizeof(*params); In the normal case the kernel will get param->size == sizeof(*params) as known to the kernel. When the system call is extended in the future on the kernel side, with 'u64 param_4', then the structure expands from an old size of 24 to a new size of 32 bytes. The following scenarios might occur: - the common case: new user-space calls the new kernel code, ->size is 32 on both sides. - old binaries might call the kernel with params->size == 24, in which case the kernel sets the new fields to 0. The new feature should be written accordingly, so that a value of 0 means the old behavior. - new binaries might run on old kernels, with params->size == 32. In this case the old kernel will check that all the new fields it does not know about are set to 0 - if they are nonzero (if the new feature is used) it returns with -ENOSYS or -EINVAL. With this approach we have both backwards and forwards binary compatibility: new binaries will run on old kernels just fine, even if they have ->size set to 32, as long as they make use of the features. This design simplifies application design considerably: as new code can mostly forget about old ABIs, there's no multiple versions to be taken care of, there's just a single 'struct param' known to both sides, and there's no version skew. We are using such a design in perf_event_open(), see perf_copy_attr() in kernel/events/core.c. And yes, ironically that system call still has a historic 'flags' argument, but it's not used anymore for extension: we've made over 30 extensions to the ABI in the last 3 years, which would have been impossible with a 'flags' approach. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: I like this, it's a good description of both options. I'm still biased about the approach: I prefer flags, since pointers to user structures complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. It's been suggested in the past that nlmsg is appropriate for such a thing, but I remain suspicious. :) 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). Yeah, this is a irritating part: the structures operated on are copied from userspace adhoc in each syscall. Doing argument checking would mean double copies initially, and perhaps teaching syscalls about optional already copied arguments or something as an optimization. -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: I like this, it's a good description of both options. I'm still biased about the approach: I prefer flags, since pointers to user structures complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. It's been suggested in the past that nlmsg is appropriate for such a thing, but I remain suspicious. :) Likewise. :) 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). Yeah, this is a irritating part: the structures operated on are copied from userspace adhoc in each syscall. Doing argument checking would mean double copies initially, and perhaps teaching syscalls about optional already copied arguments or something as an optimization. No, double copies can't work for security reasons. Because otherwise you could race the kernel from another thread, substituting different values after the check and before the use. I think the right API looks *roughly* like this: int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, void __user *user_struct) { if (user_len kernel_len) return -EINVAL; if (user_len copy_from_user(kernel_struct, user_struct, user_len)) return -EFAULT; if (user_len kernel_len) memset(kernel_struct + user_len, 0, kernel_len - user_len); return 0; } #define copy_param_struct(kernel_struct, user_len, user_struct) _copy_param_struct( \ sizeof(*kernel_struct) + BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ kernel_struct, user_len, user_struct) Then the syscall looks like this: SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params __user *user_params) { int ret; struct xyzzy_params params; ret = copy_param_struct(params, user_params_len, user_params); if (ret) return ret; ... And you could then hook copy_params_struct to add arbitrary additional syscall parameter validation. Bonus if there's some way to make the copy and validation occur before the syscall is ever invoked, rather than inside the syscall, but that would require adding fancier syscall definition mechanisms that autogenerate such code. - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 06:02:34PM -0700, Josh Triplett wrote: On Thu, Jul 30, 2015 at 01:03:43PM -0700, Kees Cook wrote: On Thu, Jul 30, 2015 at 12:04 PM, Josh Triplett j...@joshtriplett.org wrote: On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: I like this, it's a good description of both options. I'm still biased about the approach: I prefer flags, since pointers to user structures complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. It's been suggested in the past that nlmsg is appropriate for such a thing, but I remain suspicious. :) Likewise. :) 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). Yeah, this is a irritating part: the structures operated on are copied from userspace adhoc in each syscall. Doing argument checking would mean double copies initially, and perhaps teaching syscalls about optional already copied arguments or something as an optimization. No, double copies can't work for security reasons. Because otherwise you could race the kernel from another thread, substituting different values after the check and before the use. I think the right API looks *roughly* like this: int _copy_param_struct(size_t kernel_len, void *kernel_struct, size_t user_len, void __user *user_struct) { if (user_len kernel_len) return -EINVAL; if (user_len copy_from_user(kernel_struct, user_struct, user_len)) return -EFAULT; if (user_len kernel_len) memset(kernel_struct + user_len, 0, kernel_len - user_len); return 0; } #define copy_param_struct(kernel_struct, user_len, user_struct) _copy_param_struct( \ sizeof(*kernel_struct) + BUILD_BUG_ON_ZERO(!__same_type(*kernel_struct, *user_struct)), \ kernel_struct, user_len, user_struct) Then the syscall looks like this: SYSCALL_DEFINEn(xyzzy, ..., ..., size_t user_params_len, struct xyzzy_params __user *user_params) Missed a couple of commas here (after the types and before the names). { int ret; struct xyzzy_params params; ret = copy_param_struct(params, user_params_len, user_params); if (ret) return ret; ... And you could then hook copy_params_struct to add arbitrary additional syscall parameter validation. Bonus if there's some way to make the copy and validation occur before the syscall is ever invoked, rather than inside the syscall, but that would require adding fancier syscall definition mechanisms that autogenerate such code. - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 11:21:54AM -0700, Kees Cook wrote: I like this, it's a good description of both options. I'm still biased about the approach: I prefer flags, since pointers to user structures complicate syscall filtering. ;) Seems like we should do two things to make that easier: 1) Create a standardized kernel mechanism for parameter-struct handling, implementing the recommendations mentioned here. 2) Integrate into that mechanism a way to filter the resulting parameter struct with BPF *after* it has been copied to kernel space (and thus can no longer be tampered with). - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
* David Drysdale drysd...@google.com wrote: +Designing the API +- + +A new system call forms part of the API of the kernel, and has to be supported +indefinitely. As such, it's a very good idea to explicitly discuss the +interface on the kernel mailing list, and to plan for future extensions of the +interface. In particular: + + **Include a flags argument for every new system call** Sorry, but I think that's bad avice, because even a 'flags' field is inflexible and stupid in many cases - it fosters an 'ioctl' kind of design. +The syscall table is littered with historical examples where this wasn't done, +together with the corresponding follow-up system calls (eventfd/eventfd2, +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so +learn from the history of the kernel and include a flags argument from the +start. The syscall table is also littered with system calls that have an argument space considerably larger than what 6 parameters can express, where various 'flags' are used to bring in different parts of new APIs, in a rather messy way. The right approach IMHO is to think about how extensible a system call is expected to be, and to plan accordingly. If you are anywhere close to 6 parameters, you should not introduce 'flags' but you should _reduce_ the number of parameters to a clean essential of 2 or 3 parameters and should shuffle parameters out to a separate 'parameters/attributes' structure that is passed in by pointer: SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params); And it's the design of 'struct params' that determines future flexibility of the interface. A very flexible approach is to not use flags but a 'size' argument: struct params { u32 size; u32 param_1; u64 param_2; u64 param_3; }; Where 'size' is set by user-space to the size of 'struct params' known to it at build time: params-size = sizeof(*params); In the normal case the kernel will get param-size == sizeof(*params) as known to the kernel. When the system call is extended in the future on the kernel side, with 'u64 param_4', then the structure expands from an old size of 24 to a new size of 32 bytes. The following scenarios might occur: - the common case: new user-space calls the new kernel code, -size is 32 on both sides. - old binaries might call the kernel with params-size == 24, in which case the kernel sets the new fields to 0. The new feature should be written accordingly, so that a value of 0 means the old behavior. - new binaries might run on old kernels, with params-size == 32. In this case the old kernel will check that all the new fields it does not know about are set to 0 - if they are nonzero (if the new feature is used) it returns with -ENOSYS or -EINVAL. With this approach we have both backwards and forwards binary compatibility: new binaries will run on old kernels just fine, even if they have -size set to 32, as long as they make use of the features. This design simplifies application design considerably: as new code can mostly forget about old ABIs, there's no multiple versions to be taken care of, there's just a single 'struct param' known to both sides, and there's no version skew. We are using such a design in perf_event_open(), see perf_copy_attr() in kernel/events/core.c. And yes, ironically that system call still has a historic 'flags' argument, but it's not used anymore for extension: we've made over 30 extensions to the ABI in the last 3 years, which would have been impossible with a 'flags' approach. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 4:10 AM, David Drysdale drysd...@google.com wrote: On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar mi...@kernel.org wrote: * David Drysdale drysd...@google.com wrote: +Designing the API +- + +A new system call forms part of the API of the kernel, and has to be supported +indefinitely. As such, it's a very good idea to explicitly discuss the +interface on the kernel mailing list, and to plan for future extensions of the +interface. In particular: + + **Include a flags argument for every new system call** Sorry, but I think that's bad avice, because even a 'flags' field is inflexible and stupid in many cases - it fosters an 'ioctl' kind of design. +The syscall table is littered with historical examples where this wasn't done, +together with the corresponding follow-up system calls (eventfd/eventfd2, +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so +learn from the history of the kernel and include a flags argument from the +start. The syscall table is also littered with system calls that have an argument space considerably larger than what 6 parameters can express, where various 'flags' are used to bring in different parts of new APIs, in a rather messy way. The right approach IMHO is to think about how extensible a system call is expected to be, and to plan accordingly. If you are anywhere close to 6 parameters, you should not introduce 'flags' but you should _reduce_ the number of parameters to a clean essential of 2 or 3 parameters and should shuffle parameters out to a separate 'parameters/attributes' structure that is passed in by pointer: SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params); And it's the design of 'struct params' that determines future flexibility of the interface. A very flexible approach is to not use flags but a 'size' argument: struct params { u32 size; u32 param_1; u64 param_2; u64 param_3; }; Where 'size' is set by user-space to the size of 'struct params' known to it at build time: params-size = sizeof(*params); In the normal case the kernel will get param-size == sizeof(*params) as known to the kernel. When the system call is extended in the future on the kernel side, with 'u64 param_4', then the structure expands from an old size of 24 to a new size of 32 bytes. The following scenarios might occur: - the common case: new user-space calls the new kernel code, -size is 32 on both sides. - old binaries might call the kernel with params-size == 24, in which case the kernel sets the new fields to 0. The new feature should be written accordingly, so that a value of 0 means the old behavior. - new binaries might run on old kernels, with params-size == 32. In this case the old kernel will check that all the new fields it does not know about are set to 0 - if they are nonzero (if the new feature is used) it returns with -ENOSYS or -EINVAL. With this approach we have both backwards and forwards binary compatibility: new binaries will run on old kernels just fine, even if they have -size set to 32, as long as they make use of the features. This design simplifies application design considerably: as new code can mostly forget about old ABIs, there's no multiple versions to be taken care of, there's just a single 'struct param' known to both sides, and there's no version skew. We are using such a design in perf_event_open(), see perf_copy_attr() in kernel/events/core.c. And yes, ironically that system call still has a historic 'flags' argument, but it's not used anymore for extension: we've made over 30 extensions to the ABI in the last 3 years, which would have been impossible with a 'flags' approach. Thanks, Ingo Fair point, there are other ways to ensure extendability that just the simple flags approach -- and as you say, for more sophisticated interfaces flags might hinder more than they help. How about the attached text as an attempt to cover both bases? Thanks, David -- Designing the API: Planning for Extension - A new system call forms part of the API of the kernel, and has to be supported indefinitely. As such, it's a very good idea to explicitly discuss the interface on the kernel mailing list, and it's important to plan for future extensions of the interface. (The syscall table is littered with historical examples where this wasn't done, together with the corresponding follow-up system calls -- eventfd/eventfd2, dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2 -- so learn from the history of the kernel and plan for extensions from the start.) For simpler system calls that only take a couple of arguments, the preferred way to allow
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 10:38:31AM +0200, Ingo Molnar wrote: When the system call is extended in the future on the kernel side, with 'u64 param_4', then the structure expands from an old size of 24 to a new size of 32 bytes. The following scenarios might occur: - the common case: new user-space calls the new kernel code, -size is 32 on both sides. - old binaries might call the kernel with params-size == 24, in which case the kernel sets the new fields to 0. The new feature should be written accordingly, so that a value of 0 means the old behavior. - new binaries might run on old kernels, with params-size == 32. In this case the old kernel will check that all the new fields it does not know about are set to 0 - if they are nonzero (if the new feature is used) it returns with -ENOSYS or -EINVAL. Nit: it seems easier, rather than having the kernel check this, to have userspace only use the minimum version of the structure that contains the features they use, and then have older kernels reject sizes bigger than they understand. If you don't need param_4, don't use the version of the structure that has param_4. That also means the kernel doesn't need to copy and read those values. Either approach works, though. - Josh Triplett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 06:30:07PM +0200, Cyril Hrubis wrote: Hi! +Testing +--- + +A new system call should obviously be tested; it is also useful to provide +reviewers with a demonstration of how user space programs will use the system +call. A good way to combine these aims is to include a simple self-test +program in a new directory under tools/testing/selftests/. I know that this is a bit off topic, but since the selftest is now the official place to add kernel testcases to let me rant about it a bit. It's a bit of a pain seeing people reinvent the wheel and trying to figure out consistent test interface while most of the problems has been solved in LTP[1] test library quite some time ago. Especially use of the SAFE_MACROS[2] would simplify writing test setups quite a lot. I wonder if we can at least share the test library, pulling it out of LTP, or at least the interesting parts, wouldn't be hard at all. [1] https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure [2] https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros That would be great, please send patches to do so to the linux-api mailing list and the test maintainer. thanks, greg k-h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
Hi! +Testing +--- + +A new system call should obviously be tested; it is also useful to provide +reviewers with a demonstration of how user space programs will use the system +call. A good way to combine these aims is to include a simple self-test +program in a new directory under tools/testing/selftests/. I know that this is a bit off topic, but since the selftest is now the official place to add kernel testcases to let me rant about it a bit. It's a bit of a pain seeing people reinvent the wheel and trying to figure out consistent test interface while most of the problems has been solved in LTP[1] test library quite some time ago. Especially use of the SAFE_MACROS[2] would simplify writing test setups quite a lot. I wonder if we can at least share the test library, pulling it out of LTP, or at least the interesting parts, wouldn't be hard at all. [1] https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#221-basic-test-structure [2] https://github.com/linux-test-project/ltp/wiki/Test-Writing-Guidelines#224-safe-macros -- Cyril Hrubis chru...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 08:52:11AM +0100, David Drysdale wrote: Add a document describing the process of adding a new system call, including the need for a flags argument for future compatibility, and covering 32-bit/64-bit concerns (albeit in an x86-centric way). Signed-off-by: David Drysdale drysd...@google.com Reviewed-by: Michael Kerrisk mtk.manpa...@gmail.com Reviewed-by: Eric B Munson emun...@akamai.com Reviewed-by: Kees Cook keesc...@chromium.org Reviewed-by: Randy Dunlap rdun...@infradead.org A few comments below. I also agree with Ingo's comment about documenting the param-structure approach in addition to flags. --- /dev/null +++ b/Documentation/adding-syscalls.txt @@ -0,0 +1,463 @@ +Adding a New System Call + + +This document describes what's involved in adding a new system call to the +Linux kernel, over and above the normal submission advice in +Documentation/SubmittingPatches. + + +System Call Alternatives + + +The first thing to consider when adding a new system call is whether one of +the alternatives might be suitable instead. Although system calls are the +most traditional and most obvious interaction points between userspace and the +kernel, there are other possibilities -- choose what fits best for your +interface. + + - If the operations involved can be made to look like a filesystem-like + object, it may make more sense to create a new filesystem or device. This + also makes it easier to encapsulate the new functionality in a kernel module + rather than requiring it to be built into the main kernel. + - If the new functionality involves operations where the kernel notifies + userspace that something has happened, then returning a new file + descriptor for the relevant object allows userspace to use + poll/select/epoll to receive that notification. + - However, operations that don't map to read(2)/write(2)-like operations + have to be implemented as ioctl(2) requests, which can lead to a + somewhat opaque API. + - If you're just exposing runtime system information, a new node in sysfs + (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be + more appropriate. However, access to these mechanisms requires that the + relevant filesystem is mounted, which might not always be the case (e.g. + in a namespaced/sandboxed/chrooted environment). Avoid adding anything to + debugfs, as this is not considered a 'production' interface to userspace. s/anything/any API/ + - If the operation is specific to a particular file or file descriptor, then + an additional fcntl(2) command option may be more appropriate. However, + fcntl(2) is a multiplexing system call that hides a lot of complexity, so + this option is best for when the new function is closely analogous to + existing fcntl(2) functionality, or the new functionality is very simple + (for example, getting/setting a simple flag related to a file descriptor). + - If the operation is specific to a particular task or process, then an + additional prctl(2) command option may be more appropriate. As with + fcntl(2), this system call is a complicated multiplexor so is best reserved + for near-analogs of existing prctl() commands or getting/setting a simple + flag related to a process. + + +Designing the API +- + +A new system call forms part of the API of the kernel, and has to be supported +indefinitely. As such, it's a very good idea to explicitly discuss the +interface on the kernel mailing list, and to plan for future extensions of the +interface. In particular: + + **Include a flags argument for every new system call** + +The syscall table is littered with historical examples where this wasn't done, +together with the corresponding follow-up system calls (eventfd/eventfd2, +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so +learn from the history of the kernel and include a flags argument from the +start. + +Also, to make sure that userspace programs can safely use flags between kernel +versions, check whether the flags value holds any unknown flags, and reject the +sycall (with EINVAL) if it does: + +if (flags ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) +return -EINVAL; + +(If no flags values are used yet, check that the flags argument is zero.) Some things to add a this point, before talking about adding a new file descriptor: If you want to provide userspace with a handle to a kernel object, do so in the form of a file descriptor. Do not invent a new type of userspace object handle with its own namespace. The kernel already has numerous system calls and well-defined semantics for managing and passing around file descriptors, as well as for cleaning up when the file descriptor is no longer in use. If your new system call returns a file
Re: [PATCHv2 1/1] Documentation: describe how to add a system call
On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar mi...@kernel.org wrote: * David Drysdale drysd...@google.com wrote: +Designing the API +- + +A new system call forms part of the API of the kernel, and has to be supported +indefinitely. As such, it's a very good idea to explicitly discuss the +interface on the kernel mailing list, and to plan for future extensions of the +interface. In particular: + + **Include a flags argument for every new system call** Sorry, but I think that's bad avice, because even a 'flags' field is inflexible and stupid in many cases - it fosters an 'ioctl' kind of design. +The syscall table is littered with historical examples where this wasn't done, +together with the corresponding follow-up system calls (eventfd/eventfd2, +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so +learn from the history of the kernel and include a flags argument from the +start. The syscall table is also littered with system calls that have an argument space considerably larger than what 6 parameters can express, where various 'flags' are used to bring in different parts of new APIs, in a rather messy way. The right approach IMHO is to think about how extensible a system call is expected to be, and to plan accordingly. If you are anywhere close to 6 parameters, you should not introduce 'flags' but you should _reduce_ the number of parameters to a clean essential of 2 or 3 parameters and should shuffle parameters out to a separate 'parameters/attributes' structure that is passed in by pointer: SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params); And it's the design of 'struct params' that determines future flexibility of the interface. A very flexible approach is to not use flags but a 'size' argument: struct params { u32 size; u32 param_1; u64 param_2; u64 param_3; }; Where 'size' is set by user-space to the size of 'struct params' known to it at build time: params-size = sizeof(*params); In the normal case the kernel will get param-size == sizeof(*params) as known to the kernel. When the system call is extended in the future on the kernel side, with 'u64 param_4', then the structure expands from an old size of 24 to a new size of 32 bytes. The following scenarios might occur: - the common case: new user-space calls the new kernel code, -size is 32 on both sides. - old binaries might call the kernel with params-size == 24, in which case the kernel sets the new fields to 0. The new feature should be written accordingly, so that a value of 0 means the old behavior. - new binaries might run on old kernels, with params-size == 32. In this case the old kernel will check that all the new fields it does not know about are set to 0 - if they are nonzero (if the new feature is used) it returns with -ENOSYS or -EINVAL. With this approach we have both backwards and forwards binary compatibility: new binaries will run on old kernels just fine, even if they have -size set to 32, as long as they make use of the features. This design simplifies application design considerably: as new code can mostly forget about old ABIs, there's no multiple versions to be taken care of, there's just a single 'struct param' known to both sides, and there's no version skew. We are using such a design in perf_event_open(), see perf_copy_attr() in kernel/events/core.c. And yes, ironically that system call still has a historic 'flags' argument, but it's not used anymore for extension: we've made over 30 extensions to the ABI in the last 3 years, which would have been impossible with a 'flags' approach. Thanks, Ingo Fair point, there are other ways to ensure extendability that just the simple flags approach -- and as you say, for more sophisticated interfaces flags might hinder more than they help. How about the attached text as an attempt to cover both bases? Thanks, David -- Designing the API: Planning for Extension - A new system call forms part of the API of the kernel, and has to be supported indefinitely. As such, it's a very good idea to explicitly discuss the interface on the kernel mailing list, and it's important to plan for future extensions of the interface. (The syscall table is littered with historical examples where this wasn't done, together with the corresponding follow-up system calls -- eventfd/eventfd2, dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2 -- so learn from the history of the kernel and plan for extensions from the start.) For simpler system calls that only take a couple of arguments, the preferred way to allow for future extensibility is to include a flags argument to the system call. To make sure that userspace