Hi folks, during this years Summer of Code Charles Zhang and I implemented the posix_spawn syscall. We are now at a point that, with some further minor cleanup and debugging, this is ready to commit. The main changes are pretty local, but to avoid code duplication a few of the existing file operations had to be modified, and this causes changes literally all over the tree.
Let's look at it from a userland perspective first: what use is posix_spawn? Historically the fork/exec model used in unix has been pretty efficient, and later variants of it (vfork, some call it a hack) have made it even more efficient. Nowadays, with a lot of multithreaded applications, neither fits well. So posix_spwawn was invented, and it is realy simple to use. A minimalistic test program is: #include <stdio.h> #include <string.h> #include <spawn.h> int main(int argc, char **argv) { pid_t child = 0; int err; extern char **environ; char * const cav[3] = { "ls", "-l", NULL }; printf("trying to spawn /bin/ls\n"); err = posix_spawn(&child, "/bin/ls", NULL, NULL, cav, environ); printf("err: %d, child: %d\n", err, (int)child); return 0; } If you don't want to hardcode /bin/ls, the posix_spawnp() variant is available, which uses the PATH environment variable to locate the binary: #include <stdio.h> #include <string.h> #include <stdlib.h> #include <spawn.h> int main(int argc, char **argv) { pid_t child = 0; int err; extern char **environ; char * const cav[3] = { "ls", "-l", NULL }; printf("trying to spawn any ls\n"); err = posix_spawnp(&child, "ls", NULL, NULL, cav, environ); printf("err: %d, child: %d\n", err, (int)child); return 0; } This is implemented as a simple userland wrapper in libc, while posix_spawn itself is a real system call. There are a few fancy magic things you can do by passing some of the arguments that are NULL in above simplisitic examples, which cause the kernel to close or dup file handles, adjust scheduler paramaters etc. - in short: everything you would have done in the child process after a (v)fork. However, you do not have to go through the cloning of the VM space and manual userland adjustments before the exec, the kernel does all that for you. Now let's look at the kernel changes: posix_spawn was implemented in a way that tries to avoid code duplication. Basically it breaks down what used to be done in execve1() in NetBSD into two parts: execve_loadvm(), which handles all the VM space setup, and execve_runproc(), which deals with the exec part. Therefore the patch turns execve1() into: int +execve1(struct lwp *l, const char *path, char * const *args, + char * const *envs, execve_fetch_element_t fetch_element) +{ + struct execve_data data; + int error; + + error = execve_loadvm(l, path, args, envs, fetch_element, &data); + if (error) + return error; + error = execve_runproc(l, &data); + return error; +} The struct execve_data is everything that used to be common local variabels in execve1, but now needs to be stored in a common structure explicitly, as posix_spawn() will do the exeve_loadvm() part in the parent process, but do the execve_runproc() inside the freshly created lwp. So far this was all easy. A few unexpecte stranglers as we do create a fresh new VM space instead of a full grown clone of the old one, basically a few more checks for NULL here and there. The only prominent one is in elf_load_file where we need to decide for topdown or bottomup memory layout - this used to clone the setting from the parent VM space, but with posix_spawn we do not have an old VM around which is relevant, so we just go with the default: + if (p->p_vmspace) + use_topdown = p->p_vmspace->vm_map.flags & VM_MAP_TOPDOWN; + else +#ifdef __USING_TOPDOWN_VM + use_topdown = true; +#else + use_topdown = false; +#endif + This could be considered a hack and I'm open to better suggestions. Now what caused all the intrusiveness of the patch? Since posix_spawn, before doing the exec part, needs to manipulate file handles on behalf of the user's request for the new process, we need to pass a "lwp" argument to a few of the already existing file descriptor manipulating functions. This is the change to the header: Index: sys/sys/filedesc.h =================================================================== RCS file: /cvsroot/src/sys/sys/filedesc.h,v retrieving revision 1.61 diff -c -u -p -r1.61 filedesc.h --- sys/sys/filedesc.h 26 Jun 2011 16:43:12 -0000 1.61 +++ sys/sys/filedesc.h 18 Dec 2011 23:41:37 -0000 @@ -181,10 +181,11 @@ struct proc; * Kernel global variables and routines. */ void fd_sys_init(void); -int fd_dupopen(int, int *, int, int); +int fd_open(lwp_t *, const char *, int, int, int *); +int fd_dupopen(lwp_t *, int, int *, int, int); int fd_alloc(struct proc *, int, int *); void fd_tryexpand(struct proc *); -int fd_allocfile(file_t **, int *); +int fd_allocfile(lwp_t *, file_t **, int *); void fd_affix(struct proc *, file_t *, unsigned); void fd_abort(struct proc *, file_t *, unsigned); filedesc_t *fd_copy(void); @@ -192,19 +193,19 @@ filedesc_t *fd_init(filedesc_t *); void fd_share(proc_t *); void fd_hold(lwp_t *); void fd_free(void); -void fd_closeexec(void); +void fd_closeexec(lwp_t *); void fd_ktrexecfd(void); int fd_checkstd(void); -file_t *fd_getfile(unsigned); +file_t *fd_getfile(lwp_t *, unsigned); file_t *fd_getfile2(proc_t *, unsigned); -void fd_putfile(unsigned); +void fd_putfile(lwp_t *, unsigned); int fd_getvnode(unsigned, file_t **); int fd_getsock(unsigned, struct socket **); void fd_putvnode(unsigned); void fd_putsock(unsigned); -int fd_close(unsigned); -int fd_dup(file_t *, int, int *, bool); -int fd_dup2(file_t *, unsigned, int); +int fd_close(lwp_t *, unsigned); +int fd_dup(lwp_t *, file_t *, int, int *, bool); +int fd_dup2(lwp_t *, file_t *, unsigned, int); int fd_clone(file_t *, unsigned, int, const struct fileops *, void *); void fd_set_exclose(struct lwp *, int, bool); int pipe1(struct lwp *, register_t *, int); and this made it all realy nasty. Every emulation, everything touching file handles in the kernel, had to be adjusted. Code changes were mostly mechanical and of course trivial, but I am sure there will be fallout. We will fix it post commit ASAP, of course. There are a few test cases for the new syscalls, but they are not yet included in the files below. Cleanup and atfication of testcases is not fully done yet. Benchmarks results are being prepared, preliminary results show no noticable performance difference (which would have to be blamed on the execve1 split into two functions). I believe the final benchmark results will show no effect here either - but we will check before committing. You can find the new files needed for this at: ftp.netbsd.org:/pub/NetBSD/misc/martin/posix_spawn/newfiles_20111219.tar.gz and the diff for the existing files at: ftp.netbsd.org:/pub/NetBSD/misc/martin/posix_spawn/posix_spawn_20111219.diff So, please have a look, all comments welcome. Martin