[bcc tech-userlevel; followups to tech-kern, cc me; question carried over from PR kern/59081: Add close_range() system call]
Linux and FreeBSD have adopted a syscall close_range(min, max, flags) that, depending on flags, either closes, marks cloexec, or `unshares' every file descriptor d with min <= d <= max. closefrom(d) is equivalent to close_range(d, UINT_MAX, 0). Should we adopt this syscall in NetBSD? I would expect most use-cases to happen between fork and exec, just like closefrom(), and the second argument will almost always be UINT_MAX anyway. (I'll ignore `unshare' for now which does not really have first-class semantics in NetBSD anyway.) So why would you want to use close_range instead of closefrom? Suppose you want to create a process with a specific fd mapping. It is not necessarily contiguous: for example, with librumphijack, we deliberately use two separate ranges of file descriptors, one for `host' fds (e.g., the socket to talk to the rump server) and one for `rump' fds (interpreted by the rump server), these are separated by a large number to reduce the chance of collision. So, the fd mapping might look like this: parent child ------ ----- 0 (stdin) 0 (stdin) 3 (output file) 1 (stdout) 3 (output file) 2 (stderr) 4 (rump socket) 65536 This shape of mapping is, really, the right interface for a program running a subprocess, and I was always disappointed that posix_spawn(2) had an imperative sequence of open/dup2/close actions instead of a declarative mapping. How do you effect this mapping? With closefrom(2), you might do something like this: bitmap_t keepopen = {0} int maxfd = -1 for (entry in map) { bitmap_set(&keepopen, entry.child) if (entry.child == entry.parent) continue /* If target entry.child is needed as a source, dup. */ for (entry1 in map) { if (entry.child == entry1.parent) entry1.parent = dup(entry1.parent) } dup2(entry.parent, entry.child) maxfd = MAX(maxfd, entry.child) } for (fd = 0; fd < maxfd; fd++) { if (!bitmap_isset(&keepopen)) close(fd) } closefrom(maxfd + 1) With close_range(2), you can instead do: close_range(0, UINT_MAX, CLOSE_RANGE_CLOEXEC) for (entry in map) { if (entry.child == entry.parent) goto nixcloexec /* If target entry.child is needed as a source, dup. */ for (entry1 in map) { if (entry.child == entry1.parent) entry1.parent = dup_cloexec(entry.child) } dup2(entry.parent, entry.child) nixcloexec: /* Clear FD_CLOEXEC, i.e., keep it open on exec. */ fcntl(entry.child, F_SETFD, fcntl(entry.child, F_GETFD) & ~FD_CLOEXEC) } (The inner loop could be eliminated, of course, by first indexing the parent sources in linear time and then updating a parent->replacement map as we go so the whole thing runs in linear rather than quadratic time and never dups the same source repeatedly. But this is the same for both algorithms; it doesn't distinguish closefrom(2) from close_range(2).) Here's an example of the second algorithm in the real world: https://github.com/GNOME/vte/blob/b23aaaeeca588439d4579f4ed06c1f4850219fc5/src/spawn.cc#L68-L86 https://github.com/GNOME/vte/blob/b23aaaeeca588439d4579f4ed06c1f4850219fc5/src/spawn.cc#L380-L385 https://github.com/GNOME/vte/blob/b23aaaeeca588439d4579f4ed06c1f4850219fc5/src/spawn.cc#L437-L505 One advantage of the second algorithm with close_range(2) is that it doesn't require computing any auxiliary data structure for a (potentially sparse) bit map in userland, and doesn't require userland to iterate over a (potentially large and sparse) range of file descriptors below the first one passed to closefrom(2). One advantage of the first algorithm with closefrom(2) has only one traversal over the whole fd table (userland loop + closefrom), while the second algorithm with close_range(2) has two -- close_range(2) traverses it once to set CLOEXEC, and then in the subsequent exec, the kernel traverses it once more to interpret CLOEXEC. Maybe the kernel traversal is cheaper so that doesn't matter.