[RFC/PATCH 2/5] revoke: core code
From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). 2. Take down shared memory mappings of the inode and close all file pointers. The file descriptors and memory mapping ranges are preserved until the owning task does close(2) and munmap(2), respectively. You use revoke() (with chown, for example) to gain exclusive access to an inode that might be in use by other processes. This means that we must mke sure that: - operations on opened file descriptors pointing to that inode fail - there are no shared mappings visible to other processes - in-progress system calls are either completed (writes) or abort (reads) After revoke() system call returns, you are guaranteed to have revoked access to an inode for any processes that had access to it when you started the operation. The caller is responsible for blocking any future open(2) calls that might occur while revoke() takes care of fork(2) and dup(2) during the operation. Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- fs/Makefile |1 fs/revoke.c | 777 +++ fs/revoked_inode.c | 417 +++ include/linux/fs.h |8 include/linux/magic.h|1 include/linux/mm.h |1 include/linux/revoked_fs_i.h | 18 include/linux/syscalls.h |3 mm/mmap.c| 11 9 files changed, 1237 insertions(+) Index: 2.6/fs/Makefile === --- 2.6.orig/fs/Makefile2007-05-21 15:38:14.0 +0300 +++ 2.6/fs/Makefile 2007-07-11 11:48:35.0 +0300 @@ -19,6 +19,7 @@ else obj-y += no-block.o endif +obj-$(CONFIG_MMU) += revoke.o revoked_inode.o obj-$(CONFIG_INOTIFY) += inotify.o obj-$(CONFIG_INOTIFY_USER) += inotify_user.o obj-$(CONFIG_EPOLL)+= eventpoll.o Index: 2.6/fs/revoke.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ 2.6/fs/revoke.c 2007-07-11 11:48:35.0 +0300 @@ -0,0 +1,777 @@ +/* + * fs/revoke.c - Invalidate all current open file descriptors of an inode. + * + * Copyright (C) 2006-2007 Pekka Enberg + * + * This file is released under the GPLv2. + */ + +#include linux/file.h +#include linux/fs.h +#include linux/namei.h +#include linux/magic.h +#include linux/mm.h +#include linux/mman.h +#include linux/module.h +#include linux/mount.h +#include linux/sched.h +#include linux/revoked_fs_i.h +#include linux/syscalls.h + +/** + * fileset - an array of file pointers. + * @files:the array of file pointers + * @nr: number of elements in the array + * @end: index to next unused file pointer + */ +struct fileset { + struct file **files; + unsigned long nr; + unsigned long end; +}; + +/** + * revoke_details - details of the revoke operation + * @inode:invalidate open file descriptors of this inode + * @fset: set of files that point to a revoked inode + * @restore_start:index to the first file pointer that is currently in + *use by a file descriptor but the real file has not + *been revoked + */ +struct revoke_details { + struct fileset *fset; + unsigned long restore_start; +}; + +static struct kmem_cache *revokefs_inode_cache; + +static inline bool fset_is_full(struct fileset *set) +{ + return set-nr == set-end; +} + +static inline struct file *fset_get_filp(struct fileset *set) +{ + return set-files[set-end++]; +} + +static struct fileset *alloc_fset(unsigned long size) +{ + struct fileset *fset; + + fset = kzalloc(sizeof *fset, GFP_KERNEL); + if (!fset) + return NULL; + + fset-files = kcalloc(size, sizeof(struct file *), GFP_KERNEL); + if (!fset-files) { + kfree(fset); + return NULL; + } + fset-nr = size; + return fset; +} + +static void free_fset(struct fileset *fset) +{ + int i; + + for (i = fset-end; i fset-nr; i++) + fput(fset-files[i]); + + kfree(fset-files); + kfree(fset); +} + +/* + * Revoked file descriptors point to inodes in the revokefs filesystem. + */ +static struct vfsmount *revokefs_mnt; + +static struct file *get_revoked_file(void) +{ + struct dentry
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). How does that deal with the following: task A gets its descriptor table cleansed task B sends a descriptor to task A via SCM_RIGHTS datagram task B gets its descriptor table cleansed task A receives the datagram and gets the descriptor installed task A does any kind of IO on that descriptor -f_mapping gets replaced in the most inconvenient time. Come to think of that, what do you do if I create a socketpair, stuff the descriptor into SCM_RIGHTS datagram and send it to myself? Then receive it a day after you've called revoke() and voila - I've got an almost undamaged struct file back... At the very least, it allows me to fchmod(). Or fchdir() if that had been a directory... BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote: On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). How does that deal with the following: task A gets its descriptor table cleansed task B sends a descriptor to task A via SCM_RIGHTS datagram task B gets its descriptor table cleansed task A receives the datagram and gets the descriptor installed task A does any kind of IO on that descriptor -f_mapping gets replaced in the most inconvenient time. Come to think of that, what do you do if I create a socketpair, stuff the descriptor into SCM_RIGHTS datagram and send it to myself? Then receive it a day after you've called revoke() and voila - I've got an almost undamaged struct file back... At the very least, it allows me to fchmod(). Or fchdir() if that had been a directory... BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
Hi Al, On Wed, 11 Jul 2007, Al Viro wrote: Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. Uhm, nice. So, revoke() needs a proper inode - struct files mapping somewhere. Can we add a list of files to struct inode? Are there other cases where a file can point to an inode but the file is not attached to any file descriptor? Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, 11 Jul 2007, Al Viro wrote: BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... For writes, we (1) never start any new operations after we've cleaned up the file descriptor tables so (2) after we're done with do_fsync() we never touch -f_mapping again. But for reads, I think there's a problem if we're in do_generic_mapping_read() doing invalidate_inode_pages2() is not enough because we're hanging on to the real mapping. Hmm. Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote: Hi Al, On Wed, 11 Jul 2007, Al Viro wrote: Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. Uhm, nice. So, revoke() needs a proper inode - struct files mapping somewhere. Can we add a list of files to struct inode? Are there other cases where a file can point to an inode but the file is not attached to any file descriptor? Umm... Any number, really - it might be in the middle of syscall while another task sharing descriptor table has closed the descriptor. Then there's quota, then there's process accounting, then there's execve() in progress, then there's knfsd working with that struct file, etc. The fundamental issue here is that even if you do find struct file, you can't blindly rip its -f_mapping since it can be in the middle of -read(), -write(), pageout, etc. And even if you do manage that, you still have the ability to do fchmod() later. I don't see how the ability to find all instances in SCM_RIGHTS datagrams (for example) will help you with the race I've described first. Original state: task B has the only reference to file. revoke() is called, passes task A. B sends datagram to A and closes file. A receives datagram. Now the only reference is in A's table and you've already passed that. So you can't avoid processes keeping pointers to struct file. If you could find all struct file over given inode (which, I suspect, will lead to interesting locking), you could call something on that struct file, but you'd have zero exclusion with processes calling methods on it. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote: On Wed, 11 Jul 2007, Al Viro wrote: BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... For writes, we (1) never start any new operations after we've cleaned up the file descriptor tables so (2) after we're done with do_fsync() we never touch -f_mapping again. Er, no. do_fsync() won't hit the sys_write() that is yet to enter -write(). And you can't get rid of new callers _anyway_ (see previous mail). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
Hi, On Wed, 11 Jul 2007, Al Viro wrote: The fundamental issue here is that even if you do find struct file, you can't blindly rip its -f_mapping since it can be in the middle of -read(), -write(), pageout, etc. And even if you do manage that, you still have the ability to do fchmod() later. Then we would need to change the VFS and relevant parts so that we can take down -f_mapping. I don't see how we could do that without affecting current hotpaths. Hmm. I suppose what we really need to do is cannibalize the actual inode (remove from inode cache, detach from dentry and take down the mapping) so that we don't have to touch existing struct file pointers at all. Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html