[RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg
From: Pekka Enberg [EMAIL PROTECTED]

The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
and shared mappings of an inode.  After an successful revocation, operations
on file descriptors fail with the EBADF or ENXIO error code for regular and
device files, respectively.  Attempting to read from or write to a revoked
mapping causes SIGBUS.

The actual operation is done in two passes:

 1. Revoke all file descriptors that point to the given inode. We do
this under tasklist_lock so that after this pass, we don't need
to worry about racing with close(2) or dup(2).

 2. Take down shared memory mappings of the inode and close all file
pointers.

The file descriptors and memory mapping ranges are preserved until the
owning task does close(2) and munmap(2), respectively.


You use revoke() (with chown, for example) to gain exclusive access to 
an inode that might be in use by other processes. This means that we must 
mke sure that:

  - operations on opened file descriptors pointing to that inode fail
  - there are no shared mappings visible to other processes
  - in-progress system calls are either completed (writes) or abort 
(reads)

After revoke() system call returns, you are guaranteed to have revoked 
access to an inode for any processes that had access to it when you 
started the operation. The caller is responsible for blocking any future 
open(2) calls that might occur while revoke() takes care of fork(2) and 
dup(2) during the operation.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---

 fs/Makefile  |1 
 fs/revoke.c  |  777 +++
 fs/revoked_inode.c   |  417 +++
 include/linux/fs.h   |8 
 include/linux/magic.h|1 
 include/linux/mm.h   |1 
 include/linux/revoked_fs_i.h |   18 
 include/linux/syscalls.h |3 
 mm/mmap.c|   11 
 9 files changed, 1237 insertions(+)

Index: 2.6/fs/Makefile
===
--- 2.6.orig/fs/Makefile2007-05-21 15:38:14.0 +0300
+++ 2.6/fs/Makefile 2007-07-11 11:48:35.0 +0300
@@ -19,6 +19,7 @@ else
 obj-y +=   no-block.o
 endif
 
+obj-$(CONFIG_MMU)  += revoke.o revoked_inode.o
 obj-$(CONFIG_INOTIFY)  += inotify.o
 obj-$(CONFIG_INOTIFY_USER) += inotify_user.o
 obj-$(CONFIG_EPOLL)+= eventpoll.o
Index: 2.6/fs/revoke.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ 2.6/fs/revoke.c 2007-07-11 11:48:35.0 +0300
@@ -0,0 +1,777 @@
+/*
+ * fs/revoke.c - Invalidate all current open file descriptors of an inode.
+ *
+ * Copyright (C) 2006-2007  Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include linux/file.h
+#include linux/fs.h
+#include linux/namei.h
+#include linux/magic.h
+#include linux/mm.h
+#include linux/mman.h
+#include linux/module.h
+#include linux/mount.h
+#include linux/sched.h
+#include linux/revoked_fs_i.h
+#include linux/syscalls.h
+
+/**
+ * fileset - an array of file pointers.
+ * @files:the array of file pointers
+ * @nr:   number of elements in the array
+ * @end:  index to next unused file pointer
+ */
+struct fileset {
+   struct file **files;
+   unsigned long   nr;
+   unsigned long   end;
+};
+
+/**
+ * revoke_details - details of the revoke operation
+ * @inode:invalidate open file descriptors of this inode
+ * @fset: set of files that point to a revoked inode
+ * @restore_start:index to the first file pointer that is currently in
+ *use by a file descriptor but the real file has not
+ *been revoked
+ */
+struct revoke_details {
+   struct fileset  *fset;
+   unsigned long   restore_start;
+};
+
+static struct kmem_cache *revokefs_inode_cache;
+
+static inline bool fset_is_full(struct fileset *set)
+{
+   return set-nr == set-end;
+}
+
+static inline struct file *fset_get_filp(struct fileset *set)
+{
+   return set-files[set-end++];
+}
+
+static struct fileset *alloc_fset(unsigned long size)
+{
+   struct fileset *fset;
+
+   fset = kzalloc(sizeof *fset, GFP_KERNEL);
+   if (!fset)
+   return NULL;
+
+   fset-files = kcalloc(size, sizeof(struct file *), GFP_KERNEL);
+   if (!fset-files) {
+   kfree(fset);
+   return NULL;
+   }
+   fset-nr = size;
+   return fset;
+}
+
+static void free_fset(struct fileset *fset)
+{
+  int i;
+
+  for (i = fset-end; i  fset-nr; i++)
+  fput(fset-files[i]);
+
+  kfree(fset-files);
+  kfree(fset);
+}
+
+/*
+ * Revoked file descriptors point to inodes in the revokefs filesystem.
+ */
+static struct vfsmount *revokefs_mnt;
+
+static struct file *get_revoked_file(void)
+{
+   struct dentry 

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
 From: Pekka Enberg [EMAIL PROTECTED]
 
 The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
 and shared mappings of an inode.  After an successful revocation, operations
 on file descriptors fail with the EBADF or ENXIO error code for regular and
 device files, respectively.  Attempting to read from or write to a revoked
 mapping causes SIGBUS.
 
 The actual operation is done in two passes:
 
  1. Revoke all file descriptors that point to the given inode. We do
 this under tasklist_lock so that after this pass, we don't need
 to worry about racing with close(2) or dup(2).

How does that deal with the following:

task A gets its descriptor table cleansed
task B sends a descriptor to task A via SCM_RIGHTS datagram
task B gets its descriptor table cleansed
task A receives the datagram and gets the descriptor installed
task A does any kind of IO on that descriptor
-f_mapping gets replaced in the most inconvenient time.

Come to think of that, what do you do if I create a socketpair,
stuff the descriptor into SCM_RIGHTS datagram and send it to
myself?  Then receive it a day after you've called revoke() and
voila - I've got an almost undamaged struct file back...  At
the very least, it allows me to fchmod().  Or fchdir() if that
had been a directory...

BTW, read() or write() in progress might get rather unhappy if your
live replacement of -f_mapping races with them...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote:
 On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
  From: Pekka Enberg [EMAIL PROTECTED]
  
  The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
  and shared mappings of an inode.  After an successful revocation, operations
  on file descriptors fail with the EBADF or ENXIO error code for regular and
  device files, respectively.  Attempting to read from or write to a revoked
  mapping causes SIGBUS.
  
  The actual operation is done in two passes:
  
   1. Revoke all file descriptors that point to the given inode. We do
  this under tasklist_lock so that after this pass, we don't need
  to worry about racing with close(2) or dup(2).
 
 How does that deal with the following:
 
 task A gets its descriptor table cleansed
 task B sends a descriptor to task A via SCM_RIGHTS datagram
 task B gets its descriptor table cleansed
 task A receives the datagram and gets the descriptor installed
 task A does any kind of IO on that descriptor
 -f_mapping gets replaced in the most inconvenient time.
 
 Come to think of that, what do you do if I create a socketpair,
 stuff the descriptor into SCM_RIGHTS datagram and send it to
 myself?  Then receive it a day after you've called revoke() and
 voila - I've got an almost undamaged struct file back...  At
 the very least, it allows me to fchmod().  Or fchdir() if that
 had been a directory...
 
 BTW, read() or write() in progress might get rather unhappy if your
 live replacement of -f_mapping races with them...

Better: I have the only opened descriptor for foo.  I send it to myself
as described above.  I close it.  revoke() is called, finds no opened
instances of foo in any descriptor tables and cheerfully does nothing.
I call recvmsg() and I have completely undamaged opened file back.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg
Hi Al,

On Wed, 11 Jul 2007, Al Viro wrote:
 Better: I have the only opened descriptor for foo.  I send it to myself
 as described above.  I close it.  revoke() is called, finds no opened
 instances of foo in any descriptor tables and cheerfully does nothing.
 I call recvmsg() and I have completely undamaged opened file back.

Uhm, nice. So, revoke() needs a proper inode - struct files mapping 
somewhere. Can we add a list of files to struct inode? Are there other 
cases where a file can point to an inode but the file is not attached to 
any file descriptor?

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg
On Wed, 11 Jul 2007, Al Viro wrote:
 BTW, read() or write() in progress might get rather unhappy if your
 live replacement of -f_mapping races with them...

For writes, we (1) never start any new operations after we've cleaned up 
the file descriptor tables so (2) after we're done with do_fsync() we 
never touch -f_mapping again.

But for reads, I think there's a problem if we're in 
do_generic_mapping_read() doing invalidate_inode_pages2() is not enough 
because we're hanging on to the real mapping. Hmm.

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote:
 Hi Al,
 
 On Wed, 11 Jul 2007, Al Viro wrote:
  Better: I have the only opened descriptor for foo.  I send it to myself
  as described above.  I close it.  revoke() is called, finds no opened
  instances of foo in any descriptor tables and cheerfully does nothing.
  I call recvmsg() and I have completely undamaged opened file back.
 
 Uhm, nice. So, revoke() needs a proper inode - struct files mapping 
 somewhere. Can we add a list of files to struct inode? Are there other 
 cases where a file can point to an inode but the file is not attached to 
 any file descriptor?

Umm...  Any number, really - it might be in the middle of syscall
while another task sharing descriptor table has closed the descriptor.
Then there's quota, then there's process accounting, then there's
execve() in progress, then there's knfsd working with that struct
file, etc.

The fundamental issue here is that even if you do find struct file,
you can't blindly rip its -f_mapping since it can be in the middle
of -read(), -write(), pageout, etc.  And even if you do manage
that, you still have the ability to do fchmod() later.

I don't see how the ability to find all instances in SCM_RIGHTS
datagrams (for example) will help you with the race I've described
first.  Original state: task B has the only reference to file.
revoke() is called, passes task A.  B sends datagram to A and closes
file.  A receives datagram.  Now the only reference is in A's table
and you've already passed that.

So you can't avoid processes keeping pointers to struct file.
If you could find all struct file over given inode (which, I suspect,
will lead to interesting locking), you could call something on
that struct file, but you'd have zero exclusion with processes
calling methods on it.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote:
 On Wed, 11 Jul 2007, Al Viro wrote:
  BTW, read() or write() in progress might get rather unhappy if your
  live replacement of -f_mapping races with them...
 
 For writes, we (1) never start any new operations after we've cleaned up 
 the file descriptor tables so (2) after we're done with do_fsync() we 
 never touch -f_mapping again.

Er, no.  do_fsync() won't hit the sys_write() that is yet to enter
-write().  And you can't get rid of new callers _anyway_ (see
previous mail).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg
Hi,

On Wed, 11 Jul 2007, Al Viro wrote:
 The fundamental issue here is that even if you do find struct file,
 you can't blindly rip its -f_mapping since it can be in the middle
 of -read(), -write(), pageout, etc.  And even if you do manage
 that, you still have the ability to do fchmod() later.

Then we would need to change the VFS and relevant parts so that we can 
take down -f_mapping. I don't see how we could do that without affecting 
current hotpaths. Hmm. I suppose what we really need to do is cannibalize 
the actual inode (remove from inode cache, detach from dentry and take 
down the mapping) so that we don't have to touch existing struct file 
pointers at all.

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html