Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Andrew Morton
On Fri, 16 Mar 2007 22:37:57 +0200 "Pekka Enberg" <[EMAIL PROTECTED]> wrote:

> What we could do is add a "I am revoked" flag to struct file which
> blocks any future ->readpage, ->readpages, and ->direct_IO on the
> file. Alternatively, we could change the ->f_mapping to point to an
> address space that has "revoked address space" operations. Hmm.

iirc, that was part of the rationale for the introduction of f_mapping.

I'll cc he-who-never-replies, who did that work.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

However, modifying i_size like this might be a problem - the inode could be
dirty and it'll get written to disk!  Perhaps we could change i_size_read()
to cheat and to return zero if there's a revoke in progress.


I don't think we can actually abuse i_size_read() in any sane manner
because the inode needs to be usable for anyone who just did open(2)
after revoke or whoever called frevoke(2).

What we could do is add a "I am revoked" flag to struct file which
blocks any future ->readpage, ->readpages, and ->direct_IO on the
file. Alternatively, we could change the ->f_mapping to point to an
address space that has "revoked address space" operations. Hmm.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Alan Cox <[EMAIL PROTECTED]> wrote:

> Serious question - do we actually need revoke() on a normal file ? BSD
> has never had this, SYS5 has never had this.


On 3/16/07, Pekka Enberg <[EMAIL PROTECTED]> wrote:

It's needed for forced unmount (bits of it anyway) and
partial-revocation in SLIM.


And btw, you do need support for tearing down mmap for device files too, right?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka J Enberg
On Fri, 16 Mar 2007, Andrew Morton wrote:
> What you're trying to do here is very similar to truncate(), and truncate()
> has had a lot of work put into it, and it does work.

Indeed. revoke() is the same as truncate() without, well, the truncation 
part.

On Fri, 16 Mar 2007, Andrew Morton wrote:
> However, modifying i_size like this might be a problem - the inode could be
> dirty and it'll get written to disk!  Perhaps we could change i_size_read()
> to cheat and to return zero if there's a revoke in progress.

Hmph, it's probably going to get too painful so I'll look at adding some 
hooks to do_generic_mapping_read()...

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Alan Cox <[EMAIL PROTECTED]> wrote:

Serious question - do we actually need revoke() on a normal file ? BSD
has never had this, SYS5 has never had this.


It's needed for forced unmount (bits of it anyway) and
partial-revocation in SLIM.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Alan Cox <[EMAIL PROTECTED]> wrote:

> I'm not sure that running do_fsync() will guarantee that all sys_write()
> callers will have finished their syscall.  Probably they will have, in
> practice.  But there is logic in the sync paths to deliberately bale out
> if we're competing with ongoing dirtyings, to avoid livelocking.

For device files you really need to call into the device driver for this
(->flush etc).


Sure but the do_fsync() bits are part of generic_file_revoke() which
is not meant for device files at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Alan Cox
> I'm not sure that running do_fsync() will guarantee that all sys_write()
> callers will have finished their syscall.  Probably they will have, in
> practice.  But there is logic in the sync paths to deliberately bale out
> if we're competing with ongoing dirtyings, to avoid livelocking.

For device files you really need to call into the device driver for this
(->flush etc).

> However, modifying i_size like this might be a problem - the inode could be
> dirty and it'll get written to disk!  Perhaps we could change i_size_read()
> to cheat and to return zero if there's a revoke in progress.

The cheating is a bit messier than that - you might be revoking on a
cluster file system and I'm still trying to get my head around what the
semantics for that are. Lying about sizes will break the coherency
protocols I think

Serious question - do we actually need revoke() on a normal file ? BSD
has never had this, SYS5 has never had this.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Andrew Morton
On Fri, 16 Mar 2007 13:44:27 +0200 (EET) Pekka J Enberg <[EMAIL PROTECTED]> 
wrote:

> On Fri, 16 Mar 2007, Andrew Morton wrote:
> > I assume that any future callers to sys_read() will reliably do the right
> > thing at this stage, so we are concerned with threads which are presently
> > partway through a read from this inode?
> 
> Yes. We first revoke the file descriptors under tasklist_lock after which 
> any new operations on the revoked inode descriptors fail.

OK.

> On Fri, 16 Mar 2007, Andrew Morton wrote:
> > If it _is_ accurate then hm, tricky.  It all rather depends upon how the
> > relevant filesystem implements reading (and writing?).  Which is why you
> > made it a file_operation, fair enough.
> 
> Yeah, filesystem dependent. Writes are not a problem as do_sync() will 
> wait until the writers are done. For that part, generic_file_revoke() 
> should be fine although suboptimal for most filesystems.

I'm not sure that running do_fsync() will guarantee that all sys_write()
callers will have finished their syscall.  Probably they will have, in
practice.  But there is logic in the sync paths to deliberately bale out
if we're competing with ongoing dirtyings, to avoid livelocking.

> On Fri, 16 Mar 2007, Andrew Morton wrote:
> > But even for ext2 and ext3 (please keep ext4 in sync with ext3 changes,
> > btw), if some process is partway through a big page_cache_readahead()
> > operation then a concurrent invalidate_inode_pages2() call won't worry it
> > at all: the pagecache will be reinstantiated and do_generic_mapping_read()
> > will proceed to copy that pagecache out to the user after the revoke() has
> > returned.  I think.
> 
> That's bad. Can we perhaps wait until readers are done?

Gee.  The only way I can think of guaranteeing this is with a full-on
freeze_processes/thaw_processes, which is the biggest synchronisation
barrier we have, apart from sys_reboot().

Now it just so happens that for ext2/3/4-style filesystems, we re-check
i_size inside the inner loop to handle concurrent truncates (see
i_size_read() calls in do_generic_mapping_read().

Perhaps the revoke() code can hook into there in some fashion to tell the
process-which-is-presently-running-read() that it is now reading crap.

That still doesn't give us a way of making revoke() wait until all
read()ers have finished their reads.

What you're trying to do here is very similar to truncate(), and truncate()
has had a lot of work put into it, and it does work.  So if revoke() were
to

a) grab i_mutex to keep write()s away
b) set i_size to zero
c) run truncate_inode_pages()

then I expect that you'll have the guarantees which you need.  This is
because the read() caller will synchronise against revoke() at each
lock_page(), and the read() caller will check i_size prior to locking each
page.

However, modifying i_size like this might be a problem - the inode could be
dirty and it'll get written to disk!  Perhaps we could change i_size_read()
to cheat and to return zero if there's a revoke in progress.

> On Fri, 16 Mar 2007, Andrew Morton wrote: 
> > I'm afraid I havent paid any attention to this revoke proposal before, I
> > don't understand the usecases nor the implementation details so things
> > which are implicitly-obvious-to-you must be explained to me.  But others
> > will benefit from that explanation too ;)  What, exactly, are we trying to 
> > do
> > with the already-opened files and the currently-in-progress syscalls?
> 
> You use revoke() (with chown, for example) to gain exclusive access to 
> an inode that might be in use by other processes. This means that we must 
> mke sure that:
> 
>   - operations on opened file descriptors pointing to that inode fail
>   - there are no shared mappings visible to other processes
>   - in-progress system calls are either completed (writes) or abort 
> (reads)
> 
> After revoke() system call returns, you are guaranteed to have revoked 
> access to an inode for any processes that had access to it when you 
> started the operation. The caller is responsible for blocking any future 
> open(2) calls that might occur while revoke() takes care of fork(2) and 
> dup(2) during the operation.



> On Fri, 16 Mar 2007, Andrew Morton wrote: 
> > (A concurrent direct-io read might be a problem too?)
> 
> Good point. We would need to take down those too.
> 

direct-io caches i_size for the whole operation and sometimes re-reads it. 
It does funky things to handle concurrent reads and writes.  Probably for
the DIO_LOCKING case we're OK, as direct-io has to re-check i_size_read()
after reacquiring i_mutex.  It's complex, but again, our path to success
here will be to piggyback on the existing handling of truncation.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka J Enberg
On Fri, 16 Mar 2007, Andrew Morton wrote:
> I assume that any future callers to sys_read() will reliably do the right
> thing at this stage, so we are concerned with threads which are presently
> partway through a read from this inode?

Yes. We first revoke the file descriptors under tasklist_lock after which 
any new operations on the revoked inode descriptors fail.

On Fri, 16 Mar 2007, Andrew Morton wrote:
> If it _is_ accurate then hm, tricky.  It all rather depends upon how the
> relevant filesystem implements reading (and writing?).  Which is why you
> made it a file_operation, fair enough.

Yeah, filesystem dependent. Writes are not a problem as do_sync() will 
wait until the writers are done. For that part, generic_file_revoke() 
should be fine although suboptimal for most filesystems.

On Fri, 16 Mar 2007, Andrew Morton wrote:
> But even for ext2 and ext3 (please keep ext4 in sync with ext3 changes,
> btw), if some process is partway through a big page_cache_readahead()
> operation then a concurrent invalidate_inode_pages2() call won't worry it
> at all: the pagecache will be reinstantiated and do_generic_mapping_read()
> will proceed to copy that pagecache out to the user after the revoke() has
> returned.  I think.

That's bad. Can we perhaps wait until readers are done?

On Fri, 16 Mar 2007, Andrew Morton wrote: 
> I'm afraid I havent paid any attention to this revoke proposal before, I
> don't understand the usecases nor the implementation details so things
> which are implicitly-obvious-to-you must be explained to me.  But others
> will benefit from that explanation too ;)  What, exactly, are we trying to do
> with the already-opened files and the currently-in-progress syscalls?

You use revoke() (with chown, for example) to gain exclusive access to 
an inode that might be in use by other processes. This means that we must 
mke sure that:

  - operations on opened file descriptors pointing to that inode fail
  - there are no shared mappings visible to other processes
  - in-progress system calls are either completed (writes) or abort 
(reads)

After revoke() system call returns, you are guaranteed to have revoked 
access to an inode for any processes that had access to it when you 
started the operation. The caller is responsible for blocking any future 
open(2) calls that might occur while revoke() takes care of fork(2) and 
dup(2) during the operation.

On Fri, 16 Mar 2007, Andrew Morton wrote: 
> (A concurrent direct-io read might be a problem too?)

Good point. We would need to take down those too.

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Andrew Morton
On Fri, 16 Mar 2007 08:44:46 +0200 "Pekka Enberg" <[EMAIL PROTECTED]> wrote:

> On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > Why is this code using invalidate_inode_pages2()?  That function keeps on
> > breaking, has ill-defined semantics and will probably change in the future.
> >
> > Exactly what semantics are you looking for here, and why?
> 
> What the comment says "make pending reads fail." When revoking an
> inode, we need to make sure there are no pending I/O that will
> complete after revocation and thus leak information.

hm, let's define "pending".

I assume that any future callers to sys_read() will reliably do the right
thing at this stage, so we are concerned with threads which are presently
partway through a read from this inode?

If that's not accurate then please describe with some detail exactly what
semantics you're looking for here.

If it _is_ accurate then hm, tricky.  It all rather depends upon how the
relevant filesystem implements reading (and writing?).  Which is why you
made it a file_operation, fair enough.

But even for ext2 and ext3 (please keep ext4 in sync with ext3 changes,
btw), if some process is partway through a big page_cache_readahead()
operation then a concurrent invalidate_inode_pages2() call won't worry it
at all: the pagecache will be reinstantiated and do_generic_mapping_read()
will proceed to copy that pagecache out to the user after the revoke() has
returned.  I think.

I'm afraid I havent paid any attention to this revoke proposal before, I
don't understand the usecases nor the implementation details so things
which are implicitly-obvious-to-you must be explained to me.  But others
will benefit from that explanation too ;)  What, exactly, are we trying to do
with the already-opened files and the currently-in-progress syscalls?

(A concurrent direct-io read might be a problem too?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Andrew Morton
On Fri, 16 Mar 2007 08:44:46 +0200 Pekka Enberg [EMAIL PROTECTED] wrote:

 On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:
  Why is this code using invalidate_inode_pages2()?  That function keeps on
  breaking, has ill-defined semantics and will probably change in the future.
 
  Exactly what semantics are you looking for here, and why?
 
 What the comment says make pending reads fail. When revoking an
 inode, we need to make sure there are no pending I/O that will
 complete after revocation and thus leak information.

hm, let's define pending.

I assume that any future callers to sys_read() will reliably do the right
thing at this stage, so we are concerned with threads which are presently
partway through a read from this inode?

If that's not accurate then please describe with some detail exactly what
semantics you're looking for here.

If it _is_ accurate then hm, tricky.  It all rather depends upon how the
relevant filesystem implements reading (and writing?).  Which is why you
made it a file_operation, fair enough.

But even for ext2 and ext3 (please keep ext4 in sync with ext3 changes,
btw), if some process is partway through a big page_cache_readahead()
operation then a concurrent invalidate_inode_pages2() call won't worry it
at all: the pagecache will be reinstantiated and do_generic_mapping_read()
will proceed to copy that pagecache out to the user after the revoke() has
returned.  I think.

I'm afraid I havent paid any attention to this revoke proposal before, I
don't understand the usecases nor the implementation details so things
which are implicitly-obvious-to-you must be explained to me.  But others
will benefit from that explanation too ;)  What, exactly, are we trying to do
with the already-opened files and the currently-in-progress syscalls?

(A concurrent direct-io read might be a problem too?)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka J Enberg
On Fri, 16 Mar 2007, Andrew Morton wrote:
 I assume that any future callers to sys_read() will reliably do the right
 thing at this stage, so we are concerned with threads which are presently
 partway through a read from this inode?

Yes. We first revoke the file descriptors under tasklist_lock after which 
any new operations on the revoked inode descriptors fail.

On Fri, 16 Mar 2007, Andrew Morton wrote:
 If it _is_ accurate then hm, tricky.  It all rather depends upon how the
 relevant filesystem implements reading (and writing?).  Which is why you
 made it a file_operation, fair enough.

Yeah, filesystem dependent. Writes are not a problem as do_sync() will 
wait until the writers are done. For that part, generic_file_revoke() 
should be fine although suboptimal for most filesystems.

On Fri, 16 Mar 2007, Andrew Morton wrote:
 But even for ext2 and ext3 (please keep ext4 in sync with ext3 changes,
 btw), if some process is partway through a big page_cache_readahead()
 operation then a concurrent invalidate_inode_pages2() call won't worry it
 at all: the pagecache will be reinstantiated and do_generic_mapping_read()
 will proceed to copy that pagecache out to the user after the revoke() has
 returned.  I think.

That's bad. Can we perhaps wait until readers are done?

On Fri, 16 Mar 2007, Andrew Morton wrote: 
 I'm afraid I havent paid any attention to this revoke proposal before, I
 don't understand the usecases nor the implementation details so things
 which are implicitly-obvious-to-you must be explained to me.  But others
 will benefit from that explanation too ;)  What, exactly, are we trying to do
 with the already-opened files and the currently-in-progress syscalls?

You use revoke() (with chown, for example) to gain exclusive access to 
an inode that might be in use by other processes. This means that we must 
mke sure that:

  - operations on opened file descriptors pointing to that inode fail
  - there are no shared mappings visible to other processes
  - in-progress system calls are either completed (writes) or abort 
(reads)

After revoke() system call returns, you are guaranteed to have revoked 
access to an inode for any processes that had access to it when you 
started the operation. The caller is responsible for blocking any future 
open(2) calls that might occur while revoke() takes care of fork(2) and 
dup(2) during the operation.

On Fri, 16 Mar 2007, Andrew Morton wrote: 
 (A concurrent direct-io read might be a problem too?)

Good point. We would need to take down those too.

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Andrew Morton
On Fri, 16 Mar 2007 13:44:27 +0200 (EET) Pekka J Enberg [EMAIL PROTECTED] 
wrote:

 On Fri, 16 Mar 2007, Andrew Morton wrote:
  I assume that any future callers to sys_read() will reliably do the right
  thing at this stage, so we are concerned with threads which are presently
  partway through a read from this inode?
 
 Yes. We first revoke the file descriptors under tasklist_lock after which 
 any new operations on the revoked inode descriptors fail.

OK.

 On Fri, 16 Mar 2007, Andrew Morton wrote:
  If it _is_ accurate then hm, tricky.  It all rather depends upon how the
  relevant filesystem implements reading (and writing?).  Which is why you
  made it a file_operation, fair enough.
 
 Yeah, filesystem dependent. Writes are not a problem as do_sync() will 
 wait until the writers are done. For that part, generic_file_revoke() 
 should be fine although suboptimal for most filesystems.

I'm not sure that running do_fsync() will guarantee that all sys_write()
callers will have finished their syscall.  Probably they will have, in
practice.  But there is logic in the sync paths to deliberately bale out
if we're competing with ongoing dirtyings, to avoid livelocking.

 On Fri, 16 Mar 2007, Andrew Morton wrote:
  But even for ext2 and ext3 (please keep ext4 in sync with ext3 changes,
  btw), if some process is partway through a big page_cache_readahead()
  operation then a concurrent invalidate_inode_pages2() call won't worry it
  at all: the pagecache will be reinstantiated and do_generic_mapping_read()
  will proceed to copy that pagecache out to the user after the revoke() has
  returned.  I think.
 
 That's bad. Can we perhaps wait until readers are done?

Gee.  The only way I can think of guaranteeing this is with a full-on
freeze_processes/thaw_processes, which is the biggest synchronisation
barrier we have, apart from sys_reboot().

Now it just so happens that for ext2/3/4-style filesystems, we re-check
i_size inside the inner loop to handle concurrent truncates (see
i_size_read() calls in do_generic_mapping_read().

Perhaps the revoke() code can hook into there in some fashion to tell the
process-which-is-presently-running-read() that it is now reading crap.

That still doesn't give us a way of making revoke() wait until all
read()ers have finished their reads.

What you're trying to do here is very similar to truncate(), and truncate()
has had a lot of work put into it, and it does work.  So if revoke() were
to

a) grab i_mutex to keep write()s away
b) set i_size to zero
c) run truncate_inode_pages()

then I expect that you'll have the guarantees which you need.  This is
because the read() caller will synchronise against revoke() at each
lock_page(), and the read() caller will check i_size prior to locking each
page.

However, modifying i_size like this might be a problem - the inode could be
dirty and it'll get written to disk!  Perhaps we could change i_size_read()
to cheat and to return zero if there's a revoke in progress.

 On Fri, 16 Mar 2007, Andrew Morton wrote: 
  I'm afraid I havent paid any attention to this revoke proposal before, I
  don't understand the usecases nor the implementation details so things
  which are implicitly-obvious-to-you must be explained to me.  But others
  will benefit from that explanation too ;)  What, exactly, are we trying to 
  do
  with the already-opened files and the currently-in-progress syscalls?
 
 You use revoke() (with chown, for example) to gain exclusive access to 
 an inode that might be in use by other processes. This means that we must 
 mke sure that:
 
   - operations on opened file descriptors pointing to that inode fail
   - there are no shared mappings visible to other processes
   - in-progress system calls are either completed (writes) or abort 
 (reads)
 
 After revoke() system call returns, you are guaranteed to have revoked 
 access to an inode for any processes that had access to it when you 
 started the operation. The caller is responsible for blocking any future 
 open(2) calls that might occur while revoke() takes care of fork(2) and 
 dup(2) during the operation.

adds that to the changelog

 On Fri, 16 Mar 2007, Andrew Morton wrote: 
  (A concurrent direct-io read might be a problem too?)
 
 Good point. We would need to take down those too.
 

direct-io caches i_size for the whole operation and sometimes re-reads it. 
It does funky things to handle concurrent reads and writes.  Probably for
the DIO_LOCKING case we're OK, as direct-io has to re-check i_size_read()
after reacquiring i_mutex.  It's complex, but again, our path to success
here will be to piggyback on the existing handling of truncation.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Alan Cox
 I'm not sure that running do_fsync() will guarantee that all sys_write()
 callers will have finished their syscall.  Probably they will have, in
 practice.  But there is logic in the sync paths to deliberately bale out
 if we're competing with ongoing dirtyings, to avoid livelocking.

For device files you really need to call into the device driver for this
(-flush etc).

 However, modifying i_size like this might be a problem - the inode could be
 dirty and it'll get written to disk!  Perhaps we could change i_size_read()
 to cheat and to return zero if there's a revoke in progress.

The cheating is a bit messier than that - you might be revoking on a
cluster file system and I'm still trying to get my head around what the
semantics for that are. Lying about sizes will break the coherency
protocols I think

Serious question - do we actually need revoke() on a normal file ? BSD
has never had this, SYS5 has never had this.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Alan Cox [EMAIL PROTECTED] wrote:

Serious question - do we actually need revoke() on a normal file ? BSD
has never had this, SYS5 has never had this.


It's needed for forced unmount (bits of it anyway) and
partial-revocation in SLIM.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Alan Cox [EMAIL PROTECTED] wrote:

 I'm not sure that running do_fsync() will guarantee that all sys_write()
 callers will have finished their syscall.  Probably they will have, in
 practice.  But there is logic in the sync paths to deliberately bale out
 if we're competing with ongoing dirtyings, to avoid livelocking.

For device files you really need to call into the device driver for this
(-flush etc).


Sure but the do_fsync() bits are part of generic_file_revoke() which
is not meant for device files at all.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka J Enberg
On Fri, 16 Mar 2007, Andrew Morton wrote:
 What you're trying to do here is very similar to truncate(), and truncate()
 has had a lot of work put into it, and it does work.

Indeed. revoke() is the same as truncate() without, well, the truncation 
part.

On Fri, 16 Mar 2007, Andrew Morton wrote:
 However, modifying i_size like this might be a problem - the inode could be
 dirty and it'll get written to disk!  Perhaps we could change i_size_read()
 to cheat and to return zero if there's a revoke in progress.

Hmph, it's probably going to get too painful so I'll look at adding some 
hooks to do_generic_mapping_read()...

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Alan Cox [EMAIL PROTECTED] wrote:

 Serious question - do we actually need revoke() on a normal file ? BSD
 has never had this, SYS5 has never had this.


On 3/16/07, Pekka Enberg [EMAIL PROTECTED] wrote:

It's needed for forced unmount (bits of it anyway) and
partial-revocation in SLIM.


And btw, you do need support for tearing down mmap for device files too, right?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Pekka Enberg

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

However, modifying i_size like this might be a problem - the inode could be
dirty and it'll get written to disk!  Perhaps we could change i_size_read()
to cheat and to return zero if there's a revoke in progress.


I don't think we can actually abuse i_size_read() in any sane manner
because the inode needs to be usable for anyone who just did open(2)
after revoke or whoever called frevoke(2).

What we could do is add a I am revoked flag to struct file which
blocks any future -readpage, -readpages, and -direct_IO on the
file. Alternatively, we could change the -f_mapping to point to an
address space that has revoked address space operations. Hmm.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-16 Thread Andrew Morton
On Fri, 16 Mar 2007 22:37:57 +0200 Pekka Enberg [EMAIL PROTECTED] wrote:

 What we could do is add a I am revoked flag to struct file which
 blocks any future -readpage, -readpages, and -direct_IO on the
 file. Alternatively, we could change the -f_mapping to point to an
 address space that has revoked address space operations. Hmm.

iirc, that was part of the rationale for the introduction of f_mapping.

I'll cc he-who-never-replies, who did that work.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-15 Thread Pekka Enberg

Hi Andrew,

On Sun, 11 Mar 2007 13:30:49 +0200 (EET) Pekka J Enberg
<[EMAIL PROTECTED]> wrote:

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

n all system calls must return long.


Fixed.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

so  the modification of vm_flags is racy?

> + smp_mb();

Please always document barriers.  There's presumably some vm_flags reader
we're concerned about here, but how is the code reader to know what the
code writer was thinking?


We're need to watch out for page faults after the shared mappings have
been taken down and mmap(2) trying to remap. I'll add a comment here.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

This all looks very strange.  If the calling process expires its timeslice,
the entire system call fails?

What's happening here?


Me being stupid. I followed what unmap_mapping_range_vma is doing but
failed to see what its callers are doing. I'll fix it up.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

do_fsync() is seriously suboptimal - it will run an ext3 commit.
do_sync_file_range(...,
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)
will run maybe five times quicker.

But otoh, do_sync_file_range() will fail to write back the pages for a
data=journal ext3 file, I expect (oops).


But it's good enough for generic_file_revoke, no? Ext3 should probably
implement it's own revoke hook so you can drop the ext2 and ext3 hooks
if you're worried, I did them mostly for testing.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

Why is this code using invalidate_inode_pages2()?  That function keeps on
breaking, has ill-defined semantics and will probably change in the future.

Exactly what semantics are you looking for here, and why?


What the comment says "make pending reads fail." When revoking an
inode, we need to make sure there are no pending I/O that will
complete after revocation and thus leak information.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

The blank line before the EXPORT_SYMBOL() is a waste of space.


I'll fix that up.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

> +static struct inode *revokefs_alloc_inode(struct super_block *sb)
> +{
> + struct revokefs_inode_info *info;
> +
> + info = kmem_cache_alloc(revokefs_inode_cache, GFP_NOFS);
> + if (!info)
> + return NULL;
> +
> + return >vfs_inode;
> +}

Why GFP_NOFS?


GFP_KERNEL should be sufficient. I'll fix that up.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

> ===
> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ uml-2.6/include/linux/revoked_fs_i.h  2007-03-11 13:09:20.0 
+0200
> @@ -0,0 +1,20 @@
> +#ifndef _LINUX_REVOKED_FS_I_H
> +#define _LINUX_REVOKED_FS_I_H
> +
> +#define REVOKEFS_MAGIC 0x5245564B  /* REVK */

This is supposed to go into magic.h.


Will do. Thank you Andrew.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-15 Thread Andrew Morton
On Sun, 11 Mar 2007 13:30:49 +0200 (EET) Pekka J Enberg <[EMAIL PROTECTED]> 
wrote:

> From: Pekka Enberg <[EMAIL PROTECTED]>
> 
> The revokeat(2) and frevoke(2) system calls invalidate open file
> descriptors and shared mappings of an inode. After an successful
> revocation, operations on file descriptors fail with the EBADF or
> ENXIO error code for regular and device files,
> respectively. Attempting to read from or write to a revoked mapping
> causes SIGBUS.
> 
> The actual operation is done in two passes:
> 
>  1. Revoke all file descriptors that point to the given inode. We do
> this under tasklist_lock so that after this pass, we don't need
> to worry about racing with close(2) or dup(2).
>
>  2. Take down shared memory mappings of the inode and close all file
> pointers.
> 
> The file descriptors and memory mapping ranges are preserved until the
> owning task does close(2) and munmap(2), respectively.
> 
> ...
>
> +asmlinkage int sys_revokeat(int dfd, const char __user *filename);
> +asmlinkage int sys_frevoke(unsigned int fd);

n all system calls must return long.

> +static int revoke_vma(struct vm_area_struct *vma, struct zap_details 
> *details)
> +{
> + unsigned long restart_addr, start_addr, end_addr;
> + int need_break;
> +
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
> +
> + /*
> +  * Not holding ->mmap_sem here.
> +  */
> + vma->vm_flags |= VM_REVOKED;

so  the modification of vm_flags is racy?

> + smp_mb();

Please always document barriers.  There's presumably some vm_flags reader
we're concerned about here, but how is the code reader to know what the
code writer was thinking?


> +  again:
> + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr,
> +   details);
> +
> + need_break = need_resched() || need_lockbreak(details->i_mmap_lock);
> + if (need_break)
> + goto out_need_break;
> +
> + if (restart_addr < end_addr) {
> + start_addr = restart_addr;
> + goto again;
> + }
> + return 0;
> +
> +  out_need_break:
> + spin_unlock(details->i_mmap_lock);
> + cond_resched();
> + spin_lock(details->i_mmap_lock);
> + return -EINTR;
> +}
> +
> +static int revoke_mapping(struct address_space *mapping, struct file 
> *to_exclude)
> +{
> + struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
> + struct zap_details details;
> + int err = 0;
> +
> + details.i_mmap_lock = >i_mmap_lock;
> +
> + spin_lock(>i_mmap_lock);
> + vma_prio_tree_foreach(vma, , >i_mmap, 0, ULONG_MAX) {
> + if ((vma->vm_flags & VM_SHARED) && vma->vm_file != to_exclude) {
> + err = revoke_vma(vma, );
> + if (err)
> + goto out;
> + }
> + }
> +
> + list_for_each_entry(vma, >i_mmap_nonlinear, 
> shared.vm_set.list) {
> + if ((vma->vm_flags & VM_SHARED) && vma->vm_file != to_exclude) {
> + err = revoke_vma(vma, );
> + if (err)
> + goto out;
> + }
> + }
> +  out:
> + spin_unlock(>i_mmap_lock);
> + return err;
> +}

This all looks very strange.  If the calling process expires its timeslice,
the entire system call fails?

What's happening here?


> +
> +int generic_file_revoke(struct file *file)
> +{
> + int err;
> +
> + /*
> +  * Flush pending writes.
> +  */
> + err = do_fsync(file, 1);
> + if (err)
> + goto out;
> +
> + /*
> +  * Make pending reads fail.
> +  */
> + err = invalidate_inode_pages2(file->f_mapping);
> +
> +  out:
> + return err;
> +}
> +
> +EXPORT_SYMBOL(generic_file_revoke);

do_fsync() is seriously suboptimal - it will run an ext3 commit. 
do_sync_file_range(...,
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)
will run maybe five times quicker.

But otoh, do_sync_file_range() will fail to write back the pages for a
data=journal ext3 file, I expect (oops).


Why is this code using invalidate_inode_pages2()?  That function keeps on
breaking, has ill-defined semantics and will probably change in the future.

Exactly what semantics are you looking for here, and why?

The blank line before the EXPORT_SYMBOL() is a waste of space.

> +/*
> + *   Filesystem for revoked files.
> + */
> +
> +static struct inode *revokefs_alloc_inode(struct super_block *sb)
> +{
> + struct revokefs_inode_info *info;
> +
> + info = kmem_cache_alloc(revokefs_inode_cache, GFP_NOFS);
> + if (!info)
> + return NULL;
> +
> + return >vfs_inode;
> +}

Why GFP_NOFS?

> ===
> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ uml-2.6/include/linux/revoked_fs_i.h  2007-03-11 13:09:20.0 
> +0200
> @@ -0,0 +1,20 @@
> +#ifndef 

Re: [PATCH 2/5] revoke: core code

2007-03-15 Thread Andrew Morton
On Sun, 11 Mar 2007 13:30:49 +0200 (EET) Pekka J Enberg [EMAIL PROTECTED] 
wrote:

 From: Pekka Enberg [EMAIL PROTECTED]
 
 The revokeat(2) and frevoke(2) system calls invalidate open file
 descriptors and shared mappings of an inode. After an successful
 revocation, operations on file descriptors fail with the EBADF or
 ENXIO error code for regular and device files,
 respectively. Attempting to read from or write to a revoked mapping
 causes SIGBUS.
 
 The actual operation is done in two passes:
 
  1. Revoke all file descriptors that point to the given inode. We do
 this under tasklist_lock so that after this pass, we don't need
 to worry about racing with close(2) or dup(2).

  2. Take down shared memory mappings of the inode and close all file
 pointers.
 
 The file descriptors and memory mapping ranges are preserved until the
 owning task does close(2) and munmap(2), respectively.
 
 ...

 +asmlinkage int sys_revokeat(int dfd, const char __user *filename);
 +asmlinkage int sys_frevoke(unsigned int fd);

n all system calls must return long.

 +static int revoke_vma(struct vm_area_struct *vma, struct zap_details 
 *details)
 +{
 + unsigned long restart_addr, start_addr, end_addr;
 + int need_break;
 +
 + start_addr = vma-vm_start;
 + end_addr = vma-vm_end;
 +
 + /*
 +  * Not holding -mmap_sem here.
 +  */
 + vma-vm_flags |= VM_REVOKED;

so  the modification of vm_flags is racy?

 + smp_mb();

Please always document barriers.  There's presumably some vm_flags reader
we're concerned about here, but how is the code reader to know what the
code writer was thinking?


 +  again:
 + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr,
 +   details);
 +
 + need_break = need_resched() || need_lockbreak(details-i_mmap_lock);
 + if (need_break)
 + goto out_need_break;
 +
 + if (restart_addr  end_addr) {
 + start_addr = restart_addr;
 + goto again;
 + }
 + return 0;
 +
 +  out_need_break:
 + spin_unlock(details-i_mmap_lock);
 + cond_resched();
 + spin_lock(details-i_mmap_lock);
 + return -EINTR;
 +}
 +
 +static int revoke_mapping(struct address_space *mapping, struct file 
 *to_exclude)
 +{
 + struct vm_area_struct *vma;
 + struct prio_tree_iter iter;
 + struct zap_details details;
 + int err = 0;
 +
 + details.i_mmap_lock = mapping-i_mmap_lock;
 +
 + spin_lock(mapping-i_mmap_lock);
 + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, 0, ULONG_MAX) {
 + if ((vma-vm_flags  VM_SHARED)  vma-vm_file != to_exclude) {
 + err = revoke_vma(vma, details);
 + if (err)
 + goto out;
 + }
 + }
 +
 + list_for_each_entry(vma, mapping-i_mmap_nonlinear, 
 shared.vm_set.list) {
 + if ((vma-vm_flags  VM_SHARED)  vma-vm_file != to_exclude) {
 + err = revoke_vma(vma, details);
 + if (err)
 + goto out;
 + }
 + }
 +  out:
 + spin_unlock(mapping-i_mmap_lock);
 + return err;
 +}

This all looks very strange.  If the calling process expires its timeslice,
the entire system call fails?

What's happening here?


 +
 +int generic_file_revoke(struct file *file)
 +{
 + int err;
 +
 + /*
 +  * Flush pending writes.
 +  */
 + err = do_fsync(file, 1);
 + if (err)
 + goto out;
 +
 + /*
 +  * Make pending reads fail.
 +  */
 + err = invalidate_inode_pages2(file-f_mapping);
 +
 +  out:
 + return err;
 +}
 +
 +EXPORT_SYMBOL(generic_file_revoke);

do_fsync() is seriously suboptimal - it will run an ext3 commit. 
do_sync_file_range(...,
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)
will run maybe five times quicker.

But otoh, do_sync_file_range() will fail to write back the pages for a
data=journal ext3 file, I expect (oops).


Why is this code using invalidate_inode_pages2()?  That function keeps on
breaking, has ill-defined semantics and will probably change in the future.

Exactly what semantics are you looking for here, and why?

The blank line before the EXPORT_SYMBOL() is a waste of space.

 +/*
 + *   Filesystem for revoked files.
 + */
 +
 +static struct inode *revokefs_alloc_inode(struct super_block *sb)
 +{
 + struct revokefs_inode_info *info;
 +
 + info = kmem_cache_alloc(revokefs_inode_cache, GFP_NOFS);
 + if (!info)
 + return NULL;
 +
 + return info-vfs_inode;
 +}

Why GFP_NOFS?

 ===
 --- /dev/null 1970-01-01 00:00:00.0 +
 +++ uml-2.6/include/linux/revoked_fs_i.h  2007-03-11 13:09:20.0 
 +0200
 @@ -0,0 +1,20 @@
 +#ifndef _LINUX_REVOKED_FS_I_H
 +#define _LINUX_REVOKED_FS_I_H
 +
 +#define REVOKEFS_MAGIC 0x5245564B  /* REVK 

Re: [PATCH 2/5] revoke: core code

2007-03-15 Thread Pekka Enberg

Hi Andrew,

On Sun, 11 Mar 2007 13:30:49 +0200 (EET) Pekka J Enberg
[EMAIL PROTECTED] wrote:

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

n all system calls must return long.


Fixed.

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

so  the modification of vm_flags is racy?

 + smp_mb();

Please always document barriers.  There's presumably some vm_flags reader
we're concerned about here, but how is the code reader to know what the
code writer was thinking?


We're need to watch out for page faults after the shared mappings have
been taken down and mmap(2) trying to remap. I'll add a comment here.

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

This all looks very strange.  If the calling process expires its timeslice,
the entire system call fails?

What's happening here?


Me being stupid. I followed what unmap_mapping_range_vma is doing but
failed to see what its callers are doing. I'll fix it up.

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

do_fsync() is seriously suboptimal - it will run an ext3 commit.
do_sync_file_range(...,
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)
will run maybe five times quicker.

But otoh, do_sync_file_range() will fail to write back the pages for a
data=journal ext3 file, I expect (oops).


But it's good enough for generic_file_revoke, no? Ext3 should probably
implement it's own revoke hook so you can drop the ext2 and ext3 hooks
if you're worried, I did them mostly for testing.

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

Why is this code using invalidate_inode_pages2()?  That function keeps on
breaking, has ill-defined semantics and will probably change in the future.

Exactly what semantics are you looking for here, and why?


What the comment says make pending reads fail. When revoking an
inode, we need to make sure there are no pending I/O that will
complete after revocation and thus leak information.

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

The blank line before the EXPORT_SYMBOL() is a waste of space.


I'll fix that up.

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

 +static struct inode *revokefs_alloc_inode(struct super_block *sb)
 +{
 + struct revokefs_inode_info *info;
 +
 + info = kmem_cache_alloc(revokefs_inode_cache, GFP_NOFS);
 + if (!info)
 + return NULL;
 +
 + return info-vfs_inode;
 +}

Why GFP_NOFS?


GFP_KERNEL should be sufficient. I'll fix that up.

On 3/16/07, Andrew Morton [EMAIL PROTECTED] wrote:

 ===
 --- /dev/null 1970-01-01 00:00:00.0 +
 +++ uml-2.6/include/linux/revoked_fs_i.h  2007-03-11 13:09:20.0 
+0200
 @@ -0,0 +1,20 @@
 +#ifndef _LINUX_REVOKED_FS_I_H
 +#define _LINUX_REVOKED_FS_I_H
 +
 +#define REVOKEFS_MAGIC 0x5245564B  /* REVK */

This is supposed to go into magic.h.


Will do. Thank you Andrew.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] revoke: core code

2007-03-11 Thread Pekka J Enberg
From: Pekka Enberg <[EMAIL PROTECTED]>

The revokeat(2) and frevoke(2) system calls invalidate open file
descriptors and shared mappings of an inode. After an successful
revocation, operations on file descriptors fail with the EBADF or
ENXIO error code for regular and device files,
respectively. Attempting to read from or write to a revoked mapping
causes SIGBUS.

The actual operation is done in two passes:

 1. Revoke all file descriptors that point to the given inode. We do
this under tasklist_lock so that after this pass, we don't need
to worry about racing with close(2) or dup(2).
   
 2. Take down shared memory mappings of the inode and close all file
pointers.

The file descriptors and memory mapping ranges are preserved until the
owning task does close(2) and munmap(2), respectively.

Signed-off-by: Pekka Enberg <[EMAIL PROTECTED]>
---
 fs/Makefile  |2 
 fs/revoke.c  |  588 +++
 fs/revoked_inode.c   |  378 +++
 include/linux/fs.h   |4 
 include/linux/revoked_fs_i.h |   20 +
 include/linux/syscalls.h |3 
 6 files changed, 994 insertions(+), 1 deletion(-)

Index: uml-2.6/fs/Makefile
===
--- uml-2.6.orig/fs/Makefile2007-03-11 13:07:57.0 +0200
+++ uml-2.6/fs/Makefile 2007-03-11 13:09:20.0 +0200
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o revoke.o revoked_inode.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
Index: uml-2.6/include/linux/syscalls.h
===
--- uml-2.6.orig/include/linux/syscalls.h   2007-03-11 13:07:57.0 
+0200
+++ uml-2.6/include/linux/syscalls.h2007-03-11 13:09:20.0 +0200
@@ -605,4 +605,7 @@ asmlinkage long sys_getcpu(unsigned __us
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
+asmlinkage int sys_revokeat(int dfd, const char __user *filename);
+asmlinkage int sys_frevoke(unsigned int fd);
+
 #endif
Index: uml-2.6/include/linux/fs.h
===
--- uml-2.6.orig/include/linux/fs.h 2007-03-11 13:07:57.0 +0200
+++ uml-2.6/include/linux/fs.h  2007-03-11 13:09:20.0 +0200
@@ -1100,6 +1100,7 @@ struct file_operations {
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t 
*, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
*, size_t, unsigned int);
+   int (*revoke)(struct file *);
 };
 
 struct inode_operations {
@@ -1739,6 +1740,9 @@ extern ssize_t generic_splice_sendpage(s
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
size_t len, unsigned int flags);
 
+/* fs/revoke.c */
+extern int generic_file_revoke(struct file *);
+
 extern void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
Index: uml-2.6/fs/revoke.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ uml-2.6/fs/revoke.c 2007-03-11 13:14:42.0 +0200
@@ -0,0 +1,588 @@
+/*
+ * fs/revoke.c - Invalidate all current open file descriptors of an inode.
+ *
+ * Copyright (C) 2006-2007  Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * This is used for pre-allocating an array of file pointers so that we don't
+ * have to do memory allocation under tasklist_lock.
+ */
+struct revoke_table {
+   struct file **files;
+   unsigned long size;
+   unsigned long end;
+   unsigned long restore_start;
+};
+
+struct kmem_cache *revokefs_inode_cache;
+
+/*
+ * Revoked file descriptors point to inodes in the revokefs filesystem.
+ */
+static struct vfsmount *revokefs_mnt;
+
+static struct file *get_revoked_file(void)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+   struct file *filp;
+   struct qstr name;
+
+   filp = get_empty_filp();
+   if (!filp)
+   goto err;
+
+   inode = new_inode(revokefs_mnt->mnt_sb);
+   if (!inode)
+   goto err_inode;
+
+   name.name = "revoked_file";
+   name.len = strlen(name.name);
+   dentry = d_alloc(revokefs_mnt->mnt_sb->s_root, );
+   if (!dentry)
+   goto err_dentry;
+
+   

[PATCH 2/5] revoke: core code

2007-03-11 Thread Pekka J Enberg
From: Pekka Enberg [EMAIL PROTECTED]

The revokeat(2) and frevoke(2) system calls invalidate open file
descriptors and shared mappings of an inode. After an successful
revocation, operations on file descriptors fail with the EBADF or
ENXIO error code for regular and device files,
respectively. Attempting to read from or write to a revoked mapping
causes SIGBUS.

The actual operation is done in two passes:

 1. Revoke all file descriptors that point to the given inode. We do
this under tasklist_lock so that after this pass, we don't need
to worry about racing with close(2) or dup(2).
   
 2. Take down shared memory mappings of the inode and close all file
pointers.

The file descriptors and memory mapping ranges are preserved until the
owning task does close(2) and munmap(2), respectively.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---
 fs/Makefile  |2 
 fs/revoke.c  |  588 +++
 fs/revoked_inode.c   |  378 +++
 include/linux/fs.h   |4 
 include/linux/revoked_fs_i.h |   20 +
 include/linux/syscalls.h |3 
 6 files changed, 994 insertions(+), 1 deletion(-)

Index: uml-2.6/fs/Makefile
===
--- uml-2.6.orig/fs/Makefile2007-03-11 13:07:57.0 +0200
+++ uml-2.6/fs/Makefile 2007-03-11 13:09:20.0 +0200
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o revoke.o revoked_inode.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
Index: uml-2.6/include/linux/syscalls.h
===
--- uml-2.6.orig/include/linux/syscalls.h   2007-03-11 13:07:57.0 
+0200
+++ uml-2.6/include/linux/syscalls.h2007-03-11 13:09:20.0 +0200
@@ -605,4 +605,7 @@ asmlinkage long sys_getcpu(unsigned __us
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
+asmlinkage int sys_revokeat(int dfd, const char __user *filename);
+asmlinkage int sys_frevoke(unsigned int fd);
+
 #endif
Index: uml-2.6/include/linux/fs.h
===
--- uml-2.6.orig/include/linux/fs.h 2007-03-11 13:07:57.0 +0200
+++ uml-2.6/include/linux/fs.h  2007-03-11 13:09:20.0 +0200
@@ -1100,6 +1100,7 @@ struct file_operations {
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t 
*, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
*, size_t, unsigned int);
+   int (*revoke)(struct file *);
 };
 
 struct inode_operations {
@@ -1739,6 +1740,9 @@ extern ssize_t generic_splice_sendpage(s
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
size_t len, unsigned int flags);
 
+/* fs/revoke.c */
+extern int generic_file_revoke(struct file *);
+
 extern void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
Index: uml-2.6/fs/revoke.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ uml-2.6/fs/revoke.c 2007-03-11 13:14:42.0 +0200
@@ -0,0 +1,588 @@
+/*
+ * fs/revoke.c - Invalidate all current open file descriptors of an inode.
+ *
+ * Copyright (C) 2006-2007  Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include linux/file.h
+#include linux/fs.h
+#include linux/namei.h
+#include linux/mm.h
+#include linux/mman.h
+#include linux/module.h
+#include linux/mount.h
+#include linux/sched.h
+#include linux/revoked_fs_i.h
+
+/*
+ * This is used for pre-allocating an array of file pointers so that we don't
+ * have to do memory allocation under tasklist_lock.
+ */
+struct revoke_table {
+   struct file **files;
+   unsigned long size;
+   unsigned long end;
+   unsigned long restore_start;
+};
+
+struct kmem_cache *revokefs_inode_cache;
+
+/*
+ * Revoked file descriptors point to inodes in the revokefs filesystem.
+ */
+static struct vfsmount *revokefs_mnt;
+
+static struct file *get_revoked_file(void)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+   struct file *filp;
+   struct qstr name;
+
+   filp = get_empty_filp();
+   if (!filp)
+   goto err;
+
+   inode = new_inode(revokefs_mnt-mnt_sb);
+   if (!inode)
+   goto err_inode;
+
+   name.name = revoked_file;
+   name.len = strlen(name.name);
+   dentry =