extending VFS leases for NFSv4

2007-04-12 Thread david m. richter

hello,

	we're looking for some input regarding expanding fcntl(2) file leases 
somewhat, in order to implement NFSv4 file delegations.


	somewhat similar to Samba and OPLOCKs, NFSv4 file delegations are 
implemented with leases.  however, the current lease subsystem only breaks 
leases (and thereby delegations) when a file is truncated or is opened with a 
mode that conflicts with an existing lease.  for NFSv4, delegations must also 
be revoked when a file is renamed, unlinked, or has a metadata change from a 
chown, e.g.


	thus far, we've come up with three ways -- all with trade-offs -- to do 
this that we think avoid races between lease-granting and lease-breaking.  a 
very brief** summary of each:


- modify struct inode by adding a counter -- callers increment to block 
lease-granting; grant leases iff the counter == 0.


- create a new flavor of struct file_lock that blocks lease-granting.

- modify struct file_lock by adding a counter, similar to the above.  a 
lease-breaker increments the counters on lease file_locks it's breaking; 
lease-granting is disallowed if any file_lock's counter is nonzero.  the 
lease-breaker becomes responsible for freeing these locks.


	clearly, these aren't minor changes.  we would greatly appreciate any 
advice, criticism, or commentary.



thanks much,

d
.

**: for more information, please see some dev notes at: 
http://www.citi.umich.edu/u/richterd/vfs_leases.html

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On setting a lease across a cluster

2008-01-04 Thread david m. richter
On Fri, 4 Jan 2008, Matthew Wilcox wrote:

> 
> Hi Bruce,
> 
> The current implementation of vfs_setlease/generic_setlease/etc is a
> bit quirky.  I've been thinking it over for the past couple of days,
> and I think we need to refactor it to work sensibly.
> 
> As you may have noticed, I've been mulling over getting rid of the
> BKL in fs/locks.c and the lease interface is particularly problematic.
> Here's one reason why:
> 
> fcntl_setlease
>lock_kernel
>vfs_setlease
>   lock_kernel
>   generic_setlease
>   unlock_kernel
>fasync_helper
>unlock_kernel
> 
> This is perfectly legitimate for the BKL of course, but other spinlocks
> can't be acquired recursively in this way.  At first I thought I'd just
> push the call to generic_setlease into __vfs_setlease and have two ways
> of calling it:
> 
> fcntl_setlease
>lock_kernel
>__vfs_setlease
>fasync_helper
>unlock_kernel
> 
> vfs_setlease
>   lock_kernel
>   __vfs_setlease
>   unlock_kernel
> 
> But generic_setlease allocates memory (GFP_KERNEL) under the lock, so
> that's bad for converting to spnlocks later.  Then I thought I'd move
> the spinlock acquisition into generic_setlease().  But I can't do that
> either as fcntl_setlease() needs to hold the spinlock around the call
> to generic_setlease() and fasync_helper().
> 
> Then I started to wonder about the current split of functionality between
> fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
> other process having this file open should be common to all filesystems.
> And it should be honoured by NFS (it shouldn't hand out a delegation if
> a local user has the file open), so it needs to be in vfs_setlease().
> 
> Now I'm worrying about the mechanics of calling out to a filesystem to
> perform a lock.  Obviously, a network filesystem is going to have to
> sleep waiting for a response from the fileserver, so that precludes the
> use of a spinlock held over the call to f_op->setlease.  I'm not keen on
> the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
> hard enough without worrying what a filesystem might be doing with it.
> 
> So I think we need to refactor the interface, and I'd like to hear your
> thoughts on my ideas of how to handle this.
> 
> First, have clients of vfs_setlease call lease_alloc() and pass it in,
> rather than allocate it on the stack and have this horrible interface
> where we may pass back an allocated lock.  This allows us to not allocate
> memory (hence sleep) during generic_setlease().
> 
> Second, move some of the functionality of generic_setlease() to
> vfs_setlease(), as mentioned above.
> 
> Third, change the purpose of f_op->setlease.  We can rename it if you
> like to catch any out of tree users.  I'd propose using it like this:

fwiw, i've done some work on extending the lease subsystem to help 
support the full range of requirements for NFSv4 file and directory 
delegations (e.g., breaking a lease when unlinking a file) and we ended up 
actually doing most of what you've just suggested here, which i take to be 
a good sign.

most of my refactoring came out of trying to simplify locking and 
avoid holding locks too long (rather than specifically focusing on 
cluster-oriented stuff, but the goals dovetail) and your work on getting 
the BKL out of locks.c entirely is something i really like and look 
forward to.

thanks,

d
.
 
> vfs_setlease()
>if (f_op->setlease())
>   res = f_op->setlease()
>   if (res)
>  return res;
>lock_kernel()
>generic_setlease()
>unlock_kernel()
> 
> fcntl_setlease
>if (f_op->setlease())
>   res = f_op->setlease()
>   if (res)
>  return res;
>lock_kernel
>generic_setlease()
>... fasync ...
>unlock_kernel
> 
> So 'f_op->setlease' now means "Try to get a lease from the fileserver".
> We can optimise this a bit to not even call setlease if, say, we already
> have a read lease and we're trying to get a second read lease.  But we
> have to record our local leases (that way they'll show up in /proc/locks).
> 
> I think the combination of these three ideas gives us a sane interface
> to the various setlease functions, and let us convert from lock_kernel()
> to spin_lock() later.  Have I missed something?  I don't often think
> about the needs of cluster filesystems, so I may misunderstand how they
> need this operation to work.
> 
> At some point, we need to revisit the logic for 'is this file open
> by another process' -- it's clearly buggy since it doesn't hold the
> inode_lock while checking i_count, so it could race against an __iget().
> 
> -- 
> Intel are signing my paycheques ... these opinions are still mine
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> 

Re: On setting a lease across a cluster

2008-01-05 Thread david m. richter
> > actually doing most of what you've just suggested here, which i take to be 
> > a good sign.
> 
> As long as it's great minds thinking alike and not fools seldom
> differing ;-)

ooh, good phrase, one with which i wasn't familiar  :)

 
> > most of my refactoring came out of trying to simplify locking and 
> > avoid holding locks too long (rather than specifically focusing on 
> > cluster-oriented stuff, but the goals dovetail) and your work on getting 
> > the BKL out of locks.c entirely is something i really like and look 
> > forward to.
> 
> Excellent.  Shall I make the patch myself, or did you want to post a
> patch based on working code?  ;-)

please, by all means, keep going -- i want your code!  :)  my 
wording was poor and may've sounded like this was already a fait accompli, 
when basically what i was trying to say was that the locking ended up 
being a hassle and your approach would also help solve that, in addition 
to your extra cluster-related needs/goals.

thanks, 

d
.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS client hang on attempt to do async blocking posix lock enqueue

2008-02-08 Thread david m. richter
On Fri, 8 Feb 2008, J. Bruce Fields wrote:

> On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> > On Thu, 7 Feb 2008 18:26:18 -0500
> > "J. Bruce Fields" <[EMAIL PROTECTED]> wrote:
> > 
> > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > > Hello!
> > > >
> > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > > >
> > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > > >>> The problem seems to be with the fact that the client and server are 
> > > >>> on
> > > >>> the same machine. This test work fine with or without an underlaying 
> > > >>> fs
> > > >>> that supports locking when the client and the server are on a  
> > > >>> different
> > > >>> machines. Like you said the server is trying to send the grant  
> > > >>> message to
> > > >>> the client but for some reason it fails when the client is on the  
> > > >>> same
> > > >>> machine.
> > > >> That *shouldn't* make a difference, so we need to take another look at
> > > >> this--Oleg, this problem is still unfixed, right?
> > > >
> > > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > > problem.
> > > 
> > > OK, we have finally reproduced this problem here, and David's working on
> > > debugging.  It does indeed seem to only be reproduceable with client and
> > > server on the same machine.  Thanks for the report
> > > 
> > > --b.
> > 
> > It might be worth testing this both with and without the patchset I
> > posted to linux-nfs recently to take care of the lockd hang. If
> > lockd is stuck trying to rpc_ping itself then it probably would hang
> > like this, wouldn't it?
> 
> Of course!  Yes, that fits.
> 
> --b.

right on, jeff, good catch and thanks for directing my attention 
to your patches.

i applied them on top of 2.6.23.1 and tested them on a cluster 
exporting GFS2 over NFS, using oleg's reproducer code.  your patches fix 
that lockd hang.

in a bit more detail, oleg's reproducer basically gets a 
whole-file read lock, tests the lock, upgrades to a whole-file exclusive 
lock, tests the lock, then unlocks.  the problem was that when getting 
that exclusive lock things would hang.  this only happened when the client 
and server were on the same machine, and i could reproduce it with NFS 
exporting GFS2 but not NFS exporting EXT3.


thanks,

d
.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html