Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Tue, Dec 05, 2006 at 11:41:28AM -0500, Nikolai Joukov wrote: > > > As we promised on the linux-ext4 list on October 31, here is the patch > > > that adds secure deletion via a trash-bin functionality for ext4. It is a > > > compromise solution that combines secure deletion with the trash-bin > > > support > > > (the latter had been requested by even more people than the former :-). > > > > Given that almost all of the code for this uses vfs interfaces and > > only a couple of simple filesystem hooks, is there any reason for > > this being ext4 specific? i.e. why aren't you hooking the vfs layer > > so we get a single undelete/secure delete implementation for all > > filesystems? > > You are right. Actually, we mentioned it as a benefit number 4 of this > approach in the original post. Most of the code is not > file-system--specific and can be used by any other (all other?) file > system(s). The only complication is that only ext2/3/4 and reiser file > systems already support the per-file secure deletion and undelete > attributes. They are defined but unused in 2.6.19, right? I can't see anywhere in the 2.6.19 ext2/3/4/reiser trees that actually those flags, including setting and retrieving them from disk. JFS i can see sets, clears and retreives them, but not the fielsystems you mention. Though I might just be blind. ;) If all we need to add to XFS is support for those flags, then XFS support would be trivial to add. Oh, damn. I take that back. We're almost out of flag space in the on disk inode - these two flags would use the last 2 flag bits so this may require an on disk inode format change in XFS. This will be a little more complex than I first thought, but not impossible as we already support two on-disk inode format versions. > Since ext4 is in early development now, we believe it'd be easier to get > such code into ext4 than into the main-line VFS. If there's enough > interested among the kernel maintainers, we'd be happy to move this > functionality to the VFS and provide f/s hooks for > secure-deletion/trash-bin. I'd certainly like to see new functionality like this added at the VFS rather than in any particular filesystem. Don't know about anyone else, though > I guess, the right thing to do would be to move the common trash-bin > (tb.c and tb.h) code into the /fs and /include/linux directories. And call them trashbin.[ch] ;) > Also, VFS would require just a couple of trivial changes to support > something like '-o trashbin' mount-time option for all file systems. > In addition, individual file systems may support per-file attributes for > this (e.g., ext2/3/4). Sounds like a good idea to me. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: NFSv4/pNFS possible POSIX I/O API standards
On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote: > >Hi, > > > >I agree that it is not feasible to add new system calls every time > >somebody has a problem, and we don't take adding system calls lightly. > >However, in this case we're talking about an entire *community* of > >people (high-end computing), not just one or two people. Of course it > >may still be the case that that community is not important enough to > >justify the addition of system calls; that's obviously not my call to make! > > I have the feeling that openg stuff is rushed without looking into all > solutions, that don't require changes to the current interface. I also get the feeling that interfaces that already do this open-by-handle stuff haven't been explored either. Does anyone here know about the XFS libhandle API? This has been around for years and it does _exactly_ what these proposed syscalls are supposed to do (and more). See: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle For the libhandle man page. Basically: openg == path_to_handle sutoc == open_by_handle And here for the userspace code: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Dec 05, 2006 15:55 -0800, Ulrich Drepper wrote: > I don't think an accuracy flag is useful at all. Programs don't want to > use fuzzy information. If you want a fast 'ls -l' then add a mode which > doesn't print the fields which are not provided. Don't provide outdated > information. Similarly for other programs. Does this mean you are against the statlite() API entirely, or only against the document's use of the flag as a vague "accuracy" value instead of a hard "valid" value? > Christoph Hellwig wrote: > >I think having a stat lite variant is pretty much consensus, we just need > >to fine tune the actual API - and of course get a reference implementation. > >So if you want to get this going try to implement it based on > >http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2. > >Bonus points for actually making use of the flags in some filesystems. > > I don't like that approach. The flag parameter should be exclusively an > output parameter. By default the kernel should fill in all the fields > it has access to. If access is not easily possible then set the bit and > clear the field. There are of course certain fields which always should > be added. In the proposed man page these are already identified (i.e., > those before the st_litemask member). IMHO, if the application doesn't need a particular field (e.g. "ls -i" doesn't need size, "ls -s" doesn't need the inode number, etc) why should these be filled in if they are not easily accessible? As for what is easily accessible, that needs to be determined by the filesystem itself. It is of course fine if the filesystem fills in values that it has at hand, even if they are not requested, but it shouldn't have to do extra work to fill in values that will not be needed. "ls --color" and "ls -F" are prime examples. It does stat on files only to get the file mode (the file type is already part of many dirent structs). But a clustered filesystem may need to do a lot of work to also get the file size, because it has no way of knowing which stat() fields are needed. > And one last point: I haven't seen any discussion why readdirplus should > do the equivalent of stat and there is no 'statlite' variant. Are all > places for readdir is used non-critical for performance or depend on > accurate information? That was previously suggested by me already. IMHO, there should ONLY be a statlite variant of readdirplus(), and I think most people agree with that part of it (though there is contention on whether readdirplus() is needed at all). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: readdirplus() as possible POSIX I/O API
On Dec 05, 2006 10:23 -0500, Trond Myklebust wrote: > On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote: > > I think the "barrier semantics" are something that have just crept > > into this discussion and is confusing the issue. > > It is the _only_ concept that is of interest for something like NFS or > CIFS. We already have the ability to cache the information. Actually, wouldn't the ability for readdirplus() (with valid flag) be useful for NFS if only to indicate that it does not need to flush the cache because the ctime/mtime isn't needed by the caller? > 'find' should be quite happy with the existing readdir(). It does not > need to use stat() or readdirplus() in order to recurse because > readdir() provides d_type. It does in any but the most simplistic invocations, like "find -mtime" or "find -mode" or "find -uid", etc. > If the application is able to select a statlite()-type of behaviour with > the fadvise() hints, your filesystem could be told to serve up cached > information instead of regrabbing locks. In fact that is a much more > flexible scheme, since it also allows the filesystem to background the > actual inode lookups, or to defer them altogether if that is more > efficient. I guess I just don't understand how fadvise() on a directory file handle (used for readdir()) can be used to affect later stat operations (which definitely will NOT be using that file handle)? If you mean that the application should actually open() each file, fadvise(), fstat(), close(), instead of just a stat() call then we are WAY into negative improvements here due to overhead of doing open+close. > > The filesystem can't always do "stat-ahead" on the files because that > > requires instantiating an inode on the client which may be stale (lock > > revoked) by the time the app gets to it, and the app (and the VFS) have > > no idea just how stale it is, and whether the stat is a "real" stat or > > "only" the readdir stat (because the fadvise would only be useful on > > the directory, and not all of the child entries), so it would need to > > re-stat the file. > > Then provide hints that allow the app to select which behaviour it > prefers. Most (all?) apps don't _care_, and so would be quite happy with > cached information. That is why the current NFS caching model exists in > the first place. Most clustered filesystems have strong cache semantics, so that isn't a problem. IMHO, the mechanism to pass the hint to the filesystem IS the readdirplus_lite() that tells the filesystem exactly which data is needed on each directory entry. > > Also, this would potentially blow the client's real > > working set of inodes out of cache. > > Why? Because in many cases it is desirable to limit the number of DLM locks on a given client (e.g. GFS2 thread with AKPM about clients with millions of DLM locks due to lack of memory pressure on large mem systems). That means a finite-size lock LRU on the client that risks being wiped out by a few thousand files in a directory doing "readdir() + 5000*stat()". Consider a system like BlueGene/L with 128k compute cores. Jobs that run on that system will periodically (e.g. every hour) create up to 128K checkpoint+restart files to avoid losing a lot of computation if a node crashes. Even if each one of the checkpoints is in a separate directory (I wish all users were so nice :-) it means 128K inodes+DLM locks for doing an "ls" in the directory. > > Doing things en-masse with readdirplus() also allows the filesystem to > > do the stat() operations in parallel internally (which is a net win if > > there are many servers involved) instead of serially as the application > > would do. > > If your application really cared, it could add threading to 'ls' to > achieve the same result. You can also have the filesystem preload that > information based on fadvise hints. But it would still need 128K RPCs to get that information, and 128K new inodes on that client. And what is the chance that I can get a multi-threading "ls" into the upstream GNU ls code? In the case of local filesystems multi-threading ls would be a net loss due to seeking. But even for local filesystems readdirplus_lite() would allow them to fill in stat information they already have (either in cache or on disk), and may avoid doing extra work that isn't needed. For filesystems that don't care, readdirplus_lite() can just be readdir()+stat() internally. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
openg
On Tue, Dec 05, 2006 at 03:44:31PM -0600, Rob Ross wrote: > The openg() really just does the lookup and permission checking). The > openfh() creates the file descriptor and starts that context if the > particular FS tracks that sort of thing. ... > Well you've caught me. I don't want to cache the values, because I > fundamentally believe that sharing state between clients and servers is > braindead (to use Christoph's phrase) in systems of this scale > (thousands to tens of thousands of clients). So I don't want locks, so I > can't keep the cache consistent, ... So someone else will have to run > the tests you propose :)... Besides the whole ugliness you miss a few points about the fundamental architecture of the unix filesystem permission model unfortunately. Say you want to lookup a path /foo/bar/baz, then the access permission is based on the following things: - the credentials of the user. let's only take traditional uid/gid for this example although credentials are much more complex these days - the kind of operation you want to perform - the access permission of the actual object the path points to (inode) - the lookup permission (x bit) for every object on the way to you object In your proposal sutoc is a simple conversion operation, that means openg needs to perfom all these access checks and encodes them in the fh_t. That means an fh_t must fundamentally be an object that is kept in the kernel aka a capability as defined by Henry Levy. This does imply you _do_ need to keep state. And because it needs kernel support you fh_t is more or less equivalent to a file descriptor with sutoc equivalent to a dup variant that really duplicates the backing object instead of just the userspace index into it. Note somewhat similar open by filehandle APIs like oben by inode number as used by lustre or the XFS *_by_handle APIs are privilegued operations because of exactly this problem. What according to your mail is the most important bit in this proposal is that you thing the filehandles should be easily shared with other system in a cluster. That fact is not mentioned in the actual proposal at all, and is in fact that hardest part because of inherent statefulness of the API. > What's the etiquette on changing subject lines here? It might be useful > to separate the openg() etc. discussion from the readdirplus() etc. > discussion. Changing subject lines is fine. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] gfs2: nfs lock support for gfs2
Hi, This looks good to me, and I'm copying in Dave & Wendy who have both done previous work in this area for further comment. Provided we can get this tested, I'd be happy to accept the patch in its current form. Steve. On Wed, 2006-12-06 at 00:34 -0500, J. Bruce Fields wrote: > From: J. Bruce Fields <[EMAIL PROTECTED]> > > From: Marc Eshel <[EMAIL PROTECTED]> > > Add NFS lock support to GFS2. (Untested.) > > Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]> > --- > fs/gfs2/lm.c | 10 > fs/gfs2/lm.h |2 + > fs/gfs2/locking/dlm/lock_dlm.h |2 + > fs/gfs2/locking/dlm/mount.c|1 + > fs/gfs2/locking/dlm/plock.c| 95 > +++- > fs/gfs2/ops_export.c | 52 ++ > include/linux/lm_interface.h |3 + > include/linux/lock_dlm_plock.h |3 + > 8 files changed, 166 insertions(+), 2 deletions(-) > > diff --git a/fs/gfs2/lm.c b/fs/gfs2/lm.c > index effe4a3..cf7fd52 100644 > --- a/fs/gfs2/lm.c > +++ b/fs/gfs2/lm.c > @@ -197,6 +197,16 @@ int gfs2_lm_plock(struct gfs2_sbd *sdp, struct > lm_lockname *name, > return error; > } > > +int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name, > + struct file *file, int cmd, struct file_lock *fl) > +{ > + int error = -EIO; > + if (likely(!test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) > + error = sdp->sd_lockstruct.ls_ops->lm_plock_async( > + sdp->sd_lockstruct.ls_lockspace, name, file, > cmd, fl); > + return error; > +} > + > int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name, > struct file *file, struct file_lock *fl) > { > diff --git a/fs/gfs2/lm.h b/fs/gfs2/lm.h > index 21cdc30..1ddd1fd 100644 > --- a/fs/gfs2/lm.h > +++ b/fs/gfs2/lm.h > @@ -34,6 +34,8 @@ int gfs2_lm_plock_get(struct gfs2_sbd *sdp, struct > lm_lockname *name, > struct file *file, struct file_lock *fl); > int gfs2_lm_plock(struct gfs2_sbd *sdp, struct lm_lockname *name, > struct file *file, int cmd, struct file_lock *fl); > +int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name, > + struct file *file, int cmd, struct file_lock *fl); > int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name, > struct file *file, struct file_lock *fl); > void gfs2_lm_recovery_done(struct gfs2_sbd *sdp, unsigned int jid, > diff --git a/fs/gfs2/locking/dlm/lock_dlm.h b/fs/gfs2/locking/dlm/lock_dlm.h > index 33af707..82af860 100644 > --- a/fs/gfs2/locking/dlm/lock_dlm.h > +++ b/fs/gfs2/locking/dlm/lock_dlm.h > @@ -179,6 +179,8 @@ int gdlm_plock_init(void); > void gdlm_plock_exit(void); > int gdlm_plock(void *, struct lm_lockname *, struct file *, int, > struct file_lock *); > +int gdlm_plock_async(void *, struct lm_lockname *, struct file *, int, > + struct file_lock *); > int gdlm_plock_get(void *, struct lm_lockname *, struct file *, > struct file_lock *); > int gdlm_punlock(void *, struct lm_lockname *, struct file *, > diff --git a/fs/gfs2/locking/dlm/mount.c b/fs/gfs2/locking/dlm/mount.c > index cdd1694..4339e3f 100644 > --- a/fs/gfs2/locking/dlm/mount.c > +++ b/fs/gfs2/locking/dlm/mount.c > @@ -244,6 +244,7 @@ const struct lm_lockops gdlm_ops = { > .lm_lock = gdlm_lock, > .lm_unlock = gdlm_unlock, > .lm_plock = gdlm_plock, > + .lm_plock_async = gdlm_plock_async, > .lm_punlock = gdlm_punlock, > .lm_plock_get = gdlm_plock_get, > .lm_cancel = gdlm_cancel, > diff --git a/fs/gfs2/locking/dlm/plock.c b/fs/gfs2/locking/dlm/plock.c > index 7365aec..c21e667 100644 > --- a/fs/gfs2/locking/dlm/plock.c > +++ b/fs/gfs2/locking/dlm/plock.c > @@ -102,6 +102,93 @@ int gdlm_plock(void *lockspace, struct lm_lockname *name, > return rv; > } > > +int gdlm_plock_async(void *lockspace, struct lm_lockname *name, > +struct file *file, int cmd, struct file_lock *fl) > +{ > + struct gdlm_ls *ls = lockspace; > + struct plock_op *op; > + int rv; > + > + op = kzalloc(sizeof(*op), GFP_KERNEL); > + if (!op) > + return -ENOMEM; > + > + op->info.optype = GDLM_PLOCK_OP_LOCK; > + op->info.pid= fl->fl_pid; > + op->info.ex = (fl->fl_type == F_WRLCK); > + op->info.wait = IS_SETLKW(cmd); > + op->info.fsid = ls->id; > + op->info.number = name->ln_number; > + op->info.start = fl->fl_start; > + op->info.end= fl->fl_end; > + op->info.owner = (__u64)(long) fl->fl_owner; > + if (fl->fl_lmops) { > + op->info.callback = fl->fl_lmops->fl_notify; > + /* might need to make a copy */ > + op->info.fl = fl; > + op->info.file = file; > + } else > + op->info.callback
Re: NFSv4/pNFS possible POSIX I/O API standards
On Tue, Dec 05, 2006 at 09:55:16AM -0500, Trond Myklebust wrote: > > The again statlite and > > readdirplus really are the most sane bits of these proposals as they > > fit nicely into the existing set of APIs. The filehandle idiocy on > > the other hand is way of into crackpipe land. > > ... > > a) networked filesystem specific. The mask stuff etc adds no > value whatsoever to actual "posix" filesystems. In fact it is > telling the kernel that it can violate posix semantics. I don't see what's network filesystem specific about it. Correct me if I'm wrong, but today ls -l on a local filesystem will first do readdir and then n stat calls. In the worst case scenario this will generate n+1 disk seeks. Local filesystems go through a lot of trouble to try to make the disk layout of the directory entries and the inodes optimal so that readahead and caching reduces the number of seeks. With readdirplus on the other hand, the filesystem would be able to send all the requests to the block layer and it would be free to optimize through disk elevators and what not. And this is not simply an "ls -l" optimization. Allthough I can no loger remember why, I think this is exactly what imap servers are doing when opening up big imap folders stored in maildir. -- Ragnar Kjørstad Software Engineer Scali - http://www.scali.com Scaling the Linux Datacenter - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 10/10] gfs2: nfs lock support for gfs2
Wendy Cheng wrote: J. Bruce Fields wrote: From: J. Bruce Fields <[EMAIL PROTECTED]> From: Marc Eshel <[EMAIL PROTECTED]> Add NFS lock support to GFS2. (Untested.) Untested ? Trying to keep us busy ? Sorry, forgot this is on external mailing lists Nice piece of work ! This solves a long standing issue for our NFS servers. Thank you for doing it - will test them out from our end too. -- Wendy - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Matthew Wilcox wrote: On Tue, Dec 05, 2006 at 10:07:48AM +, Christoph Hellwig wrote: The filehandle idiocy on the other hand is way of into crackpipe land. Right, and it needs to be discarded. Of course, there was a real problem that it addressed, so we need to come up with an acceptable alternative. > The scenario is a cluster-wide application doing simultaneous opens of the same file. So thousands of nodes all hitting the same DLM locks (for read) all at once. The openg() non-solution implies that all nodes in the cluster share the same filehandle space, so I think a reasonable solution can be implemented entirely within the clusterfs, with an extra flag to open(), say O_CLUSTER_WIDE. When the clusterfs sees this flag set (in ->lookup), it can treat it as a hint that this pathname component is likely to be opened again on other nodes and broadcast that fact to the other nodes within the cluster. Other nodes on seeing that hint (which could be structured as "The child "bin" of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact. When they see their own open, they can populate the path to that file without asking the server for extra metadata. There's obviously security issues there (why I say 'hint' rather than 'command'), but there's also security problems with open-by-filehandle. Note that this solution requires no syscall changes, no application changes, and also helps a scenario where each node opens a different file in the same directory. I've never worked on a clusterfs, so there may be some gotchas (eg, how do you invalidate the caches of nodes when you do a rename). But this has to be preferable to open-by-fh. The openg() solution has the following advantages to what you propose. First, it places the burden of the communication of the file handle on the application process, not the file system. That means less work for the file system. Second, it does not require that clients respond to unexpected network traffic. Third, the network traffic is deterministic -- one client interacts with the file system and then explicitly performs the broadcast. Fourth, it does not require that the file system store additional state on clients. In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing the flag) would likely cause a storm of network traffic if clients were closely synchronized (which they are likely to be). We could work around this by having one application open early, then barrier, then have everyone else open, but then we might as well have just sent the handle as the barrier operation, and we've made the use of the O_CLUSTER_WIDE open() significantly more complicated for the application. However, the application change issue is actually moot; we will make whatever changes inside our MPI-IO implementation, and many users will get the benefits for free. The readdirplus(), readx()/writex(), and openg()/openfh() were all designed to allow our applications to explain exactly what they wanted and to allow for explicit communication. I understand that there is a tendency toward solutions where the FS guesses what the app is going to do or is passed a hint (e.g. fadvise) about what is going to happen, because these things don't require interface changes. But these solutions just aren't as effective as actually spelling out what the application wants. Regards, Rob - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, 2006-12-06 at 13:22 +0100, Ragnar Kjørstad wrote: > On Tue, Dec 05, 2006 at 09:55:16AM -0500, Trond Myklebust wrote: > > > The again statlite and > > > readdirplus really are the most sane bits of these proposals as they > > > fit nicely into the existing set of APIs. The filehandle idiocy on > > > the other hand is way of into crackpipe land. > > > > ... > > > > a) networked filesystem specific. The mask stuff etc adds no > > value whatsoever to actual "posix" filesystems. In fact it is > > telling the kernel that it can violate posix semantics. > > > I don't see what's network filesystem specific about it. Correct me if > I'm wrong, but today ls -l on a local filesystem will first do readdir > and then n stat calls. In the worst case scenario this will generate n+1 > disk seeks. I was referring to the caching mask. > Local filesystems go through a lot of trouble to try to make the disk > layout of the directory entries and the inodes optimal so that readahead > and caching reduces the number of seeks. > > With readdirplus on the other hand, the filesystem would be able to send > all the requests to the block layer and it would be free to optimize > through disk elevators and what not. As far as local filesystems are concerned, the procedure is still the same: read the directory contents, then do lookup() of the files returned to you by directory contents, then do getattr(). There is no way to magically queue up the getattr() calls before you have done the lookups, nor is there a way to magically queue up the lookups before you have read the directory contents. Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: readdirplus() as possible POSIX I/O API
On Wed, 2006-12-06 at 03:28 -0700, Andreas Dilger wrote: > On Dec 05, 2006 10:23 -0500, Trond Myklebust wrote: > > On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote: > > > I think the "barrier semantics" are something that have just crept > > > into this discussion and is confusing the issue. > > > > It is the _only_ concept that is of interest for something like NFS or > > CIFS. We already have the ability to cache the information. > > Actually, wouldn't the ability for readdirplus() (with valid flag) be > useful for NFS if only to indicate that it does not need to flush the > cache because the ctime/mtime isn't needed by the caller? That is why statlite() might be useful. I'd prefer something more generic, though. > > 'find' should be quite happy with the existing readdir(). It does not > > need to use stat() or readdirplus() in order to recurse because > > readdir() provides d_type. > > It does in any but the most simplistic invocations, like "find -mtime" > or "find -mode" or "find -uid", etc. The only 'win' a readdirplus may give you there as far as NFS is concerned is the sysenter overhead that you would have for calling stat() (or statlite() many times). > > If the application is able to select a statlite()-type of behaviour with > > the fadvise() hints, your filesystem could be told to serve up cached > > information instead of regrabbing locks. In fact that is a much more > > flexible scheme, since it also allows the filesystem to background the > > actual inode lookups, or to defer them altogether if that is more > > efficient. > > I guess I just don't understand how fadvise() on a directory file handle > (used for readdir()) can be used to affect later stat operations (which > definitely will NOT be using that file handle)? If you mean that the > application should actually open() each file, fadvise(), fstat(), close(), > instead of just a stat() call then we are WAY into negative improvements > here due to overhead of doing open+close. On the contrary, the readdir descriptor is used in all those funky new statat(), calls. Ditto for readlinkat(), faccessat(). You could even have openat() turn off the close-to-open GETATTR if the readdir descriptor contained a hint that told it that was unnecessary. Furthermore, since the fadvise-like caching operation works on filehandles, you could have it work both on readdir() for the benefit of the above *at() calls, and also on the regular file descriptor for the benefit of fstat(), fgetxattr(). > > > The filesystem can't always do "stat-ahead" on the files because that > > > requires instantiating an inode on the client which may be stale (lock > > > revoked) by the time the app gets to it, and the app (and the VFS) have > > > no idea just how stale it is, and whether the stat is a "real" stat or > > > "only" the readdir stat (because the fadvise would only be useful on > > > the directory, and not all of the child entries), so it would need to > > > re-stat the file. > > > > Then provide hints that allow the app to select which behaviour it > > prefers. Most (all?) apps don't _care_, and so would be quite happy with > > cached information. That is why the current NFS caching model exists in > > the first place. > > Most clustered filesystems have strong cache semantics, so that isn't > a problem. IMHO, the mechanism to pass the hint to the filesystem IS > the readdirplus_lite() that tells the filesystem exactly which data is > needed on each directory entry. > > > > Also, this would potentially blow the client's real > > > working set of inodes out of cache. > > > > Why? > > Because in many cases it is desirable to limit the number of DLM locks > on a given client (e.g. GFS2 thread with AKPM about clients with > millions of DLM locks due to lack of memory pressure on large mem systems). > That means a finite-size lock LRU on the client that risks being wiped > out by a few thousand files in a directory doing "readdir() + 5000*stat()". > > > Consider a system like BlueGene/L with 128k compute cores. Jobs that > run on that system will periodically (e.g. every hour) create up to 128K > checkpoint+restart files to avoid losing a lot of computation if a node > crashes. Even if each one of the checkpoints is in a separate directory > (I wish all users were so nice :-) it means 128K inodes+DLM locks for doing > an "ls" in the directory. That is precisely the sort of situation where knowing when you can cache, and when you cannot would be a plus. An ls call may not need 128k dlm locks, because it only cares about the state of the inodes as they were at the start of the opendir() call. > > > Doing things en-masse with readdirplus() also allows the filesystem to > > > do the stat() operations in parallel internally (which is a net win if > > > there are many servers involved) instead of serially as the application > > > would do. > > > > If your application really cared, it could add threading to 'ls' to > > achieve the same resul
Re: openg
On Wed, 2006-12-06 at 11:01 +, Christoph Hellwig wrote: > Say you want to lookup a path /foo/bar/baz, then the access permission > is based on the following things: > > - the credentials of the user. let's only take traditional uid/gid >for this example although credentials are much more complex these >days > - the kind of operation you want to perform > - the access permission of the actual object the path points to (inode) > - the lookup permission (x bit) for every object on the way to you object - your private namespace particularities (submounts etc) Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg
Christoph Hellwig wrote: On Tue, Dec 05, 2006 at 03:44:31PM -0600, Rob Ross wrote: The openg() really just does the lookup and permission checking). The openfh() creates the file descriptor and starts that context if the particular FS tracks that sort of thing. ... Well you've caught me. I don't want to cache the values, because I fundamentally believe that sharing state between clients and servers is braindead (to use Christoph's phrase) in systems of this scale (thousands to tens of thousands of clients). So I don't want locks, so I can't keep the cache consistent, ... So someone else will have to run the tests you propose :)... Besides the whole ugliness you miss a few points about the fundamental architecture of the unix filesystem permission model unfortunately. Say you want to lookup a path /foo/bar/baz, then the access permission is based on the following things: - the credentials of the user. let's only take traditional uid/gid for this example although credentials are much more complex these days - the kind of operation you want to perform - the access permission of the actual object the path points to (inode) - the lookup permission (x bit) for every object on the way to you object In your proposal sutoc is a simple conversion operation, that means openg needs to perfom all these access checks and encodes them in the fh_t. This is exactly right and is the intention of the call. That means an fh_t must fundamentally be an object that is kept in the kernel aka a capability as defined by Henry Levy. This does imply you _do_ need to keep state. The fh_t is indeed a type of capability. fh_t, properly protected, could be passed into user space and validated by the file system when presented back to the file system. There is state here, clearly. I feel ok about that because we allow servers to forget that they handed out these fh_ts if they feel like it; there is no guaranteed lifetime in the current proposal. This allows servers to come and go without needing to persistently store these. Likewise, clients can forget them with no real penalty. This approach is ok because of the use case. Because we expect the fh_t to be used relatively soon after its creation, servers will not need to hold onto these long before the openfh() is performed and we're back into a normal "everyone has an valid fd" use case. > And because it needs kernel support you fh_t is more or less equivalent to a file descriptor with sutoc equivalent to a dup variant that really duplicates the backing object instead of just the userspace index into it. Well, a FD has some additional state associated with it (position, etc.), but yes there are definitely similarities to dup(). Note somewhat similar open by filehandle APIs like oben by inode number as used by lustre or the XFS *_by_handle APIs are privilegued operations because of exactly this problem. I'm not sure what a properly protected fh_t couldn't be passed back into user space and handed around, but I'm not a security expert. What am I missing? What according to your mail is the most important bit in this proposal is that you thing the filehandles should be easily shared with other system in a cluster. That fact is not mentioned in the actual proposal at all, and is in fact that hardest part because of inherent statefulness of the API. The documentation of the calls is complicated by the way POSIX calls are described. We need to have a second document describing use cases also available, so that we can avoid misunderstandings as best we can, get straight to the real issues. Sorry that document wasn't available. I think I've addressed the statefulness of the API above? What's the etiquette on changing subject lines here? It might be useful to separate the openg() etc. discussion from the readdirplus() etc. discussion. Changing subject lines is fine. Thanks. Rob - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote: > The openg() solution has the following advantages to what you propose. > First, it places the burden of the communication of the file handle on > the application process, not the file system. That means less work for > the file system. Second, it does not require that clients respond to > unexpected network traffic. Third, the network traffic is deterministic > -- one client interacts with the file system and then explicitly > performs the broadcast. Fourth, it does not require that the file system > store additional state on clients. You didn't address the disadvantages I pointed out on December 1st in a mail to the posix mailing list: : I now understand this not so much as a replacement for dup() but in : terms of being able to open by NFS filehandle, or inode number. The : fh_t is presumably generated by the underlying cluster filesystem, and : is a handle that has meaning on all nodes that are members of the : cluster. : : I think we need to consider security issues (that have also come up : when open-by-inode-number was proposed). For example, how long is the : fh_t intended to be valid for? Forever? Until the cluster is rebooted? : Could the fh_t be used by any user, or only those with credentials to : access the file? What happens if we revoke() the original fd? : : I'm a little concerned about the generation of a suitable fh_t. : In the implementation of sutoc(), how does the kernel know which : filesystem to ask to translate it? It's not impossible (though it is : implausible) that an fh_t could be meaningful to more than one : filesystem. : : One possibility of fixing this could be to use a magic number at the : beginning of the fh_t to distinguish which filesystem this belongs : to (a list of currently-used magic numbers in Linux can be found at : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h) Christoph has also touched on some of these points, and added some I missed. > In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing > the flag) would likely cause a storm of network traffic if clients were > closely synchronized (which they are likely to be). I think you're referring to a naive application, rather than a naive cluster filesystem, right? There's several ways to fix that problem, including throttling broadcasts of information, having nodes ask their immediate neighbours if they have a cache of the information, and having the server not respond (wait for a retransmit) if it's recently sent out a broadcast. > However, the application change issue is actually moot; we will make > whatever changes inside our MPI-IO implementation, and many users will > get the benefits for free. That's good. > The readdirplus(), readx()/writex(), and openg()/openfh() were all > designed to allow our applications to explain exactly what they wanted > and to allow for explicit communication. I understand that there is a > tendency toward solutions where the FS guesses what the app is going to > do or is passed a hint (e.g. fadvise) about what is going to happen, > because these things don't require interface changes. But these > solutions just aren't as effective as actually spelling out what the > application wants. Sure, but I think you're emphasising "these interfaces let us get our job done" over the legitimate concerns that we have. I haven't really looked at the readdirplus() or readx()/writex() interfaces, but the security problems with openg() makes me think you haven't really looked at it from the "what could go wrong" perspective. I'd be interested in reviewing the readx()/writex() interfaces, but still don't see a document for them anywhere. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: readdirplus() as possible POSIX I/O API
Sage Weil wrote: On Mon, 4 Dec 2006, Peter Staubach wrote: I think that there are several points which are missing here. First, readdirplus(), without any sort of caching, is going to be _very_ expensive, performance-wise, for _any_ size directory. You can see this by instrumenting any NFS server which already supports the NFSv3 READDIRPLUS semantics. Are you referring to the work the server must do to gather stat information for each inode? Yes and the fact that the client will be forced to go over the wire for each readdirplus() call, whereas it can use cached information today. An application actually waiting on the response to a READDIRPLUS will not be pleased at the resulting performance. Second, the NFS client side readdirplus() implementation is going to be _very_ expensive as well. The NFS client does write-behind and all this data _must_ be flushed to the server _before_ the over the wire READDIRPLUS can be issued. This means that the client will have to step through every inode which is associated with the directory inode being readdirplus()'d and ensure that all modified data has been successfully written out. This part of the operation, for a sufficiently large directory and a sufficiently large page cache, could take signficant time in itself. Why can't the client send the over the wire READDIRPLUS without flushing inode data, and then simply ignore the stat portion of the server's response in instances where it's locally cached (and dirty) inode data is newer than the server's? This would seem to minimize the value as far as I understand the requirements here. These overheads may make this new operation expensive enough that no applications will end up using it. If the application calls readdirplus() only when it would otherwise do readdir()+stat(), the flushing you mention would happen anyway (from the stat()). Wouldn't this at least allow that to happen in parallel for the whole directory? I don't see where the parallelism comes from. Before issuing the READDIRPLUS over the wire, the client would have to ensure that each and every one of those flushes was completed. I suppose that a sufficiently clever and complex implementation could figure out how to schedule all those flushes asynchronously and then wait for all of them to complete, but there will be a performance cost. Walking the caches for all of those inodes, perhaps using several or all of the cpus in the system, smacking the server with all of those WRITE operations simultaneously with all of the associated network bandwidth usage, all adds up to other applications on the client and potentially the network not doing much at the same time. All of this cost to the system and to the network for the benefit of a single application? That seems like a tough sell to me. This is an easy problem to look at from the application viewpoint. The solution seems obvious. Give it the fastest possible way to read the directory and retrieve stat information about every entry in the directory. However, when viewed from a systemic level, this becomes a very different problem with many more aspects. Perhaps flow controlling this one application in favor of many other applications, running network wide, may be the better thing to continue to do. I dunno. ps - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
openg and path_to_handle
David Chinner wrote: On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote: Hi, I agree that it is not feasible to add new system calls every time somebody has a problem, and we don't take adding system calls lightly. However, in this case we're talking about an entire *community* of people (high-end computing), not just one or two people. Of course it may still be the case that that community is not important enough to justify the addition of system calls; that's obviously not my call to make! I have the feeling that openg stuff is rushed without looking into all solutions, that don't require changes to the current interface. I also get the feeling that interfaces that already do this open-by-handle stuff haven't been explored either. Does anyone here know about the XFS libhandle API? This has been around for years and it does _exactly_ what these proposed syscalls are supposed to do (and more). See: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle For the libhandle man page. Basically: openg == path_to_handle sutoc == open_by_handle And here for the userspace code: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ Cheers, Dave. Thanks for pointing these out Dave. These are indeed along the same lines as the openg()/openfh() approach. One difference is that they appear to perform permission checking on the open_by_handle(), which means that the entire path needs to be encoded in the handle, and makes it difficult to eliminate the path traversal overhead on N open_by_handle() operations. Regards, Rob - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] gfs2: nfs lock support for gfs2
On Wed, Dec 06, 2006 at 12:34:20AM -0500, J. Bruce Fields wrote: > +int gdlm_plock_callback(struct plock_op *op) > +{ > + struct file *file; > + struct file_lock *fl; > + int rv; > + > + spin_lock(&ops_lock); > + if (!list_empty(&op->list)) { > + printk(KERN_INFO "plock op on list\n"); > + list_del(&op->list); > + } > + spin_unlock(&ops_lock); > + > + rv = op->info.rv; > + > + if (!rv) { > + /* check if the following are still valid or make a copy */ > + file = op->info.file; > + fl = op->info.fl; > + > + if (posix_lock_file_wait(file, fl) < 0) > + log_error("gdlm_plock: vfs lock error file %p fl %p", > + file, fl); > + } > + > + kfree(op); > + return rv; > +} .. > + if (found) { > + if (op->info.callback) > + gdlm_plock_callback(op); > + else > + wake_up(&recv_wq); > + } The gfs side looks fine to me. Did you forget to call fl_notify from gdlm_plock_callback() or am I missing something? Dave - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > David Chinner wrote: > >Does anyone here know about the XFS libhandle API? This has been > >around for years and it does _exactly_ what these proposed syscalls > >are supposed to do (and more). > > Thanks for pointing these out Dave. These are indeed along the same > lines as the openg()/openfh() approach. > > One difference is that they appear to perform permission checking on the > open_by_handle(), which means that the entire path needs to be encoded > in the handle, and makes it difficult to eliminate the path traversal > overhead on N open_by_handle() operations. Another (and highly important) difference is that usage is restricted to root: xfs_open_by_handle(...) ... if (!capable(CAP_SYS_ADMIN)) return -XFS_ERROR(EPERM); - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Matthew Wilcox wrote: On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote: The openg() solution has the following advantages to what you propose. First, it places the burden of the communication of the file handle on the application process, not the file system. That means less work for the file system. Second, it does not require that clients respond to unexpected network traffic. Third, the network traffic is deterministic -- one client interacts with the file system and then explicitly performs the broadcast. Fourth, it does not require that the file system store additional state on clients. You didn't address the disadvantages I pointed out on December 1st in a mail to the posix mailing list: I coincidentally just wrote about some of this in another email. Wasn't trying to avoid you... : I now understand this not so much as a replacement for dup() but in : terms of being able to open by NFS filehandle, or inode number. The : fh_t is presumably generated by the underlying cluster filesystem, and : is a handle that has meaning on all nodes that are members of the : cluster. Exactly. : I think we need to consider security issues (that have also come up : when open-by-inode-number was proposed). For example, how long is the : fh_t intended to be valid for? Forever? Until the cluster is rebooted? : Could the fh_t be used by any user, or only those with credentials to : access the file? What happens if we revoke() the original fd? The fh_t would be validated either (a) when the openfh() is called, or on accesses using the associated capability. As Christoph pointed out, this really is a capability and encapsulates everything necessary for a particular user to access a particular file. It can be handed to others, and in fact that is a critical feature for our use case. After the openfh(), the access model is identical to a previously open()ed file. So the question is what happens between the openg() and the openfh(). Our intention was to allow servers to "forget" these fh_ts at will. So a revoke between openg() and openfh() would kill the fh_t, and the subsequent openfh() would fail, or subsequent accesses would fail (depending on when the FS chose to validate). Does this help? : I'm a little concerned about the generation of a suitable fh_t. : In the implementation of sutoc(), how does the kernel know which : filesystem to ask to translate it? It's not impossible (though it is : implausible) that an fh_t could be meaningful to more than one : filesystem. > : : One possibility of fixing this could be to use a magic number at the : beginning of the fh_t to distinguish which filesystem this belongs : to (a list of currently-used magic numbers in Linux can be found at : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h) Christoph has also touched on some of these points, and added some I missed. We could use advice on this point. Certainly it's possible to encode information about the FS from which the fh_t originated, but we haven't tried to spell out exactly how that would happen. Your approach described here sounds good to me. In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing the flag) would likely cause a storm of network traffic if clients were closely synchronized (which they are likely to be). I think you're referring to a naive application, rather than a naive cluster filesystem, right? There's several ways to fix that problem, including throttling broadcasts of information, having nodes ask their immediate neighbours if they have a cache of the information, and having the server not respond (wait for a retransmit) if it's recently sent out a broadcast. Yes, naive application. You're right that the file system could adapt to this, but on the other hand if we were explicitly passing the fh_t in user space, we could just use MPI_Bcast and be done with it, with an algorithm that is well-matched to the system, etc. However, the application change issue is actually moot; we will make whatever changes inside our MPI-IO implementation, and many users will get the benefits for free. That's good. Absolutely. Same goes for readx()/writex() also, BTW, at least for MPI-IO users. We will build the input parameters inside MPI-IO using existing information from users, rather than applying data sieving or using multiple POSIX calls. The readdirplus(), readx()/writex(), and openg()/openfh() were all designed to allow our applications to explain exactly what they wanted and to allow for explicit communication. I understand that there is a tendency toward solutions where the FS guesses what the app is going to do or is passed a hint (e.g. fadvise) about what is going to happen, because these things don't require interface changes. But these solutions just aren't as effective as actually spelling out what the application wants. Sure, but I think you're emphasising "these interfaces le
Re: openg and path_to_handle
Matthew Wilcox wrote: On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: David Chinner wrote: Does anyone here know about the XFS libhandle API? This has been around for years and it does _exactly_ what these proposed syscalls are supposed to do (and more). Thanks for pointing these out Dave. These are indeed along the same lines as the openg()/openfh() approach. One difference is that they appear to perform permission checking on the open_by_handle(), which means that the entire path needs to be encoded in the handle, and makes it difficult to eliminate the path traversal overhead on N open_by_handle() operations. Another (and highly important) difference is that usage is restricted to root: xfs_open_by_handle(...) ... if (!capable(CAP_SYS_ADMIN)) return -XFS_ERROR(EPERM); I assume that this is because the implementation chose not to do the path encoding in the handle? Because if they did, they could do full path permission checking as part of the open_by_handle. Rob - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Trond Myklebust wrote: On Tue, 2006-12-05 at 16:11 -0600, Rob Ross wrote: Trond Myklebust wrote: b) quite unnatural to impose caching semantics on all the directory _entries_ using a syscall that refers to the directory itself (see the explanations by both myself and Peter Staubach of the synchronisation difficulties). Consider in particular that it is quite possible for directory contents to change in between readdirplus calls. I want to make sure that I understand this correctly. NFS semantics dictate that if someone stat()s a file that all changes from that client need to be propagated to the server? And this call complicates that semantic because now there's an operation on a different object (the directory) that would cause this flush on the files? The only way for an NFS client to obey the POSIX requirement that write() immediately updates the mtime/ctime is to flush out all cached writes whenever the user requests stat() information. Thanks for explaining this. I've never understood how it is decided where the line is drawn with respect to where NFS does obey POSIX semantics for a particular implementation. Writes on other nodes wouldn't necessarily have updated mtime/ctime, right? i.e. the "strict posix caching model' is pretty much impossible to implement on something like NFS or CIFS using these semantics. Why then even bother to have "masks" to tell you when it is OK to violate said strict model. We're trying to obtain improved performance for distributed file systems with stronger consistency guarantees than these two. So you're saying I should ignore this thread. Fine... No I'm trying to explain when the calls might be useful. But if you are only interested in NFS and CIFS, then I guess the thread might not be very interesting. c) Says nothing about what should happen to non-stat() metadata such as ACL information and other extended attributes (for example future selinux context info). You would think that the 'ls -l' application would care about this. Honestly, we hadn't thought about other non-stat() metadata because we didn't think it was part of the use case, and we were trying to stay close to the flavor of POSIX. If you have ideas here, we'd like to hear them. See my previous postings. I'll do that. Thanks. Rob - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Andreas Dilger wrote: Does this mean you are against the statlite() API entirely, or only against the document's use of the flag as a vague "accuracy" value instead of a hard "valid" value? I'm against fuzzy values. I've no problems with a bitmap specifying that certain members are not wanted or wanted (probably the later, zero meaning the optional fields are not wanted). IMHO, if the application doesn't need a particular field (e.g. "ls -i" doesn't need size, "ls -s" doesn't need the inode number, etc) why should these be filled in if they are not easily accessible? As for what is easily accessible, that needs to be determined by the filesystem itself. Is the size not easily accessible? It would surprise me. If yes, then, by all means add it to the list. I'm not against extending the list of members which are optional if it makes sense. But certain information is certainly always easily accessible. That was previously suggested by me already. IMHO, there should ONLY be a statlite variant of readdirplus(), and I think most people agree with that part of it (though there is contention on whether readdirplus() is needed at all). Indeed. Given there is statlite and we have d_type information, in most situations we won't need more complete stat information. Outside of programs like ls that is. Part of why I wished the lab guys had submitted the draft to the OpenGroup first is that this way they would have to be more detailed on why each and every interface they propose for adding is really needed. Maybe they can do it now and here. What programs really require readdirplus? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Ulrich Drepper wrote: Andreas Dilger wrote: Does this mean you are against the statlite() API entirely, or only against the document's use of the flag as a vague "accuracy" value instead of a hard "valid" value? I'm against fuzzy values. I've no problems with a bitmap specifying that certain members are not wanted or wanted (probably the later, zero meaning the optional fields are not wanted). Thanks for clarifying. IMHO, if the application doesn't need a particular field (e.g. "ls -i" doesn't need size, "ls -s" doesn't need the inode number, etc) why should these be filled in if they are not easily accessible? As for what is easily accessible, that needs to be determined by the filesystem itself. Is the size not easily accessible? It would surprise me. If yes, then, by all means add it to the list. I'm not against extending the list of members which are optional if it makes sense. But certain information is certainly always easily accessible. File size is definitely one of the more difficult of the parameters, either because (a) it isn't stored in one place but is instead derived, or (b) because a lock has to be obtained to guarantee consistency of the returned value. That was previously suggested by me already. IMHO, there should ONLY be a statlite variant of readdirplus(), and I think most people agree with that part of it (though there is contention on whether readdirplus() is needed at all). Indeed. Given there is statlite and we have d_type information, in most situations we won't need more complete stat information. Outside of programs like ls that is. > Part of why I wished the lab guys had submitted the draft to the OpenGroup first is that this way they would have to be more detailed on why each and every interface they propose for adding is really needed. Maybe they can do it now and here. What programs really require readdirplus? I can't speak for everyone, but "ls" is the #1 consumer as far as I am concerned. Regards, Rob - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Rob Ross wrote: File size is definitely one of the more difficult of the parameters, either because (a) it isn't stored in one place but is instead derived, or (b) because a lock has to be obtained to guarantee consistency of the returned value. OK, and looking at the man page again, it is already on the list in the old proposal and hence optional. I've no problem with that. I can't speak for everyone, but "ls" is the #1 consumer as far as I am concerned. So a syscall for ls alone? I think this is more a user problem. For normal plain old 'ls' you get by with readdir. For 'ls -F' and 'ls --color' you mostly get by with readdir+d_type. If you cannot provide d_type info the readdirplus extension does you no good. For the cases when an additional stat is needed (for symlinks, for instance, to test whether they are dangling) readdirplus won't help. So, readdirplus is really only useful for 'ls -l'. But then you need st_size and st_?time. So what is gained with readdirplus? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, Dec 06, 2006 at 09:42:55AM -0800, Ulrich Drepper wrote: > >I can't speak for everyone, but "ls" is the #1 consumer as far as I am > >concerned. > > So a syscall for ls alone? I guess the code needs to be checked, but I would think that: * ls * find * rm -r * chown -R * chmod -R * rsync * various backup software * imap servers are all likely users of readdirplus. Of course the ones that spend the majority of the time doing stat are the ones that would benefit more. -- Ragnar Kjørstad Software Engineer Scali - http://www.scali.com Scaling the Linux Datacenter - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Ragnar Kjørstad wrote: I guess the code needs to be checked, but I would think that: * ls * find * rm -r * chown -R * chmod -R * rsync * various backup software * imap servers Then somebody do the analysis. And please an analysis which takes into account that some programs might need to be adapted to take advantage of d_type or non-optional data from the proposed statlite. Plus, how often are these commands really used on such filesystems? I'd hope that chown -R or so is a once in a lifetime thing on such filesystems and not worth optimizing for. I'd suggest until such data is provided the readdirplus plans are put on hold. statlite I have no problems with if the semantics is changed as I explained. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] gfs2: nfs lock support for gfs2
On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote: > The gfs side looks fine to me. Did you forget to call fl_notify from > gdlm_plock_callback() or am I missing something? Yes, looks like we missed something, thanks. This code's an rfc (not even tested), so don't apply it yet! What we should have there is something like: rv = op->info.rv; if (fl_notify(fl, NULL, rv)) { /* XXX: We need to cancel the lock here: */ printk("gfs2 lock granted after lock request failed; dangling lock!\n"); } if (!rv) { /* check if the following are still valid or make a copy */ file = op->info.file; fl = op->info.fl; if (posix_lock_file_wait(file, fl) < 0) log_error("gdlm_plock: vfs lock error file %p fl %p", file, fl); } Note there's a race condition--that calls fl_notify before actually getting the lock locally. I don't *think* that's a problem, as long as it's the filesystem and not the local lock list that's authoritative when it comes to who gets a posix lock. The more annoying problem is the need to cancel the GFS lock when fl_notify fails; is that something that it's possible for GFS to do? It can fail because lockd has a timeout--it waits a few seconds for the callback, then gives up and returns a failure to the user. If that happens after your userspace posix lock manager acquires the lock (but before fl_notify is called) then you've got to cancel it somehow. --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 10/10] gfs2: nfs lock support for gfs2
On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote: > On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote: > > The gfs side looks fine to me. Did you forget to call fl_notify from > > gdlm_plock_callback() or am I missing something? > > Yes, looks like we missed something, thanks. This code's an rfc (not > even tested), so don't apply it yet! What we should have there is > something like: > > rv = op->info.rv; > > if (fl_notify(fl, NULL, rv)) { > /* XXX: We need to cancel the lock here: */ > printk("gfs2 lock granted after lock request failed; dangling > lock!\n"); > } (And note in the patch I sent out I stuck this in the wrong place--in the synchronous instead of the asynchronous code. So the patch should have been something closer to the following.) --b. diff --git a/fs/gfs2/lm.c b/fs/gfs2/lm.c index effe4a3..cf7fd52 100644 --- a/fs/gfs2/lm.c +++ b/fs/gfs2/lm.c @@ -197,6 +197,16 @@ int gfs2_lm_plock(struct gfs2_sbd *sdp, struct lm_lockname *name, return error; } +int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name, + struct file *file, int cmd, struct file_lock *fl) +{ + int error = -EIO; + if (likely(!test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) + error = sdp->sd_lockstruct.ls_ops->lm_plock_async( + sdp->sd_lockstruct.ls_lockspace, name, file, cmd, fl); + return error; +} + int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name, struct file *file, struct file_lock *fl) { diff --git a/fs/gfs2/lm.h b/fs/gfs2/lm.h index 21cdc30..1ddd1fd 100644 --- a/fs/gfs2/lm.h +++ b/fs/gfs2/lm.h @@ -34,6 +34,8 @@ int gfs2_lm_plock_get(struct gfs2_sbd *sdp, struct lm_lockname *name, struct file *file, struct file_lock *fl); int gfs2_lm_plock(struct gfs2_sbd *sdp, struct lm_lockname *name, struct file *file, int cmd, struct file_lock *fl); +int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name, + struct file *file, int cmd, struct file_lock *fl); int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name, struct file *file, struct file_lock *fl); void gfs2_lm_recovery_done(struct gfs2_sbd *sdp, unsigned int jid, diff --git a/fs/gfs2/locking/dlm/lock_dlm.h b/fs/gfs2/locking/dlm/lock_dlm.h index 33af707..82af860 100644 --- a/fs/gfs2/locking/dlm/lock_dlm.h +++ b/fs/gfs2/locking/dlm/lock_dlm.h @@ -179,6 +179,8 @@ int gdlm_plock_init(void); void gdlm_plock_exit(void); int gdlm_plock(void *, struct lm_lockname *, struct file *, int, struct file_lock *); +int gdlm_plock_async(void *, struct lm_lockname *, struct file *, int, + struct file_lock *); int gdlm_plock_get(void *, struct lm_lockname *, struct file *, struct file_lock *); int gdlm_punlock(void *, struct lm_lockname *, struct file *, diff --git a/fs/gfs2/locking/dlm/mount.c b/fs/gfs2/locking/dlm/mount.c index cdd1694..4339e3f 100644 --- a/fs/gfs2/locking/dlm/mount.c +++ b/fs/gfs2/locking/dlm/mount.c @@ -244,6 +244,7 @@ const struct lm_lockops gdlm_ops = { .lm_lock = gdlm_lock, .lm_unlock = gdlm_unlock, .lm_plock = gdlm_plock, + .lm_plock_async = gdlm_plock_async, .lm_punlock = gdlm_punlock, .lm_plock_get = gdlm_plock_get, .lm_cancel = gdlm_cancel, diff --git a/fs/gfs2/locking/dlm/plock.c b/fs/gfs2/locking/dlm/plock.c index 7365aec..f91a18a 100644 --- a/fs/gfs2/locking/dlm/plock.c +++ b/fs/gfs2/locking/dlm/plock.c @@ -102,6 +102,94 @@ int gdlm_plock(void *lockspace, struct lm_lockname *name, return rv; } +int gdlm_plock_async(void *lockspace, struct lm_lockname *name, + struct file *file, int cmd, struct file_lock *fl) +{ + struct gdlm_ls *ls = lockspace; + struct plock_op *op; + int rv; + + op = kzalloc(sizeof(*op), GFP_KERNEL); + if (!op) + return -ENOMEM; + + op->info.optype = GDLM_PLOCK_OP_LOCK; + op->info.pid= fl->fl_pid; + op->info.ex = (fl->fl_type == F_WRLCK); + op->info.wait = IS_SETLKW(cmd); + op->info.fsid = ls->id; + op->info.number = name->ln_number; + op->info.start = fl->fl_start; + op->info.end= fl->fl_end; + op->info.owner = (__u64)(long) fl->fl_owner; + if (fl->fl_lmops) { + op->info.callback = fl->fl_lmops->fl_notify; + /* might need to make a copy */ + op->info.fl = fl; + op->info.file = file; + } else + op->info.callback = NULL; + + send_op(op); + + if (op->info.callback == NULL) + wait_event(recv_wq, (op->done != 0)); + else + return -EINPROGRESS; +
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > David Chinner wrote: > >On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > >>On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote: > >>>Hi, > >>> > >>>I agree that it is not feasible to add new system calls every time > >>>somebody has a problem, and we don't take adding system calls lightly. > >>>However, in this case we're talking about an entire *community* of people > >>>(high-end computing), not just one or two people. Of course it may still > >>>be the case that that community is not important enough to justify the > >>>addition of system calls; that's obviously not my call to make! > >>I have the feeling that openg stuff is rushed without looking into all > >>solutions, that don't require changes to the current interface. > > > >I also get the feeling that interfaces that already do this open-by-handle > >stuff haven't been explored either. > > > >Does anyone here know about the XFS libhandle API? This has been around for > >years and it does _exactly_ what these proposed syscalls are supposed to do > >(and more). > > > >See: > > > >http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle > > > >For the libhandle man page. Basically: > > > >openg == path_to_handle sutoc == open_by_handle > > > >And here for the userspace code: > > > >http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ > > > >Cheers, > > > >Dave. > > Thanks for pointing these out Dave. These are indeed along the same lines as > the openg()/openfh() approach. > > One difference is that they appear to perform permission checking on the > open_by_handle(), which means that the entire path needs to be encoded in > the handle, and makes it difficult to eliminate the path traversal overhead > on N open_by_handle() operations. open_by_handle() is checking the inode flags for things like immutibility and whether the inode is writable to determine if the open mode is valid given these flags. It's not actually checking permissions. IOWs, open_by_handle() has the same overhead as NFS filehandle to inode translation; i.e. no path traversal on open. Permission checks are done on the path_to_handle(), so in reality only root or CAP_SYS_ADMIN users can currently use the open_by_handle interface because of this lack of checking. Given that our current users of this interface need root permissions to do other things (data migration), this has never been an issue. This is an implementation detail - it is possible that file handle, being opaque, could encode a UID/GID of the user that constructed the handle and then allow any process with the same UID/GID to use open_by_handle() on that handle. (I think hch has already pointed this out.) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: > Permission checks are done on the path_to_handle(), so in reality > only root or CAP_SYS_ADMIN users can currently use the > open_by_handle interface because of this lack of checking. Given > that our current users of this interface need root permissions to do > other things (data migration), this has never been an issue. > > This is an implementation detail - it is possible that file handle, > being opaque, could encode a UID/GID of the user that constructed > the handle and then allow any process with the same UID/GID to use > open_by_handle() on that handle. (I think hch has already pointed > this out.) While it could do that, I'd be interested to see how you'd construct the handle such that it's immune to a malicious user tampering with it, or saving it across a reboot, or constructing one from scratch. I suspect any real answer to this would have to involve cryptographical techniques (say, creating a secure hash of the information plus a boot-time generated nonce). Now you're starting to use a lot of bits, and compute time, and you'll need to be sure to keep the nonce secret. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 10:20:23AM -0600, Rob Ross wrote: > Matthew Wilcox wrote: > >On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > >>David Chinner wrote: > >>>Does anyone here know about the XFS libhandle API? This has been > >>>around for years and it does _exactly_ what these proposed syscalls > >>>are supposed to do (and more). > >>Thanks for pointing these out Dave. These are indeed along the same > >>lines as the openg()/openfh() approach. > >> > >>One difference is that they appear to perform permission checking on the > >>open_by_handle(), which means that the entire path needs to be encoded > >>in the handle, and makes it difficult to eliminate the path traversal > >>overhead on N open_by_handle() operations. > > > >Another (and highly important) difference is that usage is restricted to > >root: > > > >xfs_open_by_handle(...) > >... > >if (!capable(CAP_SYS_ADMIN)) > > return -XFS_ERROR(EPERM); > > I assume that this is because the implementation chose not to do the > path encoding in the handle? Because if they did, they could do full > path permission checking as part of the open_by_handle. The original use of this interface (if I understand the Irix history correctly - this is way before my time at SGI) was a userspace NFS server and so permission checks were done after the filehandle was opened and a stat could be done on the fd and mode/uid/gid could be compared to what was in the NFS request. Paths were never needed for this because everything needed could be obtained directly from the inode. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
David Chinner wrote: On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: David Chinner wrote: On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote: Hi, I agree that it is not feasible to add new system calls every time somebody has a problem, and we don't take adding system calls lightly. However, in this case we're talking about an entire *community* of people (high-end computing), not just one or two people. Of course it may still be the case that that community is not important enough to justify the addition of system calls; that's obviously not my call to make! I have the feeling that openg stuff is rushed without looking into all solutions, that don't require changes to the current interface. I also get the feeling that interfaces that already do this open-by-handle stuff haven't been explored either. Does anyone here know about the XFS libhandle API? This has been around for years and it does _exactly_ what these proposed syscalls are supposed to do (and more). See: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle For the libhandle man page. Basically: openg == path_to_handle sutoc == open_by_handle And here for the userspace code: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ Cheers, Dave. Thanks for pointing these out Dave. These are indeed along the same lines as the openg()/openfh() approach. One difference is that they appear to perform permission checking on the open_by_handle(), which means that the entire path needs to be encoded in the handle, and makes it difficult to eliminate the path traversal overhead on N open_by_handle() operations. open_by_handle() is checking the inode flags for things like immutibility and whether the inode is writable to determine if the open mode is valid given these flags. It's not actually checking permissions. IOWs, open_by_handle() has the same overhead as NFS filehandle to inode translation; i.e. no path traversal on open. Permission checks are done on the path_to_handle(), so in reality only root or CAP_SYS_ADMIN users can currently use the open_by_handle interface because of this lack of checking. Given that our current users of this interface need root permissions to do other things (data migration), this has never been an issue. This is an implementation detail - it is possible that file handle, being opaque, could encode a UID/GID of the user that constructed the handle and then allow any process with the same UID/GID to use open_by_handle() on that handle. (I think hch has already pointed this out.) Cheers, Dave. Thanks for the clarification Dave. So I take it that you would be interested in this type of functionality then? Regards, Rob - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] gfs2: nfs lock support for gfs2
On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote: > On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote: > > The gfs side looks fine to me. Did you forget to call fl_notify from > > gdlm_plock_callback() or am I missing something? > > Yes, looks like we missed something, thanks. This code's an rfc (not > even tested), so don't apply it yet! What we should have there is > something like: > > rv = op->info.rv; > > if (fl_notify(fl, NULL, rv)) { > /* XXX: We need to cancel the lock here: */ > printk("gfs2 lock granted after lock request failed; dangling > lock!\n"); > } > > if (!rv) { > /* check if the following are still valid or make a copy */ > file = op->info.file; > fl = op->info.fl; > > if (posix_lock_file_wait(file, fl) < 0) > log_error("gdlm_plock: vfs lock error file %p fl %p", > file, fl); > } > > Note there's a race condition--that calls fl_notify before actually > getting the lock locally. I don't *think* that's a problem, as long as > it's the filesystem and not the local lock list that's authoritative > when it comes to who gets a posix lock. agree > The more annoying problem is the need to cancel the GFS lock when > fl_notify fails; is that something that it's possible for GFS to do? > > It can fail because lockd has a timeout--it waits a few seconds for the > callback, then gives up and returns a failure to the user. If that > happens after your userspace posix lock manager acquires the lock (but > before fl_notify is called) then you've got to cancel it somehow. I'd think we could just send an unlock for it at that point. Set up an op with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing and call send_op(). We probably need to flag this internal-unlock op so that when the result arrives, device_write() delists and frees it itself. (I wouldn't call this "canceling", I think of cancel as trying to force a blocked request to return/fail prematurely.) Dave - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 02:50:49PM -0600, Rob Ross wrote: > David Chinner wrote: > >On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > >>David Chinner wrote: > >>>Does anyone here know about the XFS libhandle API? This has been around > >>>for > >>>years and it does _exactly_ what these proposed syscalls are supposed to > >>>do > >>>(and more). > >>Thanks for pointing these out Dave. These are indeed along the same lines > >>as > >>the openg()/openfh() approach. > >> > >>One difference is that they appear to perform permission checking on the > >>open_by_handle(), which means that the entire path needs to be encoded in > >>the handle, and makes it difficult to eliminate the path traversal > >>overhead > >>on N open_by_handle() operations. > > > >open_by_handle() is checking the inode flags for things like > >immutibility and whether the inode is writable to determine if the > >open mode is valid given these flags. It's not actually checking > >permissions. IOWs, open_by_handle() has the same overhead as NFS > >filehandle to inode translation; i.e. no path traversal on open. > > > >Permission checks are done on the path_to_handle(), so in reality > >only root or CAP_SYS_ADMIN users can currently use the > >open_by_handle interface because of this lack of checking. Given > >that our current users of this interface need root permissions to do > >other things (data migration), this has never been an issue. > > > >This is an implementation detail - it is possible that file handle, > >being opaque, could encode a UID/GID of the user that constructed > >the handle and then allow any process with the same UID/GID to use > >open_by_handle() on that handle. (I think hch has already pointed > >this out.) > > Thanks for the clarification Dave. So I take it that you would be > interested in this type of functionality then? Not really - just trying to help by pointing out something no-one seemed to know about Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 01:50:24PM -0700, Matthew Wilcox wrote: > On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: > > Permission checks are done on the path_to_handle(), so in reality > > only root or CAP_SYS_ADMIN users can currently use the > > open_by_handle interface because of this lack of checking. Given > > that our current users of this interface need root permissions to do > > other things (data migration), this has never been an issue. > > > > This is an implementation detail - it is possible that file handle, > > being opaque, could encode a UID/GID of the user that constructed > > the handle and then allow any process with the same UID/GID to use > > open_by_handle() on that handle. (I think hch has already pointed > > this out.) > > While it could do that, I'd be interested to see how you'd construct > the handle such that it's immune to a malicious user tampering with it, > or saving it across a reboot, or constructing one from scratch. > > I suspect any real answer to this would have to involve cryptographical > techniques (say, creating a secure hash of the information plus a > boot-time generated nonce). Now you're starting to use a lot of bits, > and compute time, and you'll need to be sure to keep the nonce secret. An auth header and GSS-API integration would probably be the way to go here if you really care. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] gfs2: nfs lock support for gfs2
On Wed, Dec 06, 2006 at 02:58:22PM -0600, David Teigland wrote: > On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote: > > The more annoying problem is the need to cancel the GFS lock when > > fl_notify fails; is that something that it's possible for GFS to do? > > > > It can fail because lockd has a timeout--it waits a few seconds for the > > callback, then gives up and returns a failure to the user. If that > > happens after your userspace posix lock manager acquires the lock (but > > before fl_notify is called) then you've got to cancel it somehow. > > I'd think we could just send an unlock for it at that point. Set up an op > with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing > and call send_op(). We probably need to flag this internal-unlock op so > that when the result arrives, device_write() delists and frees it itself. > > (I wouldn't call this "canceling", I think of cancel as trying to force a > blocked request to return/fail prematurely.) I call it a cancel because it should leave us in the same state we were in if we hadn't done the lock. An unlock doesn't do that, because the original lock may have coalesced and/or downgraded existing locks. We've got a similar problem with NFSv4 blocking locks. I've got unsubmitted patches http://linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=shortlog;h=fair-queueing which introduce a new "provisional" lock type that is identical to a posix lock in every way except that a provisional lock doesn't coalesce or downgrade existing locks--it just sits there on the lock list blocking conflicting requests until somebody comes along and upgrades it to a real lock or cancels it. Maybe there's a better solution in this case. I can't think of anything other than just giving up on the whole idea of timing out. (Maybe that wouldn't be so bad?) --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] gfs2: nfs lock support for gfs2
On Wed, Dec 06, 2006 at 04:23:47PM -0500, J. Bruce Fields wrote: > On Wed, Dec 06, 2006 at 02:58:22PM -0600, David Teigland wrote: > > On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote: > > > The more annoying problem is the need to cancel the GFS lock when > > > fl_notify fails; is that something that it's possible for GFS to do? > > > > > > It can fail because lockd has a timeout--it waits a few seconds for the > > > callback, then gives up and returns a failure to the user. If that > > > happens after your userspace posix lock manager acquires the lock (but > > > before fl_notify is called) then you've got to cancel it somehow. > > > > I'd think we could just send an unlock for it at that point. Set up an op > > with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing > > and call send_op(). We probably need to flag this internal-unlock op so > > that when the result arrives, device_write() delists and frees it itself. > > > > (I wouldn't call this "canceling", I think of cancel as trying to force a > > blocked request to return/fail prematurely.) > > I call it a cancel because it should leave us in the same state we were > in if we hadn't done the lock. An unlock doesn't do that, because the > original lock may have coalesced and/or downgraded existing locks. Oh yeah, that's painful, I knew it sounded too easy. Dave - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 10/10] gfs2: nfs lock support for gfs2
On Wed, Dec 06, 2006 at 03:42:31PM -0600, David Teigland wrote: > Oh yeah, that's painful, I knew it sounded too easy. Yeah. Well, we could try to teach GFS2 to reliably cancel posix locks. I think that may turn out to be necessary some day anyway. Or we could look at why we're timing out and figure out whether there's something else we should be doing instead in that case. In what situations is the GFS2 lock call likely to take overly long? --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Dec 06, 2006 13:50 -0700, Matthew Wilcox wrote: > On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: > > This is an implementation detail - it is possible that file handle, > > being opaque, could encode a UID/GID of the user that constructed > > the handle and then allow any process with the same UID/GID to use > > open_by_handle() on that handle. (I think hch has already pointed > > this out.) > > While it could do that, I'd be interested to see how you'd construct > the handle such that it's immune to a malicious user tampering with it, > or saving it across a reboot, or constructing one from scratch. If the server has to have processed a real "open" request, say within the preceding 30s, then it would have a handle for openfh() to match against. If the server reboots, or a client tries to construct a new handle from scratch, or even tries to use the handle after the file is closed then the handle would be invalid. It isn't just an encoding for "open-by-inum", but rather a handle that references some just-created open file handle on the server. That the handle might contain the UID/GID is mostly irrelevant - either the process + network is trusted to pass the handle around without snooping, or a malicious client which intercepts the handle can spoof the UID/GID just as easily. Make the handle sufficiently large to avoid guessing and it is "secure enough" until the whole filesystem is using kerberos to avoid any number of other client/user spoofing attacks. Considering that filesystems like GFS and OCFS allow clients DIRECT ACCESS to the block device itself (which no amount of authentication will fix, unless it is in the disks themselves), the risk of passing a file handle around is pretty minimal. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: > Considering that filesystems like GFS and OCFS allow clients DIRECT > ACCESS to the block device itself (which no amount of authentication > will fix, unless it is in the disks themselves), the risk of passing a > file handle around is pretty minimal. That's either disingenuous, or missing the point. OCFS/GFS allow the kernel direct access to the block device. openg()&sutoc() are about passing around file handles to untrusted users. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Dec 06, 2006 15:17 -0700, Matthew Wilcox wrote: > On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: > > Considering that filesystems like GFS and OCFS allow clients DIRECT > > ACCESS to the block device itself (which no amount of authentication > > will fix, unless it is in the disks themselves), the risk of passing a > > file handle around is pretty minimal. > > That's either disingenuous, or missing the point. OCFS/GFS allow the > kernel direct access to the block device. openg()&sutoc() are about > passing around file handles to untrusted users. Consider - in order to intercept the file handle on the network one would have to be root on a trusted client. The same is true for direct block access. If the network isn't to be trusted or the clients aren't to be trusted, then in the absence of strong external authentication like kerberos the whole thing just falls down (i.e. root on any client can su to an arbitrary UID/GID to access files to avoid root squash, or could intercept all of the traffic on the network anyways). With some network filesystems it is at least possible to get strong authentication and crypto, but with shared block device filesystems like OCFS/GFS/GPFS they completely rely on the fact that the network and all of the clients attached thereon are secure. If the server that did the original file open and generates the unique per-open file handle can do basic sanity checking (i.e. user doing the new open is the same, the file handle isn't stale) then that is no additional security hole. Similarly, NFS passes file handles to clients that are also used to get access to the open file without traversing the whole path each time. Those file handles are even (supposed to be) persistent over reboots. Don't get me wrong - I understand that what I propose is not secure. I'm just saying it is no LESS secure than a number of other things which already exist. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: NFSv4/pNFS possible POSIX I/O API standards
On 12/5/06, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Tue, Dec 05, 2006 at 05:55:14PM +0100, Latchesar Ionkov wrote: > What is your opinion on giving the file system an option to lookup a > file more than one name/directory at a time? I think that all remote > file systems can benefit from that? Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP routine get the entire remaining path and is allowed to resolve as much of it as it can (or wants)? While this allows remote filesystems to optimize deep tree traversals it creates a pretty big mess about state that is kept on lookup operations. For Linux in particular it would mean doing large parts of __link_path_walk in the filesystem, which I can't thing of a sane way to do. The way I was thinking of implementing it is leaving all the hard parts of the name resolution in __link_path_walk and modifying inode's lookup operation to accept an array of qstrs (and its size). lookup would also check and revalidate the dentries if necessary (usually the same operation as looking up a name for the remote filesystems). lookup will check if it reaches symbolic link or mountpoint and will stop resolving any further. __link_path_walk will use the name to fill an array of qstrs (we can choose some sane size of the array, like 8 or 16), then call (directly or indirectly) ->lookup (nd->flags will reflect the flags for the last element in the array), check if the inode of the last dentry is symlink, and do what it currently does for symlinks. Does that make sense? Am I missing anything? Thanks, Lucho - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On 12/6/06, Rob Ross <[EMAIL PROTECTED]> wrote: David Chinner wrote: > On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: >> On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> I agree that it is not feasible to add new system calls every time >>> somebody has a problem, and we don't take adding system calls lightly. >>> However, in this case we're talking about an entire *community* of >>> people (high-end computing), not just one or two people. Of course it >>> may still be the case that that community is not important enough to >>> justify the addition of system calls; that's obviously not my call to make! >> I have the feeling that openg stuff is rushed without looking into all >> solutions, that don't require changes to the current interface. > > I also get the feeling that interfaces that already do this > open-by-handle stuff haven't been explored either. > > Does anyone here know about the XFS libhandle API? This has been > around for years and it does _exactly_ what these proposed syscalls > are supposed to do (and more). > > See: > > http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle > > For the libhandle man page. Basically: > > openg == path_to_handle > sutoc == open_by_handle > > And here for the userspace code: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ > > Cheers, > > Dave. Thanks for pointing these out Dave. These are indeed along the same lines as the openg()/openfh() approach. The open-by-handle makes a little more sense, because the "handle" is not opened, it only points to a resolved file. As I mentioned before, it doesn't make much sense to bundle in openg name resolution and file open. Still I am not convinced that we need two ways of "finding" files. Thanks, Lucho - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: NFSv4/pNFS possible POSIX I/O API standards
On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote: I unfortunately don't have data to show exactly where the time was spent, but it's a good guess that it is all the network traffic in the open() case. Is it hard to repeat the test and check what requests (and how much time do they take) PVFS server receives? Well you've caught me. I don't want to cache the values, because I fundamentally believe that sharing state between clients and servers is braindead (to use Christoph's phrase) in systems of this scale (thousands to tens of thousands of clients). So I don't want locks, so I can't keep the cache consistent, ... Having file handles in the server looks like a cache to me :) What are the properties of a cache that it lacks? Thanks, Lucho - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg
On Wed, Dec 06, 2006 at 09:42:47AM -0600, Rob Ross wrote: > The fh_t is indeed a type of capability. fh_t, properly protected, could > be passed into user space and validated by the file system when > presented back to the file system. Well, there's quite a lot of papers on how to implement properly secure capabilities. The only performant way to do it is to implement them in kernel space or with hardware support. As soon as you pass them to userspace the user can manipulate them, and doing a cheap enough verification is non-trivial (e.g. it doesn't buy you anything if you spent the time you previously spent for lookup roundtrip latency for some expensive cryptography) > There is state here, clearly. I feel ok about that because we allow > servers to forget that they handed out these fh_ts if they feel like it; > there is no guaranteed lifetime in the current proposal. This allows > servers to come and go without needing to persistently store these. > Likewise, clients can forget them with no real penalty. Objects without defined lifetime rules are not something we're very keen on. Particularly in userspace interface they will cause all kinds of trouble because people will expect the lifetime rules they get from their normal filesystems. > The documentation of the calls is complicated by the way POSIX calls are > described. We need to have a second document describing use cases also > available, so that we can avoid misunderstandings as best we can, get > straight to the real issues. Sorry that document wasn't available. The real problem is that you want to do something in a POSIX spec that is fundamentally out of scope. POSIX .1 deals with system interfaces on a single system. You want to specify semantics over multiple systems in a cluster. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, 2006-12-06 at 18:12 -0500, Latchesar Ionkov wrote: > On 12/5/06, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > On Tue, Dec 05, 2006 at 05:55:14PM +0100, Latchesar Ionkov wrote: > > > What is your opinion on giving the file system an option to lookup a > > > file more than one name/directory at a time? I think that all remote > > > file systems can benefit from that? > > > > Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP > > routine get the entire remaining path and is allowed to resolve as much of > > it as it can (or wants)? > > > > While this allows remote filesystems to optimize deep tree traversals it > > creates a pretty big mess about state that is kept on lookup operations. > > > > For Linux in particular it would mean doing large parts of __link_path_walk > > in the filesystem, which I can't thing of a sane way to do. > > The way I was thinking of implementing it is leaving all the hard > parts of the name resolution in __link_path_walk and modifying inode's > lookup operation to accept an array of qstrs (and its size). lookup > would also check and revalidate the dentries if necessary (usually the > same operation as looking up a name for the remote filesystems). I beg to differ. Revalidation is not the same as looking up: the locking rules are _very_ different. > lookup will check if it reaches symbolic link or mountpoint and will > stop resolving any further. __link_path_walk will use the name to fill > an array of qstrs (we can choose some sane size of the array, like 8 > or 16), then call (directly or indirectly) ->lookup (nd->flags will > reflect the flags for the last element in the array), check if the > inode of the last dentry is symlink, and do what it currently does for > symlinks. > > Does that make sense? Am I missing anything? Again: locking. How do you keep the dcache sane while the filesystem is doing a jumble of revalidation and new lookups. Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: > > While it could do that, I'd be interested to see how you'd construct > > the handle such that it's immune to a malicious user tampering with it, > > or saving it across a reboot, or constructing one from scratch. > > If the server has to have processed a real "open" request, say within > the preceding 30s, then it would have a handle for openfh() to match > against. If the server reboots, or a client tries to construct a new > handle from scratch, or even tries to use the handle after the file is > closed then the handle would be invalid. > > It isn't just an encoding for "open-by-inum", but rather a handle that > references some just-created open file handle on the server. That the > handle might contain the UID/GID is mostly irrelevant - either the > process + network is trusted to pass the handle around without snooping, > or a malicious client which intercepts the handle can spoof the UID/GID > just as easily. Make the handle sufficiently large to avoid guessing > and it is "secure enough" until the whole filesystem is using kerberos > to avoid any number of other client/user spoofing attacks. That would be fine as long as the file handle would be a kernel-level concept. The issue here is that they intent to make the whole filehandle userspace visible, for example to pass it around via mpi. As soon as an untrused user can tamper with the file descriptor we're in trouble. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Wed, Dec 06, 2006 at 08:11:00PM +1100, David Chinner wrote: > They are defined but unused in 2.6.19, right? I can't see anywhere > in the 2.6.19 ext2/3/4/reiser trees that actually those flags, > including setting and retrieving them from disk. JFS i can see > sets, clears and retreives them, but not the fielsystems you > mention. Though I might just be blind. ;) > > If all we need to add to XFS is support for those flags, then XFS > support would be trivial to add. > > Oh, damn. I take that back. We're almost out of flag space in the on > disk inode - these two flags would use the last 2 flag bits so this > may require an on disk inode format change in XFS. This will be > a little more complex than I first thought, but not impossible > as we already support two on-disk inode format versions. Hrm. I was toying around with the idea of using a flag to mark inodes as whiteouts (similar to what BSD does) for Unionfs. I remember that Jan Blunck tried similar thing in his implementation of VFS unionfs mounts. I am not entirely convinced that whiteout inode flag is the right way to do things, but I'm just raising this now as I wouldn't want to wait for new ondisk format for XFS to say that Unionfs supports XFS. (Assuming that it is the right approach.) ;-) Josef "Jeff" Sipek. -- I already backed up the box once, I can do it again! - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Wed, Dec 06, 2006 at 07:56:19PM -0500, Josef Sipek wrote: > On Wed, Dec 06, 2006 at 08:11:00PM +1100, David Chinner wrote: > > They are defined but unused in 2.6.19, right? I can't see anywhere > > in the 2.6.19 ext2/3/4/reiser trees that actually those flags, > > including setting and retrieving them from disk. JFS i can see > > sets, clears and retreives them, but not the fielsystems you > > mention. Though I might just be blind. ;) > > > > If all we need to add to XFS is support for those flags, then XFS > > support would be trivial to add. > > > > Oh, damn. I take that back. We're almost out of flag space in the on > > disk inode - these two flags would use the last 2 flag bits so this > > may require an on disk inode format change in XFS. This will be > > a little more complex than I first thought, but not impossible > > as we already support two on-disk inode format versions. > > Hrm. I was toying around with the idea of using a flag to mark inodes as > whiteouts (similar to what BSD does) for Unionfs. I remember that Jan Blunck > tried similar thing in his implementation of VFS unionfs mounts. > > I am not entirely convinced that whiteout inode flag is the right way to do > things, but I'm just raising this now as I wouldn't want to wait for new > ondisk format for XFS to say that Unionfs supports XFS. (Assuming that it is > the right approach.) ;-) Maybe we should be using EAs for this sort of thing instead of flags on the inode? If we keep adding inode flags for generic features then we are going to force more than just XFS into inode format changes eventually Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote: > Maybe we should be using EAs for this sort of thing instead of flags > on the inode? If we keep adding inode flags for generic features > then we are going to force more than just XFS into inode format > changes eventually Aren't EAs slow? Maybe not on XFS but on other filesystems... Josef "Jeff" Sipek. -- Linux, n.: Generous programmers from around the world all join forces to help you shoot yourself in the foot for free. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote: > On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote: > > Maybe we should be using EAs for this sort of thing instead of flags > > on the inode? If we keep adding inode flags for generic features > > then we are going to force more than just XFS into inode format > > changes eventually > > Aren't EAs slow? Maybe not on XFS but on other filesystems... Only when they don't fit in the inode itself and extra disk seeks are needed to retrieve them. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Thu, 2006-12-07 at 13:49 +1100, David Chinner wrote: > On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote: > > On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote: > > > Maybe we should be using EAs for this sort of thing instead of flags > > > on the inode? If we keep adding inode flags for generic features > > > then we are going to force more than just XFS into inode format > > > changes eventually > > > > Aren't EAs slow? Maybe not on XFS but on other filesystems... > > Only when they don't fit in the inode itself and extra > disk seeks are needed to retrieve them. > > Cheers, > > Dave. Also keep in mind that the EA doesn't actually have to have a physical representation on disk (or, rather, it does, but it doesn't need to be the same representation used by EAs in the user namespace). This means that if one of those slow EA filesystems still has room for flags in the inode, it can synthesize the EA on demand. This is even preferable to ioctls for the interface to new filesystem metadata -- if a backup or archive program knows how to store EAs, it will be able to save and restore any new exotic metadata without any extra effort. -- Nicholas Miell <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Wed, 2006-12-06 at 20:11 +1100, David Chinner wrote: > ... > If all we need to add to XFS is support for those flags, then XFS > support would be trivial to add. > > Oh, damn. I take that back. We're almost out of flag space in the on > disk inode - these two flags would use the last 2 flag bits so this > may require an on disk inode format change in XFS. This will be > a little more complex than I first thought, ... It should be OK - you can do it without an inode version revision if you take a second 16 bits for "di_flags2" from here... xfs_dinode_core { ... __uint8_t di_pad[8]; /* unused, zeroed space */ Its guaranteed zeroed initially (i.e. all flags unset) and the XFS get/set flags APIs are 32 bits, so you should be OK there. Also, it may also be possible to reclaim di_onlink at some point (maybe now, since 16 bits would be good here) if mkfs.xfs is changed to always create v2 inodes (dynamic conversion ATM IIRC)... not 100% sure though, needs more code analysis. cheers. -- Nathan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 26/35] Unionfs: Privileged operations workqueue
On Tue, Dec 05, 2006 at 02:50:13PM -0500, Josef Sipek wrote: > On Tue, Dec 05, 2006 at 08:27:51PM +0100, Jan Engelhardt wrote: > > > > On Dec 4 2006 07:30, Josef 'Jeff' Sipek wrote: > > >+#include "union.h" > > >+ > > >+struct workqueue_struct *sioq; > > >+ > > >+int __init init_sioq(void) > > > > Although it's just me, I'd prefer sioq_init(), sioq_exit(), > > sioq_run(), etc. to go in hand with what most drivers use as naming > > (i.e. "_" ). > > That makes sense. Hrm. Looking at the code, I noticed that the opposite is true: destroy_filldir_cache(); destroy_inode_cache(); destroy_dentry_cache(); unregister_filesystem(&unionfs_fs_type); The last one in particular... > > >+void __unionfs_mknod(void *data) > > >+{ > > >+ struct sioq_args *args = data; > > >+ struct mknod_args *m = &args->mknod; > > > > Care to make that: const struct mknod_args *m = &args->mknod;? > > (Same for other places) > > Right. If I make the *args = data line const, then gcc (4.1) yells about modifying a const variable 3 lines down.. args->err = vfs_mknod(m->parent, m->dentry, m->mode, m->dev); Sure, I could cast, but that seems like adding cruft for no good reason. Josef "Jeff" Sipek. -- Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. - Albert Einstein - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 26/35] Unionfs: Privileged operations workqueue
On Dec 6 2006 12:32, Josef Sipek wrote: >> > >+int __init init_sioq(void) >> > >> > Although it's just me, I'd prefer sioq_init(), sioq_exit(), >> > sioq_run(), etc. to go in hand with what most drivers use as naming >> > (i.e. "_" ). >> >> That makes sense. > >Hrm. Looking at the code, I noticed that the opposite is true: > >destroy_filldir_cache(); >destroy_inode_cache(); >destroy_dentry_cache(); >unregister_filesystem(&unionfs_fs_type); > >The last one in particular... I smell a big conspiracy! So yet again it's mixed mixed fs$ grep __init */*.c | grep -v ' init_' sysfs/mount.c:int __init sysfs_init(void) sysv/inode.c:int __init sysv_init_icache(void) proc/vmcore.c:static int __init vmcore_init(void) proc/nommu.c:static int __init proc_nommu_init(void) proc/proc_misc.c:void __init proc_misc_init(void) proc/proc_tty.c:void __init proc_tty_init(void) proc/root.c:void __init proc_root_init(void) > >> > >+void __unionfs_mknod(void *data) >> > >+{ >> > >+ struct sioq_args *args = data; >> > >+ struct mknod_args *m = &args->mknod; >> > >> > Care to make that: const struct mknod_args *m = &args->mknod;? >> > (Same for other places) >> >> Right. > >If I make the *args = data line const, then gcc (4.1) yells about modifying >a const variable 3 lines down.. > >args->err = vfs_mknod(m->parent, m->dentry, m->mode, m->dev); > >Sure, I could cast, but that seems like adding cruft for no good reason. No I despise casts more than missing consts. Why would gcc throw a warning? Let's take this super simple program <<< struct inode; struct dentry; struct mknod_args { struct inode *parent; struct dentry *dentry; int mode; int dev; }; extern int vfs_mknod(struct inode *, struct dentry *, int, int /*dev_t*/); int main(void) { const struct mknod_args *m; vfs_mknod(m->parent, m->dentry, m->mode, m->dev); return 0; } >>> As undefined-behavior as it looks, it's got the const and vfs_mknod, as well as an approximation of dev_t. It throws no warnings when compiled with `gcc -Wall -c test.c`. Did I miss something? -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Dec 06, 2006 09:42 -0800, Ulrich Drepper wrote: > Rob Ross wrote: > >File size is definitely one of the more difficult of the parameters, > >either because (a) it isn't stored in one place but is instead derived, > >or (b) because a lock has to be obtained to guarantee consistency of the > >returned value. > > OK, and looking at the man page again, it is already on the list in the > old proposal and hence optional. I've no problem with that. IMHO, once part of the information is optional, why bother making ANY of it required? Consider "ls -s" on a distributed filesystem that has UID+GID mapping. It doesn't actually NEED to return the UID+GID to ls for each file, since it won't be shown, but if that is part of the "required" fields then the filesystem would have to remap each UID+GID on each file in the directory. Similar arguments can be made for "find" with various options (-atime, -mtime, etc) where any one of the "required" parameters isn't needed. I don't think it is _harmful_ to fill in unrequested values if they are readily available (it might in fact avoid a lot of conditional branches) but why not let the caller request only the minimum information it needs? > >I can't speak for everyone, but "ls" is the #1 consumer as far as I am > >concerned. That is my opinion also. Lustre can do incredibly fast IO, but it isn't very good at "ls" at all because it has to do way more work than you would think (and I've stared at a lot of straces from ls, rm, etc). > I think this is more a user problem. For normal plain old 'ls' you get > by with readdir. For 'ls -F' and 'ls --color' you mostly get by with > readdir+d_type. If you cannot provide d_type info the readdirplus > extension does you no good. For the cases when an additional stat is > needed (for symlinks, for instance, to test whether they are dangling) > readdirplus won't help. I used to think this also, but even though Lustre supplies d_type info GNU ls will still do stat operations because "ls --color" depends on st_mode in order to color executable files differently. Since virtually all distros alias ls to "ls --color" this is pretty much default behaviour. Another popular alias is "ls -F" which also uses st_mode for executables. $ strace ls --color=yes # this is on an ext3 filesystem : : open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 fstat64(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0 getdents64(3, /* 53 entries */, 4096) = 1840 lstat64("ChangeLog", {st_mode=S_IFREG|0660, st_size=48, ...}) = 0 lstat64("install-sh", {st_mode=S_IFREG|0755, st_size=7122, ...}) = 0 lstat64("config.sub", {st_mode=S_IFREG|0755, st_size=30221, ...}) = 0 lstat64("autogen.sh", {st_mode=S_IFREG|0660, st_size=41, ...}) = 0 lstat64("config.h", {st_mode=S_IFREG|0664, st_size=7177, ...}) = 0 lstat64("COPYING", {st_mode=S_IFREG|0660, st_size=18483, ...}) = 0 : : Similarly, GNU rm will stat all of the files (when run as a regular user) to ask the "rm: remove write-protected regular file `foo.orig'?" question, which also depends on st_mode. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] gfs2: nfs lock support for gfs2
Here is a rewrite of gdlm_plock_callback(). We still need to add the lock cancel. Marc. int gdlm_plock_callback(struct plock_op *op) { struct file *file; struct file_lock *fl; int (*notify)(void *, void *, int) = NULL; int rv; spin_lock(&ops_lock); if (!list_empty(&op->list)) { printk(KERN_INFO "plock op on list\n"); list_del(&op->list); } spin_unlock(&ops_lock); rv = op->info.rv; /* check if the following 2 are still valid or make a copy */ file = op->info.file; fl = op->info.fl; notify = op->info.callback; if (!rv) { /* got fs lock */ rv = posix_lock_file(file, fl); if (rv) { /* did not get posix lock */ notify(fl, NULL, rv); log_error("gdlm_plock: vfs lock error file %p fl %p", file, fl); /* XXX: We need to cancel the fs lock here: */ printk("gfs2 lock posix lock request failed\n"); } else { /* got posix lock */ if (notify(fl, NULL, 0)) { /* XXX: We need to cancel the fs lock here: */ printk("gfs2 lock granted after lock request failed; dangling lock!\n"); } } } else { /* did not get fs lock */ notify(fl, NULL, rv); } kfree(op); return rv; } David Teigland wrote: On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote: On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote: The gfs side looks fine to me. Did you forget to call fl_notify from gdlm_plock_callback() or am I missing something? Yes, looks like we missed something, thanks. This code's an rfc (not even tested), so don't apply it yet! What we should have there is something like: rv = op->info.rv; if (fl_notify(fl, NULL, rv)) { /* XXX: We need to cancel the lock here: */ printk("gfs2 lock granted after lock request failed; dangling lock!\n"); } if (!rv) { /* check if the following are still valid or make a copy */ file = op->info.file; fl = op->info.fl; if (posix_lock_file_wait(file, fl) < 0) log_error("gdlm_plock: vfs lock error file %p fl %p", file, fl); } Note there's a race condition--that calls fl_notify before actually getting the lock locally. I don't *think* that's a problem, as long as it's the filesystem and not the local lock list that's authoritative when it comes to who gets a posix lock. agree The more annoying problem is the need to cancel the GFS lock when fl_notify fails; is that something that it's possible for GFS to do? It can fail because lockd has a timeout--it waits a few seconds for the callback, then gives up and returns a failure to the user. If that happens after your userspace posix lock manager acquires the lock (but before fl_notify is called) then you've got to cancel it somehow. I'd think we could just send an unlock for it at that point. Set up an op with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing and call send_op(). We probably need to flag this internal-unlock op so that when the result arrives, device_write() delists and frees it itself. (I wouldn't call this "canceling", I think of cancel as trying to force a blocked request to return/fail prematurely.) Dave - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Wed, Dec 06, 2006 at 07:14:58PM -0800, Nicholas Miell wrote: > On Thu, 2006-12-07 at 13:49 +1100, David Chinner wrote: > > On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote: > > > On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote: > > > > Maybe we should be using EAs for this sort of thing instead of flags > > > > on the inode? If we keep adding inode flags for generic features > > > > then we are going to force more than just XFS into inode format > > > > changes eventually > > > > > > Aren't EAs slow? Maybe not on XFS but on other filesystems... > > > > Only when they don't fit in the inode itself and extra > > disk seeks are needed to retrieve them. > > > > Cheers, > > > > Dave. > > Also keep in mind that the EA doesn't actually have to have a physical > representation on disk (or, rather, it does, but it doesn't need to be > the same representation used by EAs in the user namespace). > > This means that if one of those slow EA filesystems still has room for > flags in the inode, it can synthesize the EA on demand. Interesting point. Filesystems such as XFS would be unaffected (at least with attr v1, not sure how attr2 does things in XFS wrt EAs.) Another question is, suppose that filesystem x is one of the slow EA filesystems, exposing this whiteout inode flag as EA would require fs x to be compiled with EA support. As a matter of fact, it would require all the filesystems to have EA support. Sure, overall it is less work for me, but I am more interested in doing the Right Thing(tm) - which may be something completely different from EA or inode flag (e.g., what Ted Tso suggested at OLS - effectively a metadata-only ondisk format for unionfs.) > This is even preferable to ioctls for the interface to new filesystem > metadata -- if a backup or archive program knows how to store EAs, it > will be able to save and restore any new exotic metadata without any > extra effort. Agreed. Josef "Jeff" Sipek. -- UNIX is user-friendly ... it's just selective about who it's friends are - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Thu, 2006-12-07 at 12:44 +1100, David Chinner wrote: > Maybe we should be using EAs for this sort of thing instead of flags > on the inode? If we keep adding inode flags for generic features > then we are going to force more than just XFS into inode format > changes eventually You do need to be judicious in what you add, but we got up to here (well over 10 years) with 16 bits worth of space - increasing to 32 bits is going to last a fairly long time... I wouldn't switch to a different scheme until that point (which is also the point where the system calls are going to have to change anyway). cheers. -- Nathan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Relative atime (was Re: What's in ocfs2.git)
On Tue, Dec 05, 2006 at 08:58:02PM -0800, Andrew Morton wrote: > > On Mon, 4 Dec 2006 16:36:20 -0800 Valerie Henson <[EMAIL PROTECTED]> wrote: > > Add "relatime" (relative atime) support. Relative atime only updates > > the atime if the previous atime is older than the mtime or ctime. > > Like noatime, but useful for applications like mutt that need to know > > when a file has been read since it was last modified. > > That seems like a good idea. > > I found touch_atime() to be rather putrid, so I hacked it around a bit. The > end result: I like that rather better - add my: Signed-off-by: Valerie Henson <[EMAIL PROTECTED]> > That's the easy part. How are we going to get mount(8) patched? Well, the nodiratime documentation got in. (I was going to add that as part of this apatch, but lo and behold.) -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Relative atime (was Re: What's in ocfs2.git)
On Wednesday 06 December 2006 05:58, Andrew Morton wrote: > > On Mon, 4 Dec 2006 16:36:20 -0800 Valerie Henson > > <[EMAIL PROTECTED]> wrote: Add "relatime" (relative atime) > > support. Relative atime only updates the atime if the previous atime is > > older than the mtime or ctime. Like noatime, but useful for applications > > like mutt that need to know when a file has been read since it was last > > modified. > > That seems like a good idea. > > I found touch_atime() to be rather putrid, so I hacked it around a bit. I find this function full of tests... > The end result: > > void touch_atime(struct vfsmount *mnt, struct dentry *dentry) > { > struct inode *inode = dentry->d_inode; > struct timespec now; > > if (IS_RDONLY(inode)) > return; While we are adding new tests, we could try to be smart here testing both MS_RDONLY and MS_NOATIME. if (__IS_FLG(inode, MS_RDONLY | MS_NOATIME)) return; > if (inode->i_flags & S_NOATIME) > return; > if (inode->i_sb->s_flags & MS_NOATIME) > return; So that that one can be deleted. >if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode)) >return; if (__IS_FLG(inode, MS_NODIRATIME) && S_ISDIR(inode->i_mode)) return; Eric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Relative atime (was Re: What's in ocfs2.git)
> > > if (inode->i_sb->s_flags & MS_NOATIME) > > return; > So that that one can be deleted. Hi, I would mostly expect the compiler to be relatively smart about this and group a bunch of these tests together... so I rather see readable code than optimized code for something the compiler should be doing for us ;) Greetings, Arjan van de Ven - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html