date:20061206

On Tue, Dec 05, 2006 at 11:41:28AM -0500, Nikolai Joukov wrote:
> > > As we promised on the linux-ext4 list on October 31, here is the patch
> > > that adds secure deletion via a trash-bin functionality for ext4.  It is a
> > > compromise solution that combines secure deletion with the trash-bin 
> > > support
> > > (the latter had been requested by even more people than the former :-).
> >
> > Given that almost all of the code for this uses vfs interfaces and
> > only a couple of simple filesystem hooks, is there any reason for
> > this being ext4 specific?  i.e. why aren't you hooking the vfs layer
> > so we get a single undelete/secure delete implementation for all
> > filesystems?
> 
> You are right.  Actually, we mentioned it as a benefit number 4 of this
> approach in the original post.  Most of the code is not
> file-system--specific and can be used by any other (all other?) file
> system(s).  The only complication is that only ext2/3/4 and reiser file
> systems already support the per-file secure deletion and undelete
> attributes.

They are defined but unused in 2.6.19, right? I can't see anywhere
in the 2.6.19 ext2/3/4/reiser trees that actually those flags,
including setting and retrieving them from disk. JFS i can see
sets, clears and retreives them, but not the fielsystems you
mention. Though I might just be blind. ;)

If all we need to add to XFS is support for those flags, then XFS
support would be trivial to add.

Oh, damn. I take that back. We're almost out of flag space in the on
disk inode - these two flags would use the last 2 flag bits so this
may require an on disk inode format change in XFS. This will be
a little more complex than I first thought, but not impossible
as we already support two on-disk inode format versions.

> Since ext4 is in early development now, we believe it'd be easier to get
> such code into ext4 than into the main-line VFS.  If there's enough
> interested among the kernel maintainers, we'd be happy to move this
> functionality to the VFS and provide f/s hooks for
> secure-deletion/trash-bin.

I'd certainly like to see new functionality like this added at the
VFS rather than in any particular filesystem. Don't know about
anyone else, though

> I guess, the right thing to do would be to move the common trash-bin
> (tb.c and tb.h) code into the /fs and /include/linux directories.

And call them trashbin.[ch] ;)

> Also, VFS would require just a couple of trivial changes to support
> something like '-o trashbin' mount-time option for all file systems.
> In addition, individual file systems may support per-file attributes for
> this (e.g., ext2/3/4).

Sounds like a good idea to me.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: NFSv4/pNFS possible POSIX I/O API standards

On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:
> On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote:
> >Hi,
> >
> >I agree that it is not feasible to add new system calls every time
> >somebody has a problem, and we don't take adding system calls lightly.
> >However, in this case we're talking about an entire *community* of
> >people (high-end computing), not just one or two people. Of course it
> >may still be the case that that community is not important enough to
> >justify the addition of system calls; that's obviously not my call to make!
> 
> I have the feeling that openg stuff is rushed without looking into all
> solutions, that don't require changes to the current interface.

I also get the feeling that interfaces that already do this
open-by-handle stuff haven't been explored either.

Does anyone here know about the XFS libhandle API? This has been
around for years and it does _exactly_ what these proposed syscalls
are supposed to do (and more).

See:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle

For the libhandle man page. Basically:

openg == path_to_handle
sutoc == open_by_handle

And here for the userspace code:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

On Dec 05, 2006  15:55 -0800, Ulrich Drepper wrote:
> I don't think an accuracy flag is useful at all.  Programs don't want to 
> use fuzzy information.  If you want a fast 'ls -l' then add a mode which 
> doesn't print the fields which are not provided.  Don't provide outdated 
> information.  Similarly for other programs.

Does this mean you are against the statlite() API entirely, or only against
the document's use of the flag as a vague "accuracy" value instead of a
hard "valid" value?

> Christoph Hellwig wrote:
> >I think having a stat lite variant is pretty much consensus, we just need
> >to fine tune the actual API - and of course get a reference implementation.
> >So if you want to get this going try to implement it based on
> >http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2.
> >Bonus points for actually making use of the flags in some filesystems.
> 
> I don't like that approach.  The flag parameter should be exclusively an 
> output parameter.  By default the kernel should fill in all the fields 
> it has access to.  If access is not easily possible then set the bit and 
> clear the field.  There are of course certain fields which always should 
> be added.  In the proposed man page these are already identified (i.e., 
> those before the st_litemask member).

IMHO, if the application doesn't need a particular field (e.g. "ls -i"
doesn't need size, "ls -s" doesn't need the inode number, etc) why should
these be filled in if they are not easily accessible?  As for what is
easily accessible, that needs to be determined by the filesystem itself.

It is of course fine if the filesystem fills in values that it has at
hand, even if they are not requested, but it shouldn't have to do extra
work to fill in values that will not be needed.

"ls --color" and "ls -F" are prime examples.  It does stat on files only
to get the file mode (the file type is already part of many dirent structs).
But a clustered filesystem may need to do a lot of work to also get the
file size, because it has no way of knowing which stat() fields are needed.

> And one last point: I haven't seen any discussion why readdirplus should 
> do the equivalent of stat and there is no 'statlite' variant.  Are all 
> places for readdir is used non-critical for performance or depend on 
> accurate information?

That was previously suggested by me already.  IMHO, there should ONLY be
a statlite variant of readdirplus(), and I think most people agree with
that part of it (though there is contention on whether readdirplus() is
needed at all).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: readdirplus() as possible POSIX I/O API

On Dec 05, 2006  10:23 -0500, Trond Myklebust wrote:
> On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote:
> > I think the "barrier semantics" are something that have just crept
> > into this discussion and is confusing the issue.
> 
> It is the _only_ concept that is of interest for something like NFS or
> CIFS. We already have the ability to cache the information.

Actually, wouldn't the ability for readdirplus() (with valid flag) be
useful for NFS if only to indicate that it does not need to flush the
cache because the ctime/mtime isn't needed by the caller?

> 'find' should be quite happy with the existing readdir(). It does not
> need to use stat() or readdirplus() in order to recurse because
> readdir() provides d_type.

It does in any but the most simplistic invocations, like "find -mtime"
or "find -mode" or "find -uid", etc.

> If the application is able to select a statlite()-type of behaviour with
> the fadvise() hints, your filesystem could be told to serve up cached
> information instead of regrabbing locks. In fact that is a much more
> flexible scheme, since it also allows the filesystem to background the
> actual inode lookups, or to defer them altogether if that is more
> efficient.

I guess I just don't understand how fadvise() on a directory file handle
(used for readdir()) can be used to affect later stat operations (which
definitely will NOT be using that file handle)?  If you mean that the
application should actually open() each file, fadvise(), fstat(), close(),
instead of just a stat() call then we are WAY into negative improvements
here due to overhead of doing open+close.

> > The filesystem can't always do "stat-ahead" on the files because that
> > requires instantiating an inode on the client which may be stale (lock
> > revoked) by the time the app gets to it, and the app (and the VFS)  have
> > no idea just how stale it is, and whether the stat is a "real" stat or
> > "only" the readdir stat (because the fadvise would only be useful on
> > the directory, and not all of the child entries), so it would need to
> > re-stat the file.
> 
> Then provide hints that allow the app to select which behaviour it
> prefers. Most (all?) apps don't _care_, and so would be quite happy with
> cached information. That is why the current NFS caching model exists in
> the first place.

Most clustered filesystems have strong cache semantics, so that isn't
a problem.  IMHO, the mechanism to pass the hint to the filesystem IS
the readdirplus_lite() that tells the filesystem exactly which data is
needed on each directory entry.

> > Also, this would potentially blow the client's real
> > working set of inodes out of cache.
> 
> Why?

Because in many cases it is desirable to limit the number of DLM locks
on a given client (e.g. GFS2 thread with AKPM about clients with
millions of DLM locks due to lack of memory pressure on large mem systems).
That means a finite-size lock LRU on the client that risks being wiped
out by a few thousand files in a directory doing "readdir() + 5000*stat()".

Consider a system like BlueGene/L with 128k compute cores.  Jobs that
run on that system will periodically (e.g. every hour) create up to 128K
checkpoint+restart files to avoid losing a lot of computation if a node
crashes.  Even if each one of the checkpoints is in a separate directory
(I wish all users were so nice :-) it means 128K inodes+DLM locks for doing
an "ls" in the directory.

> > Doing things en-masse with readdirplus() also allows the filesystem to
> > do the stat() operations in parallel internally (which is a net win if
> > there are many servers involved) instead of serially as the application
> > would do.
> 
> If your application really cared, it could add threading to 'ls' to
> achieve the same result. You can also have the filesystem preload that
> information based on fadvise hints.

But it would still need 128K RPCs to get that information, and 128K new
inodes on that client.  And what is the chance that I can get a
multi-threading "ls" into the upstream GNU ls code?  In the case of local
filesystems multi-threading ls would be a net loss due to seeking.

But even for local filesystems readdirplus_lite() would allow them to
fill in stat information they already have (either in cache or on disk),
and may avoid doing extra work that isn't needed.  For filesystems that
don't care, readdirplus_lite() can just be readdir()+stat() internally.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

openg

2006-12-06 Thread Christoph Hellwig

On Tue, Dec 05, 2006 at 03:44:31PM -0600, Rob Ross wrote:
> The openg() really just does the lookup and permission checking). The 
> openfh() creates the file descriptor and starts that context if the 
> particular FS tracks that sort of thing.

...

> Well you've caught me. I don't want to cache the values, because I 
> fundamentally believe that sharing state between clients and servers is 
> braindead (to use Christoph's phrase) in systems of this scale 
> (thousands to tens of thousands of clients). So I don't want locks, so I 
> can't keep the cache consistent, ... So someone else will have to run 
> the tests you propose :)...

Besides the whole ugliness you miss a few points about the fundamental
architecture of the unix filesystem permission model unfortunately.

Say you want to lookup a path /foo/bar/baz, then the access permission
is based on the following things:

 - the credentials of the user.  let's only take traditional uid/gid
   for this example although credentials are much more complex these
   days
 - the kind of operation you want to perform
 - the access permission of the actual object the path points to (inode)
 - the lookup permission (x bit) for every object on the way to you object

In your proposal sutoc is a simple conversion operation, that means
openg needs to perfom all these access checks and encodes them in the
fh_t.  That means an fh_t must fundamentally be an object that is kept
in the kernel aka a capability as defined by Henry Levy.  This does imply
you _do_ need to keep state.  And because it needs kernel support you
fh_t is more or less equivalent to a file descriptor with sutoc equivalent
to a dup variant that really duplicates the backing object instead of just
the userspace index into it.

Note somewhat similar open by filehandle APIs like oben by inode number
as used by lustre or the XFS *_by_handle APIs are privilegued operations
because of exactly this problem.

What according to your mail is the most important bit in this proposal is
that you thing the filehandles should be easily shared with other system
in a cluster.  That fact is not mentioned in the actual proposal at all,
and is in fact that hardest part because of inherent statefulness of
the API.

> What's the etiquette on changing subject lines here? It might be useful 
> to separate the openg() etc. discussion from the readdirplus() etc. 
> discussion.

Changing subject lines is fine.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] gfs2: nfs lock support for gfs2

2006-12-06 Thread Steven Whitehouse

Hi,

This looks good to me, and I'm copying in Dave & Wendy who have both
done previous work in this area for further comment. Provided we can get
this tested, I'd be happy to accept the patch in its current form.

Steve.

On Wed, 2006-12-06 at 00:34 -0500, J. Bruce Fields wrote:
> From: J. Bruce Fields <[EMAIL PROTECTED]>
> 
> From: Marc Eshel <[EMAIL PROTECTED]>
> 
> Add NFS lock support to GFS2.  (Untested.)
> 
> Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]>
> ---
>  fs/gfs2/lm.c   |   10 
>  fs/gfs2/lm.h   |2 +
>  fs/gfs2/locking/dlm/lock_dlm.h |2 +
>  fs/gfs2/locking/dlm/mount.c|1 +
>  fs/gfs2/locking/dlm/plock.c|   95 
> +++-
>  fs/gfs2/ops_export.c   |   52 ++
>  include/linux/lm_interface.h   |3 +
>  include/linux/lock_dlm_plock.h |3 +
>  8 files changed, 166 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/gfs2/lm.c b/fs/gfs2/lm.c
> index effe4a3..cf7fd52 100644
> --- a/fs/gfs2/lm.c
> +++ b/fs/gfs2/lm.c
> @@ -197,6 +197,16 @@ int gfs2_lm_plock(struct gfs2_sbd *sdp, struct 
> lm_lockname *name,
>   return error;
>  }
>  
> +int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name,
> +   struct file *file, int cmd, struct file_lock *fl)
> +{
> + int error = -EIO;
> + if (likely(!test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
> + error = sdp->sd_lockstruct.ls_ops->lm_plock_async(
> + sdp->sd_lockstruct.ls_lockspace, name, file, 
> cmd, fl);
> + return error;
> +}
> +
>  int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name,
>   struct file *file, struct file_lock *fl)
>  {
> diff --git a/fs/gfs2/lm.h b/fs/gfs2/lm.h
> index 21cdc30..1ddd1fd 100644
> --- a/fs/gfs2/lm.h
> +++ b/fs/gfs2/lm.h
> @@ -34,6 +34,8 @@ int gfs2_lm_plock_get(struct gfs2_sbd *sdp, struct 
> lm_lockname *name,
> struct file *file, struct file_lock *fl);
>  int gfs2_lm_plock(struct gfs2_sbd *sdp, struct lm_lockname *name,
> struct file *file, int cmd, struct file_lock *fl);
> +int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name,
> +   struct file *file, int cmd, struct file_lock *fl);
>  int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name,
>   struct file *file, struct file_lock *fl);
>  void gfs2_lm_recovery_done(struct gfs2_sbd *sdp, unsigned int jid,
> diff --git a/fs/gfs2/locking/dlm/lock_dlm.h b/fs/gfs2/locking/dlm/lock_dlm.h
> index 33af707..82af860 100644
> --- a/fs/gfs2/locking/dlm/lock_dlm.h
> +++ b/fs/gfs2/locking/dlm/lock_dlm.h
> @@ -179,6 +179,8 @@ int gdlm_plock_init(void);
>  void gdlm_plock_exit(void);
>  int gdlm_plock(void *, struct lm_lockname *, struct file *, int,
>   struct file_lock *);
> +int gdlm_plock_async(void *, struct lm_lockname *, struct file *, int,
> + struct file_lock *);
>  int gdlm_plock_get(void *, struct lm_lockname *, struct file *,
>   struct file_lock *);
>  int gdlm_punlock(void *, struct lm_lockname *, struct file *,
> diff --git a/fs/gfs2/locking/dlm/mount.c b/fs/gfs2/locking/dlm/mount.c
> index cdd1694..4339e3f 100644
> --- a/fs/gfs2/locking/dlm/mount.c
> +++ b/fs/gfs2/locking/dlm/mount.c
> @@ -244,6 +244,7 @@ const struct lm_lockops gdlm_ops = {
>   .lm_lock = gdlm_lock,
>   .lm_unlock = gdlm_unlock,
>   .lm_plock = gdlm_plock,
> + .lm_plock_async = gdlm_plock_async,
>   .lm_punlock = gdlm_punlock,
>   .lm_plock_get = gdlm_plock_get,
>   .lm_cancel = gdlm_cancel,
> diff --git a/fs/gfs2/locking/dlm/plock.c b/fs/gfs2/locking/dlm/plock.c
> index 7365aec..c21e667 100644
> --- a/fs/gfs2/locking/dlm/plock.c
> +++ b/fs/gfs2/locking/dlm/plock.c
> @@ -102,6 +102,93 @@ int gdlm_plock(void *lockspace, struct lm_lockname *name,
>   return rv;
>  }
>  
> +int gdlm_plock_async(void *lockspace, struct lm_lockname *name,
> +struct file *file, int cmd, struct file_lock *fl)
> +{
> + struct gdlm_ls *ls = lockspace;
> + struct plock_op *op;
> + int rv;
> +
> + op = kzalloc(sizeof(*op), GFP_KERNEL);
> + if (!op)
> + return -ENOMEM;
> +
> + op->info.optype = GDLM_PLOCK_OP_LOCK;
> + op->info.pid= fl->fl_pid;
> + op->info.ex = (fl->fl_type == F_WRLCK);
> + op->info.wait   = IS_SETLKW(cmd);
> + op->info.fsid   = ls->id;
> + op->info.number = name->ln_number;
> + op->info.start  = fl->fl_start;
> + op->info.end= fl->fl_end;
> + op->info.owner  = (__u64)(long) fl->fl_owner;
> + if (fl->fl_lmops) {
> + op->info.callback   = fl->fl_lmops->fl_notify;
> + /* might need to make a copy */
> + op->info.fl = fl;
> + op->info.file   = file;
> + } else
> + op->info.callback

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Ragnar Kjørstad

On Tue, Dec 05, 2006 at 09:55:16AM -0500, Trond Myklebust wrote:
> > The again statlite and
> > readdirplus really are the most sane bits of these proposals as they
> > fit nicely into the existing set of APIs.  The filehandle idiocy on
> > the other hand is way of into crackpipe land.
> 
> ...
> 
> a) networked filesystem specific. The mask stuff etc adds no
> value whatsoever to actual "posix" filesystems. In fact it is
> telling the kernel that it can violate posix semantics.

I don't see what's network filesystem specific about it. Correct me if
I'm wrong, but today ls -l on a local filesystem will first do readdir
and then n stat calls. In the worst case scenario this will generate n+1
disk seeks.

Local filesystems go through a lot of trouble to try to make the disk
layout of the directory entries and the inodes optimal so that readahead
and caching reduces the number of seeks.

With readdirplus on the other hand, the filesystem would be able to send
all the requests to the block layer and it would be free to optimize
through disk elevators and what not. 

And this is not simply an "ls -l" optimization. Allthough I can no loger
remember why, I think this is exactly what imap servers are doing when
opening up big imap folders stored in maildir. 

-- 
Ragnar Kjørstad
Software Engineer
Scali - http://www.scali.com
Scaling the Linux Datacenter
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NFS] [PATCH 10/10] gfs2: nfs lock support for gfs2

2006-12-06 Thread Wendy Cheng


Wendy Cheng wrote:


J. Bruce Fields wrote:

 


From: J. Bruce Fields <[EMAIL PROTECTED]>

From: Marc Eshel <[EMAIL PROTECTED]>

Add NFS lock support to GFS2.  (Untested.)


   

Untested ? Trying to keep us busy ? 

 


Sorry, forgot this is on external mailing lists 

Nice piece of work ! This solves a long standing issue for our NFS 
servers. Thank you for doing it - will test them out from our end too.


-- Wendy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards


Matthew Wilcox wrote:

On Tue, Dec 05, 2006 at 10:07:48AM +, Christoph Hellwig wrote:

The filehandle idiocy on the other hand is way of into crackpipe land.


Right, and it needs to be discarded.  Of course, there was a real
problem that it addressed, so we need to come up with an acceptable
alternative.

>

The scenario is a cluster-wide application doing simultaneous opens of
the same file.  So thousands of nodes all hitting the same DLM locks
(for read) all at once.  The openg() non-solution implies that all
nodes in the cluster share the same filehandle space, so I think a
reasonable solution can be implemented entirely within the clusterfs,
with an extra flag to open(), say O_CLUSTER_WIDE.  When the clusterfs
sees this flag set (in ->lookup), it can treat it as a hint that this
pathname component is likely to be opened again on other nodes and
broadcast that fact to the other nodes within the cluster.  Other nodes
on seeing that hint (which could be structured as "The child "bin"
of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle
ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact.
When they see their own open, they can populate the path to that file
without asking the server for extra metadata.

There's obviously security issues there (why I say 'hint' rather than
'command'), but there's also security problems with open-by-filehandle.
Note that this solution requires no syscall changes, no application
changes, and also helps a scenario where each node opens a different
file in the same directory.

I've never worked on a clusterfs, so there may be some gotchas (eg, how
do you invalidate the caches of nodes when you do a rename).  But this
has to be preferable to open-by-fh.


The openg() solution has the following advantages to what you propose. 
First, it places the burden of the communication of the file handle on 
the application process, not the file system. That means less work for 
the file system. Second, it does not require that clients respond to 
unexpected network traffic. Third, the network traffic is deterministic 
-- one client interacts with the file system and then explicitly 
performs the broadcast. Fourth, it does not require that the file system 
store additional state on clients.


In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
the flag) would likely cause a storm of network traffic if clients were 
closely synchronized (which they are likely to be). We could work around 
this by having one application open early, then barrier, then have 
everyone else open, but then we might as well have just sent the handle 
as the barrier operation, and we've made the use of the O_CLUSTER_WIDE 
open() significantly more complicated for the application.


However, the application change issue is actually moot; we will make 
whatever changes inside our MPI-IO implementation, and many users will 
get the benefits for free.


The readdirplus(), readx()/writex(), and openg()/openfh() were all 
designed to allow our applications to explain exactly what they wanted 
and to allow for explicit communication. I understand that there is a 
tendency toward solutions where the FS guesses what the app is going to 
do or is passed a hint (e.g. fadvise) about what is going to happen, 
because these things don't require interface changes. But these 
solutions just aren't as effective as actually spelling out what the 
application wants.


Regards,

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

On Wed, 2006-12-06 at 13:22 +0100, Ragnar Kjørstad wrote:
> On Tue, Dec 05, 2006 at 09:55:16AM -0500, Trond Myklebust wrote:
> > > The again statlite and
> > > readdirplus really are the most sane bits of these proposals as they
> > > fit nicely into the existing set of APIs.  The filehandle idiocy on
> > > the other hand is way of into crackpipe land.
> > 
> > ...
> > 
> > a) networked filesystem specific. The mask stuff etc adds no
> > value whatsoever to actual "posix" filesystems. In fact it is
> > telling the kernel that it can violate posix semantics.
> 
> 
> I don't see what's network filesystem specific about it. Correct me if
> I'm wrong, but today ls -l on a local filesystem will first do readdir
> and then n stat calls. In the worst case scenario this will generate n+1
> disk seeks.

I was referring to the caching mask.

> Local filesystems go through a lot of trouble to try to make the disk
> layout of the directory entries and the inodes optimal so that readahead
> and caching reduces the number of seeks.
> 
> With readdirplus on the other hand, the filesystem would be able to send
> all the requests to the block layer and it would be free to optimize
> through disk elevators and what not. 

As far as local filesystems are concerned, the procedure is still the
same: read the directory contents, then do lookup() of the files
returned to you by directory contents, then do getattr().

There is no way to magically queue up the getattr() calls before you
have done the lookups, nor is there a way to magically queue up the
lookups before you have read the directory contents.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: readdirplus() as possible POSIX I/O API

On Wed, 2006-12-06 at 03:28 -0700, Andreas Dilger wrote:
> On Dec 05, 2006  10:23 -0500, Trond Myklebust wrote:
> > On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote:
> > > I think the "barrier semantics" are something that have just crept
> > > into this discussion and is confusing the issue.
> > 
> > It is the _only_ concept that is of interest for something like NFS or
> > CIFS. We already have the ability to cache the information.
> 
> Actually, wouldn't the ability for readdirplus() (with valid flag) be
> useful for NFS if only to indicate that it does not need to flush the
> cache because the ctime/mtime isn't needed by the caller?

That is why statlite() might be useful. I'd prefer something more
generic, though.

> > 'find' should be quite happy with the existing readdir(). It does not
> > need to use stat() or readdirplus() in order to recurse because
> > readdir() provides d_type.
> 
> It does in any but the most simplistic invocations, like "find -mtime"
> or "find -mode" or "find -uid", etc.

The only 'win' a readdirplus may give you there as far as NFS is
concerned is the sysenter overhead that you would have for calling
stat() (or statlite() many times).

> > If the application is able to select a statlite()-type of behaviour with
> > the fadvise() hints, your filesystem could be told to serve up cached
> > information instead of regrabbing locks. In fact that is a much more
> > flexible scheme, since it also allows the filesystem to background the
> > actual inode lookups, or to defer them altogether if that is more
> > efficient.
> 
> I guess I just don't understand how fadvise() on a directory file handle
> (used for readdir()) can be used to affect later stat operations (which
> definitely will NOT be using that file handle)?  If you mean that the
> application should actually open() each file, fadvise(), fstat(), close(),
> instead of just a stat() call then we are WAY into negative improvements
> here due to overhead of doing open+close.

On the contrary, the readdir descriptor is used in all those funky new
statat(), calls. Ditto for readlinkat(), faccessat().

You could even have openat() turn off the close-to-open GETATTR if the
readdir descriptor contained a hint that told it that was unnecessary.

Furthermore, since the fadvise-like caching operation works on
filehandles, you could have it work both on readdir() for the benefit of
the above *at() calls, and also on the regular file descriptor for the
benefit of fstat(), fgetxattr().

> > > The filesystem can't always do "stat-ahead" on the files because that
> > > requires instantiating an inode on the client which may be stale (lock
> > > revoked) by the time the app gets to it, and the app (and the VFS)  have
> > > no idea just how stale it is, and whether the stat is a "real" stat or
> > > "only" the readdir stat (because the fadvise would only be useful on
> > > the directory, and not all of the child entries), so it would need to
> > > re-stat the file.
> > 
> > Then provide hints that allow the app to select which behaviour it
> > prefers. Most (all?) apps don't _care_, and so would be quite happy with
> > cached information. That is why the current NFS caching model exists in
> > the first place.
> 
> Most clustered filesystems have strong cache semantics, so that isn't
> a problem.  IMHO, the mechanism to pass the hint to the filesystem IS
> the readdirplus_lite() that tells the filesystem exactly which data is
> needed on each directory entry.
> 
> > > Also, this would potentially blow the client's real
> > > working set of inodes out of cache.
> > 
> > Why?
> 
> Because in many cases it is desirable to limit the number of DLM locks
> on a given client (e.g. GFS2 thread with AKPM about clients with
> millions of DLM locks due to lack of memory pressure on large mem systems).
> That means a finite-size lock LRU on the client that risks being wiped
> out by a few thousand files in a directory doing "readdir() + 5000*stat()".
> 
> 
> Consider a system like BlueGene/L with 128k compute cores.  Jobs that
> run on that system will periodically (e.g. every hour) create up to 128K
> checkpoint+restart files to avoid losing a lot of computation if a node
> crashes.  Even if each one of the checkpoints is in a separate directory
> (I wish all users were so nice :-) it means 128K inodes+DLM locks for doing
> an "ls" in the directory.

That is precisely the sort of situation where knowing when you can
cache, and when you cannot would be a plus. An ls call may not need 128k
dlm locks, because it only cares about the state of the inodes as they
were at the start of the opendir() call.

> > > Doing things en-masse with readdirplus() also allows the filesystem to
> > > do the stat() operations in parallel internally (which is a net win if
> > > there are many servers involved) instead of serially as the application
> > > would do.
> > 
> > If your application really cared, it could add threading to 'ls' to
> > achieve the same resul

Re: openg

On Wed, 2006-12-06 at 11:01 +, Christoph Hellwig wrote:
> Say you want to lookup a path /foo/bar/baz, then the access permission
> is based on the following things:
> 
>  - the credentials of the user.  let's only take traditional uid/gid
>for this example although credentials are much more complex these
>days
>  - the kind of operation you want to perform
>  - the access permission of the actual object the path points to (inode)
>  - the lookup permission (x bit) for every object on the way to you object

   - your private namespace particularities (submounts etc)

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg


Christoph Hellwig wrote:

On Tue, Dec 05, 2006 at 03:44:31PM -0600, Rob Ross wrote:
The openg() really just does the lookup and permission checking). The 
openfh() creates the file descriptor and starts that context if the 
particular FS tracks that sort of thing.


...

Well you've caught me. I don't want to cache the values, because I 
fundamentally believe that sharing state between clients and servers is 
braindead (to use Christoph's phrase) in systems of this scale 
(thousands to tens of thousands of clients). So I don't want locks, so I 
can't keep the cache consistent, ... So someone else will have to run 
the tests you propose :)...


Besides the whole ugliness you miss a few points about the fundamental
architecture of the unix filesystem permission model unfortunately.

Say you want to lookup a path /foo/bar/baz, then the access permission
is based on the following things:

 - the credentials of the user.  let's only take traditional uid/gid
   for this example although credentials are much more complex these
   days
 - the kind of operation you want to perform
 - the access permission of the actual object the path points to (inode)
 - the lookup permission (x bit) for every object on the way to you object

In your proposal sutoc is a simple conversion operation, that means
openg needs to perfom all these access checks and encodes them in the
fh_t.


This is exactly right and is the intention of the call.


That means an fh_t must fundamentally be an object that is kept
in the kernel aka a capability as defined by Henry Levy.  This does imply
you _do_ need to keep state.


The fh_t is indeed a type of capability. fh_t, properly protected, could 
be passed into user space and validated by the file system when 
presented back to the file system.


There is state here, clearly. I feel ok about that because we allow 
servers to forget that they handed out these fh_ts if they feel like it; 
there is no guaranteed lifetime in the current proposal. This allows 
servers to come and go without needing to persistently store these. 
Likewise, clients can forget them with no real penalty.


This approach is ok because of the use case. Because we expect the fh_t 
to be used relatively soon after its creation, servers will not need to 
hold onto these long before the openfh() is performed and we're back 
into a normal "everyone has an valid fd" use case.


> And because it needs kernel support you

fh_t is more or less equivalent to a file descriptor with sutoc equivalent
to a dup variant that really duplicates the backing object instead of just
the userspace index into it.


Well, a FD has some additional state associated with it (position, 
etc.), but yes there are definitely similarities to dup().



Note somewhat similar open by filehandle APIs like oben by inode number
as used by lustre or the XFS *_by_handle APIs are privilegued operations
because of exactly this problem.


I'm not sure what a properly protected fh_t couldn't be passed back into 
user space and handed around, but I'm not a security expert. What am I 
missing?



What according to your mail is the most important bit in this proposal is
that you thing the filehandles should be easily shared with other system
in a cluster.  That fact is not mentioned in the actual proposal at all,
and is in fact that hardest part because of inherent statefulness of
the API.


The documentation of the calls is complicated by the way POSIX calls are 
described. We need to have a second document describing use cases also 
available, so that we can avoid misunderstandings as best we can, get 
straight to the real issues. Sorry that document wasn't available.


I think I've addressed the statefulness of the API above?

What's the etiquette on changing subject lines here? It might be useful 
to separate the openg() etc. discussion from the readdirplus() etc. 
discussion.


Changing subject lines is fine.


Thanks.

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
> The openg() solution has the following advantages to what you propose. 
> First, it places the burden of the communication of the file handle on 
> the application process, not the file system. That means less work for 
> the file system. Second, it does not require that clients respond to 
> unexpected network traffic. Third, the network traffic is deterministic 
> -- one client interacts with the file system and then explicitly 
> performs the broadcast. Fourth, it does not require that the file system 
> store additional state on clients.

You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:

: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.
:
: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?
:
: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.
:
: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.

> In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
> the flag) would likely cause a storm of network traffic if clients were 
> closely synchronized (which they are likely to be).

I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.

> However, the application change issue is actually moot; we will make 
> whatever changes inside our MPI-IO implementation, and many users will 
> get the benefits for free.

That's good.

> The readdirplus(), readx()/writex(), and openg()/openfh() were all 
> designed to allow our applications to explain exactly what they wanted 
> and to allow for explicit communication. I understand that there is a 
> tendency toward solutions where the FS guesses what the app is going to 
> do or is passed a hint (e.g. fadvise) about what is going to happen, 
> because these things don't require interface changes. But these 
> solutions just aren't as effective as actually spelling out what the 
> application wants.

Sure, but I think you're emphasising "these interfaces let us get our
job done" over the legitimate concerns that we have.  I haven't really
looked at the readdirplus() or readx()/writex() interfaces, but the
security problems with openg() makes me think you haven't really looked
at it from the "what could go wrong" perspective.  I'd be interested in
reviewing the readx()/writex() interfaces, but still don't see a document
for them anywhere.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: readdirplus() as possible POSIX I/O API

2006-12-06 Thread Peter Staubach


Sage Weil wrote:

On Mon, 4 Dec 2006, Peter Staubach wrote:

I think that there are several points which are missing here.

First, readdirplus(), without any sort of caching, is going to be _very_
expensive, performance-wise, for _any_ size directory.  You can see this
by instrumenting any NFS server which already supports the NFSv3 
READDIRPLUS

semantics.


Are you referring to the work the server must do to gather stat 
information for each inode?




Yes and the fact that the client will be forced to go over the wire for
each readdirplus() call, whereas it can use cached information today.
An application actually waiting on the response to a READDIRPLUS will
not be pleased at the resulting performance.


Second, the NFS client side readdirplus() implementation is going to be
_very_ expensive as well.  The NFS client does write-behind and all this
data _must_ be flushed to the server _before_ the over the wire 
READDIRPLUS
can be issued.  This means that the client will have to step through 
every

inode which is associated with the directory inode being readdirplus()'d
and ensure that all modified data has been successfully written out.  
This
part of the operation, for a sufficiently large directory and a 
sufficiently

large page cache, could take signficant time in itself.


Why can't the client send the over the wire READDIRPLUS without 
flushing inode data, and then simply ignore the stat portion of the 
server's response in instances where it's locally cached (and dirty) 
inode data is newer than the server's?




This would seem to minimize the value as far as I understand the
requirements here.


These overheads may make this new operation expensive enough that no
applications will end up using it.


If the application calls readdirplus() only when it would otherwise do 
readdir()+stat(), the flushing you mention would happen anyway (from 
the stat()).  Wouldn't this at least allow that to happen in parallel 
for the whole directory?


I don't see where the parallelism comes from.  Before issuing the
READDIRPLUS over the wire, the client would have to ensure that each
and every one of those flushes was completed.  I suppose that a
sufficiently clever and complex implementation could figure out how
to schedule all those flushes asynchronously and then wait for all
of them to complete, but there will be a performance cost.  Walking
the caches for all of those inodes, perhaps using several or all of
the cpus in the system, smacking the server with all of those WRITE
operations simultaneously with all of the associated network
bandwidth usage, all adds up to other applications on the client
and potentially the network not doing much at the same time.

All of this cost to the system and to the network for the benefit of
a single application?  That seems like a tough sell to me.

This is an easy problem to look at from the application viewpoint.
The solution seems obvious.  Give it the fastest possible way to
read the directory and retrieve stat information about every entry
in the directory.  However, when viewed from a systemic level, this
becomes a very different problem with many more aspects.  Perhaps
flow controlling this one application in favor of many other applications,
running network wide, may be the better thing to continue to do.
I dunno.

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

openg and path_to_handle


David Chinner wrote:

On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:

On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote:

Hi,

I agree that it is not feasible to add new system calls every time
somebody has a problem, and we don't take adding system calls lightly.
However, in this case we're talking about an entire *community* of
people (high-end computing), not just one or two people. Of course it
may still be the case that that community is not important enough to
justify the addition of system calls; that's obviously not my call to make!

I have the feeling that openg stuff is rushed without looking into all
solutions, that don't require changes to the current interface.


I also get the feeling that interfaces that already do this
open-by-handle stuff haven't been explored either.

Does anyone here know about the XFS libhandle API? This has been
around for years and it does _exactly_ what these proposed syscalls
are supposed to do (and more).

See:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle

For the libhandle man page. Basically:

openg == path_to_handle
sutoc == open_by_handle

And here for the userspace code:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/

Cheers,

Dave.


Thanks for pointing these out Dave. These are indeed along the same 
lines as the openg()/openfh() approach.


One difference is that they appear to perform permission checking on the 
open_by_handle(), which means that the entire path needs to be encoded 
in the handle, and makes it difficult to eliminate the path traversal 
overhead on N open_by_handle() operations.


Regards,

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] gfs2: nfs lock support for gfs2

2006-12-06 Thread David Teigland

On Wed, Dec 06, 2006 at 12:34:20AM -0500, J. Bruce Fields wrote:
> +int gdlm_plock_callback(struct plock_op *op)
> +{
> + struct file *file;
> + struct file_lock *fl;
> + int rv;
> +
> + spin_lock(&ops_lock);
> + if (!list_empty(&op->list)) {
> + printk(KERN_INFO "plock op on list\n");
> + list_del(&op->list);
> + }
> + spin_unlock(&ops_lock);
> +
> + rv = op->info.rv;
> +
> + if (!rv) {
> + /* check if the following are still valid or make a copy */
> + file = op->info.file;
> + fl = op->info.fl;
> + 
> + if (posix_lock_file_wait(file, fl) < 0)
> + log_error("gdlm_plock: vfs lock error file %p fl %p",
> +   file, fl);
> + }
> +
> + kfree(op);
> + return rv;
> +}

..

> + if (found) {
> + if (op->info.callback)
> + gdlm_plock_callback(op);
> + else
> + wake_up(&recv_wq);
> + }

The gfs side looks fine to me.  Did you forget to call fl_notify from
gdlm_plock_callback() or am I missing something?

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:
> David Chinner wrote:
> >Does anyone here know about the XFS libhandle API? This has been
> >around for years and it does _exactly_ what these proposed syscalls
> >are supposed to do (and more).
> 
> Thanks for pointing these out Dave. These are indeed along the same 
> lines as the openg()/openfh() approach.
> 
> One difference is that they appear to perform permission checking on the 
> open_by_handle(), which means that the entire path needs to be encoded 
> in the handle, and makes it difficult to eliminate the path traversal 
> overhead on N open_by_handle() operations.

Another (and highly important) difference is that usage is restricted to
root:

xfs_open_by_handle(...)
...
if (!capable(CAP_SYS_ADMIN))
return -XFS_ERROR(EPERM);

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards


Matthew Wilcox wrote:

On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
The openg() solution has the following advantages to what you propose. 
First, it places the burden of the communication of the file handle on 
the application process, not the file system. That means less work for 
the file system. Second, it does not require that clients respond to 
unexpected network traffic. Third, the network traffic is deterministic 
-- one client interacts with the file system and then explicitly 
performs the broadcast. Fourth, it does not require that the file system 
store additional state on clients.


You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:


I coincidentally just wrote about some of this in another email. Wasn't 
trying to avoid you...



: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.


Exactly.


: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?


The fh_t would be validated either (a) when the openfh() is called, or 
on accesses using the associated capability. As Christoph pointed out, 
this really is a capability and encapsulates everything necessary for a 
particular user to access a particular file. It can be handed to others, 
and in fact that is a critical feature for our use case.


After the openfh(), the access model is identical to a previously 
open()ed file. So the question is what happens between the openg() and 
the openfh().


Our intention was to allow servers to "forget" these fh_ts at will. So a 
revoke between openg() and openfh() would kill the fh_t, and the 
subsequent openfh() would fail, or subsequent accesses would fail 
(depending on when the FS chose to validate).


Does this help?


: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.

> :

: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.


We could use advice on this point. Certainly it's possible to encode 
information about the FS from which the fh_t originated, but we haven't 
tried to spell out exactly how that would happen. Your approach 
described here sounds good to me.


In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
the flag) would likely cause a storm of network traffic if clients were 
closely synchronized (which they are likely to be).


I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.


Yes, naive application. You're right that the file system could adapt to 
this, but on the other hand if we were explicitly passing the fh_t in 
user space, we could just use MPI_Bcast and be done with it, with an 
algorithm that is well-matched to the system, etc.


However, the application change issue is actually moot; we will make 
whatever changes inside our MPI-IO implementation, and many users will 
get the benefits for free.


That's good.


Absolutely. Same goes for readx()/writex() also, BTW, at least for 
MPI-IO users. We will build the input parameters inside MPI-IO using 
existing information from users, rather than applying data sieving or 
using multiple POSIX calls.


The readdirplus(), readx()/writex(), and openg()/openfh() were all 
designed to allow our applications to explain exactly what they wanted 
and to allow for explicit communication. I understand that there is a 
tendency toward solutions where the FS guesses what the app is going to 
do or is passed a hint (e.g. fadvise) about what is going to happen, 
because these things don't require interface changes. But these 
solutions just aren't as effective as actually spelling out what the 
application wants.


Sure, but I think you're emphasising "these interfaces le

Re: openg and path_to_handle


Matthew Wilcox wrote:

On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:

David Chinner wrote:

Does anyone here know about the XFS libhandle API? This has been
around for years and it does _exactly_ what these proposed syscalls
are supposed to do (and more).
Thanks for pointing these out Dave. These are indeed along the same 
lines as the openg()/openfh() approach.


One difference is that they appear to perform permission checking on the 
open_by_handle(), which means that the entire path needs to be encoded 
in the handle, and makes it difficult to eliminate the path traversal 
overhead on N open_by_handle() operations.


Another (and highly important) difference is that usage is restricted to
root:

xfs_open_by_handle(...)
...
if (!capable(CAP_SYS_ADMIN))
return -XFS_ERROR(EPERM);


I assume that this is because the implementation chose not to do the 
path encoding in the handle? Because if they did, they could do full 
path permission checking as part of the open_by_handle.


Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards


Trond Myklebust wrote:

On Tue, 2006-12-05 at 16:11 -0600, Rob Ross wrote:

Trond Myklebust wrote:

b) quite unnatural to impose caching semantics on all the
directory _entries_ using a syscall that refers to the directory
itself (see the explanations by both myself and Peter Staubach
of the synchronisation difficulties). Consider in particular
that it is quite possible for directory contents to change in
between readdirplus calls.
I want to make sure that I understand this correctly. NFS semantics 
dictate that if someone stat()s a file that all changes from that client 
need to be propagated to the server? And this call complicates that 
semantic because now there's an operation on a different object (the 
directory) that would cause this flush on the files?


The only way for an NFS client to obey the POSIX requirement that
write() immediately updates the mtime/ctime is to flush out all cached
writes whenever the user requests stat() information.


Thanks for explaining this. I've never understood how it is decided 
where the line is drawn with respect to where NFS does obey POSIX 
semantics for a particular implementation.


Writes on other nodes wouldn't necessarily have updated mtime/ctime, right?


i.e. the "strict posix caching model' is pretty much impossible
to implement on something like NFS or CIFS using these
semantics. Why then even bother to have "masks" to tell you when
it is OK to violate said strict model.
We're trying to obtain improved performance for distributed file systems 
with stronger consistency guarantees than these two.


So you're saying I should ignore this thread. Fine...


No I'm trying to explain when the calls might be useful. But if you are 
only interested in NFS and CIFS, then I guess the thread might not be 
very interesting.



c) Says nothing about what should happen to non-stat() metadata
such as ACL information and other extended attributes (for
example future selinux context info). You would think that the
'ls -l' application would care about this.
Honestly, we hadn't thought about other non-stat() metadata because we 
didn't think it was part of the use case, and we were trying to stay 
close to the flavor of POSIX. If you have ideas here, we'd like to hear 
them.


See my previous postings.


I'll do that. Thanks.

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Ulrich Drepper


Andreas Dilger wrote:

Does this mean you are against the statlite() API entirely, or only against
the document's use of the flag as a vague "accuracy" value instead of a
hard "valid" value?


I'm against fuzzy values.  I've no problems with a bitmap specifying 
that certain members are not wanted or wanted (probably the later, zero 
meaning the optional fields are not wanted).




IMHO, if the application doesn't need a particular field (e.g. "ls -i"
doesn't need size, "ls -s" doesn't need the inode number, etc) why should
these be filled in if they are not easily accessible?  As for what is
easily accessible, that needs to be determined by the filesystem itself.


Is the size not easily accessible?  It would surprise me.  If yes, then, 
by all means add it to the list.  I'm not against extending the list of 
members which are optional if it makes sense.  But certain information 
is certainly always easily accessible.




That was previously suggested by me already.  IMHO, there should ONLY be
a statlite variant of readdirplus(), and I think most people agree with
that part of it (though there is contention on whether readdirplus() is
needed at all).


Indeed.  Given there is statlite and we have d_type information, in most 
situations we won't need more complete stat information.  Outside of 
programs like ls that is.



Part of why I wished the lab guys had submitted the draft to the 
OpenGroup first is that this way they would have to be more detailed on 
why each and every interface they propose for adding is really needed. 
Maybe they can do it now and here.  What programs really require 
readdirplus?


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards


Ulrich Drepper wrote:

Andreas Dilger wrote:
Does this mean you are against the statlite() API entirely, or only 
against

the document's use of the flag as a vague "accuracy" value instead of a
hard "valid" value?


I'm against fuzzy values.  I've no problems with a bitmap specifying 
that certain members are not wanted or wanted (probably the later, zero 
meaning the optional fields are not wanted).


Thanks for clarifying.


IMHO, if the application doesn't need a particular field (e.g. "ls -i"
doesn't need size, "ls -s" doesn't need the inode number, etc) why should
these be filled in if they are not easily accessible?  As for what is
easily accessible, that needs to be determined by the filesystem itself.


Is the size not easily accessible?  It would surprise me.  If yes, then, 
by all means add it to the list.  I'm not against extending the list of 
members which are optional if it makes sense.  But certain information 
is certainly always easily accessible.


File size is definitely one of the more difficult of the parameters, 
either because (a) it isn't stored in one place but is instead derived, 
or (b) because a lock has to be obtained to guarantee consistency of the 
returned value.



That was previously suggested by me already.  IMHO, there should ONLY be
a statlite variant of readdirplus(), and I think most people agree with
that part of it (though there is contention on whether readdirplus() is
needed at all).


Indeed.  Given there is statlite and we have d_type information, in most 
situations we won't need more complete stat information.  Outside of 
programs like ls that is.

>
Part of why I wished the lab guys had submitted the draft to the 
OpenGroup first is that this way they would have to be more detailed on 
why each and every interface they propose for adding is really needed. 
Maybe they can do it now and here.  What programs really require 
readdirplus?


I can't speak for everyone, but "ls" is the #1 consumer as far as I am 
concerned.


Regards,

Rob

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Ulrich Drepper


Rob Ross wrote:
File size is definitely one of the more difficult of the parameters, 
either because (a) it isn't stored in one place but is instead derived, 
or (b) because a lock has to be obtained to guarantee consistency of the 
returned value.


OK, and looking at the man page again, it is already on the list in the 
old proposal and hence optional.  I've no problem with that.



I can't speak for everyone, but "ls" is the #1 consumer as far as I am 
concerned.


So a syscall for ls alone?

I think this is more a user problem.  For normal plain old 'ls' you get 
by with readdir.  For 'ls -F' and 'ls --color' you mostly get by with 
readdir+d_type.  If you cannot provide d_type info the readdirplus 
extension does you no good.  For the cases when an additional stat is 
needed (for symlinks, for instance, to test whether they are dangling) 
readdirplus won't help.


So, readdirplus is really only useful for 'ls -l'.  But then you need 
st_size and st_?time.  So what is gained with readdirplus?


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Ragnar Kjørstad

On Wed, Dec 06, 2006 at 09:42:55AM -0800, Ulrich Drepper wrote:
> >I can't speak for everyone, but "ls" is the #1 consumer as far as I am 
> >concerned.
> 
> So a syscall for ls alone?

I guess the code needs to be checked, but I would think that:
* ls
* find
* rm -r
* chown -R
* chmod -R
* rsync
* various backup software
* imap servers
are all likely users of readdirplus. Of course the ones that spend the
majority of the time doing stat are the ones that would benefit more.

-- 
Ragnar Kjørstad
Software Engineer
Scali - http://www.scali.com
Scaling the Linux Datacenter
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Ulrich Drepper


Ragnar Kjørstad wrote:

I guess the code needs to be checked, but I would think that:
* ls
* find
* rm -r
* chown -R
* chmod -R
* rsync
* various backup software
* imap servers


Then somebody do the analysis.  And please an analysis which takes into 
account that some programs might need to be adapted to take advantage of 
d_type or non-optional data from the proposed statlite.


Plus, how often are these commands really used on such filesystems?  I'd 
hope that chown -R or so is a once in a lifetime thing on such 
filesystems and not worth optimizing for.


I'd suggest until such data is provided the readdirplus plans are put on 
hold.  statlite I have no problems with if the semantics is changed as I 
explained.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] gfs2: nfs lock support for gfs2

On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote:
> The gfs side looks fine to me.  Did you forget to call fl_notify from
> gdlm_plock_callback() or am I missing something?

Yes, looks like we missed something, thanks.  This code's an rfc (not
even tested), so don't apply it yet!  What we should have there is
something like:

rv = op->info.rv;

if (fl_notify(fl, NULL, rv)) {
/* XXX: We need to cancel the lock here: */
printk("gfs2 lock granted after lock request failed; dangling 
lock!\n");
}

if (!rv) {
/* check if the following are still valid or make a copy */
file = op->info.file;
fl = op->info.fl;

if (posix_lock_file_wait(file, fl) < 0)
log_error("gdlm_plock: vfs lock error file %p fl %p",
  file, fl);
}

Note there's a race condition--that calls fl_notify before actually
getting the lock locally.  I don't *think* that's a problem, as long as
it's the filesystem and not the local lock list that's authoritative
when it comes to who gets a posix lock.

The more annoying problem is the need to cancel the GFS lock when
fl_notify fails; is that something that it's possible for GFS to do?

It can fail because lockd has a timeout--it waits a few seconds for the
callback, then gives up and returns a failure to the user.  If that
happens after your userspace posix lock manager acquires the lock (but
before fl_notify is called) then you've got to cancel it somehow.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NFS] [PATCH 10/10] gfs2: nfs lock support for gfs2

On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote:
> On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote:
> > The gfs side looks fine to me.  Did you forget to call fl_notify from
> > gdlm_plock_callback() or am I missing something?
> 
> Yes, looks like we missed something, thanks.  This code's an rfc (not
> even tested), so don't apply it yet!  What we should have there is
> something like:
> 
>   rv = op->info.rv;
> 
>   if (fl_notify(fl, NULL, rv)) {
>   /* XXX: We need to cancel the lock here: */
>   printk("gfs2 lock granted after lock request failed; dangling 
> lock!\n");
>   }

(And note in the patch I sent out I stuck this in the wrong place--in
the synchronous instead of the asynchronous code.  So the patch should
have been something closer to the following.)

--b.

diff --git a/fs/gfs2/lm.c b/fs/gfs2/lm.c
index effe4a3..cf7fd52 100644
--- a/fs/gfs2/lm.c
+++ b/fs/gfs2/lm.c
@@ -197,6 +197,16 @@ int gfs2_lm_plock(struct gfs2_sbd *sdp, struct lm_lockname 
*name,
return error;
 }
 
+int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name,
+ struct file *file, int cmd, struct file_lock *fl)
+{
+   int error = -EIO;
+   if (likely(!test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
+   error = sdp->sd_lockstruct.ls_ops->lm_plock_async(
+   sdp->sd_lockstruct.ls_lockspace, name, file, 
cmd, fl);
+   return error;
+}
+
 int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name,
struct file *file, struct file_lock *fl)
 {
diff --git a/fs/gfs2/lm.h b/fs/gfs2/lm.h
index 21cdc30..1ddd1fd 100644
--- a/fs/gfs2/lm.h
+++ b/fs/gfs2/lm.h
@@ -34,6 +34,8 @@ int gfs2_lm_plock_get(struct gfs2_sbd *sdp, struct 
lm_lockname *name,
  struct file *file, struct file_lock *fl);
 int gfs2_lm_plock(struct gfs2_sbd *sdp, struct lm_lockname *name,
  struct file *file, int cmd, struct file_lock *fl);
+int gfs2_lm_plock_async(struct gfs2_sbd *sdp, struct lm_lockname *name,
+ struct file *file, int cmd, struct file_lock *fl);
 int gfs2_lm_punlock(struct gfs2_sbd *sdp, struct lm_lockname *name,
struct file *file, struct file_lock *fl);
 void gfs2_lm_recovery_done(struct gfs2_sbd *sdp, unsigned int jid,
diff --git a/fs/gfs2/locking/dlm/lock_dlm.h b/fs/gfs2/locking/dlm/lock_dlm.h
index 33af707..82af860 100644
--- a/fs/gfs2/locking/dlm/lock_dlm.h
+++ b/fs/gfs2/locking/dlm/lock_dlm.h
@@ -179,6 +179,8 @@ int gdlm_plock_init(void);
 void gdlm_plock_exit(void);
 int gdlm_plock(void *, struct lm_lockname *, struct file *, int,
struct file_lock *);
+int gdlm_plock_async(void *, struct lm_lockname *, struct file *, int,
+   struct file_lock *);
 int gdlm_plock_get(void *, struct lm_lockname *, struct file *,
struct file_lock *);
 int gdlm_punlock(void *, struct lm_lockname *, struct file *,
diff --git a/fs/gfs2/locking/dlm/mount.c b/fs/gfs2/locking/dlm/mount.c
index cdd1694..4339e3f 100644
--- a/fs/gfs2/locking/dlm/mount.c
+++ b/fs/gfs2/locking/dlm/mount.c
@@ -244,6 +244,7 @@ const struct lm_lockops gdlm_ops = {
.lm_lock = gdlm_lock,
.lm_unlock = gdlm_unlock,
.lm_plock = gdlm_plock,
+   .lm_plock_async = gdlm_plock_async,
.lm_punlock = gdlm_punlock,
.lm_plock_get = gdlm_plock_get,
.lm_cancel = gdlm_cancel,
diff --git a/fs/gfs2/locking/dlm/plock.c b/fs/gfs2/locking/dlm/plock.c
index 7365aec..f91a18a 100644
--- a/fs/gfs2/locking/dlm/plock.c
+++ b/fs/gfs2/locking/dlm/plock.c
@@ -102,6 +102,94 @@ int gdlm_plock(void *lockspace, struct lm_lockname *name,
return rv;
 }
 
+int gdlm_plock_async(void *lockspace, struct lm_lockname *name,
+  struct file *file, int cmd, struct file_lock *fl)
+{
+   struct gdlm_ls *ls = lockspace;
+   struct plock_op *op;
+   int rv;
+
+   op = kzalloc(sizeof(*op), GFP_KERNEL);
+   if (!op)
+   return -ENOMEM;
+
+   op->info.optype = GDLM_PLOCK_OP_LOCK;
+   op->info.pid= fl->fl_pid;
+   op->info.ex = (fl->fl_type == F_WRLCK);
+   op->info.wait   = IS_SETLKW(cmd);
+   op->info.fsid   = ls->id;
+   op->info.number = name->ln_number;
+   op->info.start  = fl->fl_start;
+   op->info.end= fl->fl_end;
+   op->info.owner  = (__u64)(long) fl->fl_owner;
+   if (fl->fl_lmops) {
+   op->info.callback   = fl->fl_lmops->fl_notify;
+   /* might need to make a copy */
+   op->info.fl = fl;
+   op->info.file   = file;
+   } else
+   op->info.callback   = NULL;
+
+   send_op(op);
+
+   if (op->info.callback == NULL)
+   wait_event(recv_wq, (op->done != 0));
+   else
+   return -EINPROGRESS;
+

Re: openg and path_to_handle

On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:
> David Chinner wrote:
> >On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:
> >>On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote:
> >>>Hi,
> >>>
> >>>I agree that it is not feasible to add new system calls every time
> >>>somebody has a problem, and we don't take adding system calls lightly.
> >>>However, in this case we're talking about an entire *community* of people
> >>>(high-end computing), not just one or two people. Of course it may still
> >>>be the case that that community is not important enough to justify the
> >>>addition of system calls; that's obviously not my call to make!
> >>I have the feeling that openg stuff is rushed without looking into all
> >>solutions, that don't require changes to the current interface.
> >
> >I also get the feeling that interfaces that already do this open-by-handle
> >stuff haven't been explored either.
> >
> >Does anyone here know about the XFS libhandle API? This has been around for
> >years and it does _exactly_ what these proposed syscalls are supposed to do
> >(and more).
> >
> >See:
> >
> >http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle
> >
> >For the libhandle man page. Basically:
> >
> >openg == path_to_handle sutoc == open_by_handle
> >
> >And here for the userspace code:
> >
> >http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/
> >
> >Cheers,
> >
> >Dave.
> 
> Thanks for pointing these out Dave. These are indeed along the same lines as
> the openg()/openfh() approach.
> 
> One difference is that they appear to perform permission checking on the
> open_by_handle(), which means that the entire path needs to be encoded in
> the handle, and makes it difficult to eliminate the path traversal overhead
> on N open_by_handle() operations.

open_by_handle() is checking the inode flags for things like
immutibility and whether the inode is writable to determine if the
open mode is valid given these flags. It's not actually checking
permissions. IOWs, open_by_handle() has the same overhead as NFS
filehandle to inode translation; i.e. no path traversal on open.

Permission checks are done on the path_to_handle(), so in reality
only root or CAP_SYS_ADMIN users can currently use the
open_by_handle interface because of this lack of checking. Given
that our current users of this interface need root permissions to do
other things (data migration), this has never been an issue.

This is an implementation detail - it is possible that file handle,
being opaque, could encode a UID/GID of the user that constructed
the handle and then allow any process with the same UID/GID to use
open_by_handle() on that handle. (I think hch has already pointed
this out.)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote:
> Permission checks are done on the path_to_handle(), so in reality
> only root or CAP_SYS_ADMIN users can currently use the
> open_by_handle interface because of this lack of checking. Given
> that our current users of this interface need root permissions to do
> other things (data migration), this has never been an issue.
> 
> This is an implementation detail - it is possible that file handle,
> being opaque, could encode a UID/GID of the user that constructed
> the handle and then allow any process with the same UID/GID to use
> open_by_handle() on that handle. (I think hch has already pointed
> this out.)

While it could do that, I'd be interested to see how you'd construct
the handle such that it's immune to a malicious user tampering with it,
or saving it across a reboot, or constructing one from scratch.

I suspect any real answer to this would have to involve cryptographical
techniques (say, creating a secure hash of the information plus a
boot-time generated nonce).  Now you're starting to use a lot of bits,
and compute time, and you'll need to be sure to keep the nonce secret.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Wed, Dec 06, 2006 at 10:20:23AM -0600, Rob Ross wrote:
> Matthew Wilcox wrote:
> >On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:
> >>David Chinner wrote:
> >>>Does anyone here know about the XFS libhandle API? This has been
> >>>around for years and it does _exactly_ what these proposed syscalls
> >>>are supposed to do (and more).
> >>Thanks for pointing these out Dave. These are indeed along the same 
> >>lines as the openg()/openfh() approach.
> >>
> >>One difference is that they appear to perform permission checking on the 
> >>open_by_handle(), which means that the entire path needs to be encoded 
> >>in the handle, and makes it difficult to eliminate the path traversal 
> >>overhead on N open_by_handle() operations.
> >
> >Another (and highly important) difference is that usage is restricted to
> >root:
> >
> >xfs_open_by_handle(...)
> >...
> >if (!capable(CAP_SYS_ADMIN))
> > return -XFS_ERROR(EPERM);
> 
> I assume that this is because the implementation chose not to do the 
> path encoding in the handle? Because if they did, they could do full 
> path permission checking as part of the open_by_handle.

The original use of this interface (if I understand the Irix history
correctly - this is way before my time at SGI) was a userspace NFS
server and so permission checks were done after the filehandle was
opened and a stat could be done on the fd and mode/uid/gid could be
compared to what was in the NFS request. Paths were never needed for
this because everything needed could be obtained directly from
the inode.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

David Chinner wrote:

On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:

David Chinner wrote:

On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:

On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote:

Hi,

I agree that it is not feasible to add new system calls every time
somebody has a problem, and we don't take adding system calls lightly.
However, in this case we're talking about an entire *community* of people
(high-end computing), not just one or two people. Of course it may still
be the case that that community is not important enough to justify the
addition of system calls; that's obviously not my call to make!

I have the feeling that openg stuff is rushed without looking into all
solutions, that don't require changes to the current interface.

I also get the feeling that interfaces that already do this open-by-handle
stuff haven't been explored either.

Does anyone here know about the XFS libhandle API? This has been around for
years and it does _exactly_ what these proposed syscalls are supposed to do
(and more).

See:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle

For the libhandle man page. Basically:

openg == path_to_handle sutoc == open_by_handle

And here for the userspace code:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/

Cheers,

Dave.

Thanks for pointing these out Dave. These are indeed along the same lines as
the openg()/openfh() approach.

One difference is that they appear to perform permission checking on the
open_by_handle(), which means that the entire path needs to be encoded in
the handle, and makes it difficult to eliminate the path traversal overhead
on N open_by_handle() operations.

open_by_handle() is checking the inode flags for things like
immutibility and whether the inode is writable to determine if the
open mode is valid given these flags. It's not actually checking
permissions. IOWs, open_by_handle() has the same overhead as NFS
filehandle to inode translation; i.e. no path traversal on open.

Permission checks are done on the path_to_handle(), so in reality
only root or CAP_SYS_ADMIN users can currently use the
open_by_handle interface because of this lack of checking. Given
that our current users of this interface need root permissions to do
other things (data migration), this has never been an issue.

This is an implementation detail - it is possible that file handle,
being opaque, could encode a UID/GID of the user that constructed
the handle and then allow any process with the same UID/GID to use
open_by_handle() on that handle. (I think hch has already pointed
this out.)

Cheers,

Dave.

Thanks for the clarification Dave. So I take it that you would be
interested in this type of functionality then?

Regards,

Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] gfs2: nfs lock support for gfs2

2006-12-06 Thread David Teigland

On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote:
> On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote:
> > The gfs side looks fine to me.  Did you forget to call fl_notify from
> > gdlm_plock_callback() or am I missing something?
> 
> Yes, looks like we missed something, thanks.  This code's an rfc (not
> even tested), so don't apply it yet!  What we should have there is
> something like:
> 
>   rv = op->info.rv;
> 
>   if (fl_notify(fl, NULL, rv)) {
>   /* XXX: We need to cancel the lock here: */
>   printk("gfs2 lock granted after lock request failed; dangling 
> lock!\n");
>   }
> 
>   if (!rv) {
>   /* check if the following are still valid or make a copy */
>   file = op->info.file;
>   fl = op->info.fl;
>   
>   if (posix_lock_file_wait(file, fl) < 0)
>   log_error("gdlm_plock: vfs lock error file %p fl %p",
> file, fl);
>   }
> 
> Note there's a race condition--that calls fl_notify before actually
> getting the lock locally.  I don't *think* that's a problem, as long as
> it's the filesystem and not the local lock list that's authoritative
> when it comes to who gets a posix lock.

agree

> The more annoying problem is the need to cancel the GFS lock when
> fl_notify fails; is that something that it's possible for GFS to do?
> 
> It can fail because lockd has a timeout--it waits a few seconds for the
> callback, then gives up and returns a failure to the user.  If that
> happens after your userspace posix lock manager acquires the lock (but
> before fl_notify is called) then you've got to cancel it somehow.

I'd think we could just send an unlock for it at that point.  Set up an op
with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing
and call send_op().  We probably need to flag this internal-unlock op so
that when the result arrives, device_write() delists and frees it itself.

(I wouldn't call this "canceling", I think of cancel as trying to force a
blocked request to return/fail prematurely.)

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Wed, Dec 06, 2006 at 02:50:49PM -0600, Rob Ross wrote:
> David Chinner wrote:
> >On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:
> >>David Chinner wrote:
> >>>Does anyone here know about the XFS libhandle API? This has been around 
> >>>for
> >>>years and it does _exactly_ what these proposed syscalls are supposed to 
> >>>do
> >>>(and more).
> >>Thanks for pointing these out Dave. These are indeed along the same lines 
> >>as
> >>the openg()/openfh() approach.
> >>
> >>One difference is that they appear to perform permission checking on the
> >>open_by_handle(), which means that the entire path needs to be encoded in
> >>the handle, and makes it difficult to eliminate the path traversal 
> >>overhead
> >>on N open_by_handle() operations.
> >
> >open_by_handle() is checking the inode flags for things like
> >immutibility and whether the inode is writable to determine if the
> >open mode is valid given these flags. It's not actually checking
> >permissions. IOWs, open_by_handle() has the same overhead as NFS
> >filehandle to inode translation; i.e. no path traversal on open.
> >
> >Permission checks are done on the path_to_handle(), so in reality
> >only root or CAP_SYS_ADMIN users can currently use the
> >open_by_handle interface because of this lack of checking. Given
> >that our current users of this interface need root permissions to do
> >other things (data migration), this has never been an issue.
> >
> >This is an implementation detail - it is possible that file handle,
> >being opaque, could encode a UID/GID of the user that constructed
> >the handle and then allow any process with the same UID/GID to use
> >open_by_handle() on that handle. (I think hch has already pointed
> >this out.)
> 
> Thanks for the clarification Dave. So I take it that you would be 
> interested in this type of functionality then?

Not really - just trying to help by pointing out something no-one
seemed to know about

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Wed, Dec 06, 2006 at 01:50:24PM -0700, Matthew Wilcox wrote:
> On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote:
> > Permission checks are done on the path_to_handle(), so in reality
> > only root or CAP_SYS_ADMIN users can currently use the
> > open_by_handle interface because of this lack of checking. Given
> > that our current users of this interface need root permissions to do
> > other things (data migration), this has never been an issue.
> > 
> > This is an implementation detail - it is possible that file handle,
> > being opaque, could encode a UID/GID of the user that constructed
> > the handle and then allow any process with the same UID/GID to use
> > open_by_handle() on that handle. (I think hch has already pointed
> > this out.)
> 
> While it could do that, I'd be interested to see how you'd construct
> the handle such that it's immune to a malicious user tampering with it,
> or saving it across a reboot, or constructing one from scratch.
> 
> I suspect any real answer to this would have to involve cryptographical
> techniques (say, creating a secure hash of the information plus a
> boot-time generated nonce).  Now you're starting to use a lot of bits,
> and compute time, and you'll need to be sure to keep the nonce secret.

An auth header and GSS-API integration would probably be the way
to go here if you really care.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] gfs2: nfs lock support for gfs2

On Wed, Dec 06, 2006 at 02:58:22PM -0600, David Teigland wrote:
> On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote:
> > The more annoying problem is the need to cancel the GFS lock when
> > fl_notify fails; is that something that it's possible for GFS to do?
> > 
> > It can fail because lockd has a timeout--it waits a few seconds for the
> > callback, then gives up and returns a failure to the user.  If that
> > happens after your userspace posix lock manager acquires the lock (but
> > before fl_notify is called) then you've got to cancel it somehow.
> 
> I'd think we could just send an unlock for it at that point.  Set up an op
> with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing
> and call send_op().  We probably need to flag this internal-unlock op so
> that when the result arrives, device_write() delists and frees it itself.
> 
> (I wouldn't call this "canceling", I think of cancel as trying to force a
> blocked request to return/fail prematurely.)

I call it a cancel because it should leave us in the same state we were
in if we hadn't done the lock.  An unlock doesn't do that, because the
original lock may have coalesced and/or downgraded existing locks.

We've got a similar problem with NFSv4 blocking locks.  I've got
unsubmitted patches

http://linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=shortlog;h=fair-queueing

which introduce a new "provisional" lock type that is identical to a
posix lock in every way except that a provisional lock doesn't coalesce
or downgrade existing locks--it just sits there on the lock list
blocking conflicting requests until somebody comes along and upgrades it
to a real lock or cancels it.

Maybe there's a better solution in this case.  I can't think of anything
other than just giving up on the whole idea of timing out.  (Maybe that
wouldn't be so bad?)

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] gfs2: nfs lock support for gfs2

2006-12-06 Thread David Teigland

On Wed, Dec 06, 2006 at 04:23:47PM -0500, J. Bruce Fields wrote:
> On Wed, Dec 06, 2006 at 02:58:22PM -0600, David Teigland wrote:
> > On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote:
> > > The more annoying problem is the need to cancel the GFS lock when
> > > fl_notify fails; is that something that it's possible for GFS to do?
> > > 
> > > It can fail because lockd has a timeout--it waits a few seconds for the
> > > callback, then gives up and returns a failure to the user.  If that
> > > happens after your userspace posix lock manager acquires the lock (but
> > > before fl_notify is called) then you've got to cancel it somehow.
> > 
> > I'd think we could just send an unlock for it at that point.  Set up an op
> > with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing
> > and call send_op().  We probably need to flag this internal-unlock op so
> > that when the result arrives, device_write() delists and frees it itself.
> > 
> > (I wouldn't call this "canceling", I think of cancel as trying to force a
> > blocked request to return/fail prematurely.)
> 
> I call it a cancel because it should leave us in the same state we were
> in if we hadn't done the lock.  An unlock doesn't do that, because the
> original lock may have coalesced and/or downgraded existing locks.

Oh yeah, that's painful, I knew it sounded too easy.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NFS] [PATCH 10/10] gfs2: nfs lock support for gfs2

On Wed, Dec 06, 2006 at 03:42:31PM -0600, David Teigland wrote:
> Oh yeah, that's painful, I knew it sounded too easy.

Yeah.  Well, we could try to teach GFS2 to reliably cancel posix locks.
I think that may turn out to be necessary some day anyway.

Or we could look at why we're timing out and figure out whether there's
something else we should be doing instead in that case.  In what
situations is the GFS2 lock call likely to take overly long?

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Dec 06, 2006  13:50 -0700, Matthew Wilcox wrote:
> On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote:
> > This is an implementation detail - it is possible that file handle,
> > being opaque, could encode a UID/GID of the user that constructed
> > the handle and then allow any process with the same UID/GID to use
> > open_by_handle() on that handle. (I think hch has already pointed
> > this out.)
> 
> While it could do that, I'd be interested to see how you'd construct
> the handle such that it's immune to a malicious user tampering with it,
> or saving it across a reboot, or constructing one from scratch.

If the server has to have processed a real "open" request, say within
the preceding 30s, then it would have a handle for openfh() to match
against.  If the server reboots, or a client tries to construct a new
handle from scratch, or even tries to use the handle after the file is
closed then the handle would be invalid.

It isn't just an encoding for "open-by-inum", but rather a handle that
references some just-created open file handle on the server.  That the
handle might contain the UID/GID is mostly irrelevant - either the
process + network is trusted to pass the handle around without snooping,
or a malicious client which intercepts the handle can spoof the UID/GID
just as easily.  Make the handle sufficiently large to avoid guessing
and it is "secure enough" until the whole filesystem is using kerberos
to avoid any number of other client/user spoofing attacks.

Considering that filesystems like GFS and OCFS allow clients DIRECT
ACCESS to the block device itself (which no amount of authentication
will fix, unless it is in the disks themselves), the risk of passing a
file handle around is pretty minimal.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote:
> Considering that filesystems like GFS and OCFS allow clients DIRECT
> ACCESS to the block device itself (which no amount of authentication
> will fix, unless it is in the disks themselves), the risk of passing a
> file handle around is pretty minimal.

That's either disingenuous, or missing the point.  OCFS/GFS allow the
kernel direct access to the block device.  openg()&sutoc() are about
passing around file handles to untrusted users.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

On Dec 06, 2006  15:17 -0700, Matthew Wilcox wrote:
> On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote:
> > Considering that filesystems like GFS and OCFS allow clients DIRECT
> > ACCESS to the block device itself (which no amount of authentication
> > will fix, unless it is in the disks themselves), the risk of passing a
> > file handle around is pretty minimal.
> 
> That's either disingenuous, or missing the point.  OCFS/GFS allow the
> kernel direct access to the block device.  openg()&sutoc() are about
> passing around file handles to untrusted users.

Consider - in order to intercept the file handle on the network one would
have to be root on a trusted client.  The same is true for direct block
access.

If the network isn't to be trusted or the clients aren't to be trusted,
then in the absence of strong external authentication like kerberos the
whole thing just falls down (i.e. root on any client can su to an arbitrary
UID/GID to access files to avoid root squash, or could intercept all of
the traffic on the network anyways).

With some network filesystems it is at least possible to get strong
authentication and crypto, but with shared block device filesystems like
OCFS/GFS/GPFS they completely rely on the fact that the network and all
of the clients attached thereon are secure.

If the server that did the original file open and generates the unique
per-open file handle can do basic sanity checking (i.e. user doing the
new open is the same, the file handle isn't stale) then that is no
additional security hole.

Similarly, NFS passes file handles to clients that are also used to get
access to the open file without traversing the whole path each time.
Those file handles are even (supposed to be) persistent over reboots.

Don't get me wrong - I understand that what I propose is not secure.
I'm just saying it is no LESS secure than a number of other things
which already exist.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Latchesar Ionkov

On 12/5/06, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Tue, Dec 05, 2006 at 05:55:14PM +0100, Latchesar Ionkov wrote:
> What is your opinion on giving the file system an option to lookup a
> file more than one name/directory at a time? I think that all remote
> file systems can benefit from that?

Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP
routine get the entire remaining path and is allowed to resolve as much of
it as it can (or wants)?

While this allows remote filesystems to optimize deep tree traversals it
creates a pretty big mess about state that is kept on lookup operations.

For Linux in particular it would mean doing large parts of __link_path_walk
in the filesystem, which I can't thing of a sane way to do.

The way I was thinking of implementing it is leaving all the hard
parts of the name resolution in __link_path_walk and modifying inode's
lookup operation to accept an array of qstrs (and its size). lookup
would also check and revalidate the dentries if necessary (usually the
same operation as looking up a name for the remote filesystems).
lookup will check if it reaches symbolic link or mountpoint and will
stop resolving any further. __link_path_walk will use the name to fill
an array of qstrs (we can choose some sane size of the array, like 8
or 16), then call (directly or indirectly) ->lookup (nd->flags will
reflect the flags for the last element in the array), check if the
inode of the last dentry is symlink, and do what it currently does for
symlinks.

Does that make sense? Am I missing anything?

Thanks,
   Lucho
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

2006-12-06 Thread Latchesar Ionkov


On 12/6/06, Rob Ross <[EMAIL PROTECTED]> wrote:

David Chinner wrote:
> On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:
>> On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> I agree that it is not feasible to add new system calls every time
>>> somebody has a problem, and we don't take adding system calls lightly.
>>> However, in this case we're talking about an entire *community* of
>>> people (high-end computing), not just one or two people. Of course it
>>> may still be the case that that community is not important enough to
>>> justify the addition of system calls; that's obviously not my call to make!
>> I have the feeling that openg stuff is rushed without looking into all
>> solutions, that don't require changes to the current interface.
>
> I also get the feeling that interfaces that already do this
> open-by-handle stuff haven't been explored either.
>
> Does anyone here know about the XFS libhandle API? This has been
> around for years and it does _exactly_ what these proposed syscalls
> are supposed to do (and more).
>
> See:
>
> 
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle
>
> For the libhandle man page. Basically:
>
> openg == path_to_handle
> sutoc == open_by_handle
>
> And here for the userspace code:
>
> http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/
>
> Cheers,
>
> Dave.

Thanks for pointing these out Dave. These are indeed along the same
lines as the openg()/openfh() approach.


The open-by-handle makes a little more sense, because the "handle" is
not opened, it only points to a resolved file. As I mentioned before,
it doesn't make much sense to bundle in openg name resolution and file
open.

Still I am not convinced that we need two ways of "finding" files.

Thanks,
   Lucho
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Latchesar Ionkov


On 12/5/06, Rob Ross <[EMAIL PROTECTED]> wrote:

I unfortunately don't have data to show exactly where the time was
spent, but it's a good guess that it is all the network traffic in the
open() case.


Is it hard to repeat the test and check what requests (and how much
time do they take) PVFS server receives?


Well you've caught me. I don't want to cache the values, because I
fundamentally believe that sharing state between clients and servers is
braindead (to use Christoph's phrase) in systems of this scale
(thousands to tens of thousands of clients). So I don't want locks, so I
can't keep the cache consistent, ...


Having file handles in the server looks like a cache to me :) What are
the properties of a cache that it lacks?

Thanks,
   Lucho
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg

2006-12-06 Thread Christoph Hellwig

On Wed, Dec 06, 2006 at 09:42:47AM -0600, Rob Ross wrote:
> The fh_t is indeed a type of capability. fh_t, properly protected, could 
> be passed into user space and validated by the file system when 
> presented back to the file system.

Well, there's quite a lot of papers on how to implement properly secure
capabilities.  The only performant way to do it is to implement them
in kernel space or with hardware support.  As soon as you pass them
to userspace the user can manipulate them, and doing a cheap enough
verification is non-trivial (e.g. it doesn't buy you anything if you
spent the time you previously spent for lookup roundtrip latency
for some expensive cryptography)

> There is state here, clearly. I feel ok about that because we allow 
> servers to forget that they handed out these fh_ts if they feel like it; 
> there is no guaranteed lifetime in the current proposal. This allows 
> servers to come and go without needing to persistently store these. 
> Likewise, clients can forget them with no real penalty.

Objects without defined lifetime rules are not something we're very keen
on.  Particularly in userspace interface they will cause all kinds of
trouble because people will expect the lifetime rules they get from their
normal filesystems.

> The documentation of the calls is complicated by the way POSIX calls are 
> described. We need to have a second document describing use cases also 
> available, so that we can avoid misunderstandings as best we can, get 
> straight to the real issues. Sorry that document wasn't available.

The real problem is that you want to do something in a POSIX spec that
is fundamentally out of scope.  POSIX .1 deals with system interfaces
on a single system.  You want to specify semantics over multiple systems
in a cluster.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: NFSv4/pNFS possible POSIX I/O API standards

On Wed, 2006-12-06 at 18:12 -0500, Latchesar Ionkov wrote:
> On 12/5/06, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> > On Tue, Dec 05, 2006 at 05:55:14PM +0100, Latchesar Ionkov wrote:
> > > What is your opinion on giving the file system an option to lookup a
> > > file more than one name/directory at a time? I think that all remote
> > > file systems can benefit from that?
> >
> > Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP
> > routine get the entire remaining path and is allowed to resolve as much of
> > it as it can (or wants)?
> >
> > While this allows remote filesystems to optimize deep tree traversals it
> > creates a pretty big mess about state that is kept on lookup operations.
> >
> > For Linux in particular it would mean doing large parts of __link_path_walk
> > in the filesystem, which I can't thing of a sane way to do.
> 
> The way I was thinking of implementing it is leaving all the hard
> parts of the name resolution in __link_path_walk and modifying inode's
> lookup operation to accept an array of qstrs (and its size). lookup
> would also check and revalidate the dentries if necessary (usually the
> same operation as looking up a name for the remote filesystems).

I beg to differ. Revalidation is not the same as looking up: the locking
rules are _very_ different.

> lookup will check if it reaches symbolic link or mountpoint and will
> stop resolving any further. __link_path_walk will use the name to fill
> an array of qstrs (we can choose some sane size of the array, like 8
> or 16), then call (directly or indirectly) ->lookup (nd->flags will
> reflect the flags for the last element in the array), check if the
> inode of the last dentry is symlink, and do what it currently does for
> symlinks.
> 
> Does that make sense? Am I missing anything?

Again: locking. How do you keep the dcache sane while the filesystem is
doing a jumble of revalidation and new lookups.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

2006-12-06 Thread Christoph Hellwig

On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote:
> > While it could do that, I'd be interested to see how you'd construct
> > the handle such that it's immune to a malicious user tampering with it,
> > or saving it across a reboot, or constructing one from scratch.
> 
> If the server has to have processed a real "open" request, say within
> the preceding 30s, then it would have a handle for openfh() to match
> against.  If the server reboots, or a client tries to construct a new
> handle from scratch, or even tries to use the handle after the file is
> closed then the handle would be invalid.
> 
> It isn't just an encoding for "open-by-inum", but rather a handle that
> references some just-created open file handle on the server.  That the
> handle might contain the UID/GID is mostly irrelevant - either the
> process + network is trusted to pass the handle around without snooping,
> or a malicious client which intercepts the handle can spoof the UID/GID
> just as easily.  Make the handle sufficiently large to avoid guessing
> and it is "secure enough" until the whole filesystem is using kerberos
> to avoid any number of other client/user spoofing attacks.

That would be fine as long as the file handle would be a kernel-level
concept.  The issue here is that they intent to make the whole filehandle
userspace visible, for example to pass it around via mpi.  As soon as
an untrused user can tamper with the file descriptor we're in trouble.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4

On Wed, Dec 06, 2006 at 08:11:00PM +1100, David Chinner wrote:
> They are defined but unused in 2.6.19, right? I can't see anywhere
> in the 2.6.19 ext2/3/4/reiser trees that actually those flags,
> including setting and retrieving them from disk. JFS i can see
> sets, clears and retreives them, but not the fielsystems you
> mention. Though I might just be blind. ;)
> 
> If all we need to add to XFS is support for those flags, then XFS
> support would be trivial to add.
> 
> Oh, damn. I take that back. We're almost out of flag space in the on
> disk inode - these two flags would use the last 2 flag bits so this
> may require an on disk inode format change in XFS. This will be
> a little more complex than I first thought, but not impossible
> as we already support two on-disk inode format versions.

Hrm. I was toying around with the idea of using a flag to mark inodes as
whiteouts (similar to what BSD does) for Unionfs. I remember that Jan Blunck
tried similar thing in his implementation of VFS unionfs mounts.

I am not entirely convinced that whiteout inode flag is the right way to do
things, but I'm just raising this now as I wouldn't want to wait for new
ondisk format for XFS to say that Unionfs supports XFS. (Assuming that it is
the right approach.) ;-)

Josef "Jeff" Sipek.

-- 
I already backed up the box once, I can do it again!
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4

On Wed, Dec 06, 2006 at 07:56:19PM -0500, Josef Sipek wrote:
> On Wed, Dec 06, 2006 at 08:11:00PM +1100, David Chinner wrote:
> > They are defined but unused in 2.6.19, right? I can't see anywhere
> > in the 2.6.19 ext2/3/4/reiser trees that actually those flags,
> > including setting and retrieving them from disk. JFS i can see
> > sets, clears and retreives them, but not the fielsystems you
> > mention. Though I might just be blind. ;)
> > 
> > If all we need to add to XFS is support for those flags, then XFS
> > support would be trivial to add.
> > 
> > Oh, damn. I take that back. We're almost out of flag space in the on
> > disk inode - these two flags would use the last 2 flag bits so this
> > may require an on disk inode format change in XFS. This will be
> > a little more complex than I first thought, but not impossible
> > as we already support two on-disk inode format versions.
>  
> Hrm. I was toying around with the idea of using a flag to mark inodes as
> whiteouts (similar to what BSD does) for Unionfs. I remember that Jan Blunck
> tried similar thing in his implementation of VFS unionfs mounts.
> 
> I am not entirely convinced that whiteout inode flag is the right way to do
> things, but I'm just raising this now as I wouldn't want to wait for new
> ondisk format for XFS to say that Unionfs supports XFS. (Assuming that it is
> the right approach.) ;-)

Maybe we should be using EAs for this sort of thing instead of flags
on the inode? If we keep adding inode flags for generic features
then we are going to force more than just XFS into inode format
changes eventually

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4

On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote:
> Maybe we should be using EAs for this sort of thing instead of flags
> on the inode? If we keep adding inode flags for generic features
> then we are going to force more than just XFS into inode format
> changes eventually

Aren't EAs slow? Maybe not on XFS but on other filesystems...

Josef "Jeff" Sipek.

-- 
Linux, n.:
  Generous programmers from around the world all join forces to help
  you shoot yourself in the foot for free. 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4

On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote:
> On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote:
> > Maybe we should be using EAs for this sort of thing instead of flags
> > on the inode? If we keep adding inode flags for generic features
> > then we are going to force more than just XFS into inode format
> > changes eventually
> 
> Aren't EAs slow? Maybe not on XFS but on other filesystems...

Only when they don't fit in the inode itself and extra
disk seeks are needed to retrieve them.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4

2006-12-06 Thread Nicholas Miell

On Thu, 2006-12-07 at 13:49 +1100, David Chinner wrote:
> On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote:
> > On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote:
> > > Maybe we should be using EAs for this sort of thing instead of flags
> > > on the inode? If we keep adding inode flags for generic features
> > > then we are going to force more than just XFS into inode format
> > > changes eventually
> > 
> > Aren't EAs slow? Maybe not on XFS but on other filesystems...
> 
> Only when they don't fit in the inode itself and extra
> disk seeks are needed to retrieve them.
> 
> Cheers,
> 
> Dave.

Also keep in mind that the EA doesn't actually have to have a physical
representation on disk (or, rather, it does, but it doesn't need to be
the same representation used by EAs in the user namespace).

This means that if one of those slow EA filesystems still has room for
flags in the inode, it can synthesize the EA on demand.

This is even preferable to ioctls for the interface to new filesystem
metadata -- if a backup or archive program knows how to store EAs, it
will be able to save and restore any new exotic metadata without any
extra effort.

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4

2006-12-06 Thread Nathan Scott

On Wed, 2006-12-06 at 20:11 +1100, David Chinner wrote:
> ...
> If all we need to add to XFS is support for those flags, then XFS
> support would be trivial to add.
> 
> Oh, damn. I take that back. We're almost out of flag space in the on
> disk inode - these two flags would use the last 2 flag bits so this
> may require an on disk inode format change in XFS. This will be
> a little more complex than I first thought, ...

It should be OK - you can do it without an inode version revision
if you take a second 16 bits for "di_flags2" from here...

xfs_dinode_core {
...
__uint8_t   di_pad[8];  /* unused, zeroed space */

Its guaranteed zeroed initially (i.e. all flags unset) and the XFS
get/set flags APIs are 32 bits, so you should be OK there.

Also, it may also be possible to reclaim di_onlink at some point (maybe
now, since 16 bits would be good here) if mkfs.xfs is changed to always
create v2 inodes (dynamic conversion ATM IIRC)... not 100% sure though,
needs more code analysis.

cheers.

-- 
Nathan

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 26/35] Unionfs: Privileged operations workqueue

On Tue, Dec 05, 2006 at 02:50:13PM -0500, Josef Sipek wrote:
> On Tue, Dec 05, 2006 at 08:27:51PM +0100, Jan Engelhardt wrote:
> > 
> > On Dec 4 2006 07:30, Josef 'Jeff' Sipek wrote:
> > >+#include "union.h"
> > >+
> > >+struct workqueue_struct *sioq;
> > >+
> > >+int __init init_sioq(void)
> > 
> > Although it's just me, I'd prefer sioq_init(), sioq_exit(),
> > sioq_run(), etc. to go in hand with what most drivers use as naming
> > (i.e.  "_" ).
> 
> That makes sense.
 
Hrm. Looking at the code, I noticed that the opposite is true:

destroy_filldir_cache();
destroy_inode_cache();
destroy_dentry_cache();
unregister_filesystem(&unionfs_fs_type);

The last one in particular...
 
> > >+void __unionfs_mknod(void *data)
> > >+{
> > >+  struct sioq_args *args = data;
> > >+  struct mknod_args *m = &args->mknod;
> > 
> > Care to make that: const struct mknod_args *m = &args->mknod;?
> > (Same for other places)
>  
> Right.
 
If I make the *args = data line const, then gcc (4.1) yells about modifying
a const variable 3 lines down..

args->err = vfs_mknod(m->parent, m->dentry, m->mode, m->dev);

Sure, I could cast, but that seems like adding cruft for no good reason.

Josef "Jeff" Sipek.

-- 
Only two things are infinite, the universe and human stupidity, and I'm not
sure about the former.
- Albert Einstein
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 26/35] Unionfs: Privileged operations workqueue

2006-12-06 Thread Jan Engelhardt


On Dec 6 2006 12:32, Josef Sipek wrote:
>> > >+int __init init_sioq(void)
>> > 
>> > Although it's just me, I'd prefer sioq_init(), sioq_exit(),
>> > sioq_run(), etc. to go in hand with what most drivers use as naming
>> > (i.e.  "_" ).
>> 
>> That makes sense.
> 
>Hrm. Looking at the code, I noticed that the opposite is true:
>
>destroy_filldir_cache();
>destroy_inode_cache();
>destroy_dentry_cache();
>unregister_filesystem(&unionfs_fs_type);
>
>The last one in particular...

I smell a big conspiracy! So yet again it's mixed mixed

fs$ grep __init */*.c | grep -v ' init_'
sysfs/mount.c:int __init sysfs_init(void)
sysv/inode.c:int __init sysv_init_icache(void)
proc/vmcore.c:static int __init vmcore_init(void)
proc/nommu.c:static int __init proc_nommu_init(void)
proc/proc_misc.c:void __init proc_misc_init(void)
proc/proc_tty.c:void __init proc_tty_init(void)
proc/root.c:void __init proc_root_init(void)


> 
>> > >+void __unionfs_mknod(void *data)
>> > >+{
>> > >+ struct sioq_args *args = data;
>> > >+ struct mknod_args *m = &args->mknod;
>> > 
>> > Care to make that: const struct mknod_args *m = &args->mknod;?
>> > (Same for other places)
>>  
>> Right.
> 
>If I make the *args = data line const, then gcc (4.1) yells about modifying
>a const variable 3 lines down..
>
>args->err = vfs_mknod(m->parent, m->dentry, m->mode, m->dev);
>
>Sure, I could cast, but that seems like adding cruft for no good reason.

No I despise casts more than missing consts. Why would gcc throw a warning?
Let's take this super simple program

<<<
struct inode;
struct dentry;

struct mknod_args {
struct inode *parent;
struct dentry *dentry;
int mode;
int dev;
};

extern int vfs_mknod(struct inode *, struct dentry *, int, int /*dev_t*/);

int main(void) {
const struct mknod_args *m;
vfs_mknod(m->parent, m->dentry, m->mode, m->dev);
return 0;
}
>>>

As undefined-behavior as it looks, it's got the const and vfs_mknod, as well as
an approximation of dev_t. It throws no warnings when compiled with `gcc -Wall
-c test.c`. Did I miss something?



-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

On Dec 06, 2006  09:42 -0800, Ulrich Drepper wrote:
> Rob Ross wrote:
> >File size is definitely one of the more difficult of the parameters, 
> >either because (a) it isn't stored in one place but is instead derived, 
> >or (b) because a lock has to be obtained to guarantee consistency of the 
> >returned value.
> 
> OK, and looking at the man page again, it is already on the list in the 
> old proposal and hence optional.  I've no problem with that.

IMHO, once part of the information is optional, why bother making ANY
of it required?  Consider "ls -s" on a distributed filesystem that has
UID+GID mapping.  It doesn't actually NEED to return the UID+GID to ls
for each file, since it won't be shown, but if that is part of the 
"required" fields then the filesystem would have to remap each UID+GID
on each file in the directory.  Similar arguments can be made for "find"
with various options (-atime, -mtime, etc) where any one of the "required"
parameters isn't needed.

I don't think it is _harmful_ to fill in unrequested values if they are
readily available (it might in fact avoid a lot of conditional branches)
but why not let the caller request only the minimum information it needs?

> >I can't speak for everyone, but "ls" is the #1 consumer as far as I am 
> >concerned.

That is my opinion also.  Lustre can do incredibly fast IO, but it isn't
very good at "ls" at all because it has to do way more work than you
would think (and I've stared at a lot of straces from ls, rm, etc).

> I think this is more a user problem.  For normal plain old 'ls' you get 
> by with readdir.  For 'ls -F' and 'ls --color' you mostly get by with 
> readdir+d_type.  If you cannot provide d_type info the readdirplus 
> extension does you no good.  For the cases when an additional stat is 
> needed (for symlinks, for instance, to test whether they are dangling) 
> readdirplus won't help.

I used to think this also, but even though Lustre supplies d_type info
GNU ls will still do stat operations because "ls --color" depends on
st_mode in order to color executable files differently.  Since virtually
all distros alias ls to "ls --color" this is pretty much default behaviour.
Another popular alias is "ls -F" which also uses st_mode for executables.

$ strace ls --color=yes # this is on an ext3 filesystem
:
:
open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getdents64(3, /* 53 entries */, 4096)   = 1840
lstat64("ChangeLog", {st_mode=S_IFREG|0660, st_size=48, ...}) = 0
lstat64("install-sh", {st_mode=S_IFREG|0755, st_size=7122, ...}) = 0
lstat64("config.sub", {st_mode=S_IFREG|0755, st_size=30221, ...}) = 0
lstat64("autogen.sh", {st_mode=S_IFREG|0660, st_size=41, ...}) = 0
lstat64("config.h", {st_mode=S_IFREG|0664, st_size=7177, ...}) = 0
lstat64("COPYING", {st_mode=S_IFREG|0660, st_size=18483, ...}) = 0
:
:

Similarly, GNU rm will stat all of the files (when run as a regular user)
to ask the "rm: remove write-protected regular file `foo.orig'?" question,
which also depends on st_mode.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] gfs2: nfs lock support for gfs2

2006-12-06 Thread Marc Eshel

Here is a rewrite of gdlm_plock_callback(). We still need to add the 
lock cancel.

Marc.

int gdlm_plock_callback(struct plock_op *op)
{
struct file *file;
struct file_lock *fl;
int (*notify)(void *, void *, int) = NULL;
int rv;

   spin_lock(&ops_lock);
   if (!list_empty(&op->list)) {
   printk(KERN_INFO "plock op on list\n");
   list_del(&op->list);
   }
   spin_unlock(&ops_lock);

   rv = op->info.rv;

   /* check if the following 2 are still valid or make a copy */
   file = op->info.file;
   fl = op->info.fl;
   notify = op->info.callback;

   if (!rv) { /* got fs lock */
   rv = posix_lock_file(file, fl);
   if (rv) { /* did not get posix lock */
   notify(fl, NULL, rv);
   log_error("gdlm_plock: vfs lock error file %p fl %p",
 file, fl);
   /* XXX: We need to cancel the fs lock here: */
   printk("gfs2 lock posix lock request failed\n");
   }
   else { /* got posix lock */
   if (notify(fl, NULL, 0)) {
   /* XXX: We need to cancel the fs lock here: */
   printk("gfs2 lock granted after lock request failed; 
dangling lock!\n");

   }
   }
   }
   else {  /* did not get fs lock */
   notify(fl, NULL, rv);
   }

   kfree(op);
   return rv;
}


David Teigland wrote:


On Wed, Dec 06, 2006 at 02:57:22PM -0500, J. Bruce Fields wrote:
 


On Wed, Dec 06, 2006 at 09:49:51AM -0600, David Teigland wrote:
   


The gfs side looks fine to me.  Did you forget to call fl_notify from
gdlm_plock_callback() or am I missing something?
 


Yes, looks like we missed something, thanks.  This code's an rfc (not
even tested), so don't apply it yet!  What we should have there is
something like:

rv = op->info.rv;

if (fl_notify(fl, NULL, rv)) {
/* XXX: We need to cancel the lock here: */
printk("gfs2 lock granted after lock request failed; dangling 
lock!\n");
}

if (!rv) {
/* check if the following are still valid or make a copy */
file = op->info.file;
fl = op->info.fl;

if (posix_lock_file_wait(file, fl) < 0)
log_error("gdlm_plock: vfs lock error file %p fl %p",
  file, fl);
}

Note there's a race condition--that calls fl_notify before actually
getting the lock locally.  I don't *think* that's a problem, as long as
it's the filesystem and not the local lock list that's authoritative
when it comes to who gets a posix lock.
   



agree

 


The more annoying problem is the need to cancel the GFS lock when
fl_notify fails; is that something that it's possible for GFS to do?

It can fail because lockd has a timeout--it waits a few seconds for the
callback, then gives up and returns a failure to the user.  If that
happens after your userspace posix lock manager acquires the lock (but
before fl_notify is called) then you've got to cancel it somehow.
   



I'd think we could just send an unlock for it at that point.  Set up an op
with GDLM_PLOCK_OP_UNLOCK and the same fields as the lock you're removing
and call send_op().  We probably need to flag this internal-unlock op so
that when the result arrives, device_write() delists and frees it itself.

(I wouldn't call this "canceling", I think of cancel as trying to force a
blocked request to return/fail prematurely.)

Dave

 



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4