Re: openg and path_to_handle

2006-12-14 Thread Rob Ross

Christoph Hellwig wrote:

On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote:

While it could do that, I'd be interested to see how you'd construct
the handle such that it's immune to a malicious user tampering with it,
or saving it across a reboot, or constructing one from scratch.

If the server has to have processed a real open request, say within
the preceding 30s, then it would have a handle for openfh() to match
against.  If the server reboots, or a client tries to construct a new
handle from scratch, or even tries to use the handle after the file is
closed then the handle would be invalid.

It isn't just an encoding for open-by-inum, but rather a handle that
references some just-created open file handle on the server.  That the
handle might contain the UID/GID is mostly irrelevant - either the
process + network is trusted to pass the handle around without snooping,
or a malicious client which intercepts the handle can spoof the UID/GID
just as easily.  Make the handle sufficiently large to avoid guessing
and it is secure enough until the whole filesystem is using kerberos
to avoid any number of other client/user spoofing attacks.


That would be fine as long as the file handle would be a kernel-level
concept.  The issue here is that they intent to make the whole filehandle
userspace visible, for example to pass it around via mpi.  As soon as
an untrused user can tamper with the file descriptor we're in trouble.


I guess it could reference some just-created open file handle on the 
server, if the server tracks that sort of thing. Or it could be a 
capability, as mentioned previously. So it isn't necessary to tie this 
to an open, but I think that would be a reasonable underlying 
implementation for a file system that tracks opens.


If clients can survive a server reboot without a remount, then even this 
implementation should continue to operate if a server were rebooted, 
because the open file context would be reconstructed. If capabilities 
were being employed, we could likewise survive a server reboot.


But this issue of server reboots isn't that critical -- the use case has 
the handle being reused relatively quickly after the initial openg(), 
and clients have a clean fallback in the event that the handle is no 
longer valid -- just use open().


Visibility of the handle to a user does not imply that the user can 
effectively tamper with the handle. A cryptographically secure one-way 
hash of the data, stored in the handle itself, would allow servers to 
verify that the handle wasn't tampered with, or that the client just 
made up a handle from scratch. The server managing the metadata for that 
file would not need to share its nonce with other servers, assuming that 
single servers are responsible for particular files.


Regards,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg and path_to_handle

2006-12-14 Thread Rob Ross

Matthew Wilcox wrote:

On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote:
I don't think that I understand what you're saying here. The openg() 
call does not perform file open (not that that is necessarily even a 
first-class FS operation), it simply does the lookup.


When we were naming these calls, from a POSIX consistency perspective it 
seemed best to keep the open nomenclature. That seems to be confusing 
to some. Perhaps we should rename the function lookup or something 
similar, to help keep from giving the wrong idea?


There is a difference between the openg() and path_to_handle() approach 
in that we do permission checking at openg(), and that does have 
implications on how the handle might be stored and such. That's being 
discussed in a separate thread.


I was just thinking about how one might implement this, when it struck
me ... how much more efficient is a kernel implementation compared to:

int openg(const char *path)
{
char *s;
do {
s = tempnam(FSROOT, .sutoc);
link(path, s);
} while (errno == EEXIST);

mpi_broadcast(s);
sleep(10);
unlink(s);
}

and sutoc() becomes simply open().  Now you have a name that's quick to
open (if a client has the filesystem mounted, it has a handle for the
root already), has a defined lifespan, has minimal permission checking,
and doesn't require standardisation.

I suppose some cluster fs' might not support cross-directory links
(AFS is one, I think), but then, no cluster fs's support openg/sutoc.


Well at least one does :).


If a filesystem's willing to add support for these handles, it shouldn't
be too hard for them to treat files starting .sutoc specially, and as
efficiently as adding the openg/sutoc concept.


Adding atomic reference count updating on file metadata so that we can 
have cross-directory links is not necessarily easier than supporting 
openg/openfh, and supporting cross-directory links precludes certain 
metadata organizations, such as the ones being used in Ceph (as I 
understand it).


This also still forces all clients to read a directory and for N 
permission checking operations to be performed. I don't see what the FS 
could do to eliminate those operations given what you've described. Am I 
missing something?


Also this looks too much like sillyrename, and that's hard to swallow...

Regards,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


statlite()

2006-12-14 Thread Rob Ross
We're going to clean the statlite() call up based on this (and 
subsequent) discussion and post again.


Thanks!

Rob

Ulrich Drepper wrote:

Christoph Hellwig wrote:

Ulrich, this in reply to these API proposals:


I know the documents.  The HECWG was actually supposed to submit an 
actual draft to the OpenGroup-internal working group but I haven't seen 
anything yet.  I'm not opposed to getting real-world experience first.



So other than this lite version of the readdirplus() call, and this 
idea of making the flags indicate validity rather than accuracy, are 
there other comments on the directory-related calls? I understand 
that they might or might not ever make it in, but assuming they did, 
what other changes would you like to see?


I don't think an accuracy flag is useful at all.  Programs don't want to 
use fuzzy information.  If you want a fast 'ls -l' then add a mode which 
doesn't print the fields which are not provided.  Don't provide outdated 
information.  Similarly for other programs.




statlite needs to separate the flag for valid fields from the actual
stat structure and reuse the existing stat(64) structure.  stat lite
needs to at least get a better name, even better be folded into *statat*,
either by having a new AT_VALID_MASK flag that enables a new
unsigned int valid argument or by folding the valid flags into the AT_
flags.


Yes, this is also my pet peeve with this interface.  I don't want to 
have another data structure.  Especially since programs might want to 
store the value in places where normal stat results are returned.


And also yes on 'statat'.  I strongly suggest to define only a statat 
variant.  In the standards group I'll vehemently oppose the introduction 
of yet another superfluous non-*at interface.


As for reusing the existing statat interface and magically add another 
parameter through ellipsis: no.  We need to become more type-safe.  The 
userlevel interface needs to be a new one.  For the system call there is 
no such restriction.  We can indeed extend the existing syscall.  We 
have appropriate checks for the validity of the flags parameter in place 
which make such calls backward compatible.





I think having a stat lite variant is pretty much consensus, we just need
to fine tune the actual API - and of course get a reference 
implementation.

So if you want to get this going try to implement it based on
http://marc.theaimsgroup.com/?l=linux-fsdevelm=115487991724607w=2.
Bonus points for actually making use of the flags in some filesystems.


I don't like that approach.  The flag parameter should be exclusively an 
output parameter.  By default the kernel should fill in all the fields 
it has access to.  If access is not easily possible then set the bit and 
clear the field.  There are of course certain fields which always should 
be added.  In the proposed man page these are already identified (i.e., 
those before the st_litemask member).




At the actual
C prototype level I would rename d_stat_err to d_stat_errno for 
consistency

and maybe drop the readdirplus() entry point in favour of readdirplus_r
only - there is no point in introducing new non-reenetrant APIs today.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Rob Ross

Matthew Wilcox wrote:

On Tue, Dec 05, 2006 at 10:07:48AM +, Christoph Hellwig wrote:

The filehandle idiocy on the other hand is way of into crackpipe land.


Right, and it needs to be discarded.  Of course, there was a real
problem that it addressed, so we need to come up with an acceptable
alternative.



The scenario is a cluster-wide application doing simultaneous opens of
the same file.  So thousands of nodes all hitting the same DLM locks
(for read) all at once.  The openg() non-solution implies that all
nodes in the cluster share the same filehandle space, so I think a
reasonable solution can be implemented entirely within the clusterfs,
with an extra flag to open(), say O_CLUSTER_WIDE.  When the clusterfs
sees this flag set (in -lookup), it can treat it as a hint that this
pathname component is likely to be opened again on other nodes and
broadcast that fact to the other nodes within the cluster.  Other nodes
on seeing that hint (which could be structured as The child bin
of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle
ff51a98799931256b555446b2f5675db08de6229) can keep a record of that fact.
When they see their own open, they can populate the path to that file
without asking the server for extra metadata.

There's obviously security issues there (why I say 'hint' rather than
'command'), but there's also security problems with open-by-filehandle.
Note that this solution requires no syscall changes, no application
changes, and also helps a scenario where each node opens a different
file in the same directory.

I've never worked on a clusterfs, so there may be some gotchas (eg, how
do you invalidate the caches of nodes when you do a rename).  But this
has to be preferable to open-by-fh.


The openg() solution has the following advantages to what you propose. 
First, it places the burden of the communication of the file handle on 
the application process, not the file system. That means less work for 
the file system. Second, it does not require that clients respond to 
unexpected network traffic. Third, the network traffic is deterministic 
-- one client interacts with the file system and then explicitly 
performs the broadcast. Fourth, it does not require that the file system 
store additional state on clients.


In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
the flag) would likely cause a storm of network traffic if clients were 
closely synchronized (which they are likely to be). We could work around 
this by having one application open early, then barrier, then have 
everyone else open, but then we might as well have just sent the handle 
as the barrier operation, and we've made the use of the O_CLUSTER_WIDE 
open() significantly more complicated for the application.


However, the application change issue is actually moot; we will make 
whatever changes inside our MPI-IO implementation, and many users will 
get the benefits for free.


The readdirplus(), readx()/writex(), and openg()/openfh() were all 
designed to allow our applications to explain exactly what they wanted 
and to allow for explicit communication. I understand that there is a 
tendency toward solutions where the FS guesses what the app is going to 
do or is passed a hint (e.g. fadvise) about what is going to happen, 
because these things don't require interface changes. But these 
solutions just aren't as effective as actually spelling out what the 
application wants.


Regards,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg

2006-12-06 Thread Rob Ross

Christoph Hellwig wrote:

On Tue, Dec 05, 2006 at 03:44:31PM -0600, Rob Ross wrote:
The openg() really just does the lookup and permission checking). The 
openfh() creates the file descriptor and starts that context if the 
particular FS tracks that sort of thing.


...

Well you've caught me. I don't want to cache the values, because I 
fundamentally believe that sharing state between clients and servers is 
braindead (to use Christoph's phrase) in systems of this scale 
(thousands to tens of thousands of clients). So I don't want locks, so I 
can't keep the cache consistent, ... So someone else will have to run 
the tests you propose :)...


Besides the whole ugliness you miss a few points about the fundamental
architecture of the unix filesystem permission model unfortunately.

Say you want to lookup a path /foo/bar/baz, then the access permission
is based on the following things:

 - the credentials of the user.  let's only take traditional uid/gid
   for this example although credentials are much more complex these
   days
 - the kind of operation you want to perform
 - the access permission of the actual object the path points to (inode)
 - the lookup permission (x bit) for every object on the way to you object

In your proposal sutoc is a simple conversion operation, that means
openg needs to perfom all these access checks and encodes them in the
fh_t.


This is exactly right and is the intention of the call.


That means an fh_t must fundamentally be an object that is kept
in the kernel aka a capability as defined by Henry Levy.  This does imply
you _do_ need to keep state.


The fh_t is indeed a type of capability. fh_t, properly protected, could 
be passed into user space and validated by the file system when 
presented back to the file system.


There is state here, clearly. I feel ok about that because we allow 
servers to forget that they handed out these fh_ts if they feel like it; 
there is no guaranteed lifetime in the current proposal. This allows 
servers to come and go without needing to persistently store these. 
Likewise, clients can forget them with no real penalty.


This approach is ok because of the use case. Because we expect the fh_t 
to be used relatively soon after its creation, servers will not need to 
hold onto these long before the openfh() is performed and we're back 
into a normal everyone has an valid fd use case.


 And because it needs kernel support you

fh_t is more or less equivalent to a file descriptor with sutoc equivalent
to a dup variant that really duplicates the backing object instead of just
the userspace index into it.


Well, a FD has some additional state associated with it (position, 
etc.), but yes there are definitely similarities to dup().



Note somewhat similar open by filehandle APIs like oben by inode number
as used by lustre or the XFS *_by_handle APIs are privilegued operations
because of exactly this problem.


I'm not sure what a properly protected fh_t couldn't be passed back into 
user space and handed around, but I'm not a security expert. What am I 
missing?



What according to your mail is the most important bit in this proposal is
that you thing the filehandles should be easily shared with other system
in a cluster.  That fact is not mentioned in the actual proposal at all,
and is in fact that hardest part because of inherent statefulness of
the API.


The documentation of the calls is complicated by the way POSIX calls are 
described. We need to have a second document describing use cases also 
available, so that we can avoid misunderstandings as best we can, get 
straight to the real issues. Sorry that document wasn't available.


I think I've addressed the statefulness of the API above?

What's the etiquette on changing subject lines here? It might be useful 
to separate the openg() etc. discussion from the readdirplus() etc. 
discussion.


Changing subject lines is fine.


Thanks.

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


openg and path_to_handle

2006-12-06 Thread Rob Ross

David Chinner wrote:

On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:

On 12/5/06, Rob Ross [EMAIL PROTECTED] wrote:

Hi,

I agree that it is not feasible to add new system calls every time
somebody has a problem, and we don't take adding system calls lightly.
However, in this case we're talking about an entire *community* of
people (high-end computing), not just one or two people. Of course it
may still be the case that that community is not important enough to
justify the addition of system calls; that's obviously not my call to make!

I have the feeling that openg stuff is rushed without looking into all
solutions, that don't require changes to the current interface.


I also get the feeling that interfaces that already do this
open-by-handle stuff haven't been explored either.

Does anyone here know about the XFS libhandle API? This has been
around for years and it does _exactly_ what these proposed syscalls
are supposed to do (and more).

See:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linuxdb=manfname=/usr/share/catman/man3/open_by_handle.3.htmlsrch=open_by_handle

For the libhandle man page. Basically:

openg == path_to_handle
sutoc == open_by_handle

And here for the userspace code:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/

Cheers,

Dave.


Thanks for pointing these out Dave. These are indeed along the same 
lines as the openg()/openfh() approach.


One difference is that they appear to perform permission checking on the 
open_by_handle(), which means that the entire path needs to be encoded 
in the handle, and makes it difficult to eliminate the path traversal 
overhead on N open_by_handle() operations.


Regards,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Rob Ross

Matthew Wilcox wrote:

On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
The openg() solution has the following advantages to what you propose. 
First, it places the burden of the communication of the file handle on 
the application process, not the file system. That means less work for 
the file system. Second, it does not require that clients respond to 
unexpected network traffic. Third, the network traffic is deterministic 
-- one client interacts with the file system and then explicitly 
performs the broadcast. Fourth, it does not require that the file system 
store additional state on clients.


You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:


I coincidentally just wrote about some of this in another email. Wasn't 
trying to avoid you...



: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.


Exactly.


: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?


The fh_t would be validated either (a) when the openfh() is called, or 
on accesses using the associated capability. As Christoph pointed out, 
this really is a capability and encapsulates everything necessary for a 
particular user to access a particular file. It can be handed to others, 
and in fact that is a critical feature for our use case.


After the openfh(), the access model is identical to a previously 
open()ed file. So the question is what happens between the openg() and 
the openfh().


Our intention was to allow servers to forget these fh_ts at will. So a 
revoke between openg() and openfh() would kill the fh_t, and the 
subsequent openfh() would fail, or subsequent accesses would fail 
(depending on when the FS chose to validate).


Does this help?


: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.

 :

: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.


We could use advice on this point. Certainly it's possible to encode 
information about the FS from which the fh_t originated, but we haven't 
tried to spell out exactly how that would happen. Your approach 
described here sounds good to me.


In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
the flag) would likely cause a storm of network traffic if clients were 
closely synchronized (which they are likely to be).


I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.


Yes, naive application. You're right that the file system could adapt to 
this, but on the other hand if we were explicitly passing the fh_t in 
user space, we could just use MPI_Bcast and be done with it, with an 
algorithm that is well-matched to the system, etc.


However, the application change issue is actually moot; we will make 
whatever changes inside our MPI-IO implementation, and many users will 
get the benefits for free.


That's good.


Absolutely. Same goes for readx()/writex() also, BTW, at least for 
MPI-IO users. We will build the input parameters inside MPI-IO using 
existing information from users, rather than applying data sieving or 
using multiple POSIX calls.


The readdirplus(), readx()/writex(), and openg()/openfh() were all 
designed to allow our applications to explain exactly what they wanted 
and to allow for explicit communication. I understand that there is a 
tendency toward solutions where the FS guesses what the app is going to 
do or is passed a hint (e.g. fadvise) about what is going to happen, 
because these things don't require interface changes. But these 
solutions just aren't as effective as actually spelling out what the 
application wants.


Sure, but I think you're emphasising these interfaces let

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Rob Ross

Ulrich Drepper wrote:

Andreas Dilger wrote:
Does this mean you are against the statlite() API entirely, or only 
against

the document's use of the flag as a vague accuracy value instead of a
hard valid value?


I'm against fuzzy values.  I've no problems with a bitmap specifying 
that certain members are not wanted or wanted (probably the later, zero 
meaning the optional fields are not wanted).


Thanks for clarifying.


IMHO, if the application doesn't need a particular field (e.g. ls -i
doesn't need size, ls -s doesn't need the inode number, etc) why should
these be filled in if they are not easily accessible?  As for what is
easily accessible, that needs to be determined by the filesystem itself.


Is the size not easily accessible?  It would surprise me.  If yes, then, 
by all means add it to the list.  I'm not against extending the list of 
members which are optional if it makes sense.  But certain information 
is certainly always easily accessible.


File size is definitely one of the more difficult of the parameters, 
either because (a) it isn't stored in one place but is instead derived, 
or (b) because a lock has to be obtained to guarantee consistency of the 
returned value.



That was previously suggested by me already.  IMHO, there should ONLY be
a statlite variant of readdirplus(), and I think most people agree with
that part of it (though there is contention on whether readdirplus() is
needed at all).


Indeed.  Given there is statlite and we have d_type information, in most 
situations we won't need more complete stat information.  Outside of 
programs like ls that is.


Part of why I wished the lab guys had submitted the draft to the 
OpenGroup first is that this way they would have to be more detailed on 
why each and every interface they propose for adding is really needed. 
Maybe they can do it now and here.  What programs really require 
readdirplus?


I can't speak for everyone, but ls is the #1 consumer as far as I am 
concerned.


Regards,

Rob

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg and path_to_handle

2006-12-06 Thread Rob Ross

David Chinner wrote:

On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:

David Chinner wrote:

On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:

On 12/5/06, Rob Ross [EMAIL PROTECTED] wrote:

Hi,

I agree that it is not feasible to add new system calls every time
somebody has a problem, and we don't take adding system calls lightly.
However, in this case we're talking about an entire *community* of people
(high-end computing), not just one or two people. Of course it may still
be the case that that community is not important enough to justify the
addition of system calls; that's obviously not my call to make!

I have the feeling that openg stuff is rushed without looking into all
solutions, that don't require changes to the current interface.

I also get the feeling that interfaces that already do this open-by-handle
stuff haven't been explored either.

Does anyone here know about the XFS libhandle API? This has been around for
years and it does _exactly_ what these proposed syscalls are supposed to do
(and more).

See:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linuxdb=manfname=/usr/share/catman/man3/open_by_handle.3.htmlsrch=open_by_handle

For the libhandle man page. Basically:

openg == path_to_handle sutoc == open_by_handle

And here for the userspace code:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/

Cheers,

Dave.

Thanks for pointing these out Dave. These are indeed along the same lines as
the openg()/openfh() approach.

One difference is that they appear to perform permission checking on the
open_by_handle(), which means that the entire path needs to be encoded in
the handle, and makes it difficult to eliminate the path traversal overhead
on N open_by_handle() operations.


open_by_handle() is checking the inode flags for things like
immutibility and whether the inode is writable to determine if the
open mode is valid given these flags. It's not actually checking
permissions. IOWs, open_by_handle() has the same overhead as NFS
filehandle to inode translation; i.e. no path traversal on open.

Permission checks are done on the path_to_handle(), so in reality
only root or CAP_SYS_ADMIN users can currently use the
open_by_handle interface because of this lack of checking. Given
that our current users of this interface need root permissions to do
other things (data migration), this has never been an issue.

This is an implementation detail - it is possible that file handle,
being opaque, could encode a UID/GID of the user that constructed
the handle and then allow any process with the same UID/GID to use
open_by_handle() on that handle. (I think hch has already pointed
this out.)

Cheers,

Dave.


Thanks for the clarification Dave. So I take it that you would be 
interested in this type of functionality then?


Regards,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-05 Thread Rob Ross

Matthew Wilcox wrote:

On Tue, Dec 05, 2006 at 06:09:03PM +0100, Latchesar Ionkov wrote:

It could be wasteful, but it could (most likely) also be useful. Name
resolution is not that expensive on either side of the network. The
latency introduced by the single-name lookups is :)


*is* latency the problem here?  Last I heard, it was the intolerable
load placed on the DLM by having clients bounce the read locks for each
directory element all over the cluster.


I think you're both right: it's either the time spent on all the actual 
lookups or the time involved in all the lock traffic, depending on FS 
and network of course.


Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-05 Thread Rob Ross

Trond Myklebust wrote:

On Tue, 2006-12-05 at 10:07 +, Christoph Hellwig wrote:

...and we have pointed out how nicely this ignores the realities of
current caching models. There is no need for a  readdirplus() system
call. There may be a need for a caching barrier, but AFAICS that is all.

I think Andreas mentioned that it is useful for clustered filesystems
that can avoid additional roundtrips this way.  That alone might now
be enough reason for API additions, though.  The again statlite and
readdirplus really are the most sane bits of these proposals as they
fit nicely into the existing set of APIs.  The filehandle idiocy on
the other hand is way of into crackpipe land.


They provide no benefits whatsoever for the two most commonly used
networked filesystems NFS and CIFS. As far as they are concerned, the
only new thing added by readdirplus() is the caching barrier semantics.
I don't see why you would want to add that into a generic syscall like
readdir() though: it is

a) networked filesystem specific. The mask stuff etc adds no

value whatsoever to actual posix filesystems. In fact it is
telling the kernel that it can violate posix semantics.


It isn't violating POSIX semantics if we get the calls passed as an 
extension to POSIX :).



b) quite unnatural to impose caching semantics on all the
directory _entries_ using a syscall that refers to the directory
itself (see the explanations by both myself and Peter Staubach
of the synchronisation difficulties). Consider in particular
that it is quite possible for directory contents to change in
between readdirplus calls.


I want to make sure that I understand this correctly. NFS semantics 
dictate that if someone stat()s a file that all changes from that client 
need to be propagated to the server? And this call complicates that 
semantic because now there's an operation on a different object (the 
directory) that would cause this flush on the files?


Of course directory contents can change in between readdirplus() calls, 
just as they can between readdir() calls. That's expected, and we do not 
attempt to create consistency between calls.



i.e. the strict posix caching model' is pretty much impossible
to implement on something like NFS or CIFS using these
semantics. Why then even bother to have masks to tell you when
it is OK to violate said strict model.


We're trying to obtain improved performance for distributed file systems 
with stronger consistency guarantees than these two.



c) Says nothing about what should happen to non-stat() metadata
such as ACL information and other extended attributes (for
example future selinux context info). You would think that the
'ls -l' application would care about this.


Honestly, we hadn't thought about other non-stat() metadata because we 
didn't think it was part of the use case, and we were trying to stay 
close to the flavor of POSIX. If you have ideas here, we'd like to hear 
them.


Thanks for the comments,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-04 Thread Rob Ross

Hi,

I agree that it is not feasible to add new system calls every time 
somebody has a problem, and we don't take adding system calls lightly. 
However, in this case we're talking about an entire *community* of 
people (high-end computing), not just one or two people. Of course it 
may still be the case that that community is not important enough to 
justify the addition of system calls; that's obviously not my call to make!


I'm sure that you meant more than just to rename openg() to lookup(), 
but I don't understand what you are proposing. We still need a second 
call to take the results of the lookup (by whatever name) and convert 
that into a file descriptor. That's all the openfh() (previously named 
sutoc()) is for.


I think the subject line might be a little misleading; we're not just 
talking about NFS here. There are a number of different file systems 
that might benefit from these enhancements (e.g. GPFS, Lustre, PVFS, 
PanFS, etc.).


Finally, your comment on making filesystem developers miserable is sort 
of a point of philosophical debate for me. I personally find myself 
miserable trying to extract performance given the very small amount of 
information passing through the existing POSIX calls. The additional 
information passing through these new calls will make it much easier to 
obtain performance without correctly guessing what the user might 
actually be up to. While they do mean more work in the short term, they 
should also mean a more straight-forward path to performance for 
cluster/parallel file systems.


Thanks for the input. Does this help explain why we don't think we can 
just work under the existing calls?


Rob

Latchesar Ionkov wrote:

Hi,

One general remark: I don't think it is feasible to add new system
calls every time somebody has a problem. Usually there are (may be not
that good) solutions that don't require big changes and work well
enough. Let's change the interface and make the life of many
filesystem developers miserable, because they have to worry about
3-4-5 more operations is not the easiest solution in the long run.

On 12/1/06, Rob Ross [EMAIL PROTECTED] wrote:

Hi all,

The use model for openg() and openfh() (renamed sutoc()) is n processes
spread across a large cluster simultaneously opening a file. The
challenge is to avoid to the greatest extent possible incurring O(n) FS
interactions. To do that we need to allow actions of one process to be
reused by other processes on other OS instances.

The openg() call allows one process to perform name resolution, which is
often the most expensive part of this use model. Because permission


If the name resolution is the most expensive part, why not implement
just the name lookup part and call it lookup instead of openg. Or
even better, make NFS to resolve multiple names with a single request.
If the NFS server caches the last few name lookups, the responses from
the other nodes will be fast, and you will get your file descriptor
with two instead of the proposed one request. The performance could be
just good enough without introducing any new functions and file
handles.

Thanks,
   Lucho


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-04 Thread Rob Ross

Hi all,

I don't think that the group intended that there be an opendirplus(); 
rather readdirplus() would simply be called instead of the usual 
readdir(). We should clarify that.


Regarding Peter Staubach's comments about no one ever using the 
readdirplus() call; well, if people weren't performing this workload in 
the first place, we wouldn't *need* this sort of call! This call is 
specifically targeted at improving ls -l performance on large 
directories, and Sage has pointed out quite nicely how that might work.


In our case (PVFS), we would essentially perform three phases of 
communication with the file system for a readdirplus that was obtaining 
full statistics: first grabbing the directory entries, then obtaining 
metadata from servers on all objects in bulk, then gathering file sizes 
in bulk. The reduction in control message traffic is enormous, and the 
concurrency is much greater than in a readdir()+stat()s workload. We'd 
never perform this sort of optimization optimistically, as the cost of 
guessing wrong is just too high. We would want to see the call as a 
proper VFS operation that we could act upon.


The entire readdirplus() operation wasn't intended to be atomic, and in 
fact the returned structure has space for an error associated with the 
stat() on a particular entry, to allow for implementations that stat() 
subsequently and get an error because the object was removed between 
when the entry was read out of the directory and when the stat was 
performed. I think this fits well with what Andreas and others are 
thinking. We should clarify the description appropriately.


I don't think that we have a readdirpluslite() variation documented yet? 
Gary? It would make a lot of sense. Except that it should probably have 
a better name...


Regarding Andreas's note that he would prefer the statlite() flags to 
mean valid, that makes good sense to me (and would obviously apply to 
the so-far even more hypothetical readdirpluslite()). I don't think 
there's a lot of value in returning possibly-inaccurate values?


Thanks everyone,

Rob

Trond Myklebust wrote:

On Mon, 2006-12-04 at 00:32 -0700, Andreas Dilger wrote:
I'm wondering if a corresponding opendirplus() (or similar) would also be 
appropriate to inform the kernel/filesystem that readdirplus() will 
follow, and stat information should be gathered/buffered.  Or do most 
implementations wait for the first readdir() before doing any actual work 
anyway?

I'm not sure what some filesystems might do here.  I suppose NFS has weak
enough cache semantics that it _might_ return stale cached data from the
client in order to fill the readdirplus() data, but it is just as likely
that it ships the whole thing to the server and returns everything in
one shot.  That would imply everything would be at least as up-to-date
as the opendir().


Whether or not the posix committee decides on readdirplus, I propose
that we implement this sort of thing in the kernel via a readdir
equivalent to posix_fadvise(). That can give exactly the barrier
semantics that they are asking for, and only costs 1 extra syscall as
opposed to 2 (opendirplus() and readdirplus()).

Cheers
  Trond


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Rob Ross

Hi all,

The use model for openg() and openfh() (renamed sutoc()) is n processes 
spread across a large cluster simultaneously opening a file. The 
challenge is to avoid to the greatest extent possible incurring O(n) FS 
interactions. To do that we need to allow actions of one process to be 
reused by other processes on other OS instances.


The openg() call allows one process to perform name resolution, which is 
often the most expensive part of this use model. Because permission 
checking is also performed as part of the openg(), some file systems to 
not require additional communication between OS and FS at openfh(). 
External communication channels are used to pass the handle resulting 
from the openg() call out to processes on other nodes (e.g. MPI_Bcast).


dup(), openat(), and UNIX sockets are not viable options in this model, 
because there are many OS instances, not just one.


All the calls that are being discussed as part of the HEC extensions are 
being discussed in this context of multiple OS instances and cluster 
file systems.


Regarding the lifetime of the handle, there has been quite a bit of 
discussion about this. I believe that we most recently were thinking 
that there was an undefined lifetime for this, allowing servers to 
forget these values (as in the case where a server is restarted). 
Clients would need to perform the openg() again if they were to try to 
use an outdated handle, or simply fall back to a regular open(). This is 
not a problem in our use model.


I've attached a graph showing the time to use individual open() calls 
vs. the openg()/MPI_Bcast()/openfh() combination; it's a clear win for 
any significant number of processes. These results are from our 
colleagues at Sandia (Ruth Klundt et. al.) with PVFS underneath, but I 
expect the trend to be similar for many cluster file systems.


Regarding trying to force APIs using standardization on you 
(Christoph's 11/29/2006 message), you've got us all wrong. The 
standardization process is going to take some time, so we're starting on 
it at the same time that we're working with prototypes, so that we don't 
have to wait any longer than necessary to have these things be part of 
POSIX. The whole reason we're presenting this on this list is to try to 
describe why we think these calls are important and get feedback on how 
we can make these calls work well in the context of Linux. I'm glad to 
see so many people taking interest.


I look forward to further constructive discussion. Thanks,

Rob
---
Rob Ross
Mathematics and Computer Science Division
Argonne National Laboratory

Christoph Hellwig wrote:

On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote:

Is this for people who don't know about dup(), or do they need
independent file offsets?  If the latter, I think an xdup() would be
preferable (would there be a security issue for OSes with revoke()?)
Either that, or make the key be useful for something else.


Not sharing the file offset means we need a separate file struct, at
which point the only thing saved is doing a lookup at the time of
opening the file.  While a full pathname traversal can be quite costly
an open is not something you do all that often anyway.  And if you really
need to open/close files very often you can speed it up nicely by keeping
a file descriptor on the parent directory open and use openat().

Anyway, enough of talking here.  We really need a very good description
of the use case people want this for, and the specific performance problems
they see to find a solution.  And the solution definitly does not involve
as second half-assed file handle time with unspecified lifetime rules :-)


openg-compare.pdf
Description: Adobe PDF document