Re: openg and path_to_handle

2006-12-14 Thread Matthew Wilcox
On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote:
 I don't think that I understand what you're saying here. The openg() 
 call does not perform file open (not that that is necessarily even a 
 first-class FS operation), it simply does the lookup.
 
 When we were naming these calls, from a POSIX consistency perspective it 
 seemed best to keep the open nomenclature. That seems to be confusing 
 to some. Perhaps we should rename the function lookup or something 
 similar, to help keep from giving the wrong idea?
 
 There is a difference between the openg() and path_to_handle() approach 
 in that we do permission checking at openg(), and that does have 
 implications on how the handle might be stored and such. That's being 
 discussed in a separate thread.

I was just thinking about how one might implement this, when it struck
me ... how much more efficient is a kernel implementation compared to:

int openg(const char *path)
{
char *s;
do {
s = tempnam(FSROOT, .sutoc);
link(path, s);
} while (errno == EEXIST);

mpi_broadcast(s);
sleep(10);
unlink(s);
}

and sutoc() becomes simply open().  Now you have a name that's quick to
open (if a client has the filesystem mounted, it has a handle for the
root already), has a defined lifespan, has minimal permission checking,
and doesn't require standardisation.

I suppose some cluster fs' might not support cross-directory links
(AFS is one, I think), but then, no cluster fs's support openg/sutoc.
If a filesystem's willing to add support for these handles, it shouldn't
be too hard for them to treat files starting .sutoc specially, and as
efficiently as adding the openg/sutoc concept.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg and path_to_handle

2006-12-14 Thread Rob Ross

Christoph Hellwig wrote:

On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote:

While it could do that, I'd be interested to see how you'd construct
the handle such that it's immune to a malicious user tampering with it,
or saving it across a reboot, or constructing one from scratch.

If the server has to have processed a real open request, say within
the preceding 30s, then it would have a handle for openfh() to match
against.  If the server reboots, or a client tries to construct a new
handle from scratch, or even tries to use the handle after the file is
closed then the handle would be invalid.

It isn't just an encoding for open-by-inum, but rather a handle that
references some just-created open file handle on the server.  That the
handle might contain the UID/GID is mostly irrelevant - either the
process + network is trusted to pass the handle around without snooping,
or a malicious client which intercepts the handle can spoof the UID/GID
just as easily.  Make the handle sufficiently large to avoid guessing
and it is secure enough until the whole filesystem is using kerberos
to avoid any number of other client/user spoofing attacks.


That would be fine as long as the file handle would be a kernel-level
concept.  The issue here is that they intent to make the whole filehandle
userspace visible, for example to pass it around via mpi.  As soon as
an untrused user can tamper with the file descriptor we're in trouble.


I guess it could reference some just-created open file handle on the 
server, if the server tracks that sort of thing. Or it could be a 
capability, as mentioned previously. So it isn't necessary to tie this 
to an open, but I think that would be a reasonable underlying 
implementation for a file system that tracks opens.


If clients can survive a server reboot without a remount, then even this 
implementation should continue to operate if a server were rebooted, 
because the open file context would be reconstructed. If capabilities 
were being employed, we could likewise survive a server reboot.


But this issue of server reboots isn't that critical -- the use case has 
the handle being reused relatively quickly after the initial openg(), 
and clients have a clean fallback in the event that the handle is no 
longer valid -- just use open().


Visibility of the handle to a user does not imply that the user can 
effectively tamper with the handle. A cryptographically secure one-way 
hash of the data, stored in the handle itself, would allow servers to 
verify that the handle wasn't tampered with, or that the client just 
made up a handle from scratch. The server managing the metadata for that 
file would not need to share its nonce with other servers, assuming that 
single servers are responsible for particular files.


Regards,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] lockd: add new export operation for nfsv4/lockd locking

2006-12-14 Thread J. Bruce Fields
By the way, one other issue I think we'll need to resolve:

On Wed, Dec 06, 2006 at 12:34:11AM -0500, J. Bruce Fields wrote:
 +/**
 + * vfs_cancel_lock - file byte range unblock lock
 + * @filp: The file to apply the unblock to
 + * @fl: The lock to be unblocked
 + *
 + * FL_CANCELED is used to cancel blocked requests
 + */
 +int vfs_cancel_lock(struct file *filp, struct file_lock *fl)
 +{
 + int status;
 + struct super_block *sb;
 +
 + fl-fl_flags |= FL_CANCEL;
 + sb = filp-f_dentry-d_inode-i_sb;
 + if (sb-s_export_op  sb-s_export_op-lock)
 + status = sb-s_export_op-lock(filp, F_SETLK, fl);
 + else
 + status = posix_unblock_lock(filp, fl);
 + fl-fl_flags = ~FL_CANCEL;
 + return status;
 +}

So we're passing cancel requests to the filesystem by setting an
FL_CANCEL flag in fl_flags and then calling the lock operation.  I think
Trond has said he'd rather keep fl_flags for permanent characteristics
of the lock in question, rather than as a channel for passing arguments
to lock operations.  Also, the GFS patch isn't checking FL_CANCEL, so
that's a bug.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg and path_to_handle

2006-12-14 Thread Rob Ross

Matthew Wilcox wrote:

On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote:
I don't think that I understand what you're saying here. The openg() 
call does not perform file open (not that that is necessarily even a 
first-class FS operation), it simply does the lookup.


When we were naming these calls, from a POSIX consistency perspective it 
seemed best to keep the open nomenclature. That seems to be confusing 
to some. Perhaps we should rename the function lookup or something 
similar, to help keep from giving the wrong idea?


There is a difference between the openg() and path_to_handle() approach 
in that we do permission checking at openg(), and that does have 
implications on how the handle might be stored and such. That's being 
discussed in a separate thread.


I was just thinking about how one might implement this, when it struck
me ... how much more efficient is a kernel implementation compared to:

int openg(const char *path)
{
char *s;
do {
s = tempnam(FSROOT, .sutoc);
link(path, s);
} while (errno == EEXIST);

mpi_broadcast(s);
sleep(10);
unlink(s);
}

and sutoc() becomes simply open().  Now you have a name that's quick to
open (if a client has the filesystem mounted, it has a handle for the
root already), has a defined lifespan, has minimal permission checking,
and doesn't require standardisation.

I suppose some cluster fs' might not support cross-directory links
(AFS is one, I think), but then, no cluster fs's support openg/sutoc.


Well at least one does :).


If a filesystem's willing to add support for these handles, it shouldn't
be too hard for them to treat files starting .sutoc specially, and as
efficiently as adding the openg/sutoc concept.


Adding atomic reference count updating on file metadata so that we can 
have cross-directory links is not necessarily easier than supporting 
openg/openfh, and supporting cross-directory links precludes certain 
metadata organizations, such as the ones being used in Ceph (as I 
understand it).


This also still forces all clients to read a directory and for N 
permission checking operations to be performed. I don't see what the FS 
could do to eliminate those operations given what you've described. Am I 
missing something?


Also this looks too much like sillyrename, and that's hard to swallow...

Regards,

Rob
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] asynchronous locks for cluster exports

2006-12-14 Thread J. Bruce Fields
On Thu, Dec 07, 2006 at 04:51:08PM +, Christoph Hellwig wrote:
 On Wed, Dec 06, 2006 at 12:34:10AM -0500, J. Bruce Fields wrote:
  We'd like an asynchronous posix locking interface so that we can provide NFS
  - We added a new -lock() export operation, figuring this was a feature
that only lockd and nfsd care about for now, and that we'd rather not
muck about with common locking code.  But the export operation is
pretty much identical to the file -lock() operation; would it make
more sense to use that?
 
 This definitly needs to be merged back into the -lock file operation

So the interesting question is whether we can merge the semantics in a
reasonable way.  The export operation implemented by the current version
of these patches returns more or less immediately with success, -EAGAIN,
or -EINPROGRESS; in the latter case the filesystem later calls fl_notify
to communicate the results.  The existing file lock operation normally
blocks until the lock is actually granted.

The one file lock operation could do both, and just switch between the
two cases depending on whether fl_notify is defined.  Would the
semantics be clear enough?

I find the existing use of -lock() a little odd as it is; stuff like this,
from fcntl_getlk():

if (filp-f_op  filp-f_op-lock) {
error = filp-f_op-lock(filp, F_GETLK, file_lock);
if (file_lock.fl_ops  file_lock.fl_ops-fl_release_private)
file_lock.fl_ops-fl_release_private(file_lock);
if (error  0)
goto out;
else
fl = (file_lock.fl_type == F_UNLCK ? NULL : file_lock);
} else {
fl = (posix_test_lock(filp, file_lock, cfl) ? cfl : NULL);
}

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


statlite()

2006-12-14 Thread Rob Ross
We're going to clean the statlite() call up based on this (and 
subsequent) discussion and post again.


Thanks!

Rob

Ulrich Drepper wrote:

Christoph Hellwig wrote:

Ulrich, this in reply to these API proposals:


I know the documents.  The HECWG was actually supposed to submit an 
actual draft to the OpenGroup-internal working group but I haven't seen 
anything yet.  I'm not opposed to getting real-world experience first.



So other than this lite version of the readdirplus() call, and this 
idea of making the flags indicate validity rather than accuracy, are 
there other comments on the directory-related calls? I understand 
that they might or might not ever make it in, but assuming they did, 
what other changes would you like to see?


I don't think an accuracy flag is useful at all.  Programs don't want to 
use fuzzy information.  If you want a fast 'ls -l' then add a mode which 
doesn't print the fields which are not provided.  Don't provide outdated 
information.  Similarly for other programs.




statlite needs to separate the flag for valid fields from the actual
stat structure and reuse the existing stat(64) structure.  stat lite
needs to at least get a better name, even better be folded into *statat*,
either by having a new AT_VALID_MASK flag that enables a new
unsigned int valid argument or by folding the valid flags into the AT_
flags.


Yes, this is also my pet peeve with this interface.  I don't want to 
have another data structure.  Especially since programs might want to 
store the value in places where normal stat results are returned.


And also yes on 'statat'.  I strongly suggest to define only a statat 
variant.  In the standards group I'll vehemently oppose the introduction 
of yet another superfluous non-*at interface.


As for reusing the existing statat interface and magically add another 
parameter through ellipsis: no.  We need to become more type-safe.  The 
userlevel interface needs to be a new one.  For the system call there is 
no such restriction.  We can indeed extend the existing syscall.  We 
have appropriate checks for the validity of the flags parameter in place 
which make such calls backward compatible.





I think having a stat lite variant is pretty much consensus, we just need
to fine tune the actual API - and of course get a reference 
implementation.

So if you want to get this going try to implement it based on
http://marc.theaimsgroup.com/?l=linux-fsdevelm=115487991724607w=2.
Bonus points for actually making use of the flags in some filesystems.


I don't like that approach.  The flag parameter should be exclusively an 
output parameter.  By default the kernel should fill in all the fields 
it has access to.  If access is not easily possible then set the bit and 
clear the field.  There are of course certain fields which always should 
be added.  In the proposed man page these are already identified (i.e., 
those before the st_litemask member).




At the actual
C prototype level I would rename d_stat_err to d_stat_errno for 
consistency

and maybe drop the readdirplus() entry point in favour of readdirplus_r
only - there is no point in introducing new non-reenetrant APIs today.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov
 Nikolai Joukov wrote:
  We have designed a new stackable file system that we called RAIF:
  Redundant Array of Independent Filesystems.

 Great!

  We have performed some benchmarking on a 3GHz PC with 2GB of RAM and U320
  SCSI disks.  Compared to the Linux RAID driver, RAIF has overheads of
  about 20-25% under the Postmark v1.5 benchmark in case of striping and
  replication.  In case of RAID4 and RAID5-like configurations, RAIF
  performed about two times *better* than software RAID and even better than
  an Adaptec 2120S RAID5 controller.

 I am not surprised.  RAID 4/5/6 performance is highly sensitive to the
 underlying hw, and thus needs a fair amount of fine tuning.

Nevertheless, performance is not the biggest advantage of RAIF.  For
read-biased workloads RAID is always slightly faster than RAIF.  The
biggest advantages of RAIF are flexible configurations (e.g., can combine
NFS and local file systems), per-file-type storage policies, and the fact
that files are stored as files on the lower file systems (which is
convenient).

  This is because RAIF is located above
  file system caches and can cache parity as normal data when needed.  We
  have more performance details in a technical report, if anyone is
  interested.

 Definitely interested.  Can you give a link?

The main focus of the paper is on a general OS profiling method and not
on RAIF.  However, it has some details about the RAIF benchmarking with
Postmark in Chapter 9:

  http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf

Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
operation under the same Postmark workload.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov
  We started the project in April 2004.  Right now I am using it as my
  /home/kolya file system at home.  We believe that at this stage RAIF is
  mature enough for others to try it out.  The code is available at:
 
  ftp://ftp.fsl.cs.sunysb.edu/pub/raif/
 
  The code requires no kernel patches and compiles for a wide range of
  kernels as a module.  The latest kernel we used it for is 2.6.13 and we
  are in the process of porting it to 2.6.19.
 
  We will be happy to hear your back.

 When removing a file from the underlying branch, the oops below happens.
 Wouldn't it be possible to just fail the branch instead of oopsing?

This is a known problem of all Linux stackable file systems.  Users are
not supposed to change the file systems below mounted stackable file
systems (but they can read them).  One of the ways to enforce it is to use
overlay mounts.  For example, mount the lower file systems at
/raif/b0 ... /raif/bN and then mount RAIF at /raif.  Stackable file
systems recently started getting into the kernel and we hope that there
will be a better solution for this problem in the future.  Having said
that, you are right: failing the branch would be the right thing to do.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov
 Well, Congratulations, Doctor!!  [Must be nice to be exiled to Stony
 Brook!!  Oh, well, not I]

Long Island is a very nice place with lots of vineries and perfect sand
beaches - don't envy :-)

 Here's hoping that source exists, and that it is available for us.

I guess, you are subscribed to the linux-raid list only.  Unfortunately, I
didn't CC my post to that list and one of the replies was CC'd there
without the link.  The original post is available here:

  http://marc.theaimsgroup.com/?l=linux-fsdevelm=116603282106036w=2

And the link to the sources is:

  ftp://ftp.fsl.cs.sunysb.edu/pub/raif/

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] lockd: add new export operation for nfsv4/lockd locking

2006-12-14 Thread Marc Eshel
Lets see if we use export operations or vfs calls. If we do exports we can 
even add another call just for cancel or maybe we can add a new vfs call.
Marc. 

J. Bruce Fields [EMAIL PROTECTED] wrote on 12/14/2006 03:04:42 PM:

 By the way, one other issue I think we'll need to resolve:
 
 On Wed, Dec 06, 2006 at 12:34:11AM -0500, J. Bruce Fields wrote:
  +/**
  + * vfs_cancel_lock - file byte range unblock lock
  + * @filp: The file to apply the unblock to
  + * @fl: The lock to be unblocked
  + *
  + * FL_CANCELED is used to cancel blocked requests
  + */
  +int vfs_cancel_lock(struct file *filp, struct file_lock *fl)
  +{
  +   int status;
  +   struct super_block *sb;
  +
  +   fl-fl_flags |= FL_CANCEL;
  +   sb = filp-f_dentry-d_inode-i_sb;
  +   if (sb-s_export_op  sb-s_export_op-lock)
  +  status = sb-s_export_op-lock(filp, F_SETLK, fl);
  +   else
  +  status = posix_unblock_lock(filp, fl);
  +   fl-fl_flags = ~FL_CANCEL;
  +   return status;
  +}
 
 So we're passing cancel requests to the filesystem by setting an
 FL_CANCEL flag in fl_flags and then calling the lock operation.  I think
 Trond has said he'd rather keep fl_flags for permanent characteristics
 of the lock in question, rather than as a channel for passing arguments
 to lock operations.  Also, the GFS patch isn't checking FL_CANCEL, so
 that's a bug.
 
 --b.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html