Re: openg and path_to_handle
On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote: I don't think that I understand what you're saying here. The openg() call does not perform file open (not that that is necessarily even a first-class FS operation), it simply does the lookup. When we were naming these calls, from a POSIX consistency perspective it seemed best to keep the open nomenclature. That seems to be confusing to some. Perhaps we should rename the function lookup or something similar, to help keep from giving the wrong idea? There is a difference between the openg() and path_to_handle() approach in that we do permission checking at openg(), and that does have implications on how the handle might be stored and such. That's being discussed in a separate thread. I was just thinking about how one might implement this, when it struck me ... how much more efficient is a kernel implementation compared to: int openg(const char *path) { char *s; do { s = tempnam(FSROOT, .sutoc); link(path, s); } while (errno == EEXIST); mpi_broadcast(s); sleep(10); unlink(s); } and sutoc() becomes simply open(). Now you have a name that's quick to open (if a client has the filesystem mounted, it has a handle for the root already), has a defined lifespan, has minimal permission checking, and doesn't require standardisation. I suppose some cluster fs' might not support cross-directory links (AFS is one, I think), but then, no cluster fs's support openg/sutoc. If a filesystem's willing to add support for these handles, it shouldn't be too hard for them to treat files starting .sutoc specially, and as efficiently as adding the openg/sutoc concept. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
Christoph Hellwig wrote: On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: While it could do that, I'd be interested to see how you'd construct the handle such that it's immune to a malicious user tampering with it, or saving it across a reboot, or constructing one from scratch. If the server has to have processed a real open request, say within the preceding 30s, then it would have a handle for openfh() to match against. If the server reboots, or a client tries to construct a new handle from scratch, or even tries to use the handle after the file is closed then the handle would be invalid. It isn't just an encoding for open-by-inum, but rather a handle that references some just-created open file handle on the server. That the handle might contain the UID/GID is mostly irrelevant - either the process + network is trusted to pass the handle around without snooping, or a malicious client which intercepts the handle can spoof the UID/GID just as easily. Make the handle sufficiently large to avoid guessing and it is secure enough until the whole filesystem is using kerberos to avoid any number of other client/user spoofing attacks. That would be fine as long as the file handle would be a kernel-level concept. The issue here is that they intent to make the whole filehandle userspace visible, for example to pass it around via mpi. As soon as an untrused user can tamper with the file descriptor we're in trouble. I guess it could reference some just-created open file handle on the server, if the server tracks that sort of thing. Or it could be a capability, as mentioned previously. So it isn't necessary to tie this to an open, but I think that would be a reasonable underlying implementation for a file system that tracks opens. If clients can survive a server reboot without a remount, then even this implementation should continue to operate if a server were rebooted, because the open file context would be reconstructed. If capabilities were being employed, we could likewise survive a server reboot. But this issue of server reboots isn't that critical -- the use case has the handle being reused relatively quickly after the initial openg(), and clients have a clean fallback in the event that the handle is no longer valid -- just use open(). Visibility of the handle to a user does not imply that the user can effectively tamper with the handle. A cryptographically secure one-way hash of the data, stored in the handle itself, would allow servers to verify that the handle wasn't tampered with, or that the client just made up a handle from scratch. The server managing the metadata for that file would not need to share its nonce with other servers, assuming that single servers are responsible for particular files. Regards, Rob - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/10] lockd: add new export operation for nfsv4/lockd locking
By the way, one other issue I think we'll need to resolve: On Wed, Dec 06, 2006 at 12:34:11AM -0500, J. Bruce Fields wrote: +/** + * vfs_cancel_lock - file byte range unblock lock + * @filp: The file to apply the unblock to + * @fl: The lock to be unblocked + * + * FL_CANCELED is used to cancel blocked requests + */ +int vfs_cancel_lock(struct file *filp, struct file_lock *fl) +{ + int status; + struct super_block *sb; + + fl-fl_flags |= FL_CANCEL; + sb = filp-f_dentry-d_inode-i_sb; + if (sb-s_export_op sb-s_export_op-lock) + status = sb-s_export_op-lock(filp, F_SETLK, fl); + else + status = posix_unblock_lock(filp, fl); + fl-fl_flags = ~FL_CANCEL; + return status; +} So we're passing cancel requests to the filesystem by setting an FL_CANCEL flag in fl_flags and then calling the lock operation. I think Trond has said he'd rather keep fl_flags for permanent characteristics of the lock in question, rather than as a channel for passing arguments to lock operations. Also, the GFS patch isn't checking FL_CANCEL, so that's a bug. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
Matthew Wilcox wrote: On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote: I don't think that I understand what you're saying here. The openg() call does not perform file open (not that that is necessarily even a first-class FS operation), it simply does the lookup. When we were naming these calls, from a POSIX consistency perspective it seemed best to keep the open nomenclature. That seems to be confusing to some. Perhaps we should rename the function lookup or something similar, to help keep from giving the wrong idea? There is a difference between the openg() and path_to_handle() approach in that we do permission checking at openg(), and that does have implications on how the handle might be stored and such. That's being discussed in a separate thread. I was just thinking about how one might implement this, when it struck me ... how much more efficient is a kernel implementation compared to: int openg(const char *path) { char *s; do { s = tempnam(FSROOT, .sutoc); link(path, s); } while (errno == EEXIST); mpi_broadcast(s); sleep(10); unlink(s); } and sutoc() becomes simply open(). Now you have a name that's quick to open (if a client has the filesystem mounted, it has a handle for the root already), has a defined lifespan, has minimal permission checking, and doesn't require standardisation. I suppose some cluster fs' might not support cross-directory links (AFS is one, I think), but then, no cluster fs's support openg/sutoc. Well at least one does :). If a filesystem's willing to add support for these handles, it shouldn't be too hard for them to treat files starting .sutoc specially, and as efficiently as adding the openg/sutoc concept. Adding atomic reference count updating on file metadata so that we can have cross-directory links is not necessarily easier than supporting openg/openfh, and supporting cross-directory links precludes certain metadata organizations, such as the ones being used in Ceph (as I understand it). This also still forces all clients to read a directory and for N permission checking operations to be performed. I don't see what the FS could do to eliminate those operations given what you've described. Am I missing something? Also this looks too much like sillyrename, and that's hard to swallow... Regards, Rob - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] asynchronous locks for cluster exports
On Thu, Dec 07, 2006 at 04:51:08PM +, Christoph Hellwig wrote: On Wed, Dec 06, 2006 at 12:34:10AM -0500, J. Bruce Fields wrote: We'd like an asynchronous posix locking interface so that we can provide NFS - We added a new -lock() export operation, figuring this was a feature that only lockd and nfsd care about for now, and that we'd rather not muck about with common locking code. But the export operation is pretty much identical to the file -lock() operation; would it make more sense to use that? This definitly needs to be merged back into the -lock file operation So the interesting question is whether we can merge the semantics in a reasonable way. The export operation implemented by the current version of these patches returns more or less immediately with success, -EAGAIN, or -EINPROGRESS; in the latter case the filesystem later calls fl_notify to communicate the results. The existing file lock operation normally blocks until the lock is actually granted. The one file lock operation could do both, and just switch between the two cases depending on whether fl_notify is defined. Would the semantics be clear enough? I find the existing use of -lock() a little odd as it is; stuff like this, from fcntl_getlk(): if (filp-f_op filp-f_op-lock) { error = filp-f_op-lock(filp, F_GETLK, file_lock); if (file_lock.fl_ops file_lock.fl_ops-fl_release_private) file_lock.fl_ops-fl_release_private(file_lock); if (error 0) goto out; else fl = (file_lock.fl_type == F_UNLCK ? NULL : file_lock); } else { fl = (posix_test_lock(filp, file_lock, cfl) ? cfl : NULL); } --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
statlite()
We're going to clean the statlite() call up based on this (and subsequent) discussion and post again. Thanks! Rob Ulrich Drepper wrote: Christoph Hellwig wrote: Ulrich, this in reply to these API proposals: I know the documents. The HECWG was actually supposed to submit an actual draft to the OpenGroup-internal working group but I haven't seen anything yet. I'm not opposed to getting real-world experience first. So other than this lite version of the readdirplus() call, and this idea of making the flags indicate validity rather than accuracy, are there other comments on the directory-related calls? I understand that they might or might not ever make it in, but assuming they did, what other changes would you like to see? I don't think an accuracy flag is useful at all. Programs don't want to use fuzzy information. If you want a fast 'ls -l' then add a mode which doesn't print the fields which are not provided. Don't provide outdated information. Similarly for other programs. statlite needs to separate the flag for valid fields from the actual stat structure and reuse the existing stat(64) structure. stat lite needs to at least get a better name, even better be folded into *statat*, either by having a new AT_VALID_MASK flag that enables a new unsigned int valid argument or by folding the valid flags into the AT_ flags. Yes, this is also my pet peeve with this interface. I don't want to have another data structure. Especially since programs might want to store the value in places where normal stat results are returned. And also yes on 'statat'. I strongly suggest to define only a statat variant. In the standards group I'll vehemently oppose the introduction of yet another superfluous non-*at interface. As for reusing the existing statat interface and magically add another parameter through ellipsis: no. We need to become more type-safe. The userlevel interface needs to be a new one. For the system call there is no such restriction. We can indeed extend the existing syscall. We have appropriate checks for the validity of the flags parameter in place which make such calls backward compatible. I think having a stat lite variant is pretty much consensus, we just need to fine tune the actual API - and of course get a reference implementation. So if you want to get this going try to implement it based on http://marc.theaimsgroup.com/?l=linux-fsdevelm=115487991724607w=2. Bonus points for actually making use of the flags in some filesystems. I don't like that approach. The flag parameter should be exclusively an output parameter. By default the kernel should fill in all the fields it has access to. If access is not easily possible then set the bit and clear the field. There are of course certain fields which always should be added. In the proposed man page these are already identified (i.e., those before the st_litemask member). At the actual C prototype level I would rename d_stat_err to d_stat_errno for consistency and maybe drop the readdirplus() entry point in favour of readdirplus_r only - there is no point in introducing new non-reenetrant APIs today. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
Nikolai Joukov wrote: We have designed a new stackable file system that we called RAIF: Redundant Array of Independent Filesystems. Great! We have performed some benchmarking on a 3GHz PC with 2GB of RAM and U320 SCSI disks. Compared to the Linux RAID driver, RAIF has overheads of about 20-25% under the Postmark v1.5 benchmark in case of striping and replication. In case of RAID4 and RAID5-like configurations, RAIF performed about two times *better* than software RAID and even better than an Adaptec 2120S RAID5 controller. I am not surprised. RAID 4/5/6 performance is highly sensitive to the underlying hw, and thus needs a fair amount of fine tuning. Nevertheless, performance is not the biggest advantage of RAIF. For read-biased workloads RAID is always slightly faster than RAIF. The biggest advantages of RAIF are flexible configurations (e.g., can combine NFS and local file systems), per-file-type storage policies, and the fact that files are stored as files on the lower file systems (which is convenient). This is because RAIF is located above file system caches and can cache parity as normal data when needed. We have more performance details in a technical report, if anyone is interested. Definitely interested. Can you give a link? The main focus of the paper is on a general OS profiling method and not on RAIF. However, it has some details about the RAIF benchmarking with Postmark in Chapter 9: http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5 operation under the same Postmark workload. Nikolai. - Nikolai Joukov, Ph.D. Filesystems and Storage Laboratory Stony Brook University - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
We started the project in April 2004. Right now I am using it as my /home/kolya file system at home. We believe that at this stage RAIF is mature enough for others to try it out. The code is available at: ftp://ftp.fsl.cs.sunysb.edu/pub/raif/ The code requires no kernel patches and compiles for a wide range of kernels as a module. The latest kernel we used it for is 2.6.13 and we are in the process of porting it to 2.6.19. We will be happy to hear your back. When removing a file from the underlying branch, the oops below happens. Wouldn't it be possible to just fail the branch instead of oopsing? This is a known problem of all Linux stackable file systems. Users are not supposed to change the file systems below mounted stackable file systems (but they can read them). One of the ways to enforce it is to use overlay mounts. For example, mount the lower file systems at /raif/b0 ... /raif/bN and then mount RAIF at /raif. Stackable file systems recently started getting into the kernel and we hope that there will be a better solution for this problem in the future. Having said that, you are right: failing the branch would be the right thing to do. Nikolai. - Nikolai Joukov, Ph.D. Filesystems and Storage Laboratory Stony Brook University - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
Well, Congratulations, Doctor!! [Must be nice to be exiled to Stony Brook!! Oh, well, not I] Long Island is a very nice place with lots of vineries and perfect sand beaches - don't envy :-) Here's hoping that source exists, and that it is available for us. I guess, you are subscribed to the linux-raid list only. Unfortunately, I didn't CC my post to that list and one of the replies was CC'd there without the link. The original post is available here: http://marc.theaimsgroup.com/?l=linux-fsdevelm=116603282106036w=2 And the link to the sources is: ftp://ftp.fsl.cs.sunysb.edu/pub/raif/ Nikolai. - Nikolai Joukov, Ph.D. Filesystems and Storage Laboratory Stony Brook University - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/10] lockd: add new export operation for nfsv4/lockd locking
Lets see if we use export operations or vfs calls. If we do exports we can even add another call just for cancel or maybe we can add a new vfs call. Marc. J. Bruce Fields [EMAIL PROTECTED] wrote on 12/14/2006 03:04:42 PM: By the way, one other issue I think we'll need to resolve: On Wed, Dec 06, 2006 at 12:34:11AM -0500, J. Bruce Fields wrote: +/** + * vfs_cancel_lock - file byte range unblock lock + * @filp: The file to apply the unblock to + * @fl: The lock to be unblocked + * + * FL_CANCELED is used to cancel blocked requests + */ +int vfs_cancel_lock(struct file *filp, struct file_lock *fl) +{ + int status; + struct super_block *sb; + + fl-fl_flags |= FL_CANCEL; + sb = filp-f_dentry-d_inode-i_sb; + if (sb-s_export_op sb-s_export_op-lock) + status = sb-s_export_op-lock(filp, F_SETLK, fl); + else + status = posix_unblock_lock(filp, fl); + fl-fl_flags = ~FL_CANCEL; + return status; +} So we're passing cancel requests to the filesystem by setting an FL_CANCEL flag in fl_flags and then calling the lock operation. I think Trond has said he'd rather keep fl_flags for permanent characteristics of the lock in question, rather than as a channel for passing arguments to lock operations. Also, the GFS patch isn't checking FL_CANCEL, so that's a bug. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html