Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote: > I suppose it might be a bit late in the game to add a "goal" > parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make > the API more suitable for XFS? The goal could be a single __u64, or > a struct with e.g. __u64 byte offset (possibly also __u32 lun like > in FIEMAP). I guess the one potential limitation here is the > number of function parameters on some architectures. This isn't really about "more suitable for XFS" but more about more suitable for sophisticated layout decisions. But I'm still not confident this should be shohorned into this syscall. In fact I'm already rather unhappy about the feature churn in the current patch series. The more I think about it the more I'd prefer we would just put a simple syscall in that implements nothing but the posix_fallocate(3) semantics as defined in SuS, and then go on to brainstorm about advanced preallocation / layout hint semantics. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Jun 14, 2007 22:04 +1000, David Chinner wrote: > On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote: > > > B FA_DEALLOCATE > > > removes the underlying disk space with the given range. The disk space > > > shall be removed regardless of it's contents so both allocated space > > > from > > > B FA_ALLOCATE > > > and > > > B FA_PREALLOCATE > > > as well as from > > > B write(3) > > > will be removed. > > > B FA_DEALLOCATE > > > shall never remove disk blocks outside the range specified. > > > > So this is essentially the same as "punch". > > Depends on your definition of "punch". > > > There doesn't seem to be > > a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the > > end. > > ftruncate() No, that will delete written data also. What I'm thinking is in cases where an application does fallocate() to reserve a lot of space, and when the application is finished it wants to unreserve any unused space. > > > B FA_DEALLOCATE > > > shall never change the file size. If changing the file size > > > is required when deallocating blocks from an offset to end > > > of file (or beyond end of file) is required, > > > B ftuncate64(3) > > > should be used. > > > > This also seems to be a bit of a wart, since it isn't a natural converse > > of either of the above functions. How about having two modes, > > similar to FA_ALLOCATE and FA_PREALLOCATE? > > > > whatever. > > > Say, FA_PUNCH (which > > would be as you describe here - deletes all data in the specified > > range changing the file size if it overlaps EOF, > > Punch means different things to different people. To me (and probably > most XFS aware ppl) punch implies no change to the file size. If "punch" does not change the file size, how is it possible to determine the end of the actual written data? Say you have a file with records in it, and these records are cancelled as they are processed (e.g. a journal of sorts). One usage model for punch() that we had in the past is to punch out each record after it finishes processing, so that it will not be re-processed after a crash. If the file size doesn't change with punch then there is no way to know when the last record is hit and the rest of the file needs to be scanned. > i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching > holes to leave the file size unchanged. This is the behaviour I > described for FA_DEALLOCATE. > > > and FA_DEALLOCATE, > > which only deallocates unused FA_{PRE,}ALLOCATE space? > > That's an "unwritten-to-hole" extent conversion. Is that really > useful for anything? That's easily implemented with FIEMAP > and FA_DEALLOCATE. But why force the application to do this instead of making the fallocate API sensible and allowing it to be done directly? > Anyway, because we can't agree on a single pair of flags: > > FA_ALLOCATE== posix_fallocate() > FA_DEALLOCATE == unwritten-to-hole ??? I'd think this makes sense, being natural opposites of each other. FA_ALLOCATE doesn't overwrite existing data with zeros, so FA_DEALLOCATE shouldn't erase existing data. If FA_ALLOCATE extends the file size, then FA_DEALLOCATE should shrink it if there is no data at the end. > FA_RESV_SPACE == XFS_IOC_RESVSP64 > FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64 > > We might also consider making @mode be a mask instead of an enumeration: > > > > FA_FL_DEALLOC 0x01 (default allocate) > > FA_FL_KEEP_SIZE 0x02 (default extend/shrink size) > > FA_FL_DEL_DATA 0x04 (default keep written data on DEALLOC) > > #define FA_ALLOCATE 0 > #define FA_DEALLOCATE FA_FL_DEALLOC > #define FA_RESV_SPACE FA_FL_KEEP_SIZE > #define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA OK, this makes the semantics of XFS_IOC_RESVSP64 and XFS_IOC_UNRESVSP64 clear at least. The benefit is that it would also be possible (I'm not necessarily advocating this as a flag, just an example) to have semantics that are like XFS_IOC_ALLOCSP64 (zeroing written data while preallocating) with: #define FA_ZERO_SPACEFA_DEL_DATA or whatever semantics the caller actually wants, instead of restricting them to the subset of combinations given by FA_ALLOCATE and FA_DEALLOCATE (whatever it is we decide on in the end). > > > B ENOSPC > > > There is not enough space left on the device containing the file > > > referred to by > > > IR fd. > > > > Should probably say whether space is removed on failure or not. In > > Right. I'd say on error you need to FA_DEALLOCATE to ensure any space > allocated was freed back up. That way the error handling in the allocate > functions is much simpler (i.e. no need to undo there). Hmm, another flag? FA_FL_FREE_ENOSPC? I can imagine applications like PVRs to want to preallocate, say, an estimated 30 min of space for a show but if they only get 25 min of space returned they know some cleanup is in order (which can be done asynchronously while the show is filling the first 25 min of pre
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote: > On Jun 14, 2007 09:52 +1000, David Chinner wrote: > > B FA_PREALLOCATE > > provides the same functionality as > > B FA_ALLOCATE > > except it does not ever change the file size. This allows allocation > > of zero blocks beyond the end of file and is useful for optimising > > append workloads. > > TP > > B FA_DEALLOCATE > > removes the underlying disk space with the given range. The disk space > > shall be removed regardless of it's contents so both allocated space > > from > > B FA_ALLOCATE > > and > > B FA_PREALLOCATE > > as well as from > > B write(3) > > will be removed. > > B FA_DEALLOCATE > > shall never remove disk blocks outside the range specified. > > So this is essentially the same as "punch". Depends on your definition of "punch". > There doesn't seem to be > a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the > end. ftruncate() > > B FA_DEALLOCATE > > shall never change the file size. If changing the file size > > is required when deallocating blocks from an offset to end > > of file (or beyond end of file) is required, > > B ftuncate64(3) > > should be used. > > This also seems to be a bit of a wart, since it isn't a natural converse > of either of the above functions. How about having two modes, > similar to FA_ALLOCATE and FA_PREALLOCATE? whatever. > Say, FA_PUNCH (which > would be as you describe here - deletes all data in the specified > range changing the file size if it overlaps EOF, Punch means different things to different people. To me (and probably most XFS aware ppl) punch implies no change to the file size. i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching holes to leave the file size unchanged. This is the behaviour I described for FA_DEALLOCATE. > and FA_DEALLOCATE, > which only deallocates unused FA_{PRE,}ALLOCATE space? That's an "unwritten-to-hole" extent conversion. Is that really useful for anything? That's easily implemented with FIEMAP and FA_DEALLOCATE. Anyway, because we can't agree on a single pair of flags: FA_ALLOCATE== posix_fallocate() FA_DEALLOCATE == unwritten-to-hole ??? FA_RESV_SPACE == XFS_IOC_RESVSP64 FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64 > We might also consider making @mode be a mask instead of an enumeration: > > FA_FL_DEALLOC 0x01 (default allocate) > FA_FL_KEEP_SIZE 0x02 (default extend/shrink size) > FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC) i.e: #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA > I suppose it might be a bit late in the game to add a "goal" > parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make > the API more suitable for XFS? It would suffice for the simpler operations, I think, but we'll rapidly run out of flags and we'll still need another interface for doing complex stuff. > The goal could be a single __u64, or > a struct with e.g. __u64 byte offset (possibly also __u32 lun like > in FIEMAP). I guess the one potential limitation here is the > number of function parameters on some architectures. To be useful it needs to __u64. > > B ENOSPC > > There is not enough space left on the device containing the file > > referred to by > > IR fd. > > Should probably say whether space is removed on failure or not. In Right. I'd say on error you need to FA_DEALLOCATE to ensure any space allocated was freed back up. That way the error handling in the allocate functions is much simpler (i.e. no need to undo there). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Jun 14, 2007 09:52 +1000, David Chinner wrote: > B FA_PREALLOCATE > provides the same functionality as > B FA_ALLOCATE > except it does not ever change the file size. This allows allocation > of zero blocks beyond the end of file and is useful for optimising > append workloads. > TP > B FA_DEALLOCATE > removes the underlying disk space with the given range. The disk space > shall be removed regardless of it's contents so both allocated space > from > B FA_ALLOCATE > and > B FA_PREALLOCATE > as well as from > B write(3) > will be removed. > B FA_DEALLOCATE > shall never remove disk blocks outside the range specified. So this is essentially the same as "punch". There doesn't seem to be a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the end. > B FA_DEALLOCATE > shall never change the file size. If changing the file size > is required when deallocating blocks from an offset to end > of file (or beyond end of file) is required, > B ftuncate64(3) > should be used. This also seems to be a bit of a wart, since it isn't a natural converse of either of the above functions. How about having two modes, similar to FA_ALLOCATE and FA_PREALLOCATE? Say, FA_PUNCH (which would be as you describe here - deletes all data in the specified range changing the file size if it overlaps EOF, and FA_DEALLOCATE, which only deallocates unused FA_{PRE,}ALLOCATE space? We might also consider making @mode be a mask instead of an enumeration: FA_FL_DEALLOC 0x01 (default allocate) FA_FL_KEEP_SIZE 0x02 (default extend/shrink size) FA_FL_DEL_DATA 0x04 (default keep written data on DEALLOC) We might then build FA_ALLOCATE and FA_DEALLOCATE out of these flags without making the interface sub-optimal. I suppose it might be a bit late in the game to add a "goal" parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make the API more suitable for XFS? The goal could be a single __u64, or a struct with e.g. __u64 byte offset (possibly also __u32 lun like in FIEMAP). I guess the one potential limitation here is the number of function parameters on some architectures. > B ENOSPC > There is not enough space left on the device containing the file > referred to by > IR fd. Should probably say whether space is removed on failure or not. In some (primitive) implementations it might no longer be possible to distinguish between unwritten extents and zero-filled blocks, though at this point DEALLOC of zero-filled blocks might not be harmful either. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote: > Did you get time to write the above man page ? It will help to push > further patches in time (eg. for FA_PREALLOCATE mode). First pass is attached. `nroff -man fallocate.2 | less` to view. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space .SH SYNOPSIS .nf .B #include .PP .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len); .Op .SH DESCRIPTION The .BR fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .IR offset and continuing for .IR len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there are three modes: .TP .B FA_ALLOCATE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. If the size of the file is less than IR offset + len , then the file is increased to this size; otherwise the file size is left unchanged. B FA_ALLOCATE closely resembles B posix_fallocate(3) and is intended as a method of optimally implementing this function. B FA_ALLOCATE may allocate a larger range that was specified. TP B FA_PREALLOCATE provides the same functionality as B FA_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. TP B FA_DEALLOCATE removes the underlying disk space with the given range. The disk space shall be removed regardless of it's contents so both allocated space from B FA_ALLOCATE and B FA_PREALLOCATE as well as from B write(3) will be removed. B FA_DEALLOCATE shall never remove disk blocks outside the range specified. B FA_DEALLOCATE shall never change the file size. If changing the file size is required when deallocating blocks from an offset to end of file (or beyond end of file) is required, B ftuncate64(3) should be used. SH "RETURN VALUE" BR fallocate() returns zero on success, or an error number on failure. Note that IR errno is not set. SH "ERRORS" TP B EBADF I fd is not a valid file descriptor, or is not opened for writing. TP B EFBIG I offset+len exceeds the maximum file size. TP B EINVAL I offset or I len was less than 0. TP B ENODEV I fd does not refer to a regular file or a directory. TP B ENOSPC There is not enough space left on the device containing the file referred to by IR fd. TP B ESPIPE I fd refers to a pipe of file descriptor. B ENOSYS The filesystem underlying the file descriptor does not support this operation. SH AVAILABILITY The BR fallocate () system call is available since 2.6.XX SH "SEE ALSO" BR syscall (2), BR posix_fadvise (3) BR ftruncate (3)
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote: > On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote: > > Minimal definition to replace what applicaitons use on XFS and to > > support poasix_fallocate are the thre that have been mentioned so > > far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them > > all in a man page... > > Hi Dave, > > Did you get time to write the above man page ? It will help to push > further patches in time (eg. for FA_PREALLOCATE mode). No, I didn't. Instead of working on new preallocation stuff, I've been spending all my time fixing bugs found by new and interesting (ab)uses of preallocation and hole punching. > The idea I had was to push the patch with bare minimum functionality > (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other > new mode(s) based on the man page you planned to provide. Push them. I'll just make XFS work with whatever is provided. Is there a test harness for the syscall yet? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote: > On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: > > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > > > All I'm really interested in right now is that the fallocate > > > _interface_ can be used as a *complete replacement* for the > > > pre-existing XFS-specific ioctls that are already used by > > > applications. What ext4 can or can't do right now is irrelevant to > > > this discussion - the interface definition needs to take priority > > > over implementation > > > > Would you like to write up an interface definition description (likely > > man page) and post it for review, possibly with a mention of apps using > > it today ? > > Yeah, I started doing that yesterday as i figured it was the only way > to cut the discussion short > > > One reason for introducing the mode parameter was to allow the interface to > > evolve incrementally as more options / semantic questions are proposed, so > > that we don't have to make all the decisions right now. > > So it would be good to start with a *minimal* definition, even just one > > mode. > > The rest could follow as subsequent patches, each being reviewed and debated > > separately. Otherwise this discussion can drag on for a long time. > > Minimal definition to replace what applicaitons use on XFS and to > support poasix_fallocate are the thre that have been mentioned so > far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them > all in a man page... Hi Dave, Did you get time to write the above man page ? It will help to push further patches in time (eg. for FA_PREALLOCATE mode). The idea I had was to push the patch with bare minimum functionality (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other new mode(s) based on the man page you planned to provide. Thanks! -- Regards, Amit Arora > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > > All I'm really interested in right now is that the fallocate > > _interface_ can be used as a *complete replacement* for the > > pre-existing XFS-specific ioctls that are already used by > > applications. What ext4 can or can't do right now is irrelevant to > > this discussion - the interface definition needs to take priority > > over implementation > > Would you like to write up an interface definition description (likely > man page) and post it for review, possibly with a mention of apps using > it today ? Yeah, I started doing that yesterday as i figured it was the only way to cut the discussion short > One reason for introducing the mode parameter was to allow the interface to > evolve incrementally as more options / semantic questions are proposed, so > that we don't have to make all the decisions right now. > So it would be good to start with a *minimal* definition, even just one mode. > The rest could follow as subsequent patches, each being reviewed and debated > separately. Otherwise this discussion can drag on for a long time. Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > > I have the updated patches ready which take care of Andrew's comments. > > > > Will run some tests and post them soon. > > > > > > > > But, before submitting these patches, I think it will be better to > > > > finalize on certain things which might be worth some discussion here: > > > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > > in this case. I also tend to agree with them. Does anyone has an > > > > argument in favor of changing the filesize ? If not, I will remove the > > > > code which changes the filesize, before I resubmit the concerned ext4 > > > > patch. > > > > > > I think there needs to be both. If we don't have a mechanism to atomically > > > change the file size with the preallocation, then applications that use > > > stat() to work out if they need to preallocate more space will end up > > > racing. > > > > By "both" above, do you mean we should give user the flexibility if it wants > > the filesize changed or not ? It can be done by having *two* modes for > > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > > will change the filesize if required (i.e. when allocation is beyond EOF) > > and also update [cm]time. This way, the application can decide what it > > wants. > > Yes, that's right. > > > This will be helpfull for the partial allocation scenario also. Think of the > > case when we do not change the filesize in fallocate() and expect > > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > > Now if fallocate() results in a partial allocation with -ENOSPC error > > returned, applications/posix_fallocate() will not know for what length > > ftruncate() has to be called. :( > > Well, posix_fallocate() either gets all the space or it fails. If > you truncate to extend the file size after an ENOSPC, then that is > a buggy implementation. > > The same could be said for any application, or even the fallocate() > call itself if it changes the filesize without having completely > preallocated the space asked > > > Hence it may be a good idea to give user the flexibility if it wants to > > atomically change the file size with preallocation or not. But, with more > > flexibility there comes inconsistency in behavior, which is worth > > considering. > > We've got different modes to specify different behaviour. That's > what the mode field was put there for in the first place - the > interface is *designed* to support different preallocation > behaviours > > > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > > normal (non-preallocated) blocks (blocks allocated via regular > > > > write/truncate operations) also (i.e. work as punch()) ? > > > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > > i did for FA_UNALLOCATE as well. > > > > Ok. But, some people may not expect/like this. I think, we can keep it on > > the backburner for a while, till other issues are sorted out. > > How can it be a "backburner" issue when it defines the > implementation? I've already implemented some thing in XFS that > sort of does what I think that the interface is supposed to do, but > I need that interface to be nailed down before proceeding any > further. > > All I'm really interested in right now is that the fallocate > _interface_ can be used as a *complete replacement* for the > pre-existing XFS-specific ioctls that are already used by > applications. What ext4 can or can't do right now is irrelevant to > this discussion - the interface definition needs to take priority > over implementation Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards Suparna > > Cheers, > > Dave, > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > I have the updated patches ready which take care of Andrew's comments. > > > Will run some tests and post them soon. > > > > > > But, before submitting these patches, I think it will be better to > > > finalize on certain things which might be worth some discussion here: > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > in this case. I also tend to agree with them. Does anyone has an > > > argument in favor of changing the filesize ? If not, I will remove the > > > code which changes the filesize, before I resubmit the concerned ext4 > > > patch. > > > > I think there needs to be both. If we don't have a mechanism to atomically > > change the file size with the preallocation, then applications that use > > stat() to work out if they need to preallocate more space will end up > > racing. > > By "both" above, do you mean we should give user the flexibility if it wants > the filesize changed or not ? It can be done by having *two* modes for > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > will change the filesize if required (i.e. when allocation is beyond EOF) > and also update [cm]time. This way, the application can decide what it > wants. Yes, that's right. > This will be helpfull for the partial allocation scenario also. Think of the > case when we do not change the filesize in fallocate() and expect > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > Now if fallocate() results in a partial allocation with -ENOSPC error > returned, applications/posix_fallocate() will not know for what length > ftruncate() has to be called. :( Well, posix_fallocate() either gets all the space or it fails. If you truncate to extend the file size after an ENOSPC, then that is a buggy implementation. The same could be said for any application, or even the fallocate() call itself if it changes the filesize without having completely preallocated the space asked > Hence it may be a good idea to give user the flexibility if it wants to > atomically change the file size with preallocation or not. But, with more > flexibility there comes inconsistency in behavior, which is worth > considering. We've got different modes to specify different behaviour. That's what the mode field was put there for in the first place - the interface is *designed* to support different preallocation behaviours > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > normal (non-preallocated) blocks (blocks allocated via regular > > > write/truncate operations) also (i.e. work as punch()) ? > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > i did for FA_UNALLOCATE as well. > > Ok. But, some people may not expect/like this. I think, we can keep it on > the backburner for a while, till other issues are sorted out. How can it be a "backburner" issue when it defines the implementation? I've already implemented some thing in XFS that sort of does what I think that the interface is supposed to do, but I need that interface to be nailed down before proceeding any further. All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation Cheers, Dave, -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > I have the updated patches ready which take care of Andrew's comments. > > Will run some tests and post them soon. > > > > But, before submitting these patches, I think it will be better to finalize > > on certain things which might be worth some discussion here: > > > > 1) Should the file size change when preallocation is done beyond EOF ? > >- Andreas and Chris Wedgwood are in favor of not changing the > > file size in this case. I also tend to agree with them. Does anyone > > has an argument in favor of changing the filesize ? > > If not, I will remove the code which changes the filesize, before I > > resubmit the concerned ext4 patch. > > I think there needs to be both. If we don't have a mechanism to > atomically change the file size with the preallocation, then > applications that use stat() to work out if they need to preallocate > more space will end up racing. By "both" above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth considering. > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > >of normal (non-preallocated) blocks (blocks allocated via > >regular write/truncate operations) also (i.e. work as punch()) ? > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and > what i did for FA_UNALLOCATE as well. Ok. But, some people may not expect/like this. I think, we can keep it on the backburner for a while, till other issues are sorted out. > >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > > we need to finalize on the convention here as a general guideline > > to all the filesystems that implement fallocate. > > > > 3) If above is true, the file size will need to be changed > >for "unallocation" when block holding the EOF gets unallocated. > > No - we punch a hole. If you want the filesize to change, then > you use ftruncate() to remove the blocks at EOF and change the > file size atomically. Ok. > > > 4) Should we update mtime & ctime on a successfull allocation/ > >unallocation ? > >- David Chinner raised this question in following post: > > http://lkml.org/lkml/2007/4/29/407 > > I think it makes sense to update the [mc]time for a successfull > > preallocation/unallocation. Does anyone feel otherwise ? > > It will be interesting to know how XFS behaves currently. Does XFS > > update [mc]time for preallocation ? > > No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size > changes. If the filesize changes, it behaves exactly the same way that > ftruncate() behaves. Having additional mode (of FA_PREALLOCATE) might help here too. Please see above. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? >- Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. > 2) For FA_UNALLOCATE mode, should the file system allow unallocation >of normal (non-preallocated) blocks (blocks allocated via >regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. > > 3) If above is true, the file size will need to be changed >for "unallocation" when block holding the EOF gets unallocated. No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the file size atomically. > 4) Should we update mtime & ctime on a successfull allocation/ >unallocation ? >- David Chinner raised this question in following post: > http://lkml.org/lkml/2007/4/29/407 > I think it makes sense to update the [mc]time for a successfull > preallocation/unallocation. Does anyone feel otherwise ? > It will be interesting to know how XFS behaves currently. Does XFS > update [mc]time for preallocation ? No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? >- Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. > If we chose not to update the file size beyong EOF, then for filesystem without fallocate() support (ext2,3 currently), posix_fallocate() will follow the hard way(zero-out) to do preallocation. Then we will get different behavior on filesystems w/o fallocate() support. It make sense to be consistent, IMO. My point of view, preallocation is just a efficient way to allocating blocks for files without zero-out, other than this, the new behavior should be consistent with the old way: file size update,mtime/ctime, ENOSPC etc. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On May 09, 2007 21:31 +0530, Amit K. Arora wrote: > 2) For FA_UNALLOCATE mode, should the file system allow unallocation >of normal (non-preallocated) blocks (blocks allocated via >regular write/truncate operations) also (i.e. work as punch()) ? >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. I would only allow this on FA_ALLOCATE extents. That means it won't be possible to do this for filesystems that don't understand unwritten extents unless there are blocks allocated beyond EOF. > 3) If above is true, the file size will need to be changed >for "unallocation" when block holding the EOF gets unallocated. >- If we do not "unallocate" normal (non-preallocated) blocks and we > do not change the file size on preallocation, then this is a > non-issue. Not necessarily. That will just make the file sparse. If FA_ALLOCATE does not change the file size, why should FA_UNALLOCATE. > 4) Should we update mtime & ctime on a successfull allocation/ >unallocation ? I would say yes. If glibc does the fallback fallocate via write() the mtime/ctime will be updated, so it makes sense to be consistent for both methods. Also, it just makes sense from the "this file was modified" point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for "unallocation" when block holding the EOF gets unallocated. - If we do not "unallocate" normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. 4) Should we update mtime & ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > Of course the interface used by an application program would have the > > > fd first. Glibc can do the translation. > > > > I think that was understood. > > OK, then what does it matter what the glibc/kernel interface is, as > long as it works? > > It's only a minor point; the order of arguments can vary between > architectures if necessary, but it's nicer if they don't have to. > 32-bit powerpc will need to have the two int arguments adjacent in > order to avoid using more than 6 argument registers at the user/kernel > boundary, and s390 will need to avoid having a 64-bit argument last > (if I understand it correctly). You are right to say that. But, it may not be _that_ a minor point, especially for the arch which is getting affected. It has other implications like what Heiko noticed in his post below: http://lkml.org/lkml/2007/4/27/377 - implications like modifying glibc and *trace utilities for a particular arch. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On 5/9/07, Paul Mackerras <[EMAIL PROTECTED]> wrote: Suparna Bhattacharya writes: > > Of course the interface used by an application program would have the > > fd first. Glibc can do the translation. > > I think that was understood. OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). Ah, almost but not quite the point. But I admit it is hard to understand.. The trouble started with the futex call which has been the first system call with 6 arguments. s390 supported only 5 arguments up to that point (%r2 - %r6). For futex we added a wrapper to the glibc that loaded the 6th argument to %r7. In entry.S we set up things so that %r7 gets stored to the kernel stack where normal C code expects the first overflow argument. This enabled us to use the standard futex system call with 6 arguments. fallocate now has an additional problem: the last argument is a 64 bit integers AND registers %r2-%r5 are already used. In this case the 64 bit number would have to be split into the high part in %r6 and the low part on the stack so that the glibc wrapper can load the low part to %r7. But the C compiler will skip %r6 and store the 64 bit number on the stack. If the order of the arguments if modified so that %r6 is assigned to a 32-bit argument, then the entry.S magic with %r7 would work. -- blue skies, Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Suparna Bhattacharya writes: > > Of course the interface used by an application program would have the > > fd first. Glibc can do the translation. > > I think that was understood. OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > This looks like it will have the same problem on s390 as > > > sys_sync_file_range. Maybe the prototype should be: > > > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > > > Yes, but the trouble is that there was a contrary viewpoint preferring that > > fd > > first be maintained as a convention like other syscalls (see the following > > posts) > > Of course the interface used by an application program would have the > fd first. Glibc can do the translation. I think that was understood. Regards Suparna > > Paul. -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Suparna Bhattacharya writes: > > This looks like it will have the same problem on s390 as > > sys_sync_file_range. Maybe the prototype should be: > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > Yes, but the trouble is that there was a contrary viewpoint preferring that fd > first be maintained as a convention like other syscalls (see the following > posts) Of course the interface used by an application program would have the fd first. Glibc can do the translation. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: > Andrew Morton writes: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t > > > len) > > > > Please add a comment over this function which specifies its behaviour. > > Really it should be enough material from which a full manpage can be > > written. > > This looks like it will have the same problem on s390 as > sys_sync_file_range. Maybe the prototype should be: > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2 http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2 Are there any better ideas ? Regards Suparna > > Paul. > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Jakub Jelinek wrote: is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is "shall fail". But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and "required storage for regular file data starting at offset and continuing for len bytes" doesn't make sense for negative size. This wording has already been cleaned up. The current draft for the next revision reads: [EINVAL] The len argument is less than or equal to zero, or the offset argument is less than zero, or the underlying file system does not support this operation. I still don't like it since len==0 shouldn't create an error (it's inconsistent) but len<0 is already outlawed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the > above > check was relaxed? I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. > > If that's all too much, this material should at least be spelled out in the > changelog. Because there's no way in which this change can be fully > reviewed unless someone (ie: you) tells us what it is setting out to > achieve. > > If we 100% implement some standard then a URL for what we claim to > implement would suffice. Given that we're at least using different types from > posix I doubt if such a thing would be sufficient. > > And given the complexity and potential variability within the filesystem > implementations of this, I'd expect that _something_ additional needs to be > said? Ok. I will add a detailed comment here. > > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. > > > + ret = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + if (!(file->f_mode & FMODE_WRITE)) > > + goto out_fput; > > + > > + inode = file->f_path.dentry->d_inode; > > + > > + ret = -ESPIPE; > > + if (S_ISFIFO(inode->i_mode)) > > + goto out_fput; > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. True. > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment > here would settle the reader's mind. Ok. I will add a check here for wrap though zero. > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, mode, offset, len); > > + else > > + ret = -ENOSYS; > > If we _are_ going to support negative `len', as posix suggests, I think we > should perform the appropriate sanity conversions to `offset' and `len' > right here, rather than expecting each filesystem to do it. > > If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). > > +out_fput: > > + fput(file); > > +out: > > + return ret; > > +} > > +EXPORT_SYMBOL(sys_fallocate); > > I don't believe this needs to be exported to modules? Ok. Will remove it. > > +/* > > + * fallocate() modes > > + */ > > +#define FA_ALLOCATE0x1 > > +#define FA_DEALLOCATE 0x2 > > Now those aren't in posix. They should be documented, along with their > expected semantics. Will add a comment describing the role of these modes. > > #ifdef __KERNEL__ > > > > #include > > @@ -1125,6 +1131,7 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *, int, loff_t, loff_t); > > I really do think it's better to put the variable names in definitions such > as this. Especially when we have two identically-typed variables next to > each other like that. Quick: which one is the offset and which is the > length? Ok. Will add the variable names here. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[EMAIL PROTECTED]> wrote: > > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > > wrote: > > > > > > > This patch implements the fallocate() system call and adds support for > > > > i386, x86_64 and powerpc. > > > > > > > > ... > > > > +{ > > > > + struct file *file; > > > > + struct inode *inode; > > > > + long ret = -EINVAL; > > > > + > > > > + if (len == 0 || offset < 0) > > > > + goto out; > > > > > > The posix spec implies that negative `len' is permitted - presumably > > > "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. > > > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. I don't think we should care. If we provide a syscall with the semantics of "allocate from offset to offset+len" then glibc's implementation can turn negative length into two separate fallocate syscalls > > > > + ret = -ENODEV; > > > > + if (!S_ISREG(inode->i_mode)) > > > > + goto out_fput; > > > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > > seems a bit silly of them. > > > > H - I thought that the intention of sys_fallocate() was to > > be generic enough to eventually allow preallocation on directories. > > If that is the case, then this check will prevent that > > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the > above > check was relaxed? No - right now empty blocks are pruned from the directory immediately so I don't think we really have a concept of empty blocks in the btree structure. dir2 is bloody complex, so adding preallocation is probably not going to be simple to do. It's not high on my list to add, either, because we can typically avoid the worst case directory fragmentation by using larger directory block sizes (e.g. 16k instead of the default 4k on a 4k block size fs). IIRC directory preallocation has been talked about more for ext3/4 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > > > The posix spec implies that negative `len' is permitted - presumably > > > "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. That describes the current glibc implementation. > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. > > However it would be neat if someone could test $OTHER_OS and, perhaps more > importantly, the present glibc emulation (which I assume your manpage is > referring to, so this would be a manpage test ;)). int posix_fallocate (int fd, __off_t offset, __off_t len) { struct stat64 st; struct statfs f; /* `off_t' is a signed type. Therefore we can determine whether OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; if (offset + len < 0) return EFBIG; /* First thing we have to make sure is that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) return ESPIPE; if (! S_ISREG (st.st_mode)) return ENODEV; if (len == 0) { if (st.st_size < offset) { int ret = __ftruncate (fd, offset); if (ret != 0) ret = errno; return ret; } return 0; } ... is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is "shall fail". But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and "required storage for regular file data starting at offset and continuing for len bytes" doesn't make sense for negative size. And given the general "Implementations may support additional errors not included in this list, may generate errors included in this list under circumstances other than those described here, or may contain extensions or limitations that prevent some errors from occurring." I believe returning EINVAL for len < 0 is not a POSIX violation. That doesn't mean the standard shouldn't be clarified, whether by saying EINVAL must be returned for non-positive len or saying that using negative len has undefined or implementation defined behavior. > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. No, see above. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[EMAIL PROTECTED]> wrote: > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > +{ > > > + struct file *file; > > > + struct inode *inode; > > > + long ret = -EINVAL; > > > + > > > + if (len == 0 || offset < 0) > > > + goto out; > > > > The posix spec implies that negative `len' is permitted - presumably > > "allocate > > ahead of `offset'". How peculiar. > > I just checked the man page for posix_fallocate() and it says: > > EINVAL offset or len was less than zero. > > We should probably follow this lead. Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. However it would be neat if someone could test $OTHER_OS and, perhaps more importantly, the present glibc emulation (which I assume your manpage is referring to, so this would be a manpage test ;)). > > > + > > > + ret = -ENODEV; > > > + if (!S_ISREG(inode->i_mode)) > > > + goto out_fput; > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > seems a bit silly of them. > > H - I thought that the intention of sys_fallocate() was to > be generic enough to eventually allow preallocation on directories. > If that is the case, then this check will prevent that The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. H - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Andrew Morton writes: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton <[EMAIL PROTECTED]> wrote: > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. But it doesn't handle offset+len wrapping through zero. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > This patch implements the fallocate() system call and adds support for > i386, x86_64 and powerpc. > > ... > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be said? > +{ > + struct file *file; > + struct inode *inode; > + long ret = -EINVAL; > + > + if (len == 0 || offset < 0) > + goto out; The posix spec implies that negative `len' is permitted - presumably "allocate ahead of `offset'". How peculiar. > + ret = -EBADF; > + file = fget(fd); > + if (!file) > + goto out; > + if (!(file->f_mode & FMODE_WRITE)) > + goto out_fput; > + > + inode = file->f_path.dentry->d_inode; > + > + ret = -ESPIPE; > + if (S_ISFIFO(inode->i_mode)) > + goto out_fput; > + > + ret = -ENODEV; > + if (!S_ISREG(inode->i_mode)) > + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. > + ret = -EFBIG; > + if (offset + len > inode->i_sb->s_maxbytes) > + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment here would settle the reader's mind. > + if (inode->i_op && inode->i_op->fallocate) > + ret = inode->i_op->fallocate(inode, mode, offset, len); > + else > + ret = -ENOSYS; If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. If we're not going to handle negative `len' then we should check for it. > +out_fput: > + fput(file); > +out: > + return ret; > +} > +EXPORT_SYMBOL(sys_fallocate); I don't believe this needs to be exported to modules? > +/* > + * fallocate() modes > + */ > +#define FA_ALLOCATE 0x1 > +#define FA_DEALLOCATE0x2 Now those aren't in posix. They should be documented, along with their expected semantics. > #ifdef __KERNEL__ > > #include > @@ -1125,6 +1131,7 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + long (*fallocate)(struct inode *, int, loff_t, loff_t); I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. NOTE: It is based on 2.6.21 kernel version. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |1 arch/powerpc/kernel/sys_ppc32.c |7 ++ arch/x86_64/kernel/functionlist |1 fs/open.c| 41 +++ include/asm-i386/unistd.h|3 +- include/asm-powerpc/systbl.h |1 include/asm-powerpc/unistd.h |3 +- include/asm-x86_64/unistd.h |4 ++- include/linux/fs.h |7 ++ include/linux/syscalls.h |1 10 files changed, 66 insertions(+), 3 deletions(-) Index: linux-2.6.21/arch/i386/kernel/syscall_table.S === --- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.21/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.21/arch/x86_64/kernel/functionlist === --- linux-2.6.21.orig/arch/x86_64/kernel/functionlist +++ linux-2.6.21/arch/x86_64/kernel/functionlist @@ -931,6 +931,7 @@ *(.text.sys_getitimer) *(.text.sys_getgroups) *(.text.sys_ftruncate) +*(.text.sys_fallocate) *(.text.sysfs_lookup) *(.text.sys_exit_group) *(.text.stub_fork) Index: linux-2.6.21/fs/open.c === --- linux-2.6.21.orig/fs/open.c +++ linux-2.6.21/fs/open.c @@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset < 0) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + if (!S_ISREG(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + if (offset + len > inode->i_sb->s_maxbytes) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; +out_fput: + fput(file); +out: + return ret; +} +EXPORT_SYMBOL(sys_fallocate); + /* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and Index: linux-2.6.21/include/asm-i386/unistd.h === --- linux-2.6.21.orig/include/asm-i386/unistd.h +++ linux-2.6.21/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/asm-powerpc/systbl.h === --- linux-2.6.21.orig/include/asm-powerpc/systbl.h +++ linux-2.6.21/include/asm-powerpc/systbl.h @@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list) COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) +COMPAT_SYS(fallocate) Index: linux-2.6.21/include/asm-powerpc/unistd.h === --- linux-2.6.21.orig/include/asm-powerpc/unistd.h +++ linux-2.6.21/include/asm-powerpc/unistd.h @@ -326,10 +326,11 @@ #define __NR_move_pages301 #define __NR_getcpu302 #define __NR_epoll_pwait 303 +#define __NR_fallocate 304 #ifdef __KERNEL__ -#define __NR_syscalls 304 +#define __NR_syscalls 305 #define __NR__exit __NR_exit #define NR_syscalls__NR_syscalls Index: linux-2.6.21/include/asm-x86_64/unistd.h === --- linux-2.6.21.orig/include/asm-x86_64/unistd.h +++ linux-2.6.21/include/asm-x86_64/unistd.h @@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_fallocate 280 +__SYSCALL(__NR_fallocate, sys_fallocate) -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_fallocate #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/l