Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote: On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space left to squeeze them in. FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. Why don't we just merge the interface for preallocation (essentially enough to satisfy posix_fallocate() and the simple XFS requirement for space reservation without changing file size), which there is clear agreement on (I hope :)). After all, this was all that we set out to do when we started. And leave all the dealloc/punch/hsm type features for separate future patches/ debates, those really shouldn't hold up the basic fallocate interface. I agree with Christoph that we are just diverging too much in trying to club those decisions here. Dave, Andreas, Ted ? Regards Suparna - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote: On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote: On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space left to squeeze them in. FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. Why don't we just merge the interface for preallocation (essentially enough to satisfy posix_fallocate() and the simple XFS requirement for space reservation without changing file size), which there is clear agreement on (I hope :)). After all, this was all that we set out to do when we started. As you suggest, let us just have two modes for the time being: #define FALLOC_ALLOCATE 0x1 #define FALLOC_ALLOCATE_KEEP_SIZE 0x2 As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it will result in file size not being changed even if the preallocation is beyond EOF. And leave all the dealloc/punch/hsm type features for separate future patches/ debates, those really shouldn't hold up the basic fallocate interface. I agree. I agree with Christoph that we are just diverging too much in trying to club those decisions here. Dave, Andreas, Ted ? Regards Suparna -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote: Why don't we just merge the interface for preallocation (essentially enough to satisfy posix_fallocate() and the simple XFS requirement for space reservation without changing file size), which there is clear agreement on (I hope :)). After all, this was all that we set out to do when we started. And leave all the dealloc/punch/hsm type features for separate future patches/ debates, those really shouldn't hold up the basic fallocate interface. I agree with Christoph that we are just diverging too much in trying to club those decisions here. Dave, Andreas, Ted ? Sure. I'll just make XFS work with whatever it is that gets merged. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jul 12, 2007 at 11:13:34PM +1000, David Chinner wrote: On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote: Why don't we just merge the interface for preallocation (essentially enough to satisfy posix_fallocate() and the simple XFS requirement for space reservation without changing file size), which there is clear agreement on (I hope :)). After all, this was all that we set out to do when we started. And leave all the dealloc/punch/hsm type features for separate future patches/ debates, those really shouldn't hold up the basic fallocate interface. I agree with Christoph that we are just diverging too much in trying to club those decisions here. Dave, Andreas, Ted ? Sure. I'll just make XFS work with whatever it is that gets merged. Great. I will post the new patches soon. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Jul 12, 2007 13:56 +0530, Amit K. Arora wrote: As you suggest, let us just have two modes for the time being: #define FALLOC_ALLOCATE 0x1 #define FALLOC_ALLOCATE_KEEP_SIZE 0x2 As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it will result in file size not being changed even if the preallocation is beyond EOF. What does FALLOC_ALLOCATE mean vs. not passing this flag? I have no objection to this as long as the code remains with these as flags instead of modes... Essentially just dropping the FALLOC_FL_DEALLOCATE and FALLOC_FL_DEL_DATA from the interface. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space left to squeeze them in. FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote: Given the current behaviour for posix_fallocate() in glibc, I think that retaining the same error semantic and punting the cleanup to userspace (where the app will fail with ENOSPC anyway) is the only sane thing we can do here. Trying to undo this in the kernel leads to lots of extra rarely used code in error handling paths... Agreed, looks like we should stay with the user has to clean up behaviour. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote: We use this capability in XFS at the moment. I think this is mainly for DMF (HSM) but is done via the xfs handle interface (xfs_open_by_handle) AFAICT. You're not :) You're using an O_INVIBLE equivalent (as described below), which would be a useful thing to have at the VFS level, but adding hacks to some system calls only wouldn't help any HSM system. It's just useless API clutter. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Sat, Jun 30, 2007 at 12:52:46PM -0400, Andreas Dilger wrote: The @mode flags that are currently under consideration are (AFAIK): FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) */ FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */ FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default keep) */ We now have two sets of flags - 1) the above three with which I think no one has any issues with, and 2) the ones below, for which we need some discussions before finalizing on them. I will prefer fallocate going in mainline with the above three modes, and rest of the modes can be debated upon and discussed parallely. And, each new mode/flag can be pushed as a separate patch. This will not hold fallocate feature indefinitely... Please confirm if you find this approach ok. Otherwise, please object. Thanks! FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep prealloc) */ FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data change) */ -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote: FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) */ FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */ FA_FL_DEL_DATA 0x04 /* delete existing data in alloc range (default keep) */ We now have two sets of flags - 1) the above three with which I think no one has any issues with, and Yes, I do. FA_FL_DEL_DATA is plain stupid, a preallocation call should never delete data. FA_FL_DEALLOC should probably be a separate syscall because it's very different functionality. While we're at it I also dislike the FA_ prefix becuase it doesn't say anything and is far too generic. FALLOC_ is much better. FA_FL_ERR_FREE 0x08 /* free preallocation on error (default keep prealloc) */ NACK on this one. We should have just one behaviour, and from the thread that not freeing the allocation on error. FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 11:31:07AM +0100, Christoph Hellwig wrote: On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote: FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) */ FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */ FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default keep) */ We now have two sets of flags - 1) the above three with which I think no one has any issues with, and Yes, I do. FA_FL_DEL_DATA is plain stupid, a preallocation call should never delete data. FA_FL_DEALLOC should probably be a separate syscall because it's very different functionality. Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. And, regarding FA_FL_DEALLOC being a separate syscall - I think then the very purpose of @mode argument is not justified. We have this mode so that we can provide more features like this. That said, I don't say that we should make things very complicated; but, atleast we should provide some basic features which we expect most of the applications wanting preallocation to use. To start with, we need to cater to already existing applications/user base who use XFS preallocation feature. And further advanced features, like goal based preallocation, can be implemented as a separate syscall. While we're at it I also dislike the FA_ prefix becuase it doesn't say anything and is far too generic. FALLOC_ is much better. Ok. This can be changed in the next take. FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep prealloc) */ NACK on this one. We should have just one behaviour, and from the thread that not freeing the allocation on error. I agree on this one. FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
Amit K. Arora wrote: FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. We use this capability in XFS at the moment. I think this is mainly for DMF (HSM) but is done via the xfs handle interface (xfs_open_by_handle) AFAICT. This sets up a set of invisible operations (xfs_invis_file_operations). xfs_file_ioctl_invis goes on to set IO_INVIS which goes on to set ATTR_DMI which is then tested in xfs_change_file_space() (which handles XFS_IOC_RESVSP friends) for whether xfs_ichgtime(ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG) is called or not. --Tim - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote: On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote: On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote: Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. I can't find anything in the specification of posix_fallocate (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) that tells what should happen to allocate blocks on error. Yeah, and AFAICT glibc leaves them behind ATM. Yes, it does. But common sense would be to not leak disk space on failure of this syscall, and this definitively should not be left up to the filesystem, either we always leak it or always free it, and I'd strongly favour the latter variant. I would not call it a leak, since the blocks which got allocated as part of the partial success of the fallocate syscall can be strictly accounted for (i.e. they are assigned to a particular inode). And these can be freed by the application, using a suitable @mode of fallocate. We can't simply walk the range an remove unwritten extents, as some of them may have been present before the fallocate() call. That makes it extremely difficult to undo a failed call and not remove more pre-existing pre-allocations. Same is true for ext4 too. It is very difficult to keep track of which uninitialized (unwritten) extents got allocated as part of the current syscall. This is because, as David mentions, some of them might be already present; and also because some of the older ones may have got merged with the *new* uninitialized/unwritten extents as part of the current syscall. Given the current behaviour for posix_fallocate() in glibc, I think that retaining the same error semantic and punting the cleanup to userspace (where the app will fail with ENOSPC anyway) is the only sane thing we can do here. Trying to undo this in the kernel leads to lots of extra rarely used code in error handling paths... Right. This gives applications the free hand if they really want to use the partially preallocated space, OR they want to free it; without introducing additional complexity in the kernel. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Jun 30, 2007 11:21 +0100, Christoph Hellwig wrote: On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote: Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. I can't find anything in the specification of posix_fallocate (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) that tells what should happen to allocate blocks on error. But common sense would be to not leak disk space on failure of this syscall, and this definitively should not be left up to the filesystem, either we always leak it or always free it, and I'd strongly favour the latter variant. I definitely agree that the behaviour should be specified part of the interface. The current behaviour of both ext4 and XFS is that the successful part of the unallocated extent is left in place when returning ENOSPC so we considered this the consistent behaviour. This is the same as e.g. sys_write() which does not remove the part of the write that was successful if ENOSPC is hit. I think this also makes sense for some usa cases, because application like PVR may want to preallocate approximately 30min of space, but if it gets only 25min worth then it can at least start using this while it also begins looking for and/or freeing old files. If the space is always freed on ENOSPC, then there may be a significant amount of work done and undone while the application is iterating over possible sizes until one works. It is easy for the application to use fstat() to see the blocks/size actually preallocated on failure, and explicitly request unallocation of this space if the outcome is undesirable. If you think that applications have a strong preference for both kinds of behaviour (e.g. database which requires the full allocation to succeed, unlike PVR application above) then this could be encoded into a @mode flag. For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we don't want to expose uninitialized disk blocks to userspace. I'm not sure if this makes sense at all. This is the xfs unwritten extent behaviour. But anyway, the important bit is uninitialized blocks should never ever leak to userspace, so there is not need for the flag. I agree that we shouldn't need FA_ZERO_SPACE. If an application wants explicit zeros written to disk it can just do this with O_DIRECT writes or similar. The more I think about it the more I'd prefer we would just put a simple syscall in that implements nothing but the posix_fallocate(3) semantics as defined in SuS, and then go on to brainstorm about advanced preallocation / layout hint semantics. I don't think the current @mode flags introduce any significant complexity in the implementation, and in fact one of the reasons these came up in the first place was because David pointed out the XFS behaviour did NOT match with posix_fallocate() and we started getting strange semantics enforced by monolithic modes. IMHO, coding for and understanding the semantics of the monolithic modes is much more complex and less useful than the explicit flags. The @mode flags that are currently under consideration are (AFAIK): FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) */ FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */ FA_FL_DEL_DATA 0x04 /* delete existing data in alloc range (default keep) */ Your concern about leaking space would imply: FA_FL_ERR_FREE 0x08 /* free preallocation on error (default keep prealloc) */ The other possible flags that were proposed, to avoid confusing backup and HSM applications when preallocated space is added or removed from a file (you don't want a backup app to re-backup a file that was migrated via HSM): FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote: On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote: Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. I can't find anything in the specification of posix_fallocate (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) that tells what should happen to allocate blocks on error. Yeah, and AFAICT glibc leaves them behind ATM. But common sense would be to not leak disk space on failure of this syscall, and this definitively should not be left up to the filesystem, either we always leak it or always free it, and I'd strongly favour the latter variant. We can't simply walk the range an remove unwritten extents, as some of them may have been present before the fallocate() call. That makes it extremely difficult to undo a failed call and not remove more pre-existing pre-allocations. Given the current behaviour for posix_fallocate() in glibc, I think that retaining the same error semantic and punting the cleanup to userspace (where the app will fail with ENOSPC anyway) is the only sane thing we can do here. Trying to undo this in the kernel leads to lots of extra rarely used code in error handling paths... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote: Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. I can't find anything in the specification of posix_fallocate (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) that tells what should happen to allocate blocks on error. But common sense would be to not leak disk space on failure of this syscall, and this definitively should not be left up to the filesystem, either we always leak it or always free it, and I'd strongly favour the latter variant. For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we don't want to expose uninitialized disk blocks to userspace. I'm not sure if this makes sense at all. I don't think we need to make it default - atleast for filesystems which have a mechanism to distinguish preallocated blocks from regular ones. In ext4, for example, we will have a way to mark uninitialized extents. All the preallocated blocks will be part of these uninitialized extents. And any read on these extents will treat them as a hole, returning zeroes to user land. Thus any existing data on uninitialized blocks will not be exposed to the userspace. This is the xfs unwritten extent behaviour. But anyway, the important bit is uninitialized blocks should never ever leak to userspace, so there is not need for the flag. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. Since I believe the XFS allocation ioctls do it the opposite way (free preallocated space on error) this should be encoded into the flags. Having it filesystem dependent just means that nobody will be happy. No, XFs does not free preallocated space on error. it is up to the application to clean up. Since XFS also does not free preallocated space on error and this behavior is inline with dd, posix_fallocate() and the current ext4 implementation, do we still need FA_FL_FREE_ENOSPC flag ? What I mean is that any data read from the file should have the appearance of being zeroed (whether zeroes are actually written to disk or not). What I _think_ David is proposing is to allow fallocate() to return without marking the blocks even uninitialized and subsequent reads would return the old data from the disk. Correct, but for swap files that's not an issue - no user should be able too read them, and FA_MKSWAP would really need root privileges to execute. Will the FA_MKSWAP mode still be required with your suggested change of teaching do_mpage_readpage() about unwritten extents being in place ? Or, will you still like to have FA_MKSWAP mode ? -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, 2007-06-28 at 23:49 +0530, Amit K. Arora wrote: Correct, but for swap files that's not an issue - no user should be able too read them, and FA_MKSWAP would really need root privileges to execute. Will the FA_MKSWAP mode still be required with your suggested change of teaching do_mpage_readpage() about unwritten extents being in place ? Or, will you still like to have FA_MKSWAP mode ? There's no need for a MKSWAP flag. cheers. -- Nathan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jun 28, 2007 at 11:49:13PM +0530, Amit K. Arora wrote: On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. Since I believe the XFS allocation ioctls do it the opposite way (free preallocated space on error) this should be encoded into the flags. Having it filesystem dependent just means that nobody will be happy. No, XFs does not free preallocated space on error. it is up to the application to clean up. Since XFS also does not free preallocated space on error and this behavior is inline with dd, posix_fallocate() and the current ext4 implementation, do we still need FA_FL_FREE_ENOSPC flag ? Not at the moment. What I mean is that any data read from the file should have the appearance of being zeroed (whether zeroes are actually written to disk or not). What I _think_ David is proposing is to allow fallocate() to return without marking the blocks even uninitialized and subsequent reads would return the old data from the disk. Correct, but for swap files that's not an issue - no user should be able too read them, and FA_MKSWAP would really need root privileges to execute. Will the FA_MKSWAP mode still be required with your suggested change of teaching do_mpage_readpage() about unwritten extents being in place ? Or, will you still like to have FA_MKSWAP mode ? budgie:/mnt/test # xfs_io -f -c resvsp 0 1048576 -c truncate 1048576 swap_file budgie:/mnt/test # mkswap swap_file Setting up swapspace version 1, size = 1032 kB budgie:/mnt/test # swapon -v swap_file swapon on swap_file budgie:/mnt/test # swapon -s FilenameTypeSizeUsedPriority /dev/sda2 partition 9437152 0 -1 /mnt/test/swap_file file992 0 -2 budgie:/mnt/test # xfs_bmap -vvp swap_file swap_file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSETTOTAL FLAGS 0: [0..31]: 96..127 0 (96..127) 32 1: [32..2047]: 128..2143 0 (128..2143) 2016 1 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width Looks like the changes work, so FA_MKSWAP is not necessary for XFS. We can drop that for the moment unless anyone else sees a need for it. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:49:15PM -0400, Andreas Dilger wrote: On Jun 27, 2007 09:14 +1000, David Chinner wrote: Someone on the XFs list had an interesting request - preallocated swap files. You can't use unwritten extents for this because of sys_swapon()s use of bmap() (XFS returns holes for reading unwritten extents), so we need a method of preallocating that does not zero or mark the extent unread. i.e. FA_MKSWAP. Is there a reason why unwritten extents return 0 to bmap()? It's a fallout of xfs_get_blocks not mapping unwritten extents on read because we want do_mpage_readpage() to treat them as a hole. i.e. zero fill them instead of doing I/O. This is the way XFS was shoehorned into the generic read path :/ This would seem to be the only impediment from using fallocated files for swap files. Maybe if FIEMAP was used by mkswap to get an UNWRITTEN flag back instead of HOLE it wouldn't be a problem. Probably. If we taught do_mpage_readpage() about unwritten mappings, then would could map them on read if and then sys_swapon can remain blissfully unaware of unwritten extents. I think this is pretty much all I need to do to acheive that is (untested): --- Teach do_mpage_readpage() about unwritten extents so we can always map them in get_blocks rather than they are are holes on read. Allows setup_swap_extents() to use preallocated files on XFS filesystems for swap files without ever needing to convert them. Signed-Off-By: Dave Chinner [EMAIL PROTECTED] --- fs/mpage.c |5 +++-- fs/xfs/linux-2.6/xfs_aops.c | 13 +++-- 2 files changed, 6 insertions(+), 12 deletions(-) Index: 2.6.x-xfs-new/fs/mpage.c === --- 2.6.x-xfs-new.orig/fs/mpage.c 2007-05-29 16:17:59.0 +1000 +++ 2.6.x-xfs-new/fs/mpage.c2007-06-27 22:39:35.568852270 +1000 @@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc * Map blocks using the result from the previous get_blocks call first. */ nblocks = map_bh-b_size blkbits; - if (buffer_mapped(map_bh) block_in_file *first_logical_block + if (buffer_mapped(map_bh) !buffer_unwritten(map_bh) + block_in_file *first_logical_block block_in_file (*first_logical_block + nblocks)) { unsigned map_offset = block_in_file - *first_logical_block; unsigned last = nblocks - map_offset; @@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc *first_logical_block = block_in_file; } - if (!buffer_mapped(map_bh)) { + if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) { fully_mapped = 0; if (first_hole == blocks_per_page) first_hole = page_block; Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-06-05 22:14:39.0 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-06-27 22:39:29.545636749 +1000 @@ -1340,16 +1340,9 @@ __xfs_get_blocks( return 0; if (iomap.iomap_bn != IOMAP_DADDR_NULL) { - /* -* For unwritten extents do not report a disk address on -* the read case (treat as if we're reading into a hole). -*/ - if (create || !(iomap.iomap_flags IOMAP_UNWRITTEN)) { - xfs_map_buffer(bh_result, iomap, offset, - inode-i_blkbits); - } - if (create (iomap.iomap_flags IOMAP_UNWRITTEN)) { - if (direct) + xfs_map_buffer(bh_result, iomap, offset, inode-i_blkbits); + if (iomap.iomap_flags IOMAP_UNWRITTEN) { + if (create direct) bh_result-b_private = inode; set_buffer_unwritten(bh_result); } Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, 2007-06-27 at 23:36 +1000, David Chinner wrote: Allows setup_swap_extents() to use preallocated files on XFS filesystems for swap files without ever needing to convert them. Using unwritten extents (as opposed to the MKSWAP flag mentioned earlier) has the unfortunate down side of requiring transactions, possibly additional IO, and memory allocation during swap. (but, this patch should probably go in regardless, as teaching generic code about unwritten extents is not a bad idea). cheers. -- Nathan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jun 28, 2007 at 09:28:36AM +1000, Nathan Scott wrote: On Wed, 2007-06-27 at 23:36 +1000, David Chinner wrote: Allows setup_swap_extents() to use preallocated files on XFS filesystems for swap files without ever needing to convert them. Using unwritten extents (as opposed to the MKSWAP flag mentioned earlier) has the unfortunate down side of requiring transactions, possibly additional IO, and memory allocation during swap. (but, this patch should probably go in regardless, as teaching generic code about unwritten extents is not a bad idea). I don't think it does - swapfile I/O looks like it goes direct to bio without passing through the filesystem. When the swapfile is mapped, it scans and records the extent map of the entire swapfile in a separate structure and AFAICT the swap code uses that built map without touching the filesystem at all. If that is true then the written/unwritten state of the extents is irrelevant; all we need is allocated disk space for the file and swapping should work. And it's not like anyone should be reading the contents of that swapfile through the filesystem, either. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, 2007-06-28 at 10:39 +1000, David Chinner wrote: I don't think it does - swapfile I/O looks like it goes direct to bio without passing through the filesystem. When the swapfile is mapped, it scans and records the extent map of the entire swapfile in a separate structure and AFAICT the swap code uses that built map without touching the filesystem at all. If that is true then the written/unwritten state of the extents is irrelevant; all we need is allocated disk space for the file and swapping should work. And it's not like anyone should be reading the contents of that swapfile through the filesystem, either. ;) Ah, yes, good point - thats true. Unwritten extents are ideal for this then, as attempts to read swap via the regular interfaces will return zeros instead of random swapped out memory contents. cheers. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote: I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post. If it is decided that these flags are also needed, I will update this patch. Thanks! Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we don't want to expose uninitialized disk blocks to userspace. I'm not sure if this makes sense at all. I don't think we need to make it default - atleast for filesystems which have a mechanism to distinguish preallocated blocks from regular ones. In ext4, for example, we will have a way to mark uninitialized extents. All the preallocated blocks will be part of these uninitialized extents. And any read on these extents will treat them as a hole, returning zeroes to user land. Thus any existing data on uninitialized blocks will not be exposed to the userspace. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote: +#define FA_FL_DEALLOC 0x01 /* default is allocate */ +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ In XFS one of the (many) ALLOC modes is to zero existing data on allocate. For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on each extent. For some workloads this would be much faster than truncate and reallocate of all the blocks in a file. In ext4, we already mark each extent having preallocated blocks as uninitialized. This is done as part of following code (which is part of patch 5/7) in ext4_ext_get_blocks() : @@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle /* try to insert new extent into found leaf and return */ ext4_ext_store_pblock(newex, newblock); newex.ee_len = cpu_to_le16(allocated); + if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */ + ext4_ext_mark_uninitialized(newex); err = ext4_ext_insert_extent(handle, inode, path, newex); if (err) { /* free data blocks we just allocated */ In that light, please change the comment to /* default is keep existing data */ so that it doesn't imply this is only for DEALLOC. Ok. Will update the comment. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote: On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: In XFS one of the (many) ALLOC modes is to zero existing data on allocate. For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on each extent. For some workloads this would be much faster than truncate and reallocate of all the blocks in a file. In ext4, we already mark each extent having preallocated blocks as uninitialized. This is done as part of following code (which is part of patch 5/7) in ext4_ext_get_blocks() : What I meant is that with XFS_IOC_ALLOCSP the previously-written data is ZEROED OUT, unlike with fallocate() which leaves previously-written data alone and only allocates in holes. So, if you had a sparse file with some data in it: A BB fallocate() would allocate the holes: 0A0BB XFS_IOC_ALLOCSP would overwrite everything: 0 In order to specify this for allocation, FA_FL_DEL_DATA would need to make sense for allocations (as well as the deallocation). This is farily easy to do - just mark all of the existing extents as unallocated, and their data disappears. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. Since I believe the XFS allocation ioctls do it the opposite way (free preallocated space on error) this should be encoded into the flags. Having it filesystem dependent just means that nobody will be happy. Ok, got your point. Maybe we can have a flag for this, as you suggested. But, default behavior IMHO should be _not_ to undo partial allocation (thus the file system will have the option of supporting this flag or not and it will be inline with posix_fallocate; XFS will obviously like to support this flag, inline with its existing behavior). For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we don't want to expose uninitialized disk blocks to userspace. I'm not sure if this makes sense at all. I don't think we need to make it default - atleast for filesystems which have a mechanism to distinguish preallocated blocks from regular ones. What I mean is that any data read from the file should have the appearance of being zeroed (whether zeroes are actually written to disk or not). What I _think_ David is proposing is to allow fallocate() to return without marking the blocks even uninitialized and subsequent reads would return the old data from the disk. I can't think of a good reason for this (i.e. returning stale data from preallocated blocks). It is infact a security issue to me. Anyhow, this may though be beneficial for file systems which have noticable overhead in marking the blocks uninitialized/preallocated. Can you or David please throw some light on how this option might really be helpful ? Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote: On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote: On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: In XFS one of the (many) ALLOC modes is to zero existing data on allocate. For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on each extent. For some workloads this would be much faster than truncate and reallocate of all the blocks in a file. In ext4, we already mark each extent having preallocated blocks as uninitialized. This is done as part of following code (which is part of patch 5/7) in ext4_ext_get_blocks() : What I meant is that with XFS_IOC_ALLOCSP the previously-written data is ZEROED OUT, unlike with fallocate() which leaves previously-written data alone and only allocates in holes. In order to specify this for allocation, FA_FL_DEL_DATA would need to make sense for allocations (as well as the deallocation). This is farily easy to do - just mark all of the existing extents as unallocated, and their data disappears. Ok, agreed. Will add the FA_ZERO_SPACE mode too. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote: I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post. If it is decided that these flags are also needed, I will update this patch. Thanks! Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we don't want to expose uninitialized disk blocks to userspace. I'm not sure if this makes sense at all. Someone on the XFs list had an interesting request - preallocated swap files. You can't use unwritten extents for this because of sys_swapon()s use of bmap() (XFS returns holes for reading unwritten extents), so we need a method of preallocating that does not zero or mark the extent unread. i.e. FA_MKSWAP. I thinkthis would be: #define FA_FL_NO_ZERO_SPACE 0x08/* default is to zero space */ #define FA_MKSWAP (FA_ALLOCATE | FA_FL_NO_ZERO_SPACE) That way we can allocate large swap files that don't need zeroing in a single, fast operation, and hence potentially bring new swap space online without needed very much memory at all (i.e. should succeed in most near-OOM conditions). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. Since I believe the XFS allocation ioctls do it the opposite way (free preallocated space on error) this should be encoded into the flags. Having it filesystem dependent just means that nobody will be happy. No, XFs does not free preallocated space on error. it is up to the application to clean up. What I mean is that any data read from the file should have the appearance of being zeroed (whether zeroes are actually written to disk or not). What I _think_ David is proposing is to allow fallocate() to return without marking the blocks even uninitialized and subsequent reads would return the old data from the disk. Correct, but for swap files that's not an issue - no user should be able too read them, and FA_MKSWAP would really need root privileges to execute. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote: +#define FA_FL_DEALLOC 0x01 /* default is allocate */ +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ In XFS one of the (many) ALLOC modes is to zero existing data on allocate. No, none of the XFS allocation modes do that. XFS_IOC_ALLOCSP, which does write zeros to disk, only allocates and writes zeros in the range between the old file size and the new file size. XFS_IOC_RESVSP, which alocates unwritten extents, only allocates where extents do not currently exist. It does not zero existing extents. IOWs, you can't overwrite existing data with XFS preallocation. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote: On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote: On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: In XFS one of the (many) ALLOC modes is to zero existing data on allocate. For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on each extent. For some workloads this would be much faster than truncate and reallocate of all the blocks in a file. In ext4, we already mark each extent having preallocated blocks as uninitialized. This is done as part of following code (which is part of patch 5/7) in ext4_ext_get_blocks() : What I meant is that with XFS_IOC_ALLOCSP the previously-written data is ZEROED OUT, unlike with fallocate() which leaves previously-written data alone and only allocates in holes. So, if you had a sparse file with some data in it: A BB fallocate() would allocate the holes: 0A0BB XFS_IOC_ALLOCSP would overwrite everything: 0 No, it wouldn't. XFS_IOC_ALLOCSP would give you: A BB because it only allocates the space between the old EOF and the new EOF. Graphic demonstration - write 4k @ 4k, 4k @ 16k, allocsp out to 32k: budgie:~ # xfs_io -f \ -c pwrite 4096 4096 \ -c pwrite 16384 4096 \ -c bmap -vvp \ -c allocsp 32768 0 \ -c bmap -vvp \ /mnt/test/alfred wrote 4096/4096 bytes at offset 4096 4 KiB, 1 ops; 0. sec (108.507 MiB/sec and 2.7778 ops/sec) wrote 4096/4096 bytes at offset 16384 4 KiB, 1 ops; 0. sec (260.417 MiB/sec and 6.6667 ops/sec) /mnt/test/alfred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: hole 8 1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8 2: [16..31]:hole 16 3: [32..39]:5226888..5226895 4 (1022184..1022191) 8 /mnt/test/alfred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: hole 8 1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8 2: [16..31]:hole 16 3: [32..63]:5226888..5226919 4 (1022184..1022215)32 budgie:~ # Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Jun 27, 2007 09:14 +1000, David Chinner wrote: Someone on the XFs list had an interesting request - preallocated swap files. You can't use unwritten extents for this because of sys_swapon()s use of bmap() (XFS returns holes for reading unwritten extents), so we need a method of preallocating that does not zero or mark the extent unread. i.e. FA_MKSWAP. Is there a reason why unwritten extents return 0 to bmap()? This would seem to be the only impediment from using fallocated files for swap files. Maybe if FIEMAP was used by mkswap to get an UNWRITTEN flag back instead of HOLE it wouldn't be a problem. That way we can allocate large swap files that don't need zeroing in a single, fast operation, and hence potentially bring new swap space online without needed very much memory at all (i.e. should succeed in most near-OOM conditions). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7][TAKE5] support new modes in fallocate
Implement new flags and values for mode argument. This patch implements the new flags and values for the mode argument of the fallocate system call. It is based on the discussion between Andreas Dilger and David Chinner on the man page proposed (by the later) on fallocate. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22-rc4/include/linux/fs.h === --- linux-2.6.22-rc4.orig/include/linux/fs.h +++ linux-2.6.22-rc4/include/linux/fs.h @@ -267,15 +267,16 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WAIT_AFTER 4 /* - * sys_fallocate modes - * Currently sys_fallocate supports two modes: - * FA_ALLOCATE : This is the preallocate mode, using which an application/user - * may request (pre)allocation of blocks. - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free - * the preallocated blocks. + * sys_fallocate mode flags and values */ -#define FA_ALLOCATE0x1 -#define FA_DEALLOCATE 0x2 +#define FA_FL_DEALLOC 0x01 /* default is allocate */ +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ + +#define FA_ALLOCATE0 +#define FA_DEALLOCATE FA_FL_DEALLOC +#define FA_RESV_SPACE FA_FL_KEEP_SIZE +#define FA_UNRESV_SPACE(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) #ifdef __KERNEL__ Index: linux-2.6.22-rc4/fs/open.c === --- linux-2.6.22-rc4.orig/fs/open.c +++ linux-2.6.22-rc4/fs/open.c @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned * sys_fallocate - preallocate blocks or free preallocated blocks * @fd: the file descriptor * @mode: mode specifies if fallocate should preallocate blocks OR free - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and - * FA_DEALLOCATE modes are supported. + * (unallocate) preallocated blocks. * @offset: The offset within file, from where (un)allocation is being * requested. It should not have a negative value. * @len: The amount (in bytes) of space to be (un)allocated, from the offset. * * This system call, depending on the mode, preallocates or unallocates blocks * for a file. The range of blocks depends on the value of offset and len - * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * arguments provided by the user/application. For FA_ALLOCATE and + * FA_RESV_SPACE modes, if the sys_fallocate() * system call succeeds, subsequent writes to the file in the given range * (specified by offset len) should not fail - even if the file system * later becomes full. Hence the preallocation done is persistent (valid - * even after reopen of the file and remount/reboot). + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode + * is passed, the file size will not be changed even if the preallocation + * is beyond EOF. * * It is expected that the -fallocate() inode operation implemented by the * individual file systems will update the file size and/or ctime/mtime - * depending on the mode and also on the success of the operation. + * depending on the mode (change is visible to user or not - say file size) + * and obviously, on the success of the operation. * * Note: Incase the file system does not support preallocation, * posix_fallocate() should fall back to the library implementation (i.e. @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in /* Return error if mode is not supported */ ret = -EOPNOTSUPP; - if (mode != FA_ALLOCATE mode != FA_DEALLOCATE) + if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE || + mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE)) goto out; ret = -EBADF; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote: I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post. If it is decided that these flags are also needed, I will update this patch. Thanks! Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we don't want to expose uninitialized disk blocks to userspace. I'm not sure if this makes sense at all. On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote: Implement new flags and values for mode argument. This patch implements the new flags and values for the mode argument of the fallocate system call. It is based on the discussion between Andreas Dilger and David Chinner on the man page proposed (by the later) on fallocate. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22-rc4/include/linux/fs.h === --- linux-2.6.22-rc4.orig/include/linux/fs.h +++ linux-2.6.22-rc4/include/linux/fs.h @@ -267,15 +267,16 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WAIT_AFTER 4 /* - * sys_fallocate modes - * Currently sys_fallocate supports two modes: - * FA_ALLOCATE : This is the preallocate mode, using which an application/user - * may request (pre)allocation of blocks. - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free - * the preallocated blocks. + * sys_fallocate mode flags and values */ -#define FA_ALLOCATE0x1 -#define FA_DEALLOCATE 0x2 +#define FA_FL_DEALLOC 0x01 /* default is allocate */ +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ + +#define FA_ALLOCATE0 +#define FA_DEALLOCATE FA_FL_DEALLOC +#define FA_RESV_SPACE FA_FL_KEEP_SIZE +#define FA_UNRESV_SPACE(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) #ifdef __KERNEL__ Index: linux-2.6.22-rc4/fs/open.c === --- linux-2.6.22-rc4.orig/fs/open.c +++ linux-2.6.22-rc4/fs/open.c @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned * sys_fallocate - preallocate blocks or free preallocated blocks * @fd: the file descriptor * @mode: mode specifies if fallocate should preallocate blocks OR free - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and - * FA_DEALLOCATE modes are supported. + * (unallocate) preallocated blocks. * @offset: The offset within file, from where (un)allocation is being * requested. It should not have a negative value. * @len: The amount (in bytes) of space to be (un)allocated, from the offset. * * This system call, depending on the mode, preallocates or unallocates blocks * for a file. The range of blocks depends on the value of offset and len - * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * arguments provided by the user/application. For FA_ALLOCATE and + * FA_RESV_SPACE modes, if the sys_fallocate() * system call succeeds, subsequent writes to the file in the given range * (specified by offset len) should not fail - even if the file system * later becomes full. Hence the preallocation done is persistent (valid - * even after reopen of the file and remount/reboot). + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode + * is passed, the file size will not be changed even if the preallocation + * is beyond EOF. * * It is expected that the -fallocate() inode operation implemented by the * individual file systems will update the file size and/or ctime/mtime - * depending on the mode and also on the success of the operation. + * depending on the mode (change is visible to user or not - say file size) + * and obviously, on the success of the operation. * * Note: Incase the file system does not support preallocation, * posix_fallocate() should fall back to the library implementation (i.e. @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in /* Return error if mode is not supported */ ret = -EOPNOTSUPP; - if (mode != FA_ALLOCATE mode != FA_DEALLOCATE) + if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE || + mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE)) goto out; ret = -EBADF; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems,
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote: +#define FA_FL_DEALLOC0x01 /* default is allocate */ +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */ +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ In XFS one of the (many) ALLOC modes is to zero existing data on allocate. For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on each extent. For some workloads this would be much faster than truncate and reallocate of all the blocks in a file. In that light, please change the comment to /* default is keep existing data */ so that it doesn't imply this is only for DEALLOC. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html