Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Suparna Bhattacharya
On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
 On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
  Well, if you see the modes proposed using above flags :
  
  #define FA_ALLOCATE 0
  #define FA_DEALLOCATE   FA_FL_DEALLOC
  #define FA_RESV_SPACE   FA_FL_KEEP_SIZE
  #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
  FA_FL_DEL_DATA)
  
  FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
  for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
  flag. Hence prealloction will never delete data.
  This mode is required only for FA_UNRESV_SPACE, which is a deallocation
  mode, to support any existing XFS aware applications/usage-scenarios.
 
 Sorry, but this doesn't make any sense.  There is no need to put every
 feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
 be supported forever anyway - as I suggested before they really should
 be moved to generic code.
 
 What needs to be supported is what makes sense as an interface.
 A punch a hole interface does make sense, but trying to hack this into
 a preallocation system call is just madness.  We're not IRIX or windows
 that fit things into random subcall just because there was some space
 left to squeeze them in.
 
 FA_FL_NO_MTIME0x10 /* keep same mtime (default change on 
 size, data change) */
 FA_FL_NO_CTIME0x20 /* keep same ctime (default change on 
 size, data change) */
   
   NACK to these aswell.  If i_size changes c/mtime need updates, if the size
   doesn't chamge they don't.  No need to add more flags for this.
  
  This requirement was from the point of view of HSM applications. Hope
  you saw Andreas previous post and are keeping that in mind.
 
 HSMs needs this basically for every system call, which screams for an
 open flag like O_INVISIBLE anyway.  Adding this in a generic way is
 a good idea, but hacking bits and pieces that won't fit into the global
 design is completely wrong.


Why don't we just merge the interface for preallocation (essentially
enough to satisfy posix_fallocate() and the simple XFS requirement for 
space reservation without changing file size), which there is clear agreement
on (I hope :)).  After all, this was all that we set out to do when we
started.

And leave all the dealloc/punch/hsm type features for separate future patches/
debates, those really shouldn't hold up the basic fallocate interface.
I agree with Christoph that we are just diverging too much in trying to
club those decisions here.

Dave, Andreas, Ted ?

Regards
Suparna

 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Amit K. Arora
On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote:
 On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
  On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
   Well, if you see the modes proposed using above flags :
   
   #define FA_ALLOCATE   0
   #define FA_DEALLOCATE FA_FL_DEALLOC
   #define FA_RESV_SPACE FA_FL_KEEP_SIZE
   #define FA_UNRESV_SPACE   (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
   FA_FL_DEL_DATA)
   
   FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
   for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
   flag. Hence prealloction will never delete data.
   This mode is required only for FA_UNRESV_SPACE, which is a deallocation
   mode, to support any existing XFS aware applications/usage-scenarios.
  
  Sorry, but this doesn't make any sense.  There is no need to put every
  feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
  be supported forever anyway - as I suggested before they really should
  be moved to generic code.
  
  What needs to be supported is what makes sense as an interface.
  A punch a hole interface does make sense, but trying to hack this into
  a preallocation system call is just madness.  We're not IRIX or windows
  that fit things into random subcall just because there was some space
  left to squeeze them in.
  
  FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on 
  size, data change) */
  FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on 
  size, data change) */

NACK to these aswell.  If i_size changes c/mtime need updates, if the 
size
doesn't chamge they don't.  No need to add more flags for this.
   
   This requirement was from the point of view of HSM applications. Hope
   you saw Andreas previous post and are keeping that in mind.
  
  HSMs needs this basically for every system call, which screams for an
  open flag like O_INVISIBLE anyway.  Adding this in a generic way is
  a good idea, but hacking bits and pieces that won't fit into the global
  design is completely wrong.
 
 Why don't we just merge the interface for preallocation (essentially
 enough to satisfy posix_fallocate() and the simple XFS requirement for 
 space reservation without changing file size), which there is clear agreement
 on (I hope :)).  After all, this was all that we set out to do when we
 started.

As you suggest, let us just have two modes for the time being:

#define FALLOC_ALLOCATE 0x1
#define FALLOC_ALLOCATE_KEEP_SIZE   0x2

As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
will result in file size not being changed even if the preallocation is
beyond EOF.

 And leave all the dealloc/punch/hsm type features for separate future patches/
 debates, those really shouldn't hold up the basic fallocate interface.

I agree.

 I agree with Christoph that we are just diverging too much in trying to
 club those decisions here.
 
 Dave, Andreas, Ted ?
 
 Regards
 Suparna

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread David Chinner
On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote:
 
 Why don't we just merge the interface for preallocation (essentially
 enough to satisfy posix_fallocate() and the simple XFS requirement for 
 space reservation without changing file size), which there is clear agreement
 on (I hope :)).  After all, this was all that we set out to do when we
 started.
 
 And leave all the dealloc/punch/hsm type features for separate future patches/
 debates, those really shouldn't hold up the basic fallocate interface.
 I agree with Christoph that we are just diverging too much in trying to
 club those decisions here.
 
 Dave, Andreas, Ted ?

Sure. I'll just make XFS work with whatever it is that gets merged.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Amit K. Arora
On Thu, Jul 12, 2007 at 11:13:34PM +1000, David Chinner wrote:
 On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote:
  
  Why don't we just merge the interface for preallocation (essentially
  enough to satisfy posix_fallocate() and the simple XFS requirement for 
  space reservation without changing file size), which there is clear 
  agreement
  on (I hope :)).  After all, this was all that we set out to do when we
  started.
  
  And leave all the dealloc/punch/hsm type features for separate future 
  patches/
  debates, those really shouldn't hold up the basic fallocate interface.
  I agree with Christoph that we are just diverging too much in trying to
  club those decisions here.
  
  Dave, Andreas, Ted ?
 
 Sure. I'll just make XFS work with whatever it is that gets merged.

Great. I will post the new patches soon.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Andreas Dilger
On Jul 12, 2007  13:56 +0530, Amit K. Arora wrote:
 As you suggest, let us just have two modes for the time being:
 
 #define FALLOC_ALLOCATE   0x1
 #define FALLOC_ALLOCATE_KEEP_SIZE 0x2
 
 As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
 will result in file size not being changed even if the preallocation is
 beyond EOF.

What does FALLOC_ALLOCATE mean vs. not passing this flag?  I have no
objection to this as long as the code remains with these as flags
instead of modes...  Essentially just dropping the FALLOC_FL_DEALLOCATE
and FALLOC_FL_DEL_DATA from the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
 Well, if you see the modes proposed using above flags :
 
 #define FA_ALLOCATE   0
 #define FA_DEALLOCATE FA_FL_DEALLOC
 #define FA_RESV_SPACE FA_FL_KEEP_SIZE
 #define FA_UNRESV_SPACE   (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
 FA_FL_DEL_DATA)
 
 FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
 for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
 flag. Hence prealloction will never delete data.
 This mode is required only for FA_UNRESV_SPACE, which is a deallocation
 mode, to support any existing XFS aware applications/usage-scenarios.

Sorry, but this doesn't make any sense.  There is no need to put every
feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
be supported forever anyway - as I suggested before they really should
be moved to generic code.

What needs to be supported is what makes sense as an interface.
A punch a hole interface does make sense, but trying to hack this into
a preallocation system call is just madness.  We're not IRIX or windows
that fit things into random subcall just because there was some space
left to squeeze them in.

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data 
change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data 
change) */
  
  NACK to these aswell.  If i_size changes c/mtime need updates, if the size
  doesn't chamge they don't.  No need to add more flags for this.
 
 This requirement was from the point of view of HSM applications. Hope
 you saw Andreas previous post and are keeping that in mind.

HSMs needs this basically for every system call, which screams for an
open flag like O_INVISIBLE anyway.  Adding this in a generic way is
a good idea, but hacking bits and pieces that won't fit into the global
design is completely wrong.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
 Given the current behaviour for posix_fallocate() in glibc, I think
 that retaining the same error semantic and punting the cleanup to
 userspace (where the app will fail with ENOSPC anyway) is the only
 sane thing we can do here. Trying to undo this in the kernel leads
 to lots of extra rarely used code in error handling paths...

Agreed, looks like we should stay with the user has to clean up behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote:
 We use this capability in XFS at the moment.
 I think this is mainly for DMF (HSM) but is done via the xfs handle 
 interface
 (xfs_open_by_handle) AFAICT.
 

You're not :)  You're using an O_INVIBLE equivalent (as described below),
which would be a useful thing to have at the VFS level, but adding hacks
to some system calls only wouldn't help any HSM system.  It's just useless
API clutter.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Amit K. Arora
On Sat, Jun 30, 2007 at 12:52:46PM -0400, Andreas Dilger wrote:
 The @mode flags that are currently under consideration are (AFAIK):
 
 FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) 
 */
 FA_FL_KEEP_SIZE   0x02 /* keep size for EOF {pre,de}alloc (default change 
 size) */
 FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default 
 keep) */

We now have two sets of flags - 
1) the above three with which I think no one has any issues with, and
2) the ones below, for which we need some discussions before finalizing
on them.

I will prefer fallocate going in mainline with the above three modes, and
rest of the modes can be debated upon and discussed parallely. And, each
new mode/flag can be pushed as a separate patch. This will not hold
fallocate feature indefinitely...

Please confirm if you find this approach ok. Otherwise, please object.
Thanks!

 FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep 
 prealloc) */
 FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data 
 change) */
 FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data 
 change) */

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Christoph Hellwig
On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote:
  FA_FL_DEALLOC   0x01 /* deallocate unwritten extent (default 
  allocate) */
  FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change 
  size) */
  FA_FL_DEL_DATA  0x04 /* delete existing data in alloc range (default 
  keep) */
 
 We now have two sets of flags - 
 1) the above three with which I think no one has any issues with, and

Yes, I do.  FA_FL_DEL_DATA is plain stupid, a preallocation call should
never delete data.  FA_FL_DEALLOC should probably be a separate syscall
because it's very different functionality.

While we're at it I also dislike the FA_ prefix becuase it doesn't say
anything and is far too generic.  FALLOC_ is much better.

  FA_FL_ERR_FREE  0x08 /* free preallocation on error (default keep 
  prealloc) */

NACK on this one.  We should have just one behaviour, and from the thread
that not freeing the allocation on error.

  FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data 
  change) */
  FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data 
  change) */

NACK to these aswell.  If i_size changes c/mtime need updates, if the size
doesn't chamge they don't.  No need to add more flags for this.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Amit K. Arora
On Tue, Jul 03, 2007 at 11:31:07AM +0100, Christoph Hellwig wrote:
 On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote:
   FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default 
   allocate) */
   FA_FL_KEEP_SIZE   0x02 /* keep size for EOF {pre,de}alloc (default change 
   size) */
   FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default 
   keep) */
  
  We now have two sets of flags - 
  1) the above three with which I think no one has any issues with, and
 
 Yes, I do.  FA_FL_DEL_DATA is plain stupid, a preallocation call should
 never delete data.  FA_FL_DEALLOC should probably be a separate syscall
 because it's very different functionality.

Well, if you see the modes proposed using above flags :

#define FA_ALLOCATE 0
#define FA_DEALLOCATE   FA_FL_DEALLOC
#define FA_RESV_SPACE   FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)

FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
flag. Hence prealloction will never delete data.
This mode is required only for FA_UNRESV_SPACE, which is a deallocation
mode, to support any existing XFS aware applications/usage-scenarios.

And, regarding FA_FL_DEALLOC being a separate syscall - I think then the
very purpose of @mode argument is not justified. We have this mode so
that we can provide more features like this. That said, I don't say that
we should make things very complicated; but, atleast we should provide
some basic features which we expect most of the applications wanting
preallocation to use. To start with, we need to cater to already
existing applications/user base who use XFS preallocation feature.

And further advanced features, like goal based preallocation, can be
implemented as a separate syscall.

 While we're at it I also dislike the FA_ prefix becuase it doesn't say
 anything and is far too generic.  FALLOC_ is much better.

Ok. This can be changed in the next take.
 
   FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep 
   prealloc) */
 
 NACK on this one.  We should have just one behaviour, and from the thread
 that not freeing the allocation on error.

I agree on this one. 
 
   FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data 
   change) */
   FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data 
   change) */
 
 NACK to these aswell.  If i_size changes c/mtime need updates, if the size
 doesn't chamge they don't.  No need to add more flags for this.

This requirement was from the point of view of HSM applications. Hope
you saw Andreas previous post and are keeping that in mind.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Timothy Shimmin

Amit K. Arora wrote:

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data change) */

NACK to these aswell.  If i_size changes c/mtime need updates, if the size
doesn't chamge they don't.  No need to add more flags for this.


This requirement was from the point of view of HSM applications. Hope
you saw Andreas previous post and are keeping that in mind.


We use this capability in XFS at the moment.
I think this is mainly for DMF (HSM) but is done via the xfs handle interface
(xfs_open_by_handle) AFAICT.

This sets up a set of invisible operations (xfs_invis_file_operations).
xfs_file_ioctl_invis goes on to set IO_INVIS which goes on to set ATTR_DMI
which is then tested in xfs_change_file_space() (which handles XFS_IOC_RESVSP  
friends)
for whether xfs_ichgtime(ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG)
is called or not.

--Tim
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-02 Thread Amit K. Arora
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
 On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote:
  On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
Can you clarify - what is the current behaviour when ENOSPC (or some 
other
error) is hit?  Does it keep the current fallocate() or does it free it?
   
   Currently it is left on the file system implementation. In ext4, we do
   not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
   end up with partial (pre)allocation. This is inline with dd and
   posix_fallocate, which also do not free the partially allocated space.
  
  I can't find anything in the specification of posix_fallocate
  (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
  that tells what should happen to allocate blocks on error.
 
 Yeah, and AFAICT glibc leaves them behind ATM.

Yes, it does.
 
  But common sense would be to not leak disk space on failure of this
  syscall, and this definitively should not be left up to the filesystem,
  either we always leak it or always free it, and I'd strongly favour
  the latter variant.

I would not call it a leak, since the blocks which got allocated as
part of the partial success of the fallocate syscall can be strictly
accounted for (i.e. they are assigned to a particular inode). And these
can be freed by the application, using a suitable @mode of fallocate.
 
 We can't simply walk the range an remove unwritten extents, as some
 of them may have been present before the fallocate() call. That
 makes it extremely difficult to undo a failed call and not remove
 more pre-existing pre-allocations.

Same is true for ext4 too. It is very difficult to keep track of which
uninitialized (unwritten) extents got allocated as part of the current
syscall. This is because, as David mentions, some of them might be
already present; and also because some of the older ones may have got
merged with the *new* uninitialized/unwritten extents as part of the
current syscall. 
 
 Given the current behaviour for posix_fallocate() in glibc, I think
 that retaining the same error semantic and punting the cleanup to
 userspace (where the app will fail with ENOSPC anyway) is the only
 sane thing we can do here. Trying to undo this in the kernel leads
 to lots of extra rarely used code in error handling paths...

Right. This gives applications the free hand if they really want to use
the partially preallocated space, OR they want to free it; without
introducing additional complexity in the kernel.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-01 Thread Andreas Dilger
On Jun 30, 2007  11:21 +0100, Christoph Hellwig wrote:
 On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
  Currently it is left on the file system implementation. In ext4, we do
  not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
  end up with partial (pre)allocation. This is inline with dd and
  posix_fallocate, which also do not free the partially allocated space.
 
 I can't find anything in the specification of posix_fallocate
 (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
 that tells what should happen to allocate blocks on error.
 
 But common sense would be to not leak disk space on failure of this
 syscall, and this definitively should not be left up to the filesystem,
 either we always leak it or always free it, and I'd strongly favour
 the latter variant.

I definitely agree that the behaviour should be specified part of
the interface.  The current behaviour of both ext4 and XFS is that the
successful part of the unallocated extent is left in place when returning
ENOSPC so we considered this the consistent behaviour.  This is the same
as e.g. sys_write() which does not remove the part of the write that was
successful if ENOSPC is hit.  I think this also makes sense for some usa
cases, because application like PVR may want to preallocate approximately
30min of space, but if it gets only 25min worth then it can at least start
using this while it also begins looking for and/or freeing old files.

If the space is always freed on ENOSPC, then there may be a significant
amount of work done and undone while the application is iterating over
possible sizes until one works.   It is easy for the application to
use fstat() to see the blocks/size actually preallocated on failure, and
explicitly request unallocation of this space if the outcome is undesirable.

If you think that applications have a strong preference for both kinds
of behaviour (e.g. database which requires the full allocation to succeed,
unlike PVR application above) then this could be encoded into a @mode flag.

   For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
   don't want to expose uninitialized disk blocks to userspace.  I'm not
   sure if this makes sense at all.
 
 This is the xfs unwritten extent behaviour.  But anyway, the important bit
 is uninitialized blocks should never ever leak to userspace, so there is
 not need for the flag.

I agree that we shouldn't need FA_ZERO_SPACE.  If an application wants
explicit zeros written to disk it can just do this with O_DIRECT writes
or similar.

 The more I think about it the more I'd prefer we would just put a simple
 syscall in that implements nothing but the posix_fallocate(3) semantics
 as defined in SuS, and then go on to brainstorm about advanced
 preallocation / layout hint semantics.

I don't think the current @mode flags introduce any significant complexity
in the implementation, and in fact one of the reasons these came up in the
first place was because David pointed out the XFS behaviour did NOT match
with posix_fallocate() and we started getting strange semantics enforced
by monolithic modes.  IMHO, coding for and understanding the semantics of
the monolithic modes is much more complex and less useful than the explicit
flags.

The @mode flags that are currently under consideration are (AFAIK):

FA_FL_DEALLOC   0x01 /* deallocate unwritten extent (default allocate) */
FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */
FA_FL_DEL_DATA  0x04 /* delete existing data in alloc range (default keep) */

Your concern about leaking space would imply:

FA_FL_ERR_FREE  0x08 /* free preallocation on error (default keep prealloc) */

The other possible flags that were proposed, to avoid confusing backup and
HSM applications when preallocated space is added or removed from a file
(you don't want a backup app to re-backup a file that was migrated via HSM):

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data change) */

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-01 Thread David Chinner
On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote:
 On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
   Can you clarify - what is the current behaviour when ENOSPC (or some other
   error) is hit?  Does it keep the current fallocate() or does it free it?
  
  Currently it is left on the file system implementation. In ext4, we do
  not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
  end up with partial (pre)allocation. This is inline with dd and
  posix_fallocate, which also do not free the partially allocated space.
 
 I can't find anything in the specification of posix_fallocate
 (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
 that tells what should happen to allocate blocks on error.

Yeah, and AFAICT glibc leaves them behind ATM.

 But common sense would be to not leak disk space on failure of this
 syscall, and this definitively should not be left up to the filesystem,
 either we always leak it or always free it, and I'd strongly favour
 the latter variant.

We can't simply walk the range an remove unwritten extents, as some
of them may have been present before the fallocate() call. That
makes it extremely difficult to undo a failed call and not remove
more pre-existing pre-allocations.

Given the current behaviour for posix_fallocate() in glibc, I think
that retaining the same error semantic and punting the cleanup to
userspace (where the app will fail with ENOSPC anyway) is the only
sane thing we can do here. Trying to undo this in the kernel leads
to lots of extra rarely used code in error handling paths...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-30 Thread Christoph Hellwig
On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
  Can you clarify - what is the current behaviour when ENOSPC (or some other
  error) is hit?  Does it keep the current fallocate() or does it free it?
 
 Currently it is left on the file system implementation. In ext4, we do
 not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
 end up with partial (pre)allocation. This is inline with dd and
 posix_fallocate, which also do not free the partially allocated space.

I can't find anything in the specification of posix_fallocate
(http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
that tells what should happen to allocate blocks on error.

But common sense would be to not leak disk space on failure of this
syscall, and this definitively should not be left up to the filesystem,
either we always leak it or always free it, and I'd strongly favour
the latter variant.

  For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
  don't want to expose uninitialized disk blocks to userspace.  I'm not
  sure if this makes sense at all.
 
 I don't think we need to make it default - atleast for filesystems which
 have a mechanism to distinguish preallocated blocks from regular ones.
 In ext4, for example, we will have a way to mark uninitialized extents.
 All the preallocated blocks will be part of these uninitialized extents.
 And any read on these extents will treat them as a hole, returning
 zeroes to user land. Thus any existing data on uninitialized blocks will
 not be exposed to the userspace.

This is the xfs unwritten extent behaviour.  But anyway, the important bit
is uninitialized blocks should never ever leak to userspace, so there is
not need for the flag.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-28 Thread Amit K. Arora
On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote:
 On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
  On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
   On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
Can you clarify - what is the current behaviour when ENOSPC (or some 
other
error) is hit?  Does it keep the current fallocate() or does it free it?
   
   Currently it is left on the file system implementation. In ext4, we do
   not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
   end up with partial (pre)allocation. This is inline with dd and
   posix_fallocate, which also do not free the partially allocated space.
  
  Since I believe the XFS allocation ioctls do it the opposite way (free
  preallocated space on error) this should be encoded into the flags.
  Having it filesystem dependent just means that nobody will be happy.
 
 No, XFs does not free preallocated space on error. it is up to the
 application to clean up.

Since XFS also does not free preallocated space on error and this
behavior is inline with dd, posix_fallocate() and the current ext4
implementation, do we still need FA_FL_FREE_ENOSPC flag ?
 
  What I mean is that any data read from the file should have the appearance
  of being zeroed (whether zeroes are actually written to disk or not).  What
  I _think_ David is proposing is to allow fallocate() to return without
  marking the blocks even uninitialized and subsequent reads would return
  the old data from the disk.
 
 Correct, but for swap files that's not an issue - no user should be able
 too read them, and FA_MKSWAP would really need root privileges to execute.

Will the FA_MKSWAP mode still be required with your suggested change of
teaching do_mpage_readpage() about unwritten extents being in place ?
Or, will you still like to have FA_MKSWAP mode ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-28 Thread Nathan Scott
On Thu, 2007-06-28 at 23:49 +0530, Amit K. Arora wrote:
 
  Correct, but for swap files that's not an issue - no user should be
 able
  too read them, and FA_MKSWAP would really need root privileges to
 execute.
 
 Will the FA_MKSWAP mode still be required with your suggested change
 of
 teaching do_mpage_readpage() about unwritten extents being in place ?
 Or, will you still like to have FA_MKSWAP mode ? 

There's no need for a MKSWAP flag.

cheers.

--
Nathan

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-28 Thread David Chinner
On Thu, Jun 28, 2007 at 11:49:13PM +0530, Amit K. Arora wrote:
 On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote:
  On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
   On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
 Can you clarify - what is the current behaviour when ENOSPC (or some 
 other
 error) is hit?  Does it keep the current fallocate() or does it free 
 it?

Currently it is left on the file system implementation. In ext4, we do
not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
end up with partial (pre)allocation. This is inline with dd and
posix_fallocate, which also do not free the partially allocated space.
   
   Since I believe the XFS allocation ioctls do it the opposite way (free
   preallocated space on error) this should be encoded into the flags.
   Having it filesystem dependent just means that nobody will be happy.
  
  No, XFs does not free preallocated space on error. it is up to the
  application to clean up.
 
 Since XFS also does not free preallocated space on error and this
 behavior is inline with dd, posix_fallocate() and the current ext4
 implementation, do we still need FA_FL_FREE_ENOSPC flag ?

Not at the moment.

   What I mean is that any data read from the file should have the 
   appearance
   of being zeroed (whether zeroes are actually written to disk or not).  
   What
   I _think_ David is proposing is to allow fallocate() to return without
   marking the blocks even uninitialized and subsequent reads would return
   the old data from the disk.
  
  Correct, but for swap files that's not an issue - no user should be able
  too read them, and FA_MKSWAP would really need root privileges to execute.
 
 Will the FA_MKSWAP mode still be required with your suggested change of
 teaching do_mpage_readpage() about unwritten extents being in place ?
 Or, will you still like to have FA_MKSWAP mode ?

budgie:/mnt/test # xfs_io -f -c resvsp 0 1048576 -c truncate 1048576 
swap_file
budgie:/mnt/test # mkswap swap_file
Setting up swapspace version 1, size = 1032 kB
budgie:/mnt/test # swapon -v swap_file
swapon on swap_file
budgie:/mnt/test # swapon -s
FilenameTypeSizeUsedPriority
/dev/sda2   partition   9437152 0   -1
/mnt/test/swap_file file992 0   -2
budgie:/mnt/test # xfs_bmap -vvp swap_file
swap_file:
 EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSETTOTAL FLAGS
   0: [0..31]: 96..127   0 (96..127)   32
   1: [32..2047]:  128..2143 0 (128..2143)   2016 1
 FLAG Values:
01 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end   on stripe unit
10 Doesn't begin on stripe width
01 Doesn't end   on stripe width

Looks like the changes work, so FA_MKSWAP is not necessary for XFS.
We can drop that for the moment unless anyone else sees a need for it.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-27 Thread David Chinner
On Tue, Jun 26, 2007 at 11:49:15PM -0400, Andreas Dilger wrote:
 On Jun 27, 2007  09:14 +1000, David Chinner wrote:
  Someone on the XFs list had an interesting request - preallocated
  swap files. You can't use unwritten extents for this because
  of sys_swapon()s use of bmap() (XFS returns holes for reading
  unwritten extents), so we need a method of preallocating that does
  not zero or mark the extent unread. i.e. FA_MKSWAP.
 
 Is there a reason why unwritten extents return 0 to bmap()?

It's a fallout of xfs_get_blocks not mapping unwritten extents
on read because we want do_mpage_readpage() to treat them
as a hole. i.e. zero fill them instead of doing I/O. This is
the way XFS was shoehorned into the generic read path :/

 This
 would seem to be the only impediment from using fallocated files
 for swap files.  Maybe if FIEMAP was used by mkswap to get an
 UNWRITTEN flag back instead of HOLE it wouldn't be a problem.

Probably. If we taught do_mpage_readpage() about unwritten mappings,
then would could map them on read if and then sys_swapon can remain
blissfully unaware of unwritten extents.

I think this is pretty much all I need to do to acheive that is
(untested):

---

Teach do_mpage_readpage() about unwritten extents so we can
always map them in get_blocks rather than they are are holes on
read. Allows setup_swap_extents() to use preallocated files on XFS
filesystems for swap files without ever needing to convert them.

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 fs/mpage.c  |5 +++--
 fs/xfs/linux-2.6/xfs_aops.c |   13 +++--
 2 files changed, 6 insertions(+), 12 deletions(-)

Index: 2.6.x-xfs-new/fs/mpage.c
===
--- 2.6.x-xfs-new.orig/fs/mpage.c   2007-05-29 16:17:59.0 +1000
+++ 2.6.x-xfs-new/fs/mpage.c2007-06-27 22:39:35.568852270 +1000
@@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc
 * Map blocks using the result from the previous get_blocks call first.
 */
nblocks = map_bh-b_size  blkbits;
-   if (buffer_mapped(map_bh)  block_in_file  *first_logical_block 
+   if (buffer_mapped(map_bh)  !buffer_unwritten(map_bh) 
+   block_in_file  *first_logical_block 
block_in_file  (*first_logical_block + nblocks)) {
unsigned map_offset = block_in_file - *first_logical_block;
unsigned last = nblocks - map_offset;
@@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc
*first_logical_block = block_in_file;
}
 
-   if (!buffer_mapped(map_bh)) {
+   if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) {
fully_mapped = 0;
if (first_hole == blocks_per_page)
first_hole = page_block;
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c  2007-06-05 
22:14:39.0 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c   2007-06-27 22:39:29.545636749 
+1000
@@ -1340,16 +1340,9 @@ __xfs_get_blocks(
return 0;
 
if (iomap.iomap_bn != IOMAP_DADDR_NULL) {
-   /*
-* For unwritten extents do not report a disk address on
-* the read case (treat as if we're reading into a hole).
-*/
-   if (create || !(iomap.iomap_flags  IOMAP_UNWRITTEN)) {
-   xfs_map_buffer(bh_result, iomap, offset,
-  inode-i_blkbits);
-   }
-   if (create  (iomap.iomap_flags  IOMAP_UNWRITTEN)) {
-   if (direct)
+   xfs_map_buffer(bh_result, iomap, offset, inode-i_blkbits);
+   if (iomap.iomap_flags  IOMAP_UNWRITTEN) {
+   if (create  direct)
bh_result-b_private = inode;
set_buffer_unwritten(bh_result);
}


Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-27 Thread Nathan Scott
On Wed, 2007-06-27 at 23:36 +1000, David Chinner wrote:
  Allows setup_swap_extents() to use preallocated files on XFS
 filesystems for swap files without ever needing to convert them.

Using unwritten extents (as opposed to the MKSWAP flag mentioned
earlier) has the unfortunate down side of requiring transactions,
possibly additional IO, and memory allocation during swap.  (but,
this patch should probably go in regardless, as teaching generic
code about unwritten extents is not a bad idea).

cheers.

--
Nathan

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-27 Thread David Chinner
On Thu, Jun 28, 2007 at 09:28:36AM +1000, Nathan Scott wrote:
 On Wed, 2007-06-27 at 23:36 +1000, David Chinner wrote:
   Allows setup_swap_extents() to use preallocated files on XFS
  filesystems for swap files without ever needing to convert them.
 
 Using unwritten extents (as opposed to the MKSWAP flag mentioned
 earlier) has the unfortunate down side of requiring transactions,
 possibly additional IO, and memory allocation during swap.  (but,
 this patch should probably go in regardless, as teaching generic
 code about unwritten extents is not a bad idea).

I don't think it does - swapfile I/O looks like it goes direct to
bio without passing through the filesystem.  When the swapfile is
mapped, it scans and records the extent map of the entire swapfile
in a separate structure and AFAICT the swap code uses that built map
without touching the filesystem at all.

If that is true then the written/unwritten state of the extents is
irrelevant; all we need is allocated disk space for the file and
swapping should work. And it's not like anyone should be reading
the contents of that swapfile through the filesystem, either. ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-27 Thread Nathan Scott
On Thu, 2007-06-28 at 10:39 +1000, David Chinner wrote:
 
 
 I don't think it does - swapfile I/O looks like it goes direct to
 bio without passing through the filesystem.  When the swapfile is
 mapped, it scans and records the extent map of the entire swapfile
 in a separate structure and AFAICT the swap code uses that built map
 without touching the filesystem at all.
 
 If that is true then the written/unwritten state of the extents is
 irrelevant; all we need is allocated disk space for the file and
 swapping should work. And it's not like anyone should be reading
 the contents of that swapfile through the filesystem, either. ;)

Ah, yes, good point - thats true.  Unwritten extents are ideal for
this then, as attempts to read swap via the regular interfaces will
return zeros instead of random swapped out memory contents.

cheers.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Amit K. Arora
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
 On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
  I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
  *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
  If it is decided that these flags are also needed, I will update this
  patch. Thanks!
 
 Can you clarify - what is the current behaviour when ENOSPC (or some other
 error) is hit?  Does it keep the current fallocate() or does it free it?

Currently it is left on the file system implementation. In ext4, we do
not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
end up with partial (pre)allocation. This is inline with dd and
posix_fallocate, which also do not free the partially allocated space.
 
 For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
 don't want to expose uninitialized disk blocks to userspace.  I'm not
 sure if this makes sense at all.

I don't think we need to make it default - atleast for filesystems which
have a mechanism to distinguish preallocated blocks from regular ones.
In ext4, for example, we will have a way to mark uninitialized extents.
All the preallocated blocks will be part of these uninitialized extents.
And any read on these extents will treat them as a hole, returning
zeroes to user land. Thus any existing data on uninitialized blocks will
not be exposed to the userspace.

--
Regards,
Amit Arora 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Amit K. Arora
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
 On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
  +#define FA_FL_DEALLOC  0x01 /* default is allocate */
  +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
  +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC 
  */
 
 In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
 For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
 each extent.  For some workloads this would be much faster than truncate
 and reallocate of all the blocks in a file.

In ext4, we already mark each extent having preallocated blocks as
uninitialized. This is done as part of following code (which is part of
patch 5/7) in ext4_ext_get_blocks() :  

@@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
+   if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+   ext4_ext_mark_uninitialized(newex);
err = ext4_ext_insert_extent(handle, inode, path, newex);
if (err) {
/* free data blocks we just allocated */


 In that light, please change the comment to /* default is keep existing data 
 */
 so that it doesn't imply this is only for DEALLOC.

Ok. Will update the comment.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Andreas Dilger
On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
 On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
  In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
  For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
  each extent.  For some workloads this would be much faster than truncate
  and reallocate of all the blocks in a file.
 
 In ext4, we already mark each extent having preallocated blocks as
 uninitialized. This is done as part of following code (which is part of
 patch 5/7) in ext4_ext_get_blocks() :  

What I meant is that with XFS_IOC_ALLOCSP the previously-written data
is ZEROED OUT, unlike with fallocate() which leaves previously-written
data alone and only allocates in holes.

So, if you had a sparse file with some data in it:

 A BB

fallocate() would allocate the holes:

0A0BB

XFS_IOC_ALLOCSP would overwrite everything:

0

In order to specify this for allocation, FA_FL_DEL_DATA would need to make
sense for allocations (as well as the deallocation).  This is farily easy
to do - just mark all of the existing extents as unallocated, and their
data disappears.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Amit K. Arora
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
 On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
  On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
   Can you clarify - what is the current behaviour when ENOSPC (or some other
   error) is hit?  Does it keep the current fallocate() or does it free it?
  
  Currently it is left on the file system implementation. In ext4, we do
  not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
  end up with partial (pre)allocation. This is inline with dd and
  posix_fallocate, which also do not free the partially allocated space.
 
 Since I believe the XFS allocation ioctls do it the opposite way (free
 preallocated space on error) this should be encoded into the flags.
 Having it filesystem dependent just means that nobody will be happy.

Ok, got your point. Maybe we can have a flag for this, as you suggested.
But, default behavior IMHO should be _not_ to undo partial allocation
(thus the file system will have the option of supporting this flag or
not and it will be inline with posix_fallocate; XFS will obviously
like to support this flag, inline with its existing behavior).

   For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
   don't want to expose uninitialized disk blocks to userspace.  I'm not
   sure if this makes sense at all.
  
  I don't think we need to make it default - atleast for filesystems which
  have a mechanism to distinguish preallocated blocks from regular ones.
 
 What I mean is that any data read from the file should have the appearance
 of being zeroed (whether zeroes are actually written to disk or not).  What
 I _think_ David is proposing is to allow fallocate() to return without
 marking the blocks even uninitialized and subsequent reads would return
 the old data from the disk.

I can't think of a good reason for this (i.e. returning stale data from
preallocated blocks). It is infact a security issue to me.
Anyhow, this may though be beneficial for file systems which have
noticable overhead in marking the blocks uninitialized/preallocated.
Can you or David please throw some light on how this option might really
be helpful ? Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Amit K. Arora
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote:
 On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
  On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
   In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
   For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
   each extent.  For some workloads this would be much faster than truncate
   and reallocate of all the blocks in a file.
  
  In ext4, we already mark each extent having preallocated blocks as
  uninitialized. This is done as part of following code (which is part of
  patch 5/7) in ext4_ext_get_blocks() :  
 
 What I meant is that with XFS_IOC_ALLOCSP the previously-written data
 is ZEROED OUT, unlike with fallocate() which leaves previously-written
 data alone and only allocates in holes.
 
 In order to specify this for allocation, FA_FL_DEL_DATA would need to make
 sense for allocations (as well as the deallocation).  This is farily easy
 to do - just mark all of the existing extents as unallocated, and their
 data disappears.

Ok, agreed. Will add the FA_ZERO_SPACE mode too.
Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread David Chinner
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
 On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
  I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
  *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
  If it is decided that these flags are also needed, I will update this
  patch. Thanks!
 
 Can you clarify - what is the current behaviour when ENOSPC (or some other
 error) is hit?  Does it keep the current fallocate() or does it free it?
 
 For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
 don't want to expose uninitialized disk blocks to userspace.  I'm not
 sure if this makes sense at all.

Someone on the XFs list had an interesting request - preallocated
swap files. You can't use unwritten extents for this because
of sys_swapon()s use of bmap() (XFS returns holes for reading
unwritten extents), so we need a method of preallocating that does
not zero or mark the extent unread. i.e. FA_MKSWAP.

I thinkthis would be:

#define FA_FL_NO_ZERO_SPACE 0x08/* default is to zero space */

#define FA_MKSWAP   (FA_ALLOCATE | FA_FL_NO_ZERO_SPACE)

That way we can allocate large swap files that don't need zeroing
in a single, fast operation, and hence potentially bring new
swap space online without needed very much memory at all (i.e.
should succeed in most near-OOM conditions).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread David Chinner
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
 On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
  On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
   Can you clarify - what is the current behaviour when ENOSPC (or some other
   error) is hit?  Does it keep the current fallocate() or does it free it?
  
  Currently it is left on the file system implementation. In ext4, we do
  not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
  end up with partial (pre)allocation. This is inline with dd and
  posix_fallocate, which also do not free the partially allocated space.
 
 Since I believe the XFS allocation ioctls do it the opposite way (free
 preallocated space on error) this should be encoded into the flags.
 Having it filesystem dependent just means that nobody will be happy.

No, XFs does not free preallocated space on error. it is up to the
application to clean up.

 What I mean is that any data read from the file should have the appearance
 of being zeroed (whether zeroes are actually written to disk or not).  What
 I _think_ David is proposing is to allow fallocate() to return without
 marking the blocks even uninitialized and subsequent reads would return
 the old data from the disk.

Correct, but for swap files that's not an issue - no user should be able
too read them, and FA_MKSWAP would really need root privileges to execute.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread David Chinner
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
 On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
  +#define FA_FL_DEALLOC  0x01 /* default is allocate */
  +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
  +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC 
  */
 
 In XFS one of the (many) ALLOC modes is to zero existing data on allocate.

No, none of the XFS allocation modes do that.

XFS_IOC_ALLOCSP, which does write zeros to disk, only allocates and
writes zeros in the range between the old file size and the new file size.
XFS_IOC_RESVSP, which alocates unwritten extents, only allocates
where extents do not currently exist. It does not zero existing
extents.

IOWs, you can't overwrite existing data with XFS preallocation.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread David Chinner
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote:
 On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
  On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
   In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
   For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
   each extent.  For some workloads this would be much faster than truncate
   and reallocate of all the blocks in a file.
  
  In ext4, we already mark each extent having preallocated blocks as
  uninitialized. This is done as part of following code (which is part of
  patch 5/7) in ext4_ext_get_blocks() :  
 
 What I meant is that with XFS_IOC_ALLOCSP the previously-written data
 is ZEROED OUT, unlike with fallocate() which leaves previously-written
 data alone and only allocates in holes.
 
 So, if you had a sparse file with some data in it:
 
  A BB
 
 fallocate() would allocate the holes:
 
 0A0BB
 
 XFS_IOC_ALLOCSP would overwrite everything:
 
 0

No, it wouldn't. XFS_IOC_ALLOCSP would give you:


  A BB

because it only allocates the space between the old EOF and the new
EOF. Graphic demonstration - write 4k @ 4k, 4k @ 16k, allocsp out to 32k:

budgie:~ # xfs_io -f \
 -c pwrite 4096 4096 \
 -c pwrite 16384 4096 \
 -c bmap -vvp \
 -c allocsp 32768 0 \
 -c bmap -vvp \
 /mnt/test/alfred
wrote 4096/4096 bytes at offset 4096
4 KiB, 1 ops; 0. sec (108.507 MiB/sec and 2.7778 ops/sec)
wrote 4096/4096 bytes at offset 16384
4 KiB, 1 ops; 0. sec (260.417 MiB/sec and 6.6667 ops/sec)
/mnt/test/alfred:
 EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL
   0: [0..7]:  hole   8
   1: [8..15]: 5226864..5226871  4 (1022160..1022167) 8
   2: [16..31]:hole  16
   3: [32..39]:5226888..5226895  4 (1022184..1022191) 8
/mnt/test/alfred:
 EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL
   0: [0..7]:  hole   8
   1: [8..15]: 5226864..5226871  4 (1022160..1022167) 8
   2: [16..31]:hole  16
   3: [32..63]:5226888..5226919  4 (1022184..1022215)32
budgie:~ #

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Andreas Dilger
On Jun 27, 2007  09:14 +1000, David Chinner wrote:
 Someone on the XFs list had an interesting request - preallocated
 swap files. You can't use unwritten extents for this because
 of sys_swapon()s use of bmap() (XFS returns holes for reading
 unwritten extents), so we need a method of preallocating that does
 not zero or mark the extent unread. i.e. FA_MKSWAP.

Is there a reason why unwritten extents return 0 to bmap()?  This
would seem to be the only impediment from using fallocated files
for swap files.  Maybe if FIEMAP was used by mkswap to get an
UNWRITTEN flag back instead of HOLE it wouldn't be a problem.

 That way we can allocate large swap files that don't need zeroing
 in a single, fast operation, and hence potentially bring new
 swap space online without needed very much memory at all (i.e.
 should succeed in most near-OOM conditions).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-25 Thread Amit K. Arora
Implement new flags and values for mode argument.

This patch implements the new flags and values for the mode argument
of the fallocate system call. It is based on the discussion between
Andreas Dilger and David Chinner on the man page proposed (by the later)
on fallocate.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22-rc4/include/linux/fs.h
===
--- linux-2.6.22-rc4.orig/include/linux/fs.h
+++ linux-2.6.22-rc4/include/linux/fs.h
@@ -267,15 +267,16 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WAIT_AFTER 4
 
 /*
- * sys_fallocate modes
- * Currently sys_fallocate supports two modes:
- * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
- *   may request (pre)allocation of blocks.
- * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
- *   the preallocated blocks.
+ * sys_fallocate mode flags and values
  */
-#define FA_ALLOCATE0x1
-#define FA_DEALLOCATE  0x2
+#define FA_FL_DEALLOC  0x01 /* default is allocate */
+#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
+#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
+
+#define FA_ALLOCATE0
+#define FA_DEALLOCATE  FA_FL_DEALLOC
+#define FA_RESV_SPACE  FA_FL_KEEP_SIZE
+#define FA_UNRESV_SPACE(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
FA_FL_DEL_DATA)
 
 #ifdef __KERNEL__
 
Index: linux-2.6.22-rc4/fs/open.c
===
--- linux-2.6.22-rc4.orig/fs/open.c
+++ linux-2.6.22-rc4/fs/open.c
@@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
  * sys_fallocate - preallocate blocks or free preallocated blocks
  * @fd: the file descriptor
  * @mode: mode specifies if fallocate should preallocate blocks OR free
- *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
- *   FA_DEALLOCATE modes are supported.
+ *   (unallocate) preallocated blocks.
  * @offset: The offset within file, from where (un)allocation is being
  * requested. It should not have a negative value.
  * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
  *
  * This system call, depending on the mode, preallocates or unallocates blocks
  * for a file. The range of blocks depends on the value of offset and len
- * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * arguments provided by the user/application. For FA_ALLOCATE and
+ * FA_RESV_SPACE modes, if the sys_fallocate()
  * system call succeeds, subsequent writes to the file in the given range
  * (specified by offset  len) should not fail - even if the file system
  * later becomes full. Hence the preallocation done is persistent (valid
- * even after reopen of the file and remount/reboot).
+ * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
+ * is passed, the file size will not be changed even if the preallocation
+ * is beyond EOF.
  *
  * It is expected that the -fallocate() inode operation implemented by the
  * individual file systems will update the file size and/or ctime/mtime
- * depending on the mode and also on the success of the operation.
+ * depending on the mode (change is visible to user or not - say file size)
+ * and obviously, on the success of the operation.
  *
  * Note: Incase the file system does not support preallocation,
  * posix_fallocate() should fall back to the library implementation (i.e.
@@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
 
/* Return error if mode is not supported */
ret = -EOPNOTSUPP;
-   if (mode != FA_ALLOCATE  mode != FA_DEALLOCATE)
+   if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
+   mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
goto out;
 
ret = -EBADF;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-25 Thread Andreas Dilger
On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
 I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
 *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
 If it is decided that these flags are also needed, I will update this
 patch. Thanks!

Can you clarify - what is the current behaviour when ENOSPC (or some other
error) is hit?  Does it keep the current fallocate() or does it free it?

For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
don't want to expose uninitialized disk blocks to userspace.  I'm not
sure if this makes sense at all.

 On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote:
  Implement new flags and values for mode argument.
  
  This patch implements the new flags and values for the mode argument
  of the fallocate system call. It is based on the discussion between
  Andreas Dilger and David Chinner on the man page proposed (by the later)
  on fallocate.
  
  Signed-off-by: Amit Arora [EMAIL PROTECTED]
  
  Index: linux-2.6.22-rc4/include/linux/fs.h
  ===
  --- linux-2.6.22-rc4.orig/include/linux/fs.h
  +++ linux-2.6.22-rc4/include/linux/fs.h
  @@ -267,15 +267,16 @@ extern int dir_notify_enable;
   #define SYNC_FILE_RANGE_WAIT_AFTER 4
  
   /*
  - * sys_fallocate modes
  - * Currently sys_fallocate supports two modes:
  - * FA_ALLOCATE  : This is the preallocate mode, using which an 
  application/user
  - *   may request (pre)allocation of blocks.
  - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
  - *   the preallocated blocks.
  + * sys_fallocate mode flags and values
*/
  -#define FA_ALLOCATE0x1
  -#define FA_DEALLOCATE  0x2
  +#define FA_FL_DEALLOC  0x01 /* default is allocate */
  +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
  +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC 
  */
  +
  +#define FA_ALLOCATE0
  +#define FA_DEALLOCATE  FA_FL_DEALLOC
  +#define FA_RESV_SPACE  FA_FL_KEEP_SIZE
  +#define FA_UNRESV_SPACE(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
  FA_FL_DEL_DATA)
  
   #ifdef __KERNEL__
  
  Index: linux-2.6.22-rc4/fs/open.c
  ===
  --- linux-2.6.22-rc4.orig/fs/open.c
  +++ linux-2.6.22-rc4/fs/open.c
  @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
* sys_fallocate - preallocate blocks or free preallocated blocks
* @fd: the file descriptor
* @mode: mode specifies if fallocate should preallocate blocks OR free
  - *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
  - *   FA_DEALLOCATE modes are supported.
  + *   (unallocate) preallocated blocks.
* @offset: The offset within file, from where (un)allocation is being
* requested. It should not have a negative value.
* @len: The amount (in bytes) of space to be (un)allocated, from the 
  offset.
*
* This system call, depending on the mode, preallocates or unallocates 
  blocks
* for a file. The range of blocks depends on the value of offset and len
  - * arguments provided by the user/application. For FA_ALLOCATE mode, if 
  this
  + * arguments provided by the user/application. For FA_ALLOCATE and
  + * FA_RESV_SPACE modes, if the sys_fallocate()
* system call succeeds, subsequent writes to the file in the given range
* (specified by offset  len) should not fail - even if the file system
* later becomes full. Hence the preallocation done is persistent (valid
  - * even after reopen of the file and remount/reboot).
  + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
  + * is passed, the file size will not be changed even if the preallocation
  + * is beyond EOF.
*
* It is expected that the -fallocate() inode operation implemented by the
* individual file systems will update the file size and/or ctime/mtime
  - * depending on the mode and also on the success of the operation.
  + * depending on the mode (change is visible to user or not - say file size)
  + * and obviously, on the success of the operation.
*
* Note: Incase the file system does not support preallocation,
* posix_fallocate() should fall back to the library implementation (i.e.
  @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
  
  /* Return error if mode is not supported */
  ret = -EOPNOTSUPP;
  -   if (mode != FA_ALLOCATE  mode != FA_DEALLOCATE)
  +   if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
  +   mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
  goto out;
  
  ret = -EBADF;
  -
  To unsubscribe from this list: send the line unsubscribe linux-ext4 in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, 

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-25 Thread Andreas Dilger
On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
 +#define FA_FL_DEALLOC0x01 /* default is allocate */
 +#define FA_FL_KEEP_SIZE  0x02 /* default is extend/shrink size */
 +#define FA_FL_DEL_DATA   0x04 /* default is keep written data on DEALLOC 
 */

In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
each extent.  For some workloads this would be much faster than truncate
and reallocate of all the blocks in a file.

In that light, please change the comment to /* default is keep existing data */
so that it doesn't imply this is only for DEALLOC.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html