Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-26 Thread Hugh Dickins
On Wed, 26 Feb 2014, Dave Chinner wrote:
> On Tue, Feb 25, 2014 at 09:25:40PM -0800, Hugh Dickins wrote:
> 
> > But I wasn't really thinking of the offset > i_size case, just the
> > offset + len >= i_size case: which would end with i_size at offset,
> > and the areas you're worried about still beyond EOF - or am I confused?
> 
> Right, offset beyond EOF is just plain daft. But you're not thinking
> of the entire picture. What happens when a system crashes half way
> through a collapse operation? On tmpfs you don't care - everything
> is lost, but on real filesystems we have to care about. 
> 
> offset + len beyond EOF is just truncate(offset).
> 
> From the point of view of an application offloading a data movement
> operation via collapse range, any range that overlaps EOF is wrong -
> data beyond EOF is not accessible and is not available for the
> application to move. Hence EINVAL - it's an invalid set of
> parameters.
> 
> If we do allow it and implement it by block shifting (which,
> technically, is the correct interpretation of the collapse range
> behaviour because it preserves preallocation beyond
> the collapsed region beyond EOF), then we have
> thr problem of moving data blocks below EOF by extent shifting
> before we change the EOF. That exposes blocks of undefined content
> to the user if we crash and recover up to that point of the
> operation. It's just plain dangerous, and if we allow this
> possibility via the API, someone is going to make that mistake in a
> filesystem because it's difficult to test and hard to get right.

Again, I would have thought that this is a problem you are already
having to solve in the case when offset + len is below EOF, with
blocks of undefined content preallocated beyond EOF.

But I don't know xfs, you do: so I accept there may be subtle reasons
why the offset + len below EOF case is easier for you to handle - and
please don't spend your time trying to hammer those into my head!

I think I've been somewhat unreasonable: I insisted in the other
thread that "Collapse is significantly more challenging than either
hole-punch or truncation", so I should give you a break, not demand
that you provide a perfect smooth implementation in all circumstances.

None of our filesystems were designed with this operation in mind,
each may have its own sound reasons to reject those peculiare cases
which would pose more trouble and risk than they are worth.

Whether that should be enforced at the VFS level is anther matter:
if it turns out that the xfs and ext4 limitations match up, okay.

I think we have different preferences, for whether to return error
or success, when there is nothing to be done; but I notice now that
fallocate fails on len 0, so you are being consistent with that.

Reject "offset + len >= i_size" or "offset + len > i_size"?

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-26 Thread Dave Chinner
On Tue, Feb 25, 2014 at 09:25:40PM -0800, Hugh Dickins wrote:
> On Wed, 26 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 03:41:20PM -0800, Hugh Dickins wrote:
> > > On Mon, 24 Feb 2014, Dave Chinner wrote:
> > > > On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
> > > > > On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
> > > > > > +   /*
> > > > > > +* There is no need to overlap collapse range with EOF, in 
> > > > > > which case
> > > > > > +* it is effectively a truncate operation
> > > > > > +*/
> > > > > > +   if ((mode & FALLOC_FL_COLLAPSE_RANGE) &&
> > > > > > +   (offset + len >= i_size_read(inode)))
> > > > > > +   return -EINVAL;
> > > > > > +
> > > > > 
> > > > > I wonder if we should just translate a collapse range that is
> > > > > equivalent to a truncate operation to, in fact, be a truncate
> > > > > operation?
> > > > 
> > > > Trying to collapse a range that extends beyond EOF, IMO, is likely
> > > > to only happen if the DVR/NLE application is buggy. Hence I think
> > > > that telling the application it is doing something that is likely to
> > > > be wrong is better than silently truncating the file
> > > 
> > > I do agree with Ted on this point.  This is not an xfs ioctl added
> > > for one DVR/NLE application, it's a mode of a Linux system call.
> > > 
> > > We do not usually reject with an error when one system call happens
> > > to ask for something which can already be accomplished another way;
> > > nor nanny our callers.
> > > 
> > > It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
> > > unless that adds significantly to implementation difficulties?
> > 
> > Yes, it does add to the implementation complexity significantly - it
> > adds data security issues that don't exist with the current API.
> > 
> > That is, Filesystems can have uninitialised blocks beyond EOF so
> > if we allow COLLAPSE_RANGE to shift them down within EOF, we now
> > have to ensure they are properly zeroed or marked as unwritten.
> > 
> > It also makes implementations more difficult. For example, XFS can
> > also have in-memory delayed allocation extents beyond EOF, and they
> > can't be brought into the range < EOF without either:
> > 
> > a) inserting zeroed pages with appropriately set up
> > and mapped bufferheads into the page cache for the range
> > that sits within EOF; or
> > b) truncating the delalloc extents beyond EOF before the
> > move
> > 
> > So, really, the moment you go beyond EOF filesystems have to do
> > quite a bit more validation and IO in the context of the system
> > call. It no longer becomes a pure extent manipulation offload - it
> > becomes a data security problem.
> 
> Those sound like problems you would already have solved for a
> simple extending truncate.

Yes, they have because of what truncate defines - that the region
between the old EOF and the new EOF must contain zeroes. truncate
does not move blocks around, it merely changes the EOF, and so the
solution is simple.


> But I wasn't really thinking of the offset > i_size case, just the
> offset + len >= i_size case: which would end with i_size at offset,
> and the areas you're worried about still beyond EOF - or am I confused?

Right, offset beyond EOF is just plain daft. But you're not thinking
of the entire picture. What happens when a system crashes half way
through a collapse operation? On tmpfs you don't care - everything
is lost, but on real filesystems we have to care about. 

offset + len beyond EOF is just truncate(offset).

>From the point of view of an application offloading a data movement
operation via collapse range, any range that overlaps EOF is wrong -
data beyond EOF is not accessible and is not available for the
application to move. Hence EINVAL - it's an invalid set of
parameters.

If we do allow it and implement it by block shifting (which,
technically, is the correct interpretation of the collapse range
behaviour because it preserves preallocation beyond
the collapsed region beyond EOF), then we have
thr problem of moving data blocks below EOF by extent shifting
before we change the EOF. That exposes blocks of undefined content
to the user if we crash and recover up to that point of the
operation. It's just plain dangerous, and if we allow this
possibility via the API, someone is going to make that mistake in a
filesystem because it's difficult to test and hard to get right.

> > And, indeed, the specification that we are working to is that the
> > applications that want to collapse the range of a file are using
> > this function instead of read/memcpy/write/truncate, which by
> > definition means they cannot shift ranges of the file beyond EOF
> > into the new file.
> > 
> > So IMO the API defines the functionality as required by the
> > applications that require it and *no more*. If you need some
> > different behaviour - we can add it via additional flags in future
> > when you have an 

Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-26 Thread Dave Chinner
On Tue, Feb 25, 2014 at 09:25:40PM -0800, Hugh Dickins wrote:
 On Wed, 26 Feb 2014, Dave Chinner wrote:
  On Tue, Feb 25, 2014 at 03:41:20PM -0800, Hugh Dickins wrote:
   On Mon, 24 Feb 2014, Dave Chinner wrote:
On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
 On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
  +   /*
  +* There is no need to overlap collapse range with EOF, in 
  which case
  +* it is effectively a truncate operation
  +*/
  +   if ((mode  FALLOC_FL_COLLAPSE_RANGE) 
  +   (offset + len = i_size_read(inode)))
  +   return -EINVAL;
  +
 
 I wonder if we should just translate a collapse range that is
 equivalent to a truncate operation to, in fact, be a truncate
 operation?

Trying to collapse a range that extends beyond EOF, IMO, is likely
to only happen if the DVR/NLE application is buggy. Hence I think
that telling the application it is doing something that is likely to
be wrong is better than silently truncating the file
   
   I do agree with Ted on this point.  This is not an xfs ioctl added
   for one DVR/NLE application, it's a mode of a Linux system call.
   
   We do not usually reject with an error when one system call happens
   to ask for something which can already be accomplished another way;
   nor nanny our callers.
   
   It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
   unless that adds significantly to implementation difficulties?
  
  Yes, it does add to the implementation complexity significantly - it
  adds data security issues that don't exist with the current API.
  
  That is, Filesystems can have uninitialised blocks beyond EOF so
  if we allow COLLAPSE_RANGE to shift them down within EOF, we now
  have to ensure they are properly zeroed or marked as unwritten.
  
  It also makes implementations more difficult. For example, XFS can
  also have in-memory delayed allocation extents beyond EOF, and they
  can't be brought into the range  EOF without either:
  
  a) inserting zeroed pages with appropriately set up
  and mapped bufferheads into the page cache for the range
  that sits within EOF; or
  b) truncating the delalloc extents beyond EOF before the
  move
  
  So, really, the moment you go beyond EOF filesystems have to do
  quite a bit more validation and IO in the context of the system
  call. It no longer becomes a pure extent manipulation offload - it
  becomes a data security problem.
 
 Those sound like problems you would already have solved for a
 simple extending truncate.

Yes, they have because of what truncate defines - that the region
between the old EOF and the new EOF must contain zeroes. truncate
does not move blocks around, it merely changes the EOF, and so the
solution is simple.


 But I wasn't really thinking of the offset  i_size case, just the
 offset + len = i_size case: which would end with i_size at offset,
 and the areas you're worried about still beyond EOF - or am I confused?

Right, offset beyond EOF is just plain daft. But you're not thinking
of the entire picture. What happens when a system crashes half way
through a collapse operation? On tmpfs you don't care - everything
is lost, but on real filesystems we have to care about. 

offset + len beyond EOF is just truncate(offset).

From the point of view of an application offloading a data movement
operation via collapse range, any range that overlaps EOF is wrong -
data beyond EOF is not accessible and is not available for the
application to move. Hence EINVAL - it's an invalid set of
parameters.

If we do allow it and implement it by block shifting (which,
technically, is the correct interpretation of the collapse range
behaviour because it preserves preallocation beyond
the collapsed region beyond EOF), then we have
thr problem of moving data blocks below EOF by extent shifting
before we change the EOF. That exposes blocks of undefined content
to the user if we crash and recover up to that point of the
operation. It's just plain dangerous, and if we allow this
possibility via the API, someone is going to make that mistake in a
filesystem because it's difficult to test and hard to get right.

  And, indeed, the specification that we are working to is that the
  applications that want to collapse the range of a file are using
  this function instead of read/memcpy/write/truncate, which by
  definition means they cannot shift ranges of the file beyond EOF
  into the new file.
  
  So IMO the API defines the functionality as required by the
  applications that require it and *no more*. If you need some
  different behaviour - we can add it via additional flags in future
  when you have an application that requires it. 
 
 You still seem to be thinking in terms of xfs ioctl hacks,
 rather than fully scoped Linux system calls.

Low blow, referee! :/

To set the record straight, this fallocate interface was 

Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-26 Thread Hugh Dickins
On Wed, 26 Feb 2014, Dave Chinner wrote:
 On Tue, Feb 25, 2014 at 09:25:40PM -0800, Hugh Dickins wrote:
 
  But I wasn't really thinking of the offset  i_size case, just the
  offset + len = i_size case: which would end with i_size at offset,
  and the areas you're worried about still beyond EOF - or am I confused?
 
 Right, offset beyond EOF is just plain daft. But you're not thinking
 of the entire picture. What happens when a system crashes half way
 through a collapse operation? On tmpfs you don't care - everything
 is lost, but on real filesystems we have to care about. 
 
 offset + len beyond EOF is just truncate(offset).
 
 From the point of view of an application offloading a data movement
 operation via collapse range, any range that overlaps EOF is wrong -
 data beyond EOF is not accessible and is not available for the
 application to move. Hence EINVAL - it's an invalid set of
 parameters.
 
 If we do allow it and implement it by block shifting (which,
 technically, is the correct interpretation of the collapse range
 behaviour because it preserves preallocation beyond
 the collapsed region beyond EOF), then we have
 thr problem of moving data blocks below EOF by extent shifting
 before we change the EOF. That exposes blocks of undefined content
 to the user if we crash and recover up to that point of the
 operation. It's just plain dangerous, and if we allow this
 possibility via the API, someone is going to make that mistake in a
 filesystem because it's difficult to test and hard to get right.

Again, I would have thought that this is a problem you are already
having to solve in the case when offset + len is below EOF, with
blocks of undefined content preallocated beyond EOF.

But I don't know xfs, you do: so I accept there may be subtle reasons
why the offset + len below EOF case is easier for you to handle - and
please don't spend your time trying to hammer those into my head!

I think I've been somewhat unreasonable: I insisted in the other
thread that Collapse is significantly more challenging than either
hole-punch or truncation, so I should give you a break, not demand
that you provide a perfect smooth implementation in all circumstances.

None of our filesystems were designed with this operation in mind,
each may have its own sound reasons to reject those peculiare cases
which would pose more trouble and risk than they are worth.

Whether that should be enforced at the VFS level is anther matter:
if it turns out that the xfs and ext4 limitations match up, okay.

I think we have different preferences, for whether to return error
or success, when there is nothing to be done; but I notice now that
fallocate fails on len 0, so you are being consistent with that.

Reject offset + len = i_size or offset + len  i_size?

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-25 Thread Hugh Dickins
On Wed, 26 Feb 2014, Dave Chinner wrote:
> On Tue, Feb 25, 2014 at 03:41:20PM -0800, Hugh Dickins wrote:
> > On Mon, 24 Feb 2014, Dave Chinner wrote:
> > > On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
> > > > On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
> > > > > + /*
> > > > > +  * There is no need to overlap collapse range with EOF, in 
> > > > > which case
> > > > > +  * it is effectively a truncate operation
> > > > > +  */
> > > > > + if ((mode & FALLOC_FL_COLLAPSE_RANGE) &&
> > > > > + (offset + len >= i_size_read(inode)))
> > > > > + return -EINVAL;
> > > > > +
> > > > 
> > > > I wonder if we should just translate a collapse range that is
> > > > equivalent to a truncate operation to, in fact, be a truncate
> > > > operation?
> > > 
> > > Trying to collapse a range that extends beyond EOF, IMO, is likely
> > > to only happen if the DVR/NLE application is buggy. Hence I think
> > > that telling the application it is doing something that is likely to
> > > be wrong is better than silently truncating the file
> > 
> > I do agree with Ted on this point.  This is not an xfs ioctl added
> > for one DVR/NLE application, it's a mode of a Linux system call.
> > 
> > We do not usually reject with an error when one system call happens
> > to ask for something which can already be accomplished another way;
> > nor nanny our callers.
> > 
> > It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
> > unless that adds significantly to implementation difficulties?
> 
> Yes, it does add to the implementation complexity significantly - it
> adds data security issues that don't exist with the current API.
> 
> That is, Filesystems can have uninitialised blocks beyond EOF so
> if we allow COLLAPSE_RANGE to shift them down within EOF, we now
> have to ensure they are properly zeroed or marked as unwritten.
> 
> It also makes implementations more difficult. For example, XFS can
> also have in-memory delayed allocation extents beyond EOF, and they
> can't be brought into the range < EOF without either:
> 
>   a) inserting zeroed pages with appropriately set up
>   and mapped bufferheads into the page cache for the range
>   that sits within EOF; or
>   b) truncating the delalloc extents beyond EOF before the
>   move
> 
> So, really, the moment you go beyond EOF filesystems have to do
> quite a bit more validation and IO in the context of the system
> call. It no longer becomes a pure extent manipulation offload - it
> becomes a data security problem.

Those sound like problems you would already have solved for a
simple extending truncate.

But I wasn't really thinking of the offset > i_size case, just the
offset + len >= i_size case: which would end with i_size at offset,
and the areas you're worried about still beyond EOF - or am I confused?

> 
> And, indeed, the specification that we are working to is that the
> applications that want to collapse the range of a file are using
> this function instead of read/memcpy/write/truncate, which by
> definition means they cannot shift ranges of the file beyond EOF
> into the new file.
> 
> So IMO the API defines the functionality as required by the
> applications that require it and *no more*. If you need some
> different behaviour - we can add it via additional flags in future
> when you have an application that requires it. 

You still seem to be thinking in terms of xfs ioctl hacks,
rather than fully scoped Linux system calls.

But it probably doesn't matter too much, if we start with an error,
and later correct that to a full implementation - an xfstest or LTP
test which expected failure will fail once it sees success, but no
users harmed in the making of this change.

> 
> > Actually, is it even correct to fail at EOF?  What if fallocation
> > with FALLOC_FL_KEEP_SIZE was used earlier, to allocate beyond EOF:
> > shouldn't it be possible to shift that allocation down, along with
> > the EOF, rather than leave it behind as a stranded island?
> 
> It does get shifted down - it just remains beyond EOF, just like it
> was before the operation. And that is part of the specification of
> COLLAPSE_RANGE - it was done so that preallocation (physical or
> speculative delayed allocation) beyond EOF to avoid fragmentation as
> the DVR continues to write is not screwed up by chopping out earlier
> parts of the file.

Yes, I was confused when I pictured a stranded island there.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-25 Thread Dave Chinner
On Tue, Feb 25, 2014 at 03:41:20PM -0800, Hugh Dickins wrote:
> On Mon, 24 Feb 2014, Dave Chinner wrote:
> > On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
> > > On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
> > > > +   /*
> > > > +* There is no need to overlap collapse range with EOF, in 
> > > > which case
> > > > +* it is effectively a truncate operation
> > > > +*/
> > > > +   if ((mode & FALLOC_FL_COLLAPSE_RANGE) &&
> > > > +   (offset + len >= i_size_read(inode)))
> > > > +   return -EINVAL;
> > > > +
> > > 
> > > I wonder if we should just translate a collapse range that is
> > > equivalent to a truncate operation to, in fact, be a truncate
> > > operation?
> > 
> > Trying to collapse a range that extends beyond EOF, IMO, is likely
> > to only happen if the DVR/NLE application is buggy. Hence I think
> > that telling the application it is doing something that is likely to
> > be wrong is better than silently truncating the file
> 
> I do agree with Ted on this point.  This is not an xfs ioctl added
> for one DVR/NLE application, it's a mode of a Linux system call.
> 
> We do not usually reject with an error when one system call happens
> to ask for something which can already be accomplished another way;
> nor nanny our callers.
> 
> It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
> unless that adds significantly to implementation difficulties?

Yes, it does add to the implementation complexity significantly - it
adds data security issues that don't exist with the current API.

That is, Filesystems can have uninitialised blocks beyond EOF so
if we allow COLLAPSE_RANGE to shift them down within EOF, we now
have to ensure they are properly zeroed or marked as unwritten.

It also makes implementations more difficult. For example, XFS can
also have in-memory delayed allocation extents beyond EOF, and they
can't be brought into the range < EOF without either:

a) inserting zeroed pages with appropriately set up
and mapped bufferheads into the page cache for the range
that sits within EOF; or
b) truncating the delalloc extents beyond EOF before the
move

So, really, the moment you go beyond EOF filesystems have to do
quite a bit more validation and IO in the context of the system
call. It no longer becomes a pure extent manipulation offload - it
becomes a data security problem.

And, indeed, the specification that we are working to is that the
applications that want to collapse the range of a file are using
this function instead of read/memcpy/write/truncate, which by
definition means they cannot shift ranges of the file beyond EOF
into the new file.

So IMO the API defines the functionality as required by the
applications that require it and *no more*. If you need some
different behaviour - we can add it via additional flags in future
when you have an application that requires it. 

> Actually, is it even correct to fail at EOF?  What if fallocation
> with FALLOC_FL_KEEP_SIZE was used earlier, to allocate beyond EOF:
> shouldn't it be possible to shift that allocation down, along with
> the EOF, rather than leave it behind as a stranded island?

It does get shifted down - it just remains beyond EOF, just like it
was before the operation. And that is part of the specification of
COLLAPSE_RANGE - it was done so that preallocation (physical or
speculative delayed allocation) beyond EOF to avoid fragmentation as
the DVR continues to write is not screwed up by chopping out earlier
parts of the file.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-25 Thread Hugh Dickins
On Mon, 24 Feb 2014, Dave Chinner wrote:
> On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
> > On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
> > > + /*
> > > +  * There is no need to overlap collapse range with EOF, in which case
> > > +  * it is effectively a truncate operation
> > > +  */
> > > + if ((mode & FALLOC_FL_COLLAPSE_RANGE) &&
> > > + (offset + len >= i_size_read(inode)))
> > > + return -EINVAL;
> > > +
> > 
> > I wonder if we should just translate a collapse range that is
> > equivalent to a truncate operation to, in fact, be a truncate
> > operation?
> 
> Trying to collapse a range that extends beyond EOF, IMO, is likely
> to only happen if the DVR/NLE application is buggy. Hence I think
> that telling the application it is doing something that is likely to
> be wrong is better than silently truncating the file

I do agree with Ted on this point.  This is not an xfs ioctl added
for one DVR/NLE application, it's a mode of a Linux system call.

We do not usually reject with an error when one system call happens
to ask for something which can already be accomplished another way;
nor nanny our callers.

It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
unless that adds significantly to implementation difficulties?

Actually, is it even correct to fail at EOF?  What if fallocation
with FALLOC_FL_KEEP_SIZE was used earlier, to allocate beyond EOF:
shouldn't it be possible to shift that allocation down, along with
the EOF, rather than leave it behind as a stranded island?

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-25 Thread Hugh Dickins
On Mon, 24 Feb 2014, Dave Chinner wrote:
 On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
  On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
   + /*
   +  * There is no need to overlap collapse range with EOF, in which case
   +  * it is effectively a truncate operation
   +  */
   + if ((mode  FALLOC_FL_COLLAPSE_RANGE) 
   + (offset + len = i_size_read(inode)))
   + return -EINVAL;
   +
  
  I wonder if we should just translate a collapse range that is
  equivalent to a truncate operation to, in fact, be a truncate
  operation?
 
 Trying to collapse a range that extends beyond EOF, IMO, is likely
 to only happen if the DVR/NLE application is buggy. Hence I think
 that telling the application it is doing something that is likely to
 be wrong is better than silently truncating the file

I do agree with Ted on this point.  This is not an xfs ioctl added
for one DVR/NLE application, it's a mode of a Linux system call.

We do not usually reject with an error when one system call happens
to ask for something which can already be accomplished another way;
nor nanny our callers.

It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
unless that adds significantly to implementation difficulties?

Actually, is it even correct to fail at EOF?  What if fallocation
with FALLOC_FL_KEEP_SIZE was used earlier, to allocate beyond EOF:
shouldn't it be possible to shift that allocation down, along with
the EOF, rather than leave it behind as a stranded island?

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-25 Thread Dave Chinner
On Tue, Feb 25, 2014 at 03:41:20PM -0800, Hugh Dickins wrote:
 On Mon, 24 Feb 2014, Dave Chinner wrote:
  On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
   On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
+   /*
+* There is no need to overlap collapse range with EOF, in 
which case
+* it is effectively a truncate operation
+*/
+   if ((mode  FALLOC_FL_COLLAPSE_RANGE) 
+   (offset + len = i_size_read(inode)))
+   return -EINVAL;
+
   
   I wonder if we should just translate a collapse range that is
   equivalent to a truncate operation to, in fact, be a truncate
   operation?
  
  Trying to collapse a range that extends beyond EOF, IMO, is likely
  to only happen if the DVR/NLE application is buggy. Hence I think
  that telling the application it is doing something that is likely to
  be wrong is better than silently truncating the file
 
 I do agree with Ted on this point.  This is not an xfs ioctl added
 for one DVR/NLE application, it's a mode of a Linux system call.
 
 We do not usually reject with an error when one system call happens
 to ask for something which can already be accomplished another way;
 nor nanny our callers.
 
 It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
 unless that adds significantly to implementation difficulties?

Yes, it does add to the implementation complexity significantly - it
adds data security issues that don't exist with the current API.

That is, Filesystems can have uninitialised blocks beyond EOF so
if we allow COLLAPSE_RANGE to shift them down within EOF, we now
have to ensure they are properly zeroed or marked as unwritten.

It also makes implementations more difficult. For example, XFS can
also have in-memory delayed allocation extents beyond EOF, and they
can't be brought into the range  EOF without either:

a) inserting zeroed pages with appropriately set up
and mapped bufferheads into the page cache for the range
that sits within EOF; or
b) truncating the delalloc extents beyond EOF before the
move

So, really, the moment you go beyond EOF filesystems have to do
quite a bit more validation and IO in the context of the system
call. It no longer becomes a pure extent manipulation offload - it
becomes a data security problem.

And, indeed, the specification that we are working to is that the
applications that want to collapse the range of a file are using
this function instead of read/memcpy/write/truncate, which by
definition means they cannot shift ranges of the file beyond EOF
into the new file.

So IMO the API defines the functionality as required by the
applications that require it and *no more*. If you need some
different behaviour - we can add it via additional flags in future
when you have an application that requires it. 

 Actually, is it even correct to fail at EOF?  What if fallocation
 with FALLOC_FL_KEEP_SIZE was used earlier, to allocate beyond EOF:
 shouldn't it be possible to shift that allocation down, along with
 the EOF, rather than leave it behind as a stranded island?

It does get shifted down - it just remains beyond EOF, just like it
was before the operation. And that is part of the specification of
COLLAPSE_RANGE - it was done so that preallocation (physical or
speculative delayed allocation) beyond EOF to avoid fragmentation as
the DVR continues to write is not screwed up by chopping out earlier
parts of the file.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-25 Thread Hugh Dickins
On Wed, 26 Feb 2014, Dave Chinner wrote:
 On Tue, Feb 25, 2014 at 03:41:20PM -0800, Hugh Dickins wrote:
  On Mon, 24 Feb 2014, Dave Chinner wrote:
   On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
 + /*
 +  * There is no need to overlap collapse range with EOF, in 
 which case
 +  * it is effectively a truncate operation
 +  */
 + if ((mode  FALLOC_FL_COLLAPSE_RANGE) 
 + (offset + len = i_size_read(inode)))
 + return -EINVAL;
 +

I wonder if we should just translate a collapse range that is
equivalent to a truncate operation to, in fact, be a truncate
operation?
   
   Trying to collapse a range that extends beyond EOF, IMO, is likely
   to only happen if the DVR/NLE application is buggy. Hence I think
   that telling the application it is doing something that is likely to
   be wrong is better than silently truncating the file
  
  I do agree with Ted on this point.  This is not an xfs ioctl added
  for one DVR/NLE application, it's a mode of a Linux system call.
  
  We do not usually reject with an error when one system call happens
  to ask for something which can already be accomplished another way;
  nor nanny our callers.
  
  It seems natural to me that COLLAPSE_RANGE should support beyond EOF;
  unless that adds significantly to implementation difficulties?
 
 Yes, it does add to the implementation complexity significantly - it
 adds data security issues that don't exist with the current API.
 
 That is, Filesystems can have uninitialised blocks beyond EOF so
 if we allow COLLAPSE_RANGE to shift them down within EOF, we now
 have to ensure they are properly zeroed or marked as unwritten.
 
 It also makes implementations more difficult. For example, XFS can
 also have in-memory delayed allocation extents beyond EOF, and they
 can't be brought into the range  EOF without either:
 
   a) inserting zeroed pages with appropriately set up
   and mapped bufferheads into the page cache for the range
   that sits within EOF; or
   b) truncating the delalloc extents beyond EOF before the
   move
 
 So, really, the moment you go beyond EOF filesystems have to do
 quite a bit more validation and IO in the context of the system
 call. It no longer becomes a pure extent manipulation offload - it
 becomes a data security problem.

Those sound like problems you would already have solved for a
simple extending truncate.

But I wasn't really thinking of the offset  i_size case, just the
offset + len = i_size case: which would end with i_size at offset,
and the areas you're worried about still beyond EOF - or am I confused?

 
 And, indeed, the specification that we are working to is that the
 applications that want to collapse the range of a file are using
 this function instead of read/memcpy/write/truncate, which by
 definition means they cannot shift ranges of the file beyond EOF
 into the new file.
 
 So IMO the API defines the functionality as required by the
 applications that require it and *no more*. If you need some
 different behaviour - we can add it via additional flags in future
 when you have an application that requires it. 

You still seem to be thinking in terms of xfs ioctl hacks,
rather than fully scoped Linux system calls.

But it probably doesn't matter too much, if we start with an error,
and later correct that to a full implementation - an xfstest or LTP
test which expected failure will fail once it sees success, but no
users harmed in the making of this change.

 
  Actually, is it even correct to fail at EOF?  What if fallocation
  with FALLOC_FL_KEEP_SIZE was used earlier, to allocate beyond EOF:
  shouldn't it be possible to shift that allocation down, along with
  the EOF, rather than leave it behind as a stranded island?
 
 It does get shifted down - it just remains beyond EOF, just like it
 was before the operation. And that is part of the specification of
 COLLAPSE_RANGE - it was done so that preallocation (physical or
 speculative delayed allocation) beyond EOF to avoid fragmentation as
 the DVR continues to write is not screwed up by chopping out earlier
 parts of the file.

Yes, I was confused when I pictured a stranded island there.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-23 Thread Dave Chinner
On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
> On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
> > +   /*
> > +* There is no need to overlap collapse range with EOF, in which case
> > +* it is effectively a truncate operation
> > +*/
> > +   if ((mode & FALLOC_FL_COLLAPSE_RANGE) &&
> > +   (offset + len >= i_size_read(inode)))
> > +   return -EINVAL;
> > +
> 
> I wonder if we should just translate a collapse range that is
> equivalent to a truncate operation to, in fact, be a truncate
> operation?

Trying to collapse a range that extends beyond EOF, IMO, is likely
to only happen if the DVR/NLE application is buggy. Hence I think
that telling the application it is doing something that is likely to
be wrong is better than silently truncating the file

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-23 Thread Dave Chinner
On Sat, Feb 22, 2014 at 09:06:25AM -0500, Theodore Ts'o wrote:
 On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
  +   /*
  +* There is no need to overlap collapse range with EOF, in which case
  +* it is effectively a truncate operation
  +*/
  +   if ((mode  FALLOC_FL_COLLAPSE_RANGE) 
  +   (offset + len = i_size_read(inode)))
  +   return -EINVAL;
  +
 
 I wonder if we should just translate a collapse range that is
 equivalent to a truncate operation to, in fact, be a truncate
 operation?

Trying to collapse a range that extends beyond EOF, IMO, is likely
to only happen if the DVR/NLE application is buggy. Hence I think
that telling the application it is doing something that is likely to
be wrong is better than silently truncating the file

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-22 Thread Theodore Ts'o
On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
> + /*
> +  * There is no need to overlap collapse range with EOF, in which case
> +  * it is effectively a truncate operation
> +  */
> + if ((mode & FALLOC_FL_COLLAPSE_RANGE) &&
> + (offset + len >= i_size_read(inode)))
> + return -EINVAL;
> +

I wonder if we should just translate a collapse range that is
equivalent to a truncate operation to, in fact, be a truncate
operation?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/10] fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate

2014-02-22 Thread Theodore Ts'o
On Wed, Feb 19, 2014 at 01:37:43AM +0900, Namjae Jeon wrote:
 + /*
 +  * There is no need to overlap collapse range with EOF, in which case
 +  * it is effectively a truncate operation
 +  */
 + if ((mode  FALLOC_FL_COLLAPSE_RANGE) 
 + (offset + len = i_size_read(inode)))
 + return -EINVAL;
 +

I wonder if we should just translate a collapse range that is
equivalent to a truncate operation to, in fact, be a truncate
operation?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/