Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-30 Thread Christoph Hellwig
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
 I suppose it might be a bit late in the game to add a goal
 parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
 the API more suitable for XFS?  The goal could be a single __u64, or
 a struct with e.g. __u64 byte offset (possibly also __u32 lun like
 in FIEMAP).  I guess the one potential limitation here is the
 number of function parameters on some architectures.

This isn't really about more suitable for XFS but more about more
suitable for sophisticated layout decisions.

But I'm still not confident this should be shohorned into this
syscall.  In fact I'm already rather unhappy about the feature churn in
the current patch series.

The more I think about it the more I'd prefer we would just put a simple
syscall in that implements nothing but the posix_fallocate(3) semantics
as defined in SuS, and then go on to brainstorm about advanced
preallocation / layout hint semantics.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread David Chinner
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
 On Jun 14, 2007  09:52 +1000, David Chinner wrote:
  B FA_PREALLOCATE
  provides the same functionality as
  B FA_ALLOCATE
  except it does not ever change the file size. This allows allocation
  of zero blocks beyond the end of file and is useful for optimising
  append workloads.
  TP
  B FA_DEALLOCATE
  removes the underlying disk space with the given range. The disk space
  shall be removed regardless of it's contents so both allocated space
  from
  B FA_ALLOCATE
  and
  B FA_PREALLOCATE
  as well as from
  B write(3)
  will be removed.
  B FA_DEALLOCATE
  shall never remove disk blocks outside the range specified.
 
 So this is essentially the same as punch.

Depends on your definition of punch.

 There doesn't seem to be
 a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
 end.

ftruncate()

  B FA_DEALLOCATE
  shall never change the file size. If changing the file size
  is required when deallocating blocks from an offset to end
  of file (or beyond end of file) is required,
  B ftuncate64(3)
  should be used.
 
 This also seems to be a bit of a wart, since it isn't a natural converse
 of either of the above functions.  How about having two modes,
 similar to FA_ALLOCATE and FA_PREALLOCATE?

shrug

whatever.

 Say, FA_PUNCH (which
 would be as you describe here - deletes all data in the specified
 range changing the file size if it overlaps EOF,

Punch means different things to different people. To me (and probably
most XFS aware ppl) punch implies no change to the file size.

i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
holes to leave the file size unchanged. This is the behaviour I
described for FA_DEALLOCATE.

 and FA_DEALLOCATE,
 which only deallocates unused FA_{PRE,}ALLOCATE space?

That's an unwritten-to-hole extent conversion. Is that really
useful for anything? That's easily implemented with FIEMAP
and FA_DEALLOCATE.

Anyway, because we can't agree on a single pair of flags:

FA_ALLOCATE== posix_fallocate()
FA_DEALLOCATE  == unwritten-to-hole ???
FA_RESV_SPACE  == XFS_IOC_RESVSP64
FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64

 We might also consider making @mode be a mask instead of an enumeration:
 
 FA_FL_DEALLOC 0x01 (default allocate)
 FA_FL_KEEP_SIZE   0x02 (default extend/shrink size)
 FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC)

i.e:

#define FA_ALLOCATE 0
#define FA_DEALLOCATE   FA_FL_DEALLOC
#define FA_RESV_SPACE   FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

 I suppose it might be a bit late in the game to add a goal
 parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
 the API more suitable for XFS?

It would suffice for the simpler operations, I think, but we'll
rapidly run out of flags and we'll still need another interface
for doing complex stuff.

 The goal could be a single __u64, or
 a struct with e.g. __u64 byte offset (possibly also __u32 lun like
 in FIEMAP).  I guess the one potential limitation here is the
 number of function parameters on some architectures.

To be useful it needs to __u64.

  B ENOSPC
  There is not enough space left on the device containing the file
  referred to by
  IR fd.
 
 Should probably say whether space is removed on failure or not.  In

Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
allocated was freed back up. That way the error handling in the allocate
functions is much simpler (i.e. no need to undo there).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-13 Thread David Chinner
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
 Did you get time to write the above man page ? It will help to push
 further patches in time (eg. for FA_PREALLOCATE mode).

First pass is attached.

`nroff -man fallocate.2 | less` to view.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include sys/syscall.h
.PP
.BI int syscall(int, int fd, int mode, loff_t offset, loff_t len);
.Op
.SH DESCRIPTION
The
.BR fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.IR offset
and continuing for
.IR len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are three modes:
.TP
.B FA_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
B FA_ALLOCATE
closely resembles
B posix_fallocate(3)
and is intended as a method of optimally implementing this function.
B FA_ALLOCATE
may allocate a larger range that was specified.
TP
B FA_PREALLOCATE
provides the same functionality as
B FA_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
TP
B FA_DEALLOCATE
removes the underlying disk space with the given range. The disk space
shall be removed regardless of it's contents so both allocated space
from
B FA_ALLOCATE
and
B FA_PREALLOCATE
as well as from
B write(3)
will be removed.
B FA_DEALLOCATE
shall never remove disk blocks outside the range specified.
B FA_DEALLOCATE
shall never change the file size. If changing the file size
is required when deallocating blocks from an offset to end
of file (or beyond end of file) is required,
B ftuncate64(3)
should be used.

SH RETURN VALUE
BR fallocate()
returns zero on success, or an error number on failure.
Note that
IR errno
is not set.
SH ERRORS
TP
B EBADF
I fd
is not a valid file descriptor, or is not opened for writing.
TP
B EFBIG
I offset+len
exceeds the maximum file size.
TP
B EINVAL
I offset
or
I len
was less than 0.
TP
B ENODEV
I fd
does not refer to a regular file or a directory.
TP
B ENOSPC
There is not enough space left on the device containing the file
referred to by
IR fd.
TP
B ESPIPE
I fd
refers to a pipe of file descriptor.
B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
SH AVAILABILITY
The
BR fallocate ()
system call is available since 2.6.XX
SH SEE ALSO
BR syscall (2),
BR posix_fadvise (3)
BR ftruncate (3)


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-12 Thread Amit K. Arora
On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
 On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
  On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
   All I'm really interested in right now is that the fallocate
   _interface_ can be used as a *complete replacement* for the
   pre-existing XFS-specific ioctls that are already used by
   applications.  What ext4 can or can't do right now is irrelevant to
   this discussion - the interface definition needs to take priority
   over implementation
  
  Would you like to write up an interface definition description (likely
  man page) and post it for review, possibly with a mention of apps using
  it today ?
 
 Yeah, I started doing that yesterday as i figured it was the only way
 to cut the discussion short
 
  One reason for introducing the mode parameter was to allow the interface to
  evolve incrementally as more options / semantic questions are proposed, so
  that we don't have to make all the decisions right now. 
  So it would be good to start with a *minimal* definition, even just one 
  mode.
  The rest could follow as subsequent patches, each being reviewed and debated
  separately. Otherwise this discussion can drag on for a long time.
 
 Minimal definition to replace what applicaitons use on XFS and to
 support poasix_fallocate are the thre that have been mentioned so
 far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
 all in a man page...

Hi Dave,

Did you get time to write the above man page ? It will help to push
further patches in time (eg. for FA_PREALLOCATE mode).

The idea I had was to push the patch with bare minimum functionality
(FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
new mode(s) based on the man page you planned to provide.

Thanks!
--
Regards,
Amit Arora

 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-12 Thread David Chinner
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
 On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
  Minimal definition to replace what applicaitons use on XFS and to
  support poasix_fallocate are the thre that have been mentioned so
  far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
  all in a man page...
 
 Hi Dave,
 
 Did you get time to write the above man page ? It will help to push
 further patches in time (eg. for FA_PREALLOCATE mode).

No, I didn't. Instead of working on new preallocation stuff, I've
been spending all my time fixing bugs found by new and interesting
(ab)uses of preallocation and hole punching.

 The idea I had was to push the patch with bare minimum functionality
 (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
 new mode(s) based on the man page you planned to provide.

Push them. I'll just make XFS work with whatever is provided.
Is there a test harness for the syscall yet?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-12 Thread David Chinner
On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
 On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
  All I'm really interested in right now is that the fallocate
  _interface_ can be used as a *complete replacement* for the
  pre-existing XFS-specific ioctls that are already used by
  applications.  What ext4 can or can't do right now is irrelevant to
  this discussion - the interface definition needs to take priority
  over implementation
 
 Would you like to write up an interface definition description (likely
 man page) and post it for review, possibly with a mention of apps using
 it today ?

Yeah, I started doing that yesterday as i figured it was the only way
to cut the discussion short

 One reason for introducing the mode parameter was to allow the interface to
 evolve incrementally as more options / semantic questions are proposed, so
 that we don't have to make all the decisions right now. 
 So it would be good to start with a *minimal* definition, even just one mode.
 The rest could follow as subsequent patches, each being reviewed and debated
 separately. Otherwise this discussion can drag on for a long time.

Minimal definition to replace what applicaitons use on XFS and to
support poasix_fallocate are the thre that have been mentioned so
far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
all in a man page...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-11 Thread Suparna Bhattacharya
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
 On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
  On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
   On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to
finalize on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the file size
in this case. I also tend to agree with them. Does anyone has an
argument in favor of changing the filesize ?  If not, I will remove the
code which changes the filesize, before I resubmit the concerned ext4
patch.
   
   I think there needs to be both. If we don't have a mechanism to atomically
   change the file size with the preallocation, then applications that use
   stat() to work out if they need to preallocate more space will end up
   racing.
  
  By both above, do you mean we should give user the flexibility if it wants
  the filesize changed or not ? It can be done by having *two* modes for
  preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
  use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
  change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
  will change the filesize if required (i.e.  when allocation is beyond EOF)
  and also update [cm]time.  This way, the application can decide what it
  wants.
 
 Yes, that's right.
 
  This will be helpfull for the partial allocation scenario also. Think of the
  case when we do not change the filesize in fallocate() and expect
  applications/posix_fallocate() to do ftruncate() after fallocate() for this.
  Now if fallocate() results in a partial allocation with -ENOSPC error
  returned, applications/posix_fallocate() will not know for what length
  ftruncate() has to be called.  :(
 
 Well, posix_fallocate() either gets all the space or it fails. If
 you truncate to extend the file size after an ENOSPC, then that is
 a buggy implementation.
 
 The same could be said for any application, or even the fallocate()
 call itself if it changes the filesize without having completely
 preallocated the space asked
 
  Hence it may be a good idea to give user the flexibility if it wants to
  atomically change the file size with preallocation or not. But, with more
  flexibility there comes inconsistency in behavior, which is worth
  considering.
 
 We've got different modes to specify different behaviour. That's
 what the mode field was put there for in the first place - the
 interface is *designed* to support different preallocation
 behaviours
 
2) For FA_UNALLOCATE mode, should the file system allow unallocation of
normal (non-preallocated) blocks (blocks allocated via regular
write/truncate operations) also (i.e. work as punch()) ?
   
   Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
   i did for FA_UNALLOCATE as well.
  
  Ok. But, some people may not expect/like this. I think, we can keep it on
  the backburner for a while, till other issues are sorted out.
 
 How can it be a backburner issue when it defines the
 implementation?  I've already implemented some thing in XFS that
 sort of does what I think that the interface is supposed to do, but
 I need that interface to be nailed down before proceeding any
 further.
 
 All I'm really interested in right now is that the fallocate
 _interface_ can be used as a *complete replacement* for the
 pre-existing XFS-specific ioctls that are already used by
 applications.  What ext4 can or can't do right now is irrelevant to
 this discussion - the interface definition needs to take priority
 over implementation

Would you like to write up an interface definition description (likely
man page) and post it for review, possibly with a mention of apps using
it today ?

One reason for introducing the mode parameter was to allow the interface to
evolve incrementally as more options / semantic questions are proposed, so
that we don't have to make all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

 
 Cheers,
 
 Dave,
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the 

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-10 Thread Amit K. Arora
On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
 On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
  I have the updated patches ready which take care of Andrew's comments.
  Will run some tests and post them soon.
  
  But, before submitting these patches, I think it will be better to finalize
  on certain things which might be worth some discussion here:
  
  1) Should the file size change when preallocation is done beyond EOF ?
 - Andreas and Chris Wedgwood are in favor of not changing the
   file size in this case. I also tend to agree with them. Does anyone
   has an argument in favor of changing the filesize ?
   If not, I will remove the code which changes the filesize, before I
   resubmit the concerned ext4 patch.
 
 I think there needs to be both. If we don't have a mechanism to
 atomically change the file size with the preallocation, then
 applications that use stat() to work out if they need to preallocate
 more space will end up racing.

By both above, do you mean we should give user the flexibility if it
wants the filesize changed or not ? It can be done by having *two* modes
for preallocation in the system call - say FA_PREALLOCATE and
FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate
blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE
mode is used, fallocate() will change the filesize if required (i.e.
when allocation is beyond EOF) and also update [cm]time.
This way, the application can decide what it wants.

This will be helpfull for the partial allocation scenario also. Think of
the case when we do not change the filesize in fallocate() and expect
applications/posix_fallocate() to do ftruncate() after fallocate() for
this. Now if fallocate() results in a partial allocation with -ENOSPC
error returned, applications/posix_fallocate() will not know for what
length ftruncate() has to be called.  :(

Hence it may be a good idea to give user the flexibility if it wants to
atomically change the file size with preallocation or not. But, with
more flexibility there comes inconsistency in behavior, which is worth
considering.

 
  2) For FA_UNALLOCATE mode, should the file system allow unallocation
 of normal (non-preallocated) blocks (blocks allocated via
 regular write/truncate operations) also (i.e. work as punch()) ?
 
 Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
 what i did for FA_UNALLOCATE as well.

Ok. But, some people may not expect/like this. I think, we can keep it
on the backburner for a while, till other issues are sorted out.
 
 - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
   we need to finalize on the convention here as a general guideline
   to all the filesystems that implement fallocate.
  
  3) If above is true, the file size will need to be changed
 for unallocation when block holding the EOF gets unallocated.
 
 No - we punch a hole. If you want the filesize to change, then
 you use ftruncate() to remove the blocks at EOF and change the
 file size atomically.

Ok.
 
  4) Should we update mtime  ctime on a successfull allocation/
 unallocation ?
 - David Chinner raised this question in following post:
   http://lkml.org/lkml/2007/4/29/407
   I think it makes sense to update the [mc]time for a successfull
   preallocation/unallocation. Does anyone feel otherwise ?
   It will be interesting to know how XFS behaves currently. Does XFS
   update [mc]time for preallocation ?
 
 No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
 changes. If the filesize changes, it behaves exactly the same way that
 ftruncate() behaves.

Having additional mode (of FA_PREALLOCATE) might help here too. Please
see above.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
 Andrew Morton writes:
 
  On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
  wrote:
  
   This patch implements the fallocate() system call and adds support for
   i386, x86_64 and powerpc.
   
   ...
  
   +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t 
   len)
  
  Please add a comment over this function which specifies its behaviour. 
  Really it should be enough material from which a full manpage can be
  written.
 
 This looks like it will have the same problem on s390 as
 sys_sync_file_range.  Maybe the prototype should be:
 
 asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)

http://marc.info/?l=linux-fsdevelm=117585330016809w=2 (Andreas)
http://marc.info/?l=linux-fsdevelm=117690157917378w=2  (Andreas)

http://marc.info/?l=linux-fsdevelm=117578821827323w=2 (Randy)

So we are kind of deadlocked, aren't we ?

The debates on the proposed solution for s390

http://marc.info/?l=linux-fsdevelm=117760995610639w=2  
http://marc.info/?l=linux-fsdevelm=117708124913098w=2 
http://marc.info/?l=linux-fsdevelm=117767607229807w=2

Are there any better ideas ?

Regards
Suparna

 
 Paul.
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Paul Mackerras
Suparna Bhattacharya writes:

  This looks like it will have the same problem on s390 as
  sys_sync_file_range.  Maybe the prototype should be:
  
  asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
 
 Yes, but the trouble is that there was a contrary viewpoint preferring that fd
 first be maintained as a convention like other syscalls (see the following
 posts)

Of course the interface used by an application program would have the
fd first.  Glibc can do the translation.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
 Suparna Bhattacharya writes:
 
   This looks like it will have the same problem on s390 as
   sys_sync_file_range.  Maybe the prototype should be:
   
   asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
  
  Yes, but the trouble is that there was a contrary viewpoint preferring that 
  fd
  first be maintained as a convention like other syscalls (see the following
  posts)
 
 Of course the interface used by an application program would have the
 fd first.  Glibc can do the translation.

I think that was understood.

Regards
Suparna

 
 Paul.

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Amit K. Arora
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote:
 Suparna Bhattacharya writes:
 
   Of course the interface used by an application program would have the
   fd first.  Glibc can do the translation.
  
  I think that was understood.
 
 OK, then what does it matter what the glibc/kernel interface is, as
 long as it works?
 
 It's only a minor point; the order of arguments can vary between
 architectures if necessary, but it's nicer if they don't have to.
 32-bit powerpc will need to have the two int arguments adjacent in
 order to avoid using more than 6 argument registers at the user/kernel
 boundary, and s390 will need to avoid having a 64-bit argument last
 (if I understand it correctly).

You are right to say that. But, it may not be _that_ a minor point,
especially for the arch which is getting affected. It has
other implications like what Heiko noticed in his post below:
http://lkml.org/lkml/2007/4/27/377
 - implications like modifying glibc and *trace utilities for a particular
arch.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Amit K. Arora
I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to finalize
on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
   - Andreas and Chris Wedgwood are in favor of not changing the
 file size in this case. I also tend to agree with them. Does anyone
 has an argument in favor of changing the filesize ?
 If not, I will remove the code which changes the filesize, before I
 resubmit the concerned ext4 patch.

2) For FA_UNALLOCATE mode, should the file system allow unallocation
   of normal (non-preallocated) blocks (blocks allocated via
   regular write/truncate operations) also (i.e. work as punch()) ?
   - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
 we need to finalize on the convention here as a general guideline
 to all the filesystems that implement fallocate.

3) If above is true, the file size will need to be changed
   for unallocation when block holding the EOF gets unallocated.
   - If we do not unallocate normal (non-preallocated) blocks and we
 do not change the file size on preallocation, then this is a
 non-issue.

4) Should we update mtime  ctime on a successfull allocation/
   unallocation ?
   - David Chinner raised this question in following post:
 http://lkml.org/lkml/2007/4/29/407
 I think it makes sense to update the [mc]time for a successfull
 preallocation/unallocation. Does anyone feel otherwise ?
 It will be interesting to know how XFS behaves currently. Does XFS
 update [mc]time for preallocation ?


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Andreas Dilger
On May 09, 2007  21:31 +0530, Amit K. Arora wrote:
 2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?
- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
  we need to finalize on the convention here as a general guideline
  to all the filesystems that implement fallocate.

I would only allow this on FA_ALLOCATE extents.  That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.

 3) If above is true, the file size will need to be changed
for unallocation when block holding the EOF gets unallocated.
- If we do not unallocate normal (non-preallocated) blocks and we
  do not change the file size on preallocation, then this is a
  non-issue.

Not necessarily.  That will just make the file sparse.  If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.

 4) Should we update mtime  ctime on a successfull allocation/
unallocation ?

I would say yes.  If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods.  Also, it just makes sense from the this file was modified
point of view.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Mingming Cao
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote:
 I have the updated patches ready which take care of Andrew's comments.
 Will run some tests and post them soon.
 
 But, before submitting these patches, I think it will be better to finalize
 on certain things which might be worth some discussion here:
 
 1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the
  file size in this case. I also tend to agree with them. Does anyone
  has an argument in favor of changing the filesize ?
  If not, I will remove the code which changes the filesize, before I
  resubmit the concerned ext4 patch.
 

If we chose not to update the file size beyong EOF, then for filesystem
without fallocate() support (ext2,3 currently), posix_fallocate() will
follow the hard way(zero-out) to do preallocation. Then we will get
different behavior on filesystems w/o fallocate() support. It make sense
to be consistent, IMO.

My point of view, preallocation is just a efficient way to allocating
blocks for files without zero-out, other than this, the new behavior
should be consistent with the old way: file size update,mtime/ctime,
ENOSPC etc.

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread David Chinner
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
 I have the updated patches ready which take care of Andrew's comments.
 Will run some tests and post them soon.
 
 But, before submitting these patches, I think it will be better to finalize
 on certain things which might be worth some discussion here:
 
 1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the
  file size in this case. I also tend to agree with them. Does anyone
  has an argument in favor of changing the filesize ?
  If not, I will remove the code which changes the filesize, before I
  resubmit the concerned ext4 patch.

I think there needs to be both. If we don't have a mechanism to
atomically change the file size with the preallocation, then
applications that use stat() to work out if they need to preallocate
more space will end up racing.

 2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?

Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
what i did for FA_UNALLOCATE as well.

- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
  we need to finalize on the convention here as a general guideline
  to all the filesystems that implement fallocate.
 
 3) If above is true, the file size will need to be changed
for unallocation when block holding the EOF gets unallocated.

No - we punch a hole. If you want the filesize to change, then
you use ftruncate() to remove the blocks at EOF and change the
file size atomically.

 4) Should we update mtime  ctime on a successfull allocation/
unallocation ?
- David Chinner raised this question in following post:
  http://lkml.org/lkml/2007/4/29/407
  I think it makes sense to update the [mc]time for a successfull
  preallocation/unallocation. Does anyone feel otherwise ?
  It will be interesting to know how XFS behaves currently. Does XFS
  update [mc]time for preallocation ?

No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
changes. If the filesize changes, it behaves exactly the same way that
ftruncate() behaves.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
 
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 Please add a comment over this function which specifies its behaviour. 
 Really it should be enough material from which a full manpage can be
 written.
 
 If that's all too much, this material should at least be spelled out in the
 changelog.  Because there's no way in which this change can be fully
 reviewed unless someone (ie: you) tells us what it is setting out to
 achieve.
 
 If we 100% implement some standard then a URL for what we claim to
 implement would suffice.  Given that we're at least using different types from
 posix I doubt if such a thing would be sufficient.
 
 And given the complexity and potential variability within the filesystem
 implementations of this, I'd expect that _something_ additional needs to be
 said?

Ok. I will add a detailed comment here.

 
  +{
  +   struct file *file;
  +   struct inode *inode;
  +   long ret = -EINVAL;
  +
  +   if (len == 0 || offset  0)
  +   goto out;
 
 The posix spec implies that negative `len' is permitted - presumably allocate
 ahead of `offset'.  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

 
  +   ret = -EBADF;
  +   file = fget(fd);
  +   if (!file)
  +   goto out;
  +   if (!(file-f_mode  FMODE_WRITE))
  +   goto out_fput;
  +
  +   inode = file-f_path.dentry-d_inode;
  +
  +   ret = -ESPIPE;
  +   if (S_ISFIFO(inode-i_mode))
  +   goto out_fput;
  +
  +   ret = -ENODEV;
  +   if (!S_ISREG(inode-i_mode))
  +   goto out_fput;
 
 So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
 seems a bit silly of them.

True. 
 
  +   ret = -EFBIG;
  +   if (offset + len  inode-i_sb-s_maxbytes)
  +   goto out_fput;
 
 This code does handle offset+len going negative, but only by accident, I
 suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
 here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
  +   if (inode-i_op  inode-i_op-fallocate)
  +   ret = inode-i_op-fallocate(inode, mode, offset, len);
  +   else
  +   ret = -ENOSYS;
 
 If we _are_ going to support negative `len', as posix suggests, I think we
 should perform the appropriate sanity conversions to `offset' and `len'
 right here, rather than expecting each filesystem to do it.
 
 If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
  +out_fput:
  +   fput(file);
  +out:
  +   return ret;
  +}
  +EXPORT_SYMBOL(sys_fallocate);
 
 I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
  +/*
  + * fallocate() modes
  + */
  +#define FA_ALLOCATE0x1
  +#define FA_DEALLOCATE  0x2
 
 Now those aren't in posix.  They should be documented, along with their
 expected semantics.

Will add a comment describing the role of these modes.
 
   #ifdef __KERNEL__
   
   #include linux/linkage.h
  @@ -1125,6 +1131,7 @@ struct inode_operations {
  ssize_t (*listxattr) (struct dentry *, char *, size_t);
  int (*removexattr) (struct dentry *, const char *);
  void (*truncate_range)(struct inode *, loff_t, loff_t);
  +   long (*fallocate)(struct inode *, int, loff_t, loff_t);
 
 I really do think it's better to put the variable names in definitions such
 as this.  Especially when we have two identically-typed variables next to
 each other like that.  Quick: which one is the offset and which is the
 length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
 The above opengroup page only permits S_ISREG.  Preallocating directories
 sounds quite useful to me, although it's something which would be pretty
 hard to emulate if the FS doesn't support it.  And there's a decent case to
 be made for emulating it - run-anywhere reasons.  Does glibc emulation support
 directories?  Quite unlikely.
 
 But yes, sounds like a desirable thing.  Would XFS support it easily if the 
 above
 check was relaxed?

I think we may relax the check here and let the individual file system
decide if they support preallocation for directories or not. What do you
think ?

One thing to be thought in this case is the error code which should be
returned by the file system implementation, incase it doesn't support
preallocation for directories. Should it be -ENODEV (to match with what
posix says) , or something else (which might make more sense in this
case) ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread David Chinner
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
  +{
  +   struct file *file;
  +   struct inode *inode;
  +   long ret = -EINVAL;
  +
  +   if (len == 0 || offset  0)
  +   goto out;
 
 The posix spec implies that negative `len' is permitted - presumably allocate
 ahead of `offset'.  How peculiar.

I just checked the man page for posix_fallocate() and it says:

  EINVAL  offset or len was less than zero.

We should probably follow this lead.

  +
  +   ret = -ENODEV;
  +   if (!S_ISREG(inode-i_mode))
  +   goto out_fput;
 
 So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
 seems a bit silly of them.

H - I thought that the intention of sys_fallocate() was to
be generic enough to eventually allow preallocation on directories.
If that is the case, then this check will prevent that

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 16:07:31 +1000 David Chinner [EMAIL PROTECTED] wrote:

 On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
  On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
  wrote:
  
   This patch implements the fallocate() system call and adds support for
   i386, x86_64 and powerpc.
   
   ...
   +{
   + struct file *file;
   + struct inode *inode;
   + long ret = -EINVAL;
   +
   + if (len == 0 || offset  0)
   + goto out;
  
  The posix spec implies that negative `len' is permitted - presumably 
  allocate
  ahead of `offset'.  How peculiar.
 
 I just checked the man page for posix_fallocate() and it says:
 
   EINVAL  offset or len was less than zero.
 
 We should probably follow this lead.

Yes, I think so.  I'm suspecting that
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
is just buggy.  Or I can't read.

I mean, if we're going to support negative `len' then is the byte at
`offset' inside or outside the segment?  Head spins.

However it would be neat if someone could test $OTHER_OS and, perhaps more
importantly, the present glibc emulation (which I assume your manpage is
referring to, so this would be a manpage test ;)).

   +
   + ret = -ENODEV;
   + if (!S_ISREG(inode-i_mode))
   + goto out_fput;
  
  So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
  seems a bit silly of them.
 
 H - I thought that the intention of sys_fallocate() was to
 be generic enough to eventually allow preallocation on directories.
 If that is the case, then this check will prevent that

The above opengroup page only permits S_ISREG.  Preallocating directories
sounds quite useful to me, although it's something which would be pretty
hard to emulate if the FS doesn't support it.  And there's a decent case to
be made for emulating it - run-anywhere reasons.  Does glibc emulation support
directories?  Quite unlikely.

But yes, sounds like a desirable thing.  Would XFS support it easily if the 
above
check was relaxed?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread Jakub Jelinek
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
   The posix spec implies that negative `len' is permitted - presumably 
   allocate
   ahead of `offset'.  How peculiar.
  
  I just checked the man page for posix_fallocate() and it says:
  
EINVAL  offset or len was less than zero.

That describes the current glibc implementation.


  We should probably follow this lead.
 
 Yes, I think so.  I'm suspecting that
 http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
 is just buggy.  Or I can't read.
 
 I mean, if we're going to support negative `len' then is the byte at
 `offset' inside or outside the segment?  Head spins.
 
 However it would be neat if someone could test $OTHER_OS and, perhaps more
 importantly, the present glibc emulation (which I assume your manpage is
 referring to, so this would be a manpage test ;)).

int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
  struct stat64 st;
  struct statfs f;

  /* `off_t' is a signed type.  Therefore we can determine whether
 OFFSET + LEN is too large if it is a negative value.  */
  if (offset  0 || len  0)
return EINVAL;
  if (offset + len  0)
return EFBIG;

  /* First thing we have to make sure is that this is really a regular
 file.  */
  if (__fxstat64 (_STAT_VER, fd, st) != 0)
return EBADF;
  if (S_ISFIFO (st.st_mode))
return ESPIPE;
  if (! S_ISREG (st.st_mode))
return ENODEV;

  if (len == 0)
{
  if (st.st_size  offset)
{
  int ret = __ftruncate (fd, offset);

  if (ret != 0)
ret = errno;
  return ret;
}
  return 0;
}
...

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is shall fail.  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
required storage for regular file data starting at offset and continuing for 
len bytes
doesn't make sense for negative size.  
And given the general
Implementations may support additional errors not included in this list,
may generate errors included in this list under circumstances other than
those described here, or may contain extensions or limitations that prevent
some errors from occurring.
I believe returning EINVAL for len  0 is not a POSIX violation.
That doesn't mean the standard shouldn't be clarified, whether by saying
EINVAL must be returned for non-positive len or saying that using negative
len has undefined or implementation defined behavior.

 The above opengroup page only permits S_ISREG.  Preallocating directories
 sounds quite useful to me, although it's something which would be pretty
 hard to emulate if the FS doesn't support it.  And there's a decent case to
 be made for emulating it - run-anywhere reasons.  Does glibc emulation support
 directories?  Quite unlikely.

No, see above.

Jakub
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread David Chinner
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
 On Fri, 4 May 2007 16:07:31 +1000 David Chinner [EMAIL PROTECTED] wrote:
  On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
   On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
   wrote:
   
This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.

...
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (len == 0 || offset  0)
+   goto out;
   
   The posix spec implies that negative `len' is permitted - presumably 
   allocate
   ahead of `offset'.  How peculiar.
  
  I just checked the man page for posix_fallocate() and it says:
  
EINVAL  offset or len was less than zero.
  
  We should probably follow this lead.
 
 Yes, I think so.  I'm suspecting that
 http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
 is just buggy.  Or I can't read.
 
 I mean, if we're going to support negative `len' then is the byte at
 `offset' inside or outside the segment?  Head spins.

I don't think we should care. If we provide a syscall with the
semantics of allocate from offset to offset+len then glibc's
implementation can turn negative length into two separate
fallocate syscalls

+   ret = -ENODEV;
+   if (!S_ISREG(inode-i_mode))
+   goto out_fput;
   
   So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
   seems a bit silly of them.
  
  H - I thought that the intention of sys_fallocate() was to
  be generic enough to eventually allow preallocation on directories.
  If that is the case, then this check will prevent that
 
 The above opengroup page only permits S_ISREG.  Preallocating directories
 sounds quite useful to me, although it's something which would be pretty
 hard to emulate if the FS doesn't support it.  And there's a decent case to
 be made for emulating it - run-anywhere reasons.  Does glibc emulation support
 directories?  Quite unlikely.
 
 But yes, sounds like a desirable thing.  Would XFS support it easily if the 
 above
 check was relaxed?

No - right now empty blocks are pruned from the directory immediately so I
don't think we really have a concept of empty blocks in the btree structure.
dir2 is bloody complex, so adding preallocation is probably not going to
be simple to do.

It's not high on my list to add, either, because we can typically avoid the
worst case directory fragmentation by using larger directory block sizes
(e.g. 16k instead of the default 4k on a 4k block size fs).

IIRC directory preallocation has been talked about more for ext3/4

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:

 This patch implements the fallocate() system call and adds support for
 i386, x86_64 and powerpc.
 
 ...

 +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Please add a comment over this function which specifies its behaviour. 
Really it should be enough material from which a full manpage can be
written.

If that's all too much, this material should at least be spelled out in the
changelog.  Because there's no way in which this change can be fully
reviewed unless someone (ie: you) tells us what it is setting out to
achieve.

If we 100% implement some standard then a URL for what we claim to
implement would suffice.  Given that we're at least using different types from
posix I doubt if such a thing would be sufficient.

And given the complexity and potential variability within the filesystem
implementations of this, I'd expect that _something_ additional needs to be
said?

 +{
 + struct file *file;
 + struct inode *inode;
 + long ret = -EINVAL;
 +
 + if (len == 0 || offset  0)
 + goto out;

The posix spec implies that negative `len' is permitted - presumably allocate
ahead of `offset'.  How peculiar.

 + ret = -EBADF;
 + file = fget(fd);
 + if (!file)
 + goto out;
 + if (!(file-f_mode  FMODE_WRITE))
 + goto out_fput;
 +
 + inode = file-f_path.dentry-d_inode;
 +
 + ret = -ESPIPE;
 + if (S_ISFIFO(inode-i_mode))
 + goto out_fput;
 +
 + ret = -ENODEV;
 + if (!S_ISREG(inode-i_mode))
 + goto out_fput;

So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
seems a bit silly of them.

 + ret = -EFBIG;
 + if (offset + len  inode-i_sb-s_maxbytes)
 + goto out_fput;

This code does handle offset+len going negative, but only by accident, I
suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
here would settle the reader's mind.

 + if (inode-i_op  inode-i_op-fallocate)
 + ret = inode-i_op-fallocate(inode, mode, offset, len);
 + else
 + ret = -ENOSYS;

If we _are_ going to support negative `len', as posix suggests, I think we
should perform the appropriate sanity conversions to `offset' and `len'
right here, rather than expecting each filesystem to do it.

If we're not going to handle negative `len' then we should check for it.

 +out_fput:
 + fput(file);
 +out:
 + return ret;
 +}
 +EXPORT_SYMBOL(sys_fallocate);

I don't believe this needs to be exported to modules?

 +/*
 + * fallocate() modes
 + */
 +#define FA_ALLOCATE  0x1
 +#define FA_DEALLOCATE0x2

Now those aren't in posix.  They should be documented, along with their
expected semantics.

  #ifdef __KERNEL__
  
  #include linux/linkage.h
 @@ -1125,6 +1131,7 @@ struct inode_operations {
   ssize_t (*listxattr) (struct dentry *, char *, size_t);
   int (*removexattr) (struct dentry *, const char *);
   void (*truncate_range)(struct inode *, loff_t, loff_t);
 + long (*fallocate)(struct inode *, int, loff_t, loff_t);

I really do think it's better to put the variable names in definitions such
as this.  Especially when we have two identically-typed variables next to
each other like that.  Quick: which one is the offset and which is the
length?


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton [EMAIL PROTECTED] wrote:

  +   ret = -EFBIG;
  +   if (offset + len  inode-i_sb-s_maxbytes)
  +   goto out_fput;
 
 This code does handle offset+len going negative, but only by accident, I
 suspect.

But it doesn't handle offset+len wrapping through zero.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Paul Mackerras
Andrew Morton writes:

 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
 
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 Please add a comment over this function which specifies its behaviour. 
 Really it should be enough material from which a full manpage can be
 written.

This looks like it will have the same problem on s390 as
sys_sync_file_range.  Maybe the prototype should be:

asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-04-26 Thread Amit K. Arora
This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.

NOTE: It is based on 2.6.21 kernel version.

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 ++
 arch/x86_64/kernel/functionlist  |1 
 fs/open.c|   41 +++
 include/asm-i386/unistd.h|3 +-
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 +-
 include/asm-x86_64/unistd.h  |4 ++-
 include/linux/fs.h   |7 ++
 include/linux/syscalls.h |1 
 10 files changed, 66 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (len == 0 || offset  0)
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file-f_mode  FMODE_WRITE))
+   goto out_fput;
+
+   inode = file-f_path.dentry-d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode-i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   if (!S_ISREG(inode-i_mode))
+   goto out_fput;
+
+   ret = -EFBIG;
+   if (offset + len  inode-i_sb-s_maxbytes)
+   goto out_fput;
+
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, mode, offset, len);
+   else
+   ret = -ENOSYS;
+out_fput:
+   fput(file);
+out:
+   return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.21/include/asm-i386/unistd.h
===
--- linux-2.6.21.orig/include/asm-i386/unistd.h
+++ linux-2.6.21/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_fallocate 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/asm-powerpc/systbl.h
===
--- linux-2.6.21.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.21/include/asm-powerpc/systbl.h
@@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
+COMPAT_SYS(fallocate)
Index: linux-2.6.21/include/asm-powerpc/unistd.h
===
--- linux-2.6.21.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.21/include/asm-powerpc/unistd.h
@@ -326,10 +326,11 @@
 #define __NR_move_pages301
 #define __NR_getcpu302
 #define __NR_epoll_pwait   303
+#define __NR_fallocate 304
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls  304
+#define __NR_syscalls  305
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
Index: linux-2.6.21/include/asm-x86_64/unistd.h
===
--- linux-2.6.21.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.21/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate 280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/linux/fs.h