from:"Suparna Bhattacharya"

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Suparna Bhattacharya

On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
> On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
> > Well, if you see the modes proposed using above flags :
> > 
> > #define FA_ALLOCATE 0
> > #define FA_DEALLOCATE   FA_FL_DEALLOC
> > #define FA_RESV_SPACE   FA_FL_KEEP_SIZE
> > #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
> > FA_FL_DEL_DATA)
> > 
> > FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
> > for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
> > flag. Hence prealloction will never delete data.
> > This mode is required only for FA_UNRESV_SPACE, which is a deallocation
> > mode, to support any existing XFS aware applications/usage-scenarios.
> 
> Sorry, but this doesn't make any sense.  There is no need to put every
> feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
> be supported forever anyway - as I suggested before they really should
> be moved to generic code.
> 
> What needs to be supported is what makes sense as an interface.
> A punch a hole interface does make sense, but trying to hack this into
> a preallocation system call is just madness.  We're not IRIX or windows
> that fit things into random subcall just because there was some space
> left to squeeze them in.
> 
> > > > > FA_FL_NO_MTIME0x10 /* keep same mtime (default change on 
> > > > > size, data change) */
> > > > > FA_FL_NO_CTIME0x20 /* keep same ctime (default change on 
> > > > > size, data change) */
> > > 
> > > NACK to these aswell.  If i_size changes c/mtime need updates, if the size
> > > doesn't chamge they don't.  No need to add more flags for this.
> > 
> > This requirement was from the point of view of HSM applications. Hope
> > you saw Andreas previous post and are keeping that in mind.
> 
> HSMs needs this basically for every system call, which screams for an
> open flag like O_INVISIBLE anyway.  Adding this in a generic way is
> a good idea, but hacking bits and pieces that won't fit into the global
> design is completely wrong.


Why don't we just merge the interface for preallocation (essentially
enough to satisfy posix_fallocate() and the simple XFS requirement for 
space reservation without changing file size), which there is clear agreement
on (I hope :)).  After all, this was all that we set out to do when we
started.

And leave all the dealloc/punch/hsm type features for separate future patches/
debates, those really shouldn't hold up the basic fallocate interface.
I agree with Christoph that we are just diverging too much in trying to
club those decisions here.

Dave, Andreas, Ted ?

Regards
Suparna

> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Suparna Bhattacharya

On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
 On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
  Well, if you see the modes proposed using above flags :
  
  #define FA_ALLOCATE 0
  #define FA_DEALLOCATE   FA_FL_DEALLOC
  #define FA_RESV_SPACE   FA_FL_KEEP_SIZE
  #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
  FA_FL_DEL_DATA)
  
  FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
  for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
  flag. Hence prealloction will never delete data.
  This mode is required only for FA_UNRESV_SPACE, which is a deallocation
  mode, to support any existing XFS aware applications/usage-scenarios.
 
 Sorry, but this doesn't make any sense.  There is no need to put every
 feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
 be supported forever anyway - as I suggested before they really should
 be moved to generic code.
 
 What needs to be supported is what makes sense as an interface.
 A punch a hole interface does make sense, but trying to hack this into
 a preallocation system call is just madness.  We're not IRIX or windows
 that fit things into random subcall just because there was some space
 left to squeeze them in.
 
 FA_FL_NO_MTIME0x10 /* keep same mtime (default change on 
 size, data change) */
 FA_FL_NO_CTIME0x20 /* keep same ctime (default change on 
 size, data change) */
   
   NACK to these aswell.  If i_size changes c/mtime need updates, if the size
   doesn't chamge they don't.  No need to add more flags for this.
  
  This requirement was from the point of view of HSM applications. Hope
  you saw Andreas previous post and are keeping that in mind.
 
 HSMs needs this basically for every system call, which screams for an
 open flag like O_INVISIBLE anyway.  Adding this in a generic way is
 a good idea, but hacking bits and pieces that won't fit into the global
 design is completely wrong.


Why don't we just merge the interface for preallocation (essentially
enough to satisfy posix_fallocate() and the simple XFS requirement for 
space reservation without changing file size), which there is clear agreement
on (I hope :)).  After all, this was all that we set out to do when we
started.

And leave all the dealloc/punch/hsm type features for separate future patches/
debates, those really shouldn't hold up the basic fallocate interface.
I agree with Christoph that we are just diverging too much in trying to
club those decisions here.

Dave, Andreas, Ted ?

Regards
Suparna

 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] dio: remove bogus refcounting BUG_ON

2007-07-04 Thread Suparna Bhattacharya

On Wed, Jul 04, 2007 at 07:25:10PM -0700, Badari Pulavarty wrote:
> On Tue, 2007-07-03 at 15:28 -0700, Zach Brown wrote:
> > Linus, Andrew, please apply the bug fix patch at the end of this reply
> > for .22.
> > 
> > > >>One of our perf. team ran into this while doing some runs.
> > > >>I didn't see anything obvious - it looks like we converted
> > > >>async IO to synchronous one. I didn't spend much time digging
> > > >>around.
> > 
> > OK, I think this BUG_ON() is just broken.  I wasn't able to find any
> > obvious bugs from reading the code which would cause the BUG_ON() to
> > fire.  If it's reproducible I'd love to hear what the recipe is.
> > 
> > I did notice that this BUG_ON() is evaluating dio after having dropped
> > it's ref :/.  So it's not completely absurd to fear that it's a race
> > with the dio's memory being reused, but that'd be a pretty tight race.
> > 
> > Let's remove this stupid BUG_ON and see if that test box still has
> > trouble.  It might just hit the valid BUG_ON a few lines down, but this
> > unsafe BUG_ON needs to go.
> 
> I went through the code multiple times, I can't find how we can trigger
> the BUG_ON(). But unfortunately, our perf. team is able reproduce the
> problem. Debug indicated that, the ret2 == 1 :(
> 
> Not sure how that can happen. Ideas ?

Does it trigger even if you avoid referencing dio in the BUG_ON(), i.e.
with something like ...


--- direct-io.c 2007-07-02 01:24:24.0 +0530
+++ direct-io-debug.c   2007-07-05 09:18:56.0 +0530
@@ -1104,9 +1104,10 @@ direct_io_worker(int rw, struct kiocb *i
 * decide to wake the submission path atomically.
 */
spin_lock_irqsave(>bio_lock, flags);
+   is_async = dio->is_async;
ret2 = --dio->refcount;
spin_unlock_irqrestore(>bio_lock, flags);
-   BUG_ON(!dio->is_async && ret2 != 0);
+   BUG_ON(!is_async && ret2 != 0);
if (ret2 == 0) {
ret = dio_complete(dio, offset, ret);
kfree(dio);

> 
> Thanks,
> Badari
> 
> > 
> > ---
> > 
> > dio: remove bogus refcounting BUG_ON
> > 
> > Badari Pulavarty reported a case of this BUG_ON is triggering during
> > testing.  It's completely bogus and should be removed.
> > 
> > It's trying to notice if we left references to the dio hanging around in
> > the sync case.  They should have been dropped as IO completed while this
> > path was in dio_await_completion().  This condition will also be
> > checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
> > lines lower.  So to start this BUG_ON() is redundant.
> > 
> > More fatally, it's dereferencing dio-> after having dropped its
> > reference.  It's only safe to dereference the dio after releasing the
> > lock if the final reference was just dropped.  Another CPU might free
> > the dio in bio completion and reuse the memory after this path drops the
> > dio lock but before the BUG_ON() is evaluated.
> > 
> > This patch passed aio+dio regression unit tests and aio-stress on ext3.
> > 
> > Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
> > Cc: Badari Pulavarty <[EMAIL PROTECTED]>
> > 
> > diff -r 509ce354ae1b fs/direct-io.c
> > --- a/fs/direct-io.cSun Jul 01 22:00:49 2007 +
> > +++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700
> > @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i
> > spin_lock_irqsave(>bio_lock, flags);
> > ret2 = --dio->refcount;
> >     spin_unlock_irqrestore(>bio_lock, flags);
> > -   BUG_ON(!dio->is_async && ret2 != 0);
> > +
> > if (ret2 == 0) {
> > ret = dio_complete(dio, offset, ret);
> > kfree(dio);
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] dio: remove bogus refcounting BUG_ON

2007-07-04 Thread Suparna Bhattacharya

On Wed, Jul 04, 2007 at 07:25:10PM -0700, Badari Pulavarty wrote:
 On Tue, 2007-07-03 at 15:28 -0700, Zach Brown wrote:
  Linus, Andrew, please apply the bug fix patch at the end of this reply
  for .22.
  
   One of our perf. team ran into this while doing some runs.
   I didn't see anything obvious - it looks like we converted
   async IO to synchronous one. I didn't spend much time digging
   around.
  
  OK, I think this BUG_ON() is just broken.  I wasn't able to find any
  obvious bugs from reading the code which would cause the BUG_ON() to
  fire.  If it's reproducible I'd love to hear what the recipe is.
  
  I did notice that this BUG_ON() is evaluating dio after having dropped
  it's ref :/.  So it's not completely absurd to fear that it's a race
  with the dio's memory being reused, but that'd be a pretty tight race.
  
  Let's remove this stupid BUG_ON and see if that test box still has
  trouble.  It might just hit the valid BUG_ON a few lines down, but this
  unsafe BUG_ON needs to go.
 
 I went through the code multiple times, I can't find how we can trigger
 the BUG_ON(). But unfortunately, our perf. team is able reproduce the
 problem. Debug indicated that, the ret2 == 1 :(
 
 Not sure how that can happen. Ideas ?

Does it trigger even if you avoid referencing dio in the BUG_ON(), i.e.
with something like ...


--- direct-io.c 2007-07-02 01:24:24.0 +0530
+++ direct-io-debug.c   2007-07-05 09:18:56.0 +0530
@@ -1104,9 +1104,10 @@ direct_io_worker(int rw, struct kiocb *i
 * decide to wake the submission path atomically.
 */
spin_lock_irqsave(dio-bio_lock, flags);
+   is_async = dio-is_async;
ret2 = --dio-refcount;
spin_unlock_irqrestore(dio-bio_lock, flags);
-   BUG_ON(!dio-is_async  ret2 != 0);
+   BUG_ON(!is_async  ret2 != 0);
if (ret2 == 0) {
ret = dio_complete(dio, offset, ret);
kfree(dio);

 
 Thanks,
 Badari
 
  
  ---
  
  dio: remove bogus refcounting BUG_ON
  
  Badari Pulavarty reported a case of this BUG_ON is triggering during
  testing.  It's completely bogus and should be removed.
  
  It's trying to notice if we left references to the dio hanging around in
  the sync case.  They should have been dropped as IO completed while this
  path was in dio_await_completion().  This condition will also be
  checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
  lines lower.  So to start this BUG_ON() is redundant.
  
  More fatally, it's dereferencing dio- after having dropped its
  reference.  It's only safe to dereference the dio after releasing the
  lock if the final reference was just dropped.  Another CPU might free
  the dio in bio completion and reuse the memory after this path drops the
  dio lock but before the BUG_ON() is evaluated.
  
  This patch passed aio+dio regression unit tests and aio-stress on ext3.
  
  Signed-off-by: Zach Brown [EMAIL PROTECTED]
  Cc: Badari Pulavarty [EMAIL PROTECTED]
  
  diff -r 509ce354ae1b fs/direct-io.c
  --- a/fs/direct-io.cSun Jul 01 22:00:49 2007 +
  +++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700
  @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i
  spin_lock_irqsave(dio-bio_lock, flags);
  ret2 = --dio-refcount;
  spin_unlock_irqrestore(dio-bio_lock, flags);
  -   BUG_ON(!dio-is_async  ret2 != 0);
  +
  if (ret2 == 0) {
  ret = dio_complete(dio, offset, ret);
  kfree(dio);
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] file capabilities: Introduction

2007-05-16 Thread Suparna Bhattacharya

On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote:
> Hi!
> 
> > "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote:
> > 
> > > Following are two patches which have been sitting for some time in -mm.
> > 
> > Where "some time" == "nearly six months".
> > 
> > We need help considering, reviewing and testing this code, please.
> 
> I did quick scan, and it looks ok. Plus, it means we can finally start
> using that old capabilities subsystem... so I think we should do it.

FWIW, I looked through it recently as well, and it looked reasonable enough
to me, though I'm not a security expert. I did have a question about
testing corner cases etc, which Serge has tried to address.

Serge, are you planning to post an update without STRICTXATTR ? That should
simplify the second patch.

Regards
Suparna

> 
>   Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] file capabilities: Introduction

2007-05-16 Thread Suparna Bhattacharya

On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote:
 Hi!
 
  Serge E. Hallyn [EMAIL PROTECTED] wrote:
  
   Following are two patches which have been sitting for some time in -mm.
  
  Where some time == nearly six months.
  
  We need help considering, reviewing and testing this code, please.
 
 I did quick scan, and it looks ok. Plus, it means we can finally start
 using that old capabilities subsystem... so I think we should do it.

FWIW, I looked through it recently as well, and it looked reasonable enough
to me, though I'm not a security expert. I did have a question about
testing corner cases etc, which Serge has tried to address.

Serge, are you planning to post an update without STRICTXATTR ? That should
simplify the second patch.

Regards
Suparna

 
   Pavel
 -- 
 (english) http://www.livejournal.com/~pavelmachek
 (cesky, pictures) 
 http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/10] Linux Kernel Markers - i386 optimized version

2007-05-11 Thread Suparna Bhattacharya

On Fri, May 11, 2007 at 10:27:29AM +0530, Ananth N Mavinakayanahalli wrote:
> On Thu, May 10, 2007 at 12:59:18PM -0400, Mathieu Desnoyers wrote:
> > * Alan Cox ([EMAIL PROTECTED]) wrote:
> 
> ...
> > > > * Third issue : Scalability. Changing code will stop every CPU on the
> > > >   system for a while. Compared to this, the int3-based approach will run
> > > >   through the breakpoint handler "if" one of the CPU happens to execute
> > > >   this code at the wrong time. The standard case is just an IPI (to
> > > 
> > > If I read the errata right then patching in an int3 will itself trigger
> > > the errata so anything could happen.
> > > 
> > > I believe there are other safe sequences for doing code patching - perhaps
> > > one of the Intel folk can advise ?
> 
> IIRC, when the first implementation of what exists now as kprobes was
> done (as part of the dprobes framework), this question did come up. I
> think the conclusion was that the errata applies only to multi-byte
> modifications and single-byte changes are guaranteed to be atomic.
> Given int3 on Intel is just 1-byte, we are safe.
> 
> > I'll let the Intel guys confirm this, I don't have the reference nearby
> > (I got this information by talking with the kprobe team members, and
> > they got this information directly from Intel developers) but the
> > int3 is the one special case to which the errata does not apply.
> > Otherwise, kprobes and gdb would have a big, big issue.
> 
> Perhaps Richard/Suparna can confirm.

I just tried digging up past discussions on this from Richard, about int3
being safe

http://sourceware.org/ml/systemtap/2005-q3/msg00208.html
http://lkml.org/lkml/2006/9/20/30

Regards
Suparna

> 
> Ananth

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-11 Thread Suparna Bhattacharya

e all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

> 
> Cheers,
> 
> Dave,
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities

2007-05-11 Thread Suparna Bhattacharya

On Thu, May 10, 2007 at 01:01:27PM -0700, Andreas Dilger wrote:
> On May 08, 2007  16:49 -0500, Serge E. Hallyn wrote:
> > Quoting Andreas Dilger ([EMAIL PROTECTED]):
> > > One of the important use cases I can see today is the ability to
> > > split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
> > > grained attributes.
> > 
> > Sounds plausible, though it suffers from both making capabilities far
> > more cumbersome (i.e. finding the right capability for what you wanted
> > to do) and backward compatibility.  Perhaps at that point we should
> > introduce security.capabilityv2 xattrs.  A binary can then carry
> > security.capability=CAP_SYS_ADMIN=p, and
> > security.capabilityv2=cap_may_clone_mntns=p.
> 
> Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing
> 12 bytes worth of data, so it is probably just better to keep extending
> the original capability fields as was in the proposal.
> 
> > > What we definitely do NOT want to happen is an application that needs
> > > priviledged access (e.g. e2fsck, mount) to stop running because the
> > > new capabilities _would_ have been granted by the new kernel and are
> > > not by the old kernel and STRICTXATTR is used.
> > > 
> > > To me it would seem that having extra capabilities on an old kernel
> > > is relatively harmless if the old kernel doesn't know what they are.
> > > It's like having a key to a door that you don't know where it is.
> > 
> > If we ditch the STRICTXATTR option do the semantics seem sane to you?
> 
> Seems reasonable.

It would simplify the code as well, which is good.

This does mean no sanity checking of fcaps, am not sure if that matters,
I'm guessing it should be similar to the case for other security attributes.

Regards
Suparna

> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] file capabilities: accomodate 32 bit capabilities

2007-05-11 Thread Suparna Bhattacharya

On Thu, May 10, 2007 at 01:01:27PM -0700, Andreas Dilger wrote:
 On May 08, 2007  16:49 -0500, Serge E. Hallyn wrote:
  Quoting Andreas Dilger ([EMAIL PROTECTED]):
   One of the important use cases I can see today is the ability to
   split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
   grained attributes.
  
  Sounds plausible, though it suffers from both making capabilities far
  more cumbersome (i.e. finding the right capability for what you wanted
  to do) and backward compatibility.  Perhaps at that point we should
  introduce security.capabilityv2 xattrs.  A binary can then carry
  security.capability=CAP_SYS_ADMIN=p, and
  security.capabilityv2=cap_may_clone_mntns=p.
 
 Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing
 12 bytes worth of data, so it is probably just better to keep extending
 the original capability fields as was in the proposal.
 
   What we definitely do NOT want to happen is an application that needs
   priviledged access (e.g. e2fsck, mount) to stop running because the
   new capabilities _would_ have been granted by the new kernel and are
   not by the old kernel and STRICTXATTR is used.
   
   To me it would seem that having extra capabilities on an old kernel
   is relatively harmless if the old kernel doesn't know what they are.
   It's like having a key to a door that you don't know where it is.
  
  If we ditch the STRICTXATTR option do the semantics seem sane to you?
 
 Seems reasonable.

It would simplify the code as well, which is good.

This does mean no sanity checking of fcaps, am not sure if that matters,
I'm guessing it should be similar to the case for other security attributes.

Regards
Suparna

 
 Cheers, Andreas
 --
 Andreas Dilger
 Principal Software Engineer
 Cluster File Systems, Inc.
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-11 Thread Suparna Bhattacharya

On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
 On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
  On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
   On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to
finalize on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the file size
in this case. I also tend to agree with them. Does anyone has an
argument in favor of changing the filesize ?  If not, I will remove the
code which changes the filesize, before I resubmit the concerned ext4
patch.
   
   I think there needs to be both. If we don't have a mechanism to atomically
   change the file size with the preallocation, then applications that use
   stat() to work out if they need to preallocate more space will end up
   racing.
  
  By both above, do you mean we should give user the flexibility if it wants
  the filesize changed or not ? It can be done by having *two* modes for
  preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
  use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
  change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
  will change the filesize if required (i.e.  when allocation is beyond EOF)
  and also update [cm]time.  This way, the application can decide what it
  wants.
 
 Yes, that's right.
 
  This will be helpfull for the partial allocation scenario also. Think of the
  case when we do not change the filesize in fallocate() and expect
  applications/posix_fallocate() to do ftruncate() after fallocate() for this.
  Now if fallocate() results in a partial allocation with -ENOSPC error
  returned, applications/posix_fallocate() will not know for what length
  ftruncate() has to be called.  :(
 
 Well, posix_fallocate() either gets all the space or it fails. If
 you truncate to extend the file size after an ENOSPC, then that is
 a buggy implementation.
 
 The same could be said for any application, or even the fallocate()
 call itself if it changes the filesize without having completely
 preallocated the space asked
 
  Hence it may be a good idea to give user the flexibility if it wants to
  atomically change the file size with preallocation or not. But, with more
  flexibility there comes inconsistency in behavior, which is worth
  considering.
 
 We've got different modes to specify different behaviour. That's
 what the mode field was put there for in the first place - the
 interface is *designed* to support different preallocation
 behaviours
 
2) For FA_UNALLOCATE mode, should the file system allow unallocation of
normal (non-preallocated) blocks (blocks allocated via regular
write/truncate operations) also (i.e. work as punch()) ?
   
   Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
   i did for FA_UNALLOCATE as well.
  
  Ok. But, some people may not expect/like this. I think, we can keep it on
  the backburner for a while, till other issues are sorted out.
 
 How can it be a backburner issue when it defines the
 implementation?  I've already implemented some thing in XFS that
 sort of does what I think that the interface is supposed to do, but
 I need that interface to be nailed down before proceeding any
 further.
 
 All I'm really interested in right now is that the fallocate
 _interface_ can be used as a *complete replacement* for the
 pre-existing XFS-specific ioctls that are already used by
 applications.  What ext4 can or can't do right now is irrelevant to
 this discussion - the interface definition needs to take priority
 over implementation

Would you like to write up an interface definition description (likely
man page) and post it for review, possibly with a mention of apps using
it today ?

One reason for introducing the mode parameter was to allow the interface to
evolve incrementally as more options / semantic questions are proposed, so
that we don't have to make all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

 
 Cheers,
 
 Dave,
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send

Re: [patch 05/10] Linux Kernel Markers - i386 optimized version

2007-05-11 Thread Suparna Bhattacharya

On Fri, May 11, 2007 at 10:27:29AM +0530, Ananth N Mavinakayanahalli wrote:
 On Thu, May 10, 2007 at 12:59:18PM -0400, Mathieu Desnoyers wrote:
  * Alan Cox ([EMAIL PROTECTED]) wrote:
 
 ...
* Third issue : Scalability. Changing code will stop every CPU on the
  system for a while. Compared to this, the int3-based approach will run
  through the breakpoint handler if one of the CPU happens to execute
  this code at the wrong time. The standard case is just an IPI (to
   
   If I read the errata right then patching in an int3 will itself trigger
   the errata so anything could happen.
   
   I believe there are other safe sequences for doing code patching - perhaps
   one of the Intel folk can advise ?
 
 IIRC, when the first implementation of what exists now as kprobes was
 done (as part of the dprobes framework), this question did come up. I
 think the conclusion was that the errata applies only to multi-byte
 modifications and single-byte changes are guaranteed to be atomic.
 Given int3 on Intel is just 1-byte, we are safe.
 
  I'll let the Intel guys confirm this, I don't have the reference nearby
  (I got this information by talking with the kprobe team members, and
  they got this information directly from Intel developers) but the
  int3 is the one special case to which the errata does not apply.
  Otherwise, kprobes and gdb would have a big, big issue.
 
 Perhaps Richard/Suparna can confirm.

I just tried digging up past discussions on this from Richard, about int3
being safe

http://sourceware.org/ml/systemtap/2005-q3/msg00208.html
http://lkml.org/lkml/2006/9/20/30

Regards
Suparna

 
 Ananth

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya

On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
> 
> > > This looks like it will have the same problem on s390 as
> > > sys_sync_file_range.  Maybe the prototype should be:
> > > 
> > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
> > 
> > Yes, but the trouble is that there was a contrary viewpoint preferring that 
> > fd
> > first be maintained as a convention like other syscalls (see the following
> > posts)
> 
> Of course the interface used by an application program would have the
> fd first.  Glibc can do the translation.

I think that was understood.

Regards
Suparna

> 
> Paul.

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya

On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
> Andrew Morton writes:
> 
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > > 
> > > ...
> > >
> > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t 
> > > len)
> > 
> > Please add a comment over this function which specifies its behaviour. 
> > Really it should be enough material from which a full manpage can be
> > written.
> 
> This looks like it will have the same problem on s390 as
> sys_sync_file_range.  Maybe the prototype should be:
> 
> asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)

http://marc.info/?l=linux-fsdevel=117585330016809=2 (Andreas)
http://marc.info/?l=linux-fsdevel=117690157917378=2  (Andreas)

http://marc.info/?l=linux-fsdevel=117578821827323=2 (Randy)

So we are kind of deadlocked, aren't we ?

The debates on the proposed solution for s390

http://marc.info/?l=linux-fsdevel=117760995610639=2  
http://marc.info/?l=linux-fsdevel=117708124913098=2 
http://marc.info/?l=linux-fsdevel=117767607229807=2

Are there any better ideas ?

Regards
Suparna

> 
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya

On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
 Andrew Morton writes:
 
  On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
  wrote:
  
   This patch implements the fallocate() system call and adds support for
   i386, x86_64 and powerpc.
   
   ...
  
   +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t 
   len)
  
  Please add a comment over this function which specifies its behaviour. 
  Really it should be enough material from which a full manpage can be
  written.
 
 This looks like it will have the same problem on s390 as
 sys_sync_file_range.  Maybe the prototype should be:
 
 asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)

http://marc.info/?l=linux-fsdevelm=117585330016809w=2 (Andreas)
http://marc.info/?l=linux-fsdevelm=117690157917378w=2  (Andreas)

http://marc.info/?l=linux-fsdevelm=117578821827323w=2 (Randy)

So we are kind of deadlocked, aren't we ?

The debates on the proposed solution for s390

http://marc.info/?l=linux-fsdevelm=117760995610639w=2  
http://marc.info/?l=linux-fsdevelm=117708124913098w=2 
http://marc.info/?l=linux-fsdevelm=117767607229807w=2

Are there any better ideas ?

Regards
Suparna

 
 Paul.
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya

On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
 Suparna Bhattacharya writes:
 
   This looks like it will have the same problem on s390 as
   sys_sync_file_range.  Maybe the prototype should be:
   
   asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
  
  Yes, but the trouble is that there was a contrary viewpoint preferring that 
  fd
  first be maintained as a convention like other syscalls (see the following
  posts)
 
 Of course the interface used by an application program would have the
 fd first.  Glibc can do the translation.

I think that was understood.

Regards
Suparna

 
 Paul.

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ChunkFS - measuring cross-chunk references

2007-04-25 Thread Suparna Bhattacharya

On Wed, Apr 25, 2007 at 05:50:55AM +0530, Karuna sagar K wrote:
> On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote:
> >On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:
> .
> >It would also be good to distinguish between directories referencing
> >files in another chunk, and directories referencing subdirectories in
> >another chunk (which would be simpler to handle, given the topological
> >restrictions on directories, as compared to files and hard links).
> >
> 
> Modified the tool to distinguish between
> 1. cross references between directories and files
> 2. cross references between directories and sub directories
> 3. cross references within a file (due to huge file size)

One more set of numbers to calculate would be an estimate of cross-references
across chunks of block groups -- 1 (=128MB), 2 (=256MB), 4 (=512MB), 8(=1GB)
as suggested by Kalpak.

Once we have that, it would be nice if we can get data on results with
the tool from other people, especially with larger filesystem sizes.

Regards
Suparna

> 
> Below is the result from / partition of ext3 file system:
> 
> Number of files = 221794
> Number of directories = 24457
> Total size = 8193116 KB
> Total data stored = 7187392 KB
> Size of block groups = 131072 KB
> Number of inodes per block group = 16288
> No. of cross references between directories and sub-directories = 7791
> No. of cross references between directories and file = 657
> Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)
> 
> Thanks for the suggestions.
> 
> >There may also be special things we will need to do to handle
> >scenarios such as BackupPC, where if it looks like a directory
> >contains a huge number of hard links to a particular chunk, we'll need
> >to make sure that directory is either created in the right chunk
> >(possibly with hints from the application) or migrated to the right
> >chunk (but this might cause the inode number of the directory to
> >change --- maybe we allow this as long as the directory has never been
> >stat'ed, so that the inode number has never been observed).
> >
> >The other thing which we should consider is that chunkfs really
> >requires a 64-bit inode number space, which means either we only allow
> >it on 64-bit systems, or we need to consider a migration so that even
> >on 32-bit platforms, stat() functions like stat64(), insofar that it
> >uses a stat structure which returns a 64-bit ino_t.
> >
> >   - Ted
> >
> 
> 
> Thanks,
> Karuna



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ChunkFS - measuring cross-chunk references

2007-04-25 Thread Suparna Bhattacharya

On Wed, Apr 25, 2007 at 05:50:55AM +0530, Karuna sagar K wrote:
 On 4/24/07, Theodore Tso [EMAIL PROTECTED] wrote:
 On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:
 .
 It would also be good to distinguish between directories referencing
 files in another chunk, and directories referencing subdirectories in
 another chunk (which would be simpler to handle, given the topological
 restrictions on directories, as compared to files and hard links).
 
 
 Modified the tool to distinguish between
 1. cross references between directories and files
 2. cross references between directories and sub directories
 3. cross references within a file (due to huge file size)

One more set of numbers to calculate would be an estimate of cross-references
across chunks of block groups -- 1 (=128MB), 2 (=256MB), 4 (=512MB), 8(=1GB)
as suggested by Kalpak.

Once we have that, it would be nice if we can get data on results with
the tool from other people, especially with larger filesystem sizes.

Regards
Suparna

 
 Below is the result from / partition of ext3 file system:
 
 Number of files = 221794
 Number of directories = 24457
 Total size = 8193116 KB
 Total data stored = 7187392 KB
 Size of block groups = 131072 KB
 Number of inodes per block group = 16288
 No. of cross references between directories and sub-directories = 7791
 No. of cross references between directories and file = 657
 Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)
 
 Thanks for the suggestions.
 
 There may also be special things we will need to do to handle
 scenarios such as BackupPC, where if it looks like a directory
 contains a huge number of hard links to a particular chunk, we'll need
 to make sure that directory is either created in the right chunk
 (possibly with hints from the application) or migrated to the right
 chunk (but this might cause the inode number of the directory to
 change --- maybe we allow this as long as the directory has never been
 stat'ed, so that the inode number has never been observed).
 
 The other thing which we should consider is that chunkfs really
 requires a 64-bit inode number space, which means either we only allow
 it on 64-bit systems, or we need to consider a migration so that even
 on 32-bit platforms, stat() functions like stat64(), insofar that it
 uses a stat structure which returns a 64-bit ino_t.
 
- Ted
 
 
 
 Thanks,
 Karuna



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya

On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote:
> On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
> > 
> > This is an initial implementation of ChunkFS technique, briefly discussed
> > at: http://lwn.net/Articles/190222 and 
> > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > 
> > This implementation is done within ext2 driver. Every chunk is an 
> > independent ext2 file system. The knowledge about chunks is kept within 
> > ext2 and 'continuation inodes', which are used to allow files and 
> > directories span across multiple chunks, are managed within ext2.
> > 
> > At mount time, super blocks for all the chunks are created and linked with 
> > the global super_blocks list maintained by VFS. This allows independent 
> > behavior or individual chunks and also helps writebacks to happen 
> > seamlessly.
> > 
> > Apart from this, chunkfs code in ext2 effectively only provides knowledge 
> > of:
> > 
> > - what inode's which block number to look for, for a given file's logical 
> > block number
> > - in which chunk to allocate next inode / block
> > - number of inodes to scan when a directory is being read
> > 
> > To maintain the ext2's inode number uniqueness property, 8 msb bits of 
> > inode number are used to indicate the chunk number in which it resides.
> > 
> > As said, this is a preliminary implementation and lots of changes are 
> > expected before this code is even sanely usable. Some known issues and 
> > obvious optimizations are listed in the TODO file in the chunkfs patch.
> > 
> > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> > - one big patch
> > - applies to 2.6.18
> 
> 
> Could you send this out as a patch to ext2 codebase, so we can just look
> at the changes for chunkfs ? That might also make it small enough
> to inline your patch in email for review. 

Sorry, I missed the part about ext2-chunkfs-diff below.

Regards
suparna

> 
> What kind of results are you planning to gather to evaluate/optimize this ?
> 
> Regards
> Suparna
> 
> > 
> > Attached - ext2-chunkfs-diff.patch.gz
> > - since the code is a spin-off of ext2, this patch explains better what
> >   has changed from the ext2.
> > 
> > git://cislinux.cis.ksu.edu/chunkfs-tools
> > - mkfs, and fsck for chunkfs.
> > 
> > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> > - config file used; tested mostly on UML with loopback file systems.
> > 
> > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
> > should be "no" for clean compile.
> > 
> > 
> > Please comment, suggest, criticize. Patches most welcome.
> > 
> > 
> > Best,
> > AG
> > --
> > May the source be with you.
> > http://www.cis.ksu.edu/~gud
> 
> 
> 
> -- 
> Suparna Bhattacharya ([EMAIL PROTECTED])
> Linux Technology Center
> IBM Software Lab, India
> 

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya

On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
> 
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and 
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> 
> This implementation is done within ext2 driver. Every chunk is an 
> independent ext2 file system. The knowledge about chunks is kept within 
> ext2 and 'continuation inodes', which are used to allow files and 
> directories span across multiple chunks, are managed within ext2.
> 
> At mount time, super blocks for all the chunks are created and linked with 
> the global super_blocks list maintained by VFS. This allows independent 
> behavior or individual chunks and also helps writebacks to happen 
> seamlessly.
> 
> Apart from this, chunkfs code in ext2 effectively only provides knowledge 
> of:
> 
> - what inode's which block number to look for, for a given file's logical 
> block number
> - in which chunk to allocate next inode / block
> - number of inodes to scan when a directory is being read
> 
> To maintain the ext2's inode number uniqueness property, 8 msb bits of 
> inode number are used to indicate the chunk number in which it resides.
> 
> As said, this is a preliminary implementation and lots of changes are 
> expected before this code is even sanely usable. Some known issues and 
> obvious optimizations are listed in the TODO file in the chunkfs patch.
> 
> http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> - one big patch
> - applies to 2.6.18


Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 

What kind of results are you planning to gather to evaluate/optimize this ?

Regards
Suparna

> 
> Attached - ext2-chunkfs-diff.patch.gz
> - since the code is a spin-off of ext2, this patch explains better what
>   has changed from the ext2.
> 
> git://cislinux.cis.ksu.edu/chunkfs-tools
> - mkfs, and fsck for chunkfs.
> 
> http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> - config file used; tested mostly on UML with loopback file systems.
> 
> NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
> should be "no" for clean compile.
> 
> 
> Please comment, suggest, criticize. Patches most welcome.
> 
> 
> Best,
> AG
> --
> May the source be with you.
> http://www.cis.ksu.edu/~gud



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya

On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
 
 This is an initial implementation of ChunkFS technique, briefly discussed
 at: http://lwn.net/Articles/190222 and 
 http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 
 This implementation is done within ext2 driver. Every chunk is an 
 independent ext2 file system. The knowledge about chunks is kept within 
 ext2 and 'continuation inodes', which are used to allow files and 
 directories span across multiple chunks, are managed within ext2.
 
 At mount time, super blocks for all the chunks are created and linked with 
 the global super_blocks list maintained by VFS. This allows independent 
 behavior or individual chunks and also helps writebacks to happen 
 seamlessly.
 
 Apart from this, chunkfs code in ext2 effectively only provides knowledge 
 of:
 
 - what inode's which block number to look for, for a given file's logical 
 block number
 - in which chunk to allocate next inode / block
 - number of inodes to scan when a directory is being read
 
 To maintain the ext2's inode number uniqueness property, 8 msb bits of 
 inode number are used to indicate the chunk number in which it resides.
 
 As said, this is a preliminary implementation and lots of changes are 
 expected before this code is even sanely usable. Some known issues and 
 obvious optimizations are listed in the TODO file in the chunkfs patch.
 
 http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
 - one big patch
 - applies to 2.6.18


Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 

What kind of results are you planning to gather to evaluate/optimize this ?

Regards
Suparna

 
 Attached - ext2-chunkfs-diff.patch.gz
 - since the code is a spin-off of ext2, this patch explains better what
   has changed from the ext2.
 
 git://cislinux.cis.ksu.edu/chunkfs-tools
 - mkfs, and fsck for chunkfs.
 
 http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
 - config file used; tested mostly on UML with loopback file systems.
 
 NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
 should be no for clean compile.
 
 
 Please comment, suggest, criticize. Patches most welcome.
 
 
 Best,
 AG
 --
 May the source be with you.
 http://www.cis.ksu.edu/~gud



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya

On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote:
 On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
  
  This is an initial implementation of ChunkFS technique, briefly discussed
  at: http://lwn.net/Articles/190222 and 
  http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
  
  This implementation is done within ext2 driver. Every chunk is an 
  independent ext2 file system. The knowledge about chunks is kept within 
  ext2 and 'continuation inodes', which are used to allow files and 
  directories span across multiple chunks, are managed within ext2.
  
  At mount time, super blocks for all the chunks are created and linked with 
  the global super_blocks list maintained by VFS. This allows independent 
  behavior or individual chunks and also helps writebacks to happen 
  seamlessly.
  
  Apart from this, chunkfs code in ext2 effectively only provides knowledge 
  of:
  
  - what inode's which block number to look for, for a given file's logical 
  block number
  - in which chunk to allocate next inode / block
  - number of inodes to scan when a directory is being read
  
  To maintain the ext2's inode number uniqueness property, 8 msb bits of 
  inode number are used to indicate the chunk number in which it resides.
  
  As said, this is a preliminary implementation and lots of changes are 
  expected before this code is even sanely usable. Some known issues and 
  obvious optimizations are listed in the TODO file in the chunkfs patch.
  
  http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
  - one big patch
  - applies to 2.6.18
 
 
 Could you send this out as a patch to ext2 codebase, so we can just look
 at the changes for chunkfs ? That might also make it small enough
 to inline your patch in email for review. 

Sorry, I missed the part about ext2-chunkfs-diff below.

Regards
suparna

 
 What kind of results are you planning to gather to evaluate/optimize this ?
 
 Regards
 Suparna
 
  
  Attached - ext2-chunkfs-diff.patch.gz
  - since the code is a spin-off of ext2, this patch explains better what
has changed from the ext2.
  
  git://cislinux.cis.ksu.edu/chunkfs-tools
  - mkfs, and fsck for chunkfs.
  
  http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
  - config file used; tested mostly on UML with loopback file systems.
  
  NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
  should be no for clean compile.
  
  
  Please comment, suggest, criticize. Patches most welcome.
  
  
  Best,
  AG
  --
  May the source be with you.
  http://www.cis.ksu.edu/~gud
 
 
 
 -- 
 Suparna Bhattacharya ([EMAIL PROTECTED])
 Linux Technology Center
 IBM Software Lab, India
 

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)

2007-02-27 Thread Suparna Bhattacharya

On Tue, Feb 27, 2007 at 10:42:11AM +0100, Jens Axboe wrote:
> On Tue, Feb 27 2007, Suparna Bhattacharya wrote:
> > On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote:
> > > On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
> > > > On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
> > > > > 
> > > > > Some more results, using a larger number of processes and io depths. A
> > > > > repeat of the tests from friday, with added depth 2 for syslet and
> > > > > libaio:
> > > > > 
> > > > > Engine  Depth   Processes   Bw (MiB/sec)
> > > > > 
> > > > > libaio1 1602
> > > > > syslet1 1759
> > > > > sync  1 1776
> > > > > libaio   32 1832
> > > > > syslet   32 1898
> > > > > libaio2 1581
> > > > > syslet2 1609
> > > > > 
> > > > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> > > > > with 100 processes each with a depth of 200, reading a per-process
> > > > > private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> > > > > doing 10,000MiB of IO in total:
> > > > 
> > > > But, why ramfs ? Don't we want to exercise the case where O_DIRECT 
> > > > actually
> > > > blocks ? Or am I missing something here ?
> > > 
> > > Just overhead numbers for that test case, lets try something like your
> > > described job.
> > > 
> > > Test case is doing random reads from /dev/sdb, in chunks of 64kb:
> > > 
> > > Engine  Depth   Processes   Bw (KiB/sec)
> > > 
> > > libaio   200   1002813
> > > syslet   200   1003944
> > > libaio 2 12793
> > > syslet 2 13854
> > > sync (*)   2 12866
> > > 
> > > deadline was used for IO scheduling, to minimize impact. Not sure why
> > > syslet actually does so much better here, looing at vmstat the rate is
> > > steady and all runs are basically 50/50 idle/wait. One difference is
> > > that the submission itself takes a long time on libaio, since the
> > > io_submit() will block on request allocation.  The generated IO pattern
> > > from each process is the same for all runs. The drive is a lousy sata
> > > that doesn't even do queuing, FWIW.
> > 
> > 
> > I tried the latest fio code with syslet v4, and my results are a little
> > different - have yet to figure out why or what to make of it.
> > I hope I have all the right pieces now.
> > 
> > This is an ext2 filesystem, SCSI AIC7xxx.
> > 
> > I used an iodepth_batch size of 8 to limit the number of ios in a single
> > io_submit (thanks for adding that parameter to fio !), like we did in
> > aio-stress.
> > 
> > Engine  Depth  BatchBw (KiB/sec)
> > 
> > libaio  64 817,226
> > syslet  64 817,620
> > libaio  2  818,552
> > syslet  2  814,935
> > 
> > 
> > Which is not bad, actually.
> 
> It's not bad for such a high depth/batch setting, but I still wonder why
> are results are so different. I'll look around for an x86 box with some
> TCQ/NCQ enabled storage attached for testing. Can you pass me your
> command line or job file (whatever you use) so we are on the same page?

Sure - I used variations of the following job file (e.g. engine=syslet-rw,
iodepth=2).

Also the io scheduler on my system is set to Anticipatory by default.
FWIW it is a 4 way SMP (PIII, 700MHz)

; aio-stress -l -O -o3 <1GB file>
[global]
ioengine=libaio
buffered=0
rw=randread
bs=64k
size=1024m
directory=/kdump/suparna

[testfile2]
iodepth=64
iodepth_batch=8

> 
> > If I do not specify the iodepth_batch (i.e. default to depth), then the
> > difference becomes more pronounced at higher depths. However, I doubt
> > whether anyone would be using such high batch sizes in practice ...
> >
> > Engine  Depth  BatchBw (KiB/

Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)

2007-02-27 Thread Suparna Bhattacharya

On Tue, Feb 27, 2007 at 10:42:11AM +0100, Jens Axboe wrote:
 On Tue, Feb 27 2007, Suparna Bhattacharya wrote:
  On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote:
   On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
 
 Some more results, using a larger number of processes and io depths. A
 repeat of the tests from friday, with added depth 2 for syslet and
 libaio:
 
 Engine  Depth   Processes   Bw (MiB/sec)
 
 libaio1 1602
 syslet1 1759
 sync  1 1776
 libaio   32 1832
 syslet   32 1898
 libaio2 1581
 syslet2 1609
 
 syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
 with 100 processes each with a depth of 200, reading a per-process
 private file of 10mb (need to fit in my ram...) 10 times each. IOW,
 doing 10,000MiB of IO in total:

But, why ramfs ? Don't we want to exercise the case where O_DIRECT 
actually
blocks ? Or am I missing something here ?
   
   Just overhead numbers for that test case, lets try something like your
   described job.
   
   Test case is doing random reads from /dev/sdb, in chunks of 64kb:
   
   Engine  Depth   Processes   Bw (KiB/sec)
   
   libaio   200   1002813
   syslet   200   1003944
   libaio 2 12793
   syslet 2 13854
   sync (*)   2 12866
   
   deadline was used for IO scheduling, to minimize impact. Not sure why
   syslet actually does so much better here, looing at vmstat the rate is
   steady and all runs are basically 50/50 idle/wait. One difference is
   that the submission itself takes a long time on libaio, since the
   io_submit() will block on request allocation.  The generated IO pattern
   from each process is the same for all runs. The drive is a lousy sata
   that doesn't even do queuing, FWIW.
  
  
  I tried the latest fio code with syslet v4, and my results are a little
  different - have yet to figure out why or what to make of it.
  I hope I have all the right pieces now.
  
  This is an ext2 filesystem, SCSI AIC7xxx.
  
  I used an iodepth_batch size of 8 to limit the number of ios in a single
  io_submit (thanks for adding that parameter to fio !), like we did in
  aio-stress.
  
  Engine  Depth  BatchBw (KiB/sec)
  
  libaio  64 817,226
  syslet  64 817,620
  libaio  2  818,552
  syslet  2  814,935
  
  
  Which is not bad, actually.
 
 It's not bad for such a high depth/batch setting, but I still wonder why
 are results are so different. I'll look around for an x86 box with some
 TCQ/NCQ enabled storage attached for testing. Can you pass me your
 command line or job file (whatever you use) so we are on the same page?

Sure - I used variations of the following job file (e.g. engine=syslet-rw,
iodepth=2).

Also the io scheduler on my system is set to Anticipatory by default.
FWIW it is a 4 way SMP (PIII, 700MHz)

; aio-stress -l -O -o3 1GB file
[global]
ioengine=libaio
buffered=0
rw=randread
bs=64k
size=1024m
directory=/kdump/suparna

[testfile2]
iodepth=64
iodepth_batch=8

 
  If I do not specify the iodepth_batch (i.e. default to depth), then the
  difference becomes more pronounced at higher depths. However, I doubt
  whether anyone would be using such high batch sizes in practice ...
 
  Engine  Depth  BatchBw (KiB/sec)
  
  libaio  64 default  17,429
  syslet  64 default  16,155
  libaio  2  default  15,494
  syslet  2  default  7,971
 
 If iodepth_batch isn't set, the syslet queued io will be serialized and

I see, so then this particular setting is not very meaningful

 not take advantage of queueing. How does the job file perform with
 ioengine=sync?

Just tried it now : 9,027KiB/s

 
  Often times it is the application tuning that makes all the difference,
  so am not really sure how much to read into these results.
  That's always been the hard part of async io ...
 
 Yes I agree, it's handy to get an overview though.

True, at least some of this helps us gain a little more understanding
about the boundaries and how to tune it to be most effective.

Regards
Suparna

 
 -- 
 Jens Axboe

-- 
Suparna Bhattacharya ([EMAIL

Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)

2007-02-26 Thread Suparna Bhattacharya

On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote:
> On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
> > On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
> > > 
> > > Some more results, using a larger number of processes and io depths. A
> > > repeat of the tests from friday, with added depth 2 for syslet and
> > > libaio:
> > > 
> > > Engine  Depth   Processes   Bw (MiB/sec)
> > > 
> > > libaio1 1602
> > > syslet1 1759
> > > sync  1 1776
> > > libaio   32 1832
> > > syslet   32 1898
> > > libaio2 1581
> > > syslet2 1609
> > > 
> > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> > > with 100 processes each with a depth of 200, reading a per-process
> > > private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> > > doing 10,000MiB of IO in total:
> > 
> > But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
> > blocks ? Or am I missing something here ?
> 
> Just overhead numbers for that test case, lets try something like your
> described job.
> 
> Test case is doing random reads from /dev/sdb, in chunks of 64kb:
> 
> Engine  Depth   Processes   Bw (KiB/sec)
> 
> libaio   200   1002813
> syslet   200   1003944
> libaio 2 12793
> syslet 2 13854
> sync (*)   2 12866
> 
> deadline was used for IO scheduling, to minimize impact. Not sure why
> syslet actually does so much better here, looing at vmstat the rate is
> steady and all runs are basically 50/50 idle/wait. One difference is
> that the submission itself takes a long time on libaio, since the
> io_submit() will block on request allocation.  The generated IO pattern
> from each process is the same for all runs. The drive is a lousy sata
> that doesn't even do queuing, FWIW.


I tried the latest fio code with syslet v4, and my results are a little
different - have yet to figure out why or what to make of it.
I hope I have all the right pieces now.

This is an ext2 filesystem, SCSI AIC7xxx.

I used an iodepth_batch size of 8 to limit the number of ios in a single
io_submit (thanks for adding that parameter to fio !), like we did in
aio-stress.

Engine  Depth  BatchBw (KiB/sec)

libaio  64 817,226
syslet  64 817,620
libaio  2  818,552
syslet  2  814,935


Which is not bad, actually.

If I do not specify the iodepth_batch (i.e. default to depth), then the
difference becomes more pronounced at higher depths. However, I doubt
whether anyone would be using such high batch sizes in practice ...

Engine  Depth  BatchBw (KiB/sec)

libaio  64 default  17,429
syslet  64 default  16,155
libaio  2  default  15,494
syslet  2  default  7,971

Often times it is the application tuning that makes all the difference,
so am not really sure how much to read into these results.
That's always been the hard part of async io ...

Regards
Suparna

> 
> [*] Just for comparison, the depth is obviously really 1 at the kernel
> side since it's sync.
> 
> -- 
> Jens Axboe

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)

2007-02-26 Thread Suparna Bhattacharya

On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
> 
> Some more results, using a larger number of processes and io depths. A
> repeat of the tests from friday, with added depth 2 for syslet and
> libaio:
> 
> Engine  Depth   Processes   Bw (MiB/sec)
> 
> libaio1 1602
> syslet1 1759
> sync  1 1776
> libaio   32 1832
> syslet   32 1898
> libaio2 1581
> syslet2 1609
> 
> syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> with 100 processes each with a depth of 200, reading a per-process
> private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> doing 10,000MiB of IO in total:

But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
blocks ? Or am I missing something here ?

Regards
Suparna

> 
> Engine  Depth   Processes   Bw (MiB/sec)
> 
> libaio   200   1001488
> syslet   200   1001714
> 
> Results are stable to within approx +/- 10MiB/sec. The syslet case
> completes a whole second faster than libaio (~6 vs ~7 seconds). Testing
> was done with fio HEAD eb7c8ae27bc301b77490b3586dd5ccab7c95880a, and it
> uses the v4 patch series.
> 
> Engine  Depth   Processes   Bw (MiB/sec)
> 
> libaio200   1001488
> syslet200   1001714
> sync  200   1001843
> 
> -- 
> Jens Axboe

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)

2007-02-26 Thread Suparna Bhattacharya

On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
 
 Some more results, using a larger number of processes and io depths. A
 repeat of the tests from friday, with added depth 2 for syslet and
 libaio:
 
 Engine  Depth   Processes   Bw (MiB/sec)
 
 libaio1 1602
 syslet1 1759
 sync  1 1776
 libaio   32 1832
 syslet   32 1898
 libaio2 1581
 syslet2 1609
 
 syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
 with 100 processes each with a depth of 200, reading a per-process
 private file of 10mb (need to fit in my ram...) 10 times each. IOW,
 doing 10,000MiB of IO in total:

But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
blocks ? Or am I missing something here ?

Regards
Suparna

 
 Engine  Depth   Processes   Bw (MiB/sec)
 
 libaio   200   1001488
 syslet   200   1001714
 
 Results are stable to within approx +/- 10MiB/sec. The syslet case
 completes a whole second faster than libaio (~6 vs ~7 seconds). Testing
 was done with fio HEAD eb7c8ae27bc301b77490b3586dd5ccab7c95880a, and it
 uses the v4 patch series.
 
 Engine  Depth   Processes   Bw (MiB/sec)
 
 libaio200   1001488
 syslet200   1001714
 sync  200   1001843
 
 -- 
 Jens Axboe

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)

2007-02-26 Thread Suparna Bhattacharya

On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote:
 On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
  On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
   
   Some more results, using a larger number of processes and io depths. A
   repeat of the tests from friday, with added depth 2 for syslet and
   libaio:
   
   Engine  Depth   Processes   Bw (MiB/sec)
   
   libaio1 1602
   syslet1 1759
   sync  1 1776
   libaio   32 1832
   syslet   32 1898
   libaio2 1581
   syslet2 1609
   
   syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
   with 100 processes each with a depth of 200, reading a per-process
   private file of 10mb (need to fit in my ram...) 10 times each. IOW,
   doing 10,000MiB of IO in total:
  
  But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
  blocks ? Or am I missing something here ?
 
 Just overhead numbers for that test case, lets try something like your
 described job.
 
 Test case is doing random reads from /dev/sdb, in chunks of 64kb:
 
 Engine  Depth   Processes   Bw (KiB/sec)
 
 libaio   200   1002813
 syslet   200   1003944
 libaio 2 12793
 syslet 2 13854
 sync (*)   2 12866
 
 deadline was used for IO scheduling, to minimize impact. Not sure why
 syslet actually does so much better here, looing at vmstat the rate is
 steady and all runs are basically 50/50 idle/wait. One difference is
 that the submission itself takes a long time on libaio, since the
 io_submit() will block on request allocation.  The generated IO pattern
 from each process is the same for all runs. The drive is a lousy sata
 that doesn't even do queuing, FWIW.


I tried the latest fio code with syslet v4, and my results are a little
different - have yet to figure out why or what to make of it.
I hope I have all the right pieces now.

This is an ext2 filesystem, SCSI AIC7xxx.

I used an iodepth_batch size of 8 to limit the number of ios in a single
io_submit (thanks for adding that parameter to fio !), like we did in
aio-stress.

Engine  Depth  BatchBw (KiB/sec)

libaio  64 817,226
syslet  64 817,620
libaio  2  818,552
syslet  2  814,935


Which is not bad, actually.

If I do not specify the iodepth_batch (i.e. default to depth), then the
difference becomes more pronounced at higher depths. However, I doubt
whether anyone would be using such high batch sizes in practice ...

Engine  Depth  BatchBw (KiB/sec)

libaio  64 default  17,429
syslet  64 default  16,155
libaio  2  default  15,494
syslet  2  default  7,971

Often times it is the application tuning that makes all the difference,
so am not really sure how much to read into these results.
That's always been the hard part of async io ...

Regards
Suparna

 
 [*] Just for comparison, the depth is obviously really 1 at the kernel
 side since it's sync.
 
 -- 
 Jens Axboe

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)

2007-02-23 Thread Suparna Bhattacharya

On Fri, Feb 23, 2007 at 05:25:08PM +0100, Jens Axboe wrote:
> On Fri, Feb 23 2007, Suparna Bhattacharya wrote:
> > On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
> > > 
> > > * Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > > 
> > > > As a really crude (and not very realistic) example of the potential 
> > > > impact of large numbers of outstanding IOs, I tried some quick direct 
> > > > IO comparisons using fio:
> > > > 
> > > > [global]
> > > > ioengine=syslet-rw
> > > > buffered=0
> > > > rw=randread
> > > > bs=64k
> > > > size=1024m
> > > > iodepth=64
> > > 
> > > could you please try those iodepth=2 tests with the latest 
> > > fio-testing branch of fio as well? Jens wrote a new, smarter syslet 
> > > plugin for FIO. You'll need the v3 syslet kernel plus:
> > > 
> > >  git-clone git://git.kernel.dk/data/git/fio.git
> > >  cd fio
> > >  git-checkout syslet-testing
> > > 
> > > my expectation is that it should behave better with iodepth=2 
> > > (although i havent tried that yet).
> > 
> > I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz)
> > and used the v3 syslet patches from your web-site.
> > 
> > Do I still need to get something more recent ?
> 
> Yes, you need to test the syslet+testing branch that Ingo referenced.
> Your test above is not totally fair right now, since you are doing
> significantly less system calls with libaio. So to compare apples with
> apples, try the syslet-testing branch. If you can't get it because of
> firewall problems, check http://brick.kernel.dk/snaps/ for the latest
> fio snapshot. If it has the syslet-testing branch, then that is
> recent enough.

I have a feeling this is getting to be a little more bleeding edge than
I had anticipated :), so will just hold off for a bit until this
crystallizes a bit.

Regards
Suparna

> 
> -- 
> Jens Axboe

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)

2007-02-23 Thread Suparna Bhattacharya

On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
> 
> * Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > As a really crude (and not very realistic) example of the potential 
> > impact of large numbers of outstanding IOs, I tried some quick direct 
> > IO comparisons using fio:
> > 
> > [global]
> > ioengine=syslet-rw
> > buffered=0
> > rw=randread
> > bs=64k
> > size=1024m
> > iodepth=64
> 
> could you please try those iodepth=2 tests with the latest 
> fio-testing branch of fio as well? Jens wrote a new, smarter syslet 
> plugin for FIO. You'll need the v3 syslet kernel plus:
> 
>  git-clone git://git.kernel.dk/data/git/fio.git
>  cd fio
>  git-checkout syslet-testing
> 
> my expectation is that it should behave better with iodepth=2 
> (although i havent tried that yet).

I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz)
and used the v3 syslet patches from your web-site.

Do I still need to get something more recent ?

Regards
Suparna


> 
>   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-23 Thread Suparna Bhattacharya

On Thu, Feb 22, 2007 at 03:36:58PM +0100, Ingo Molnar wrote:
> 
> * Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > > maybe it will, maybe it wont. Lets try? There is no true difference 
> > > between having a 'request structure' that represents the current 
> > > state of the HTTP connection plus a statemachine that moves that 
> > > request between various queues, and a 'kernel stack' that goes in 
> > > and out of runnable state and carries its processing state in its 
> > > stack - other than the amount of RAM they take. (the kernel stack is 
> > > 4K at a minimum - so with a million outstanding requests they would 
> > > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM 
> > > - that's acceptable.)
> > 
> > At what point are the cachemiss threads destroyed ? In other words how 
> > well does this adapt to load variations ? For example, would this 80MB 
> > of RAM continue to be locked down even during periods of lighter loads 
> > thereafter ?
> 
> you can destroy them at will from user-space too - just start a slow 
> timer that zaps them if load goes down. I can add a 
> sys_async_thread_exit(nr_threads) API to be able to drive this without 
> knowing the TIDs of those threads, and/or i can add a kernel-internal 
> mechanism to zap inactive threads. It would be rather easy and 
> low-overhead - the v2 code already had a max_nr_threads tunable, i can 
> reintroduce it. So the size of the pool of contexts does not have to be 
> permanent at all.

If you can find a way to do this without additional tunables burden on
the administrator that would certainly help ! IIRC, performance problems
linked to having too many or too few AIO kernel threads has been a commonly
reported issue elsewhere - it would be nice to be able to avoid repeating
the crux of that (mistake) in Linux. To me, any need to manually tune the
number has always seemed to defeat the very benefit of adaptability of varying
loads that AIO intrinsically provides. 

Regards
Suparna

> 
>   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)

2007-02-23 Thread Suparna Bhattacharya

On Fri, Feb 23, 2007 at 01:52:47PM +0100, Jens Axboe wrote:
> On Wed, Feb 21 2007, Ingo Molnar wrote:
> > this is the v3 release of the syslet/threadlet subsystem:
> > 
> >http://redhat.com/~mingo/syslet-patches/
> 
> [snip]
> 
> Ingo, some testing of the experimental syslet queueing stuff, in the
> syslet-testing branch of fio.
> 
> Fio job file:
> 
> [global]
> bs=8k
> size=1g
> direct=0
> ioengine=syslet-rw
> iodepth=32
> rw=read
> 
> [file]
> filename=/ramfs/testfile
> 
> Only changes between runs was changing ioengine and iodepth as indicated
> in the table below.
> 
> Results:
> 
> Engine  Depth   Bw (MiB/sec)
> 
> libaio1 441
> syslet1 574
> sync  1 589
> libaio   32 613
> syslet   32 681
> 
> Results are stable to within +/- 1MiB/sec. So you can see that syslet
> are still a bit slower than sync for depth 1, but beats the pants off
> libaio for equal depths. Note that this is buffered IO, I'll be out for
> the weekend, but I'll hack some direct IO testing up next week to
> compare "real" queuing.
> 
> Just a quick microbenchmark to gauge current overhead...

This is just ramfs, to gauge pure overheads, is that correct ? 

BTW, I'm not surprised at Ingo's initial results of syslet vs libaio
overheads, for aio-stress/fio type streaming io runs, because these cases
do not involve large numbers of outstanding ios. So the overhead of
thread creation with syslets is amortized across the entire run of io
submissions because of the reuse of already created async threads. While
in the libaio case there is a setup and teardown of kiocbs per request.

What I have been concerned about instead in the past when considering
thread based AIO implementations is the resource(memory) consumption impact
on overall system performance and adaptability to varying loads. It is nice
that we can avoid that for the cached cases, but for the general blocking
cases, it is still not clear to me whether we have addressed this well
enough yet. I used to think that even the kiocb was too heavyweight for its
purpose ... especially in terms of scaling to larger loads.

As a really crude (and not very realistic) example of the potential impact
of large numbers of outstanding IOs, I tried some quick direct IO comparisons
using fio:

[global]
ioengine=syslet-rw
buffered=0
rw=randread
bs=64k
size=1024m
iodepth=64

Engine  Depth   Bw (MiB/sec)

libaio  64  17.323
syslet  64  17.524
libaio  2   15.226
syslet  2   11.015

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)

2007-02-23 Thread Suparna Bhattacharya

On Fri, Feb 23, 2007 at 01:52:47PM +0100, Jens Axboe wrote:
 On Wed, Feb 21 2007, Ingo Molnar wrote:
  this is the v3 release of the syslet/threadlet subsystem:
  
 http://redhat.com/~mingo/syslet-patches/
 
 [snip]
 
 Ingo, some testing of the experimental syslet queueing stuff, in the
 syslet-testing branch of fio.
 
 Fio job file:
 
 [global]
 bs=8k
 size=1g
 direct=0
 ioengine=syslet-rw
 iodepth=32
 rw=read
 
 [file]
 filename=/ramfs/testfile
 
 Only changes between runs was changing ioengine and iodepth as indicated
 in the table below.
 
 Results:
 
 Engine  Depth   Bw (MiB/sec)
 
 libaio1 441
 syslet1 574
 sync  1 589
 libaio   32 613
 syslet   32 681
 
 Results are stable to within +/- 1MiB/sec. So you can see that syslet
 are still a bit slower than sync for depth 1, but beats the pants off
 libaio for equal depths. Note that this is buffered IO, I'll be out for
 the weekend, but I'll hack some direct IO testing up next week to
 compare real queuing.
 
 Just a quick microbenchmark to gauge current overhead...

This is just ramfs, to gauge pure overheads, is that correct ? 

BTW, I'm not surprised at Ingo's initial results of syslet vs libaio
overheads, for aio-stress/fio type streaming io runs, because these cases
do not involve large numbers of outstanding ios. So the overhead of
thread creation with syslets is amortized across the entire run of io
submissions because of the reuse of already created async threads. While
in the libaio case there is a setup and teardown of kiocbs per request.

What I have been concerned about instead in the past when considering
thread based AIO implementations is the resource(memory) consumption impact
on overall system performance and adaptability to varying loads. It is nice
that we can avoid that for the cached cases, but for the general blocking
cases, it is still not clear to me whether we have addressed this well
enough yet. I used to think that even the kiocb was too heavyweight for its
purpose ... especially in terms of scaling to larger loads.

As a really crude (and not very realistic) example of the potential impact
of large numbers of outstanding IOs, I tried some quick direct IO comparisons
using fio:

[global]
ioengine=syslet-rw
buffered=0
rw=randread
bs=64k
size=1024m
iodepth=64

Engine  Depth   Bw (MiB/sec)

libaio  64  17.323
syslet  64  17.524
libaio  2   15.226
syslet  2   11.015


Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3

2007-02-23 Thread Suparna Bhattacharya

On Thu, Feb 22, 2007 at 03:36:58PM +0100, Ingo Molnar wrote:
 
 * Suparna Bhattacharya [EMAIL PROTECTED] wrote:
 
   maybe it will, maybe it wont. Lets try? There is no true difference 
   between having a 'request structure' that represents the current 
   state of the HTTP connection plus a statemachine that moves that 
   request between various queues, and a 'kernel stack' that goes in 
   and out of runnable state and carries its processing state in its 
   stack - other than the amount of RAM they take. (the kernel stack is 
   4K at a minimum - so with a million outstanding requests they would 
   use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM 
   - that's acceptable.)
  
  At what point are the cachemiss threads destroyed ? In other words how 
  well does this adapt to load variations ? For example, would this 80MB 
  of RAM continue to be locked down even during periods of lighter loads 
  thereafter ?
 
 you can destroy them at will from user-space too - just start a slow 
 timer that zaps them if load goes down. I can add a 
 sys_async_thread_exit(nr_threads) API to be able to drive this without 
 knowing the TIDs of those threads, and/or i can add a kernel-internal 
 mechanism to zap inactive threads. It would be rather easy and 
 low-overhead - the v2 code already had a max_nr_threads tunable, i can 
 reintroduce it. So the size of the pool of contexts does not have to be 
 permanent at all.

If you can find a way to do this without additional tunables burden on
the administrator that would certainly help ! IIRC, performance problems
linked to having too many or too few AIO kernel threads has been a commonly
reported issue elsewhere - it would be nice to be able to avoid repeating
the crux of that (mistake) in Linux. To me, any need to manually tune the
number has always seemed to defeat the very benefit of adaptability of varying
loads that AIO intrinsically provides. 

Regards
Suparna

 
   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)

2007-02-23 Thread Suparna Bhattacharya

On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
 
 * Suparna Bhattacharya [EMAIL PROTECTED] wrote:
 
  As a really crude (and not very realistic) example of the potential 
  impact of large numbers of outstanding IOs, I tried some quick direct 
  IO comparisons using fio:
  
  [global]
  ioengine=syslet-rw
  buffered=0
  rw=randread
  bs=64k
  size=1024m
  iodepth=64
 
 could you please try those iodepth=2 tests with the latest 
 fio-testing branch of fio as well? Jens wrote a new, smarter syslet 
 plugin for FIO. You'll need the v3 syslet kernel plus:
 
  git-clone git://git.kernel.dk/data/git/fio.git
  cd fio
  git-checkout syslet-testing
 
 my expectation is that it should behave better with iodepth=2 
 (although i havent tried that yet).

I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz)
and used the v3 syslet patches from your web-site.

Do I still need to get something more recent ?

Regards
Suparna


 
   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)

2007-02-23 Thread Suparna Bhattacharya

On Fri, Feb 23, 2007 at 05:25:08PM +0100, Jens Axboe wrote:
 On Fri, Feb 23 2007, Suparna Bhattacharya wrote:
  On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
   
   * Suparna Bhattacharya [EMAIL PROTECTED] wrote:
   
As a really crude (and not very realistic) example of the potential 
impact of large numbers of outstanding IOs, I tried some quick direct 
IO comparisons using fio:

[global]
ioengine=syslet-rw
buffered=0
rw=randread
bs=64k
size=1024m
iodepth=64
   
   could you please try those iodepth=2 tests with the latest 
   fio-testing branch of fio as well? Jens wrote a new, smarter syslet 
   plugin for FIO. You'll need the v3 syslet kernel plus:
   
git-clone git://git.kernel.dk/data/git/fio.git
cd fio
git-checkout syslet-testing
   
   my expectation is that it should behave better with iodepth=2 
   (although i havent tried that yet).
  
  I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz)
  and used the v3 syslet patches from your web-site.
  
  Do I still need to get something more recent ?
 
 Yes, you need to test the syslet+testing branch that Ingo referenced.
 Your test above is not totally fair right now, since you are doing
 significantly less system calls with libaio. So to compare apples with
 apples, try the syslet-testing branch. If you can't get it because of
 firewall problems, check http://brick.kernel.dk/snaps/ for the latest
 fio snapshot. If it has the syslet-testing branch, then that is
 recent enough.

I have a feeling this is getting to be a little more bleeding edge than
I had anticipated :), so will just hold off for a bit until this
crystallizes a bit.

Regards
Suparna

 
 -- 
 Jens Axboe

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread Suparna Bhattacharya

On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar wrote:
> 
> * Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
> 
> > It is not a TUX anymore - you had 1024 threads, and all of them will 
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill 
> > a machine.
> 
> maybe it will, maybe it wont. Lets try? There is no true difference 
> between having a 'request structure' that represents the current state 
> of the HTTP connection plus a statemachine that moves that request 
> between various queues, and a 'kernel stack' that goes in and out of 
> runnable state and carries its processing state in its stack - other 
> than the amount of RAM they take. (the kernel stack is 4K at a minimum - 
> so with a million outstanding requests they would use up 4 GB of RAM. 
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)

At what point are the cachemiss threads destroyed ? In other words how well
does this adapt to load variations ? For example, would this 80MB of RAM 
continue to be locked down even during periods of lighter loads thereafter ?

Regards
Suparna

> 
> > My tests show that with 4k connections per second (8k concurrency) 
> > more than 20k connections of 80k total block in tcp_sendmsg() over 
> > gigabit lan between quite fast machines.
> 
> yeah. Note that you can have a million sleeping threads if you want, the 
> scheduler wont care. What matters more is the amount of true concurrency 
> that is present at any given time. But yes, i agree that overscheduling 
> can be a problem.
> 
> btw., what is the measurement utility you are using with kevents ('ab' 
> perhaps, with a high -c concurrency count?), and which webserver are you 
> using? (light-httpd?)
> 
>   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread Suparna Bhattacharya

s_async_register()/unregister() has been removed as it is not 
>needed anymore. sys_async_exec() can be called straight away.
> 
>  - there is no kernel-side resource used up by async completion rings at 
>    all (all the state is in user-space), so an arbitrary number of
>completion rings are supported.
> 
> plus lots of bugs were fixed and a good number of cleanups were done as 
> well. The v3 code is ABI-incompatible with v2, due to these fundamental 
> changes.
> 
> As always, comments, suggestions, reports are welcome.
> 
>   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3

2007-02-22 Thread Suparna Bhattacharya

 to these fundamental 
 changes.
 
 As always, comments, suggestions, reports are welcome.
 
   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3

2007-02-22 Thread Suparna Bhattacharya

On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar wrote:
 
 * Evgeniy Polyakov [EMAIL PROTECTED] wrote:
 
  It is not a TUX anymore - you had 1024 threads, and all of them will 
  be consumed by tcp_sendmsg() for slow clients - rescheduling will kill 
  a machine.
 
 maybe it will, maybe it wont. Lets try? There is no true difference 
 between having a 'request structure' that represents the current state 
 of the HTTP connection plus a statemachine that moves that request 
 between various queues, and a 'kernel stack' that goes in and out of 
 runnable state and carries its processing state in its stack - other 
 than the amount of RAM they take. (the kernel stack is 4K at a minimum - 
 so with a million outstanding requests they would use up 4 GB of RAM. 
 With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)

At what point are the cachemiss threads destroyed ? In other words how well
does this adapt to load variations ? For example, would this 80MB of RAM 
continue to be locked down even during periods of lighter loads thereafter ?

Regards
Suparna

 
  My tests show that with 4k connections per second (8k concurrency) 
  more than 20k connections of 80k total block in tcp_sendmsg() over 
  gigabit lan between quite fast machines.
 
 yeah. Note that you can have a million sleeping threads if you want, the 
 scheduler wont care. What matters more is the amount of true concurrency 
 that is present at any given time. But yes, i agree that overscheduling 
 can be a problem.
 
 btw., what is the measurement utility you are using with kevents ('ab' 
 perhaps, with a high -c concurrency count?), and which webserver are you 
 using? (light-httpd?)
 
   Ingo

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-21 Thread Suparna Bhattacharya

he first result code given to
> + * aio_complete().  Use this only for errors which must overwrite whatever 
> the
> + * return code might have been.  The first non-zero new_err given to this
> + * function for a given iocb will be returned to userspace.
> + */
> +static inline int aio_propogate_error(struct kiocb *iocb, int existing_err,
> +   int new_err)
> +{
> + if (existing_err != -EIOCBQUEUED)
> + return new_err;
> + if (!iocb->ki_pending_err)
> + iocb->ki_pending_err = new_err;
> + return -EIOCBQUEUED;
> +}
> +
>  /* for sysctl: */
>  extern unsigned long aio_nr;
>  extern unsigned long aio_max_nr;
> diff -r 8a740eb579d4 mm/filemap.c
> --- a/mm/filemap.cMon Feb 19 13:12:20 2007 -0800
> +++ b/mm/filemap.cMon Feb 19 13:16:00 2007 -0800
> @@ -2031,7 +2031,7 @@ generic_file_direct_write(struct kiocb *
>   ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
>   int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
>   if (err < 0)
> - written = err;
> + written = aio_propogate_error(iocb, written, err);
>   }
>   return written;
>  }
> @@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki
>   int err = invalidate_inode_pages2_range(mapping,
>   offset >> PAGE_CACHE_SHIFT, end);
>   if (err)
> - retval = err;
> + retval = aio_propogate_error(iocb, retval, err);
>   }
>   }
>   return retval;

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-21 Thread Suparna Bhattacharya

;
 + written = aio_propogate_error(iocb, written, err);
   }
   return written;
  }
 @@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki
   int err = invalidate_inode_pages2_range(mapping,
   offset  PAGE_CACHE_SHIFT, end);
   if (err)
 - retval = err;
 + retval = aio_propogate_error(iocb, retval, err);
   }
   }
   return retval;

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-09 Thread Suparna Bhattacharya

On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote:
> On Fri, 9 Feb 2007, Andrew Morton wrote:
>  
> > > @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const 
> > > struct iovec *iov,
> > >   do_generic_file_read(filp,ppos,,file_read_actor);
> > >   retval += desc.written;
> > >   if (desc.error) {
> > > - retval = retval ?: desc.error;
> > > + retval = desc.error;
> > >   break;
> > >   }
> > Nope.  On error the read() syscall must return the number of bytes which
> > were successfully read.
>  
> You are right. 
>  
> In current mainline this even is not a problem, because noone seems to be 
> setting the error values to EIOCBRETRY. But it still stinks a bit, as 
> there are tests for EIOCBRETRY.

We'd want retries to act only on the remaining bytes which haven't been
transferred as yet, so even in the EIOCBRETRY case the right thing to do
is to return the number of bytes that were successfully read without
blocking. The high level AIO code (see aio_rw_vect_rety) has the ability
to handle this.

So this is still correct.

Is there a real bug that you are seeing here ?

Regards
Suparna

>  
> -- 
> Jiri Kosina
>  
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-09 Thread Suparna Bhattacharya

On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote:
 On Fri, 9 Feb 2007, Andrew Morton wrote:
  
   @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const 
   struct iovec *iov,
 do_generic_file_read(filp,ppos,desc,file_read_actor);
 retval += desc.written;
 if (desc.error) {
   - retval = retval ?: desc.error;
   + retval = desc.error;
 break;
 }
  Nope.  On error the read() syscall must return the number of bytes which
  were successfully read.
  
 You are right. 
  
 In current mainline this even is not a problem, because noone seems to be 
 setting the error values to EIOCBRETRY. But it still stinks a bit, as 
 there are tests for EIOCBRETRY.

We'd want retries to act only on the remaining bytes which haven't been
transferred as yet, so even in the EIOCBRETRY case the right thing to do
is to return the number of bytes that were successfully read without
blocking. The high level AIO code (see aio_rw_vect_rety) has the ability
to handle this.

So this is still correct.

Is there a real bug that you are seeing here ?

Regards
Suparna

  
 -- 
 Jiri Kosina
  
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-08 Thread Suparna Bhattacharya

On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote:
> > invalidate_inode_pages2() has other callers.  I suspect with this
> change
> > we'll end up leaking EIOCBRETRY back to userspace.
>  
> EIOCBRETRY is used and caught already in do_sync_read() and
> do_sync_readv_writev().
>  
> Below fixed patch against kernel 2.6.20.

I agree with Andrew, this doesn't look like the right solution.

EIOCBRETRY isn't the same as EAGAIN or "try again later". EIOCBRETRY means
"I have queued up an action to notify (wakeup) the callback to retry the
operation once data is ready".

Regards
Suparna

>  
> >From Leonid Ananiev
>  
> Fix kernel bug when IO page is temporally busy:
> invalidate_inode_pages2() returns EIOCBRETRY but not  EIO..
>  
> Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
> ---
> --- linux-2.6.20/mm/truncate.c2007-02-04 10:44:54.0 -0800
> +++ linux-2.6.20p/mm/truncate.c   2007-02-08 11:38:11.0 -0800
> @@ -366,7 +366,7 @@ static int do_launder_page(struct addres
>   * Any pages which are found to be mapped into pagetables are unmapped
> prior to
>   * invalidation.
>   *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
>   */
>  int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end)
> @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
>   }
>   ret = do_launder_page(mapping, page);
>   if (ret == 0 &&
> !invalidate_complete_page2(mapping, page))
> - ret = -EIO;
> + ret = -EIOCBRETRY;
>   unlock_page(page);
>   }
>   pagevec_release();
> @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
>   * Any pages which are found to be mapped into pagetables are unmapped
> prior to
>   * invalidation.
>   *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
>   */
>  int invalidate_inode_pages2(struct address_space *mapping)
>  {
>  
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-08 Thread Suparna Bhattacharya

On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote:
  invalidate_inode_pages2() has other callers.  I suspect with this
 change
  we'll end up leaking EIOCBRETRY back to userspace.
  
 EIOCBRETRY is used and caught already in do_sync_read() and
 do_sync_readv_writev().
  
 Below fixed patch against kernel 2.6.20.

I agree with Andrew, this doesn't look like the right solution.

EIOCBRETRY isn't the same as EAGAIN or try again later. EIOCBRETRY means
I have queued up an action to notify (wakeup) the callback to retry the
operation once data is ready.

Regards
Suparna

  
 From Leonid Ananiev
  
 Fix kernel bug when IO page is temporally busy:
 invalidate_inode_pages2() returns EIOCBRETRY but not  EIO..
  
 Signed-off-by: Leonid Ananiev [EMAIL PROTECTED]
 ---
 --- linux-2.6.20/mm/truncate.c2007-02-04 10:44:54.0 -0800
 +++ linux-2.6.20p/mm/truncate.c   2007-02-08 11:38:11.0 -0800
 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres
   * Any pages which are found to be mapped into pagetables are unmapped
 prior to
   * invalidation.
   *
 - * Returns -EIO if any pages could not be invalidated.
 + * Returns -EIOCBRETRY if any pages could not be invalidated.
   */
  int invalidate_inode_pages2_range(struct address_space *mapping,
 pgoff_t start, pgoff_t end)
 @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
   }
   ret = do_launder_page(mapping, page);
   if (ret == 0 
 !invalidate_complete_page2(mapping, page))
 - ret = -EIO;
 + ret = -EIOCBRETRY;
   unlock_page(page);
   }
   pagevec_release(pvec);
 @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
   * Any pages which are found to be mapped into pagetables are unmapped
 prior to
   * invalidation.
   *
 - * Returns -EIO if any pages could not be invalidated.
 + * Returns -EIOCBRETRY if any pages could not be invalidated.
   */
  int invalidate_inode_pages2(struct address_space *mapping)
  {
  
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

2007-02-02 Thread Suparna Bhattacharya

On Fri, Feb 02, 2007 at 04:56:22PM -0800, Linus Torvalds wrote:
> 
> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> > 
> > Well, in my picture, 'only if you block' is a pure thread utilization 
> > decision: bounce a piece of work to another thread if this thread cannot 
> > complete it. (if the kernel is lucky enough that the user context told 
> > it "it's fine to do that".)
> 
> Sure, you can do it that way too. But at that point, your argument that we 
> shouldn't do it with fibrils is wrong: you'd still need basically the 
> exact same setup that Zach does in his fibril stuff, and the exact same 
> hook in the scheduler, testing the exact same value ("do we have a pending 
> queue of work").
> 
> So at that point, you really are arguing about a rather small detail in 
> the implementation, I think.
> 
> Which is fair enough. 
> 
> But I actually think the *bigger* argument and problems are elsewhere, 
> namely in the interface details. Notably, I think the *real* issues end up 
> how we handle synchronization, and how we handle signalling. Those are in 
> many ways (I think) more important than whether we actually can schedule 
> these trivial things on multiple CPU's concurrently or not.
> 
> For example, I think serialization is potentially a much more expensive 
> issue. Could we, for example, allow users to serialize with these things 
> *without* having to go through the expense of doing a system call? Again, 
> I'm thinking of the case of no IO happening, in which case there also 
> won't be any actual threading taking place, in which case it's a total 
> waste of time to do a system call at all.
> 
> And trying to do that actually has implications for the interfaces (like 
> possibly returning a zero cookie for the async() system call if it was 
> doable totally synchronously?)

This would be useful - the application wouldn't have to set up state
to remember for handling completions for operations that complete synchronously
I know Samba folks would like that.

The laio_syscall implementation (Lazy asynchronous IO) seems to have
experimented with such an interface
http://www.usenix.org/events/usenix04/tech/general/elmeleegy.html

Regards
Suparna

> 
> Signal handling is similar: I actually think that a "async()" system call 
> should be interruptible within the context of the caller, since we would 
> want to *try* to execute it synchronously. That automatically means that 
> we have semantic meaning for fibrils and signal handling.
> 
> Finally, can we actually get POSIX aio semantics with this? Can we 
> implement the current aio_xyzzy() system calls using this same feature? 
> And most importantly - does it perform well enough that we really can do 
> that?
> 
> THOSE are to me bigger questions than what happens inside the kernel, and 
> whether we actually end up using another thread if we end up doing it 
> non-synchronously.
> 
>   Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/9] buffered write deadlock fix

2007-02-02 Thread Suparna Bhattacharya

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> > 
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
> 
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.

I think Ingo made a point earlier about letting the old co-exist with the
new. Fibrils + kevents have great potential for a next generation
solution but we need to give the whole story some time to play out and prove
it in practice, debate and benchmark the alternative combinations, optimize it
for various workloads etc.  It will also take more work on top before we
can get the whole POSIX AIO implementation supported on top of this. I'll be
very happy when that happens ... it is just that it is still too early to
be sure.

Since this is going to be a new interface, not the existing linux AIO
interface, I do not see any conflict between the two. Samba4 already uses
fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which
depends on fsaio). The more we delay real world usage the longer we take
to learn about the application patterns that matter. And it is those
patterns that are key.

> 
> How bad is the clash with the readahead patches?
> 
> Clashes with git-block are likely, too.
> 
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

If you need help with fixing the clashes, please let me know.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/9] buffered write deadlock fix

2007-02-02 Thread Suparna Bhattacharya

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
 On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  The following set of patches attempt to fix the buffered write
  locking problems (and there are a couple of peripheral patches
  and cleanups there too).
  
  Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
  be an easier diff with the fsaio patches gone, but the readahead
  rewrite clashes badly :(
 
 Well fsaio is restored, but there's now considerable doubt over it due to
 the recent febril febrility.

I think Ingo made a point earlier about letting the old co-exist with the
new. Fibrils + kevents have great potential for a next generation
solution but we need to give the whole story some time to play out and prove
it in practice, debate and benchmark the alternative combinations, optimize it
for various workloads etc.  It will also take more work on top before we
can get the whole POSIX AIO implementation supported on top of this. I'll be
very happy when that happens ... it is just that it is still too early to
be sure.

Since this is going to be a new interface, not the existing linux AIO
interface, I do not see any conflict between the two. Samba4 already uses
fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which
depends on fsaio). The more we delay real world usage the longer we take
to learn about the application patterns that matter. And it is those
patterns that are key.

 
 How bad is the clash with the readahead patches?
 
 Clashes with git-block are likely, too.
 
 Bugfixes come first, so I will drop readahead and fsaio and git-block to get
 this work completed if needed - please work agaisnt mainline.

If you need help with fixing the clashes, please let me know.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

2007-02-02 Thread Suparna Bhattacharya

On Fri, Feb 02, 2007 at 04:56:22PM -0800, Linus Torvalds wrote:
 
 On Sat, 3 Feb 2007, Ingo Molnar wrote:
  
  Well, in my picture, 'only if you block' is a pure thread utilization 
  decision: bounce a piece of work to another thread if this thread cannot 
  complete it. (if the kernel is lucky enough that the user context told 
  it it's fine to do that.)
 
 Sure, you can do it that way too. But at that point, your argument that we 
 shouldn't do it with fibrils is wrong: you'd still need basically the 
 exact same setup that Zach does in his fibril stuff, and the exact same 
 hook in the scheduler, testing the exact same value (do we have a pending 
 queue of work).
 
 So at that point, you really are arguing about a rather small detail in 
 the implementation, I think.
 
 Which is fair enough. 
 
 But I actually think the *bigger* argument and problems are elsewhere, 
 namely in the interface details. Notably, I think the *real* issues end up 
 how we handle synchronization, and how we handle signalling. Those are in 
 many ways (I think) more important than whether we actually can schedule 
 these trivial things on multiple CPU's concurrently or not.
 
 For example, I think serialization is potentially a much more expensive 
 issue. Could we, for example, allow users to serialize with these things 
 *without* having to go through the expense of doing a system call? Again, 
 I'm thinking of the case of no IO happening, in which case there also 
 won't be any actual threading taking place, in which case it's a total 
 waste of time to do a system call at all.
 
 And trying to do that actually has implications for the interfaces (like 
 possibly returning a zero cookie for the async() system call if it was 
 doable totally synchronously?)

This would be useful - the application wouldn't have to set up state
to remember for handling completions for operations that complete synchronously
I know Samba folks would like that.

The laio_syscall implementation (Lazy asynchronous IO) seems to have
experimented with such an interface
http://www.usenix.org/events/usenix04/tech/general/elmeleegy.html

Regards
Suparna

 
 Signal handling is similar: I actually think that a async() system call 
 should be interruptible within the context of the caller, since we would 
 want to *try* to execute it synchronously. That automatically means that 
 we have semantic meaning for fibrils and signal handling.
 
 Finally, can we actually get POSIX aio semantics with this? Can we 
 implement the current aio_xyzzy() system calls using this same feature? 
 And most importantly - does it perform well enough that we really can do 
 that?
 
 THOSE are to me bigger questions than what happens inside the kernel, and 
 whether we actually end up using another thread if we end up doing it 
 non-synchronously.
 
   Linus
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

2007-02-01 Thread Suparna Bhattacharya

On Thu, Feb 01, 2007 at 11:50:06AM -0800, Trond Myklebust wrote:
> On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote:
> > Wooo ...hold on ... I think this is swinging out of perspective :)
> > 
> > I have said some of this before, but let me try again.
> > 
> > As you already discovered when going down the fibril path, there are
> > two kinds of accesses to current-> state, (1) common state
> > for a given call chain (e.g. journal info etc), and (2) for 
> > various validations against the caller's process (uid, ulimit etc). 
> > 
> > (1) is not an issue when it comes to execution in background threads
> > (the VFS already uses background writeback for example).
> > 
> > As for (2), such checks need to happen upfront at the time of IO submission,
> > so again are not an issue.
> 
> Wrong! These checks can and do occur well after the time of I/O
> submission in the case of remote filesystems with asynchronous writeback
> support.
> 
> Consider, for instance, the cases where the server reboots and loses all
> state. Then there is the case of failover and/or migration events, where
> the entire filesystem gets moved from one server to another, and again
> you may have to recover state, etc...
> 
> > I don't see any other reason why IO paths should be assuming that they are
> > running in the original caller's context, midway through doing the IO. If
> > that were the case background writeouts and readaheads could be fragile as
> > well (or ptrace). The reason it isn't is because of this conceptual 
> > division of
> > responsibility.
> 
> The problem with this is that the security context is getting
> progressively more heavy as we add more and more features. In addition
> to the original uid/gid/fsuid/fsgid/groups, we now have stuff like
> keyrings to carry around. Then there is all the context needed to
> support selinux,...

Isn't that kind of information supposed to be captured in nfs_open_context ?
Which is associated with the open file instance ...

I know this has been a traditional issue with network filesystems, and I
haven't kept up with the latest code and decisions in that respect, but how
would you do background writeback if there is an assumption of running in
the context of the original submitter ?

Regards
Suparna

> In the end, you end up recreating most of struct task_struct...
> 
> Cheers
>   Trond
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

2007-02-01 Thread Suparna Bhattacharya

On Thu, Feb 01, 2007 at 02:18:55PM -0800, Zach Brown wrote:
> >Wooo ...hold on ... I think this is swinging out of perspective :)
> 
> I'm sorry, but I don't.  I think using the EIOCBRETRY method in  
> complicated code paths requires too much maintenance cost to justify  
> its benefits.  We can agree to disagree on that judgement :).

I don't disagree about limitations of the EIOCBRETRY approach, nor do I
recommend it for all sorts of complicated code paths. It is only good as
an approximation for specific blocking points involving idempotent 
behaviour, and I was trying to emphasise that that is *all* it was ever
intended for. I certainly do not see it as a viable path to make all syscalls
asynchronous, or to address all blocking points in filesystem IO.

And I do like the general direction of your approach, only need to think
deeper about the details like how to reduce stack per IO request so this
can scale better. So we don't disagree as much as you think :)

The point where we seem to disagree is that I think there is goodness in
maintaining the conceptual clarity between what parts of the operation assume
that it is executing in the original submitters context. For the IO paths
this is what allows things like readahead and writeback to work and to cluster
operations which may end up to/from multiple submitters. This means that
if there is some context that is needed thereafter it could be associated
with the IO request (as an argument or in some other manner), so that this
division is still maintained.

Regards
Suparna

> 
> - z
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

2007-02-01 Thread Suparna Bhattacharya

ing to a state machine just to  
> the VFS?  If you look no further than some hypothetical AIO iscsi/aoe/ 
> nbd/whatever target you obviously include networking.  Probably splice 
> () if you're aggressive :).
> 
> Let's be clear.  I would be thrilled if AIO was implemented by native  
> non-blocking handler implementations.  I don't think it will happen.   
> Not because we don't think it sounds great on paper, but because it's  
> a hugely complex endeavor that would take development and maintenance  
> effort away from the task of keeping basic functionality working.
> 
> So the hope with fibrils is that we lower the barrier to getting AIO  
> syscall support across the board at an acceptable cost.
> 
> It doesn't *stop* us from migrating very important paths (storage,  
> networking) to wildly optimized AIO implementations.  But it also  
> doesn't force AIO support to wait for that.
> 
> >Your patches don't look that complicated yet but you openly
> >admitted you waved away many of the more tricky issues (like
> >signals etc.) and I bet there are yet-unknown side effects
> >of this too that will need more changes.
> 
> To quibble, "waved away" implies that they've been dismissed.  That's  
> not right.  It's a work in progress, so yes, there will be more  
> fiddly details discovered and addressed over time.  The hope is that  
> when it's said and done it'll still be worth merging.  If at some  
> point it gets to be too much, well, at least we'll have this work to  
> reference as a decisive attempt.
> 
> >I'm not sure the fibrils thing will be that much faster than
> >a possibly somewhat fast pathed for this case clone+syscall+exit.
> 
> I'll try and get some numbers for you sooner rather than later.
> 
> Thanks for being diligent, this is exactly the kind of hard look I  
> want this work to get.

BTW, I like the way you are approaching this with a cautiously
critical eye cognizant of lurking details/issues, despite the obvious
(and justified) excitement/eureka feeling.  AIO _is_ hard !

Regards
Suparna

> 
> - z
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

2007-02-01 Thread Suparna Bhattacharya

 think it will happen.   
 Not because we don't think it sounds great on paper, but because it's  
 a hugely complex endeavor that would take development and maintenance  
 effort away from the task of keeping basic functionality working.
 
 So the hope with fibrils is that we lower the barrier to getting AIO  
 syscall support across the board at an acceptable cost.
 
 It doesn't *stop* us from migrating very important paths (storage,  
 networking) to wildly optimized AIO implementations.  But it also  
 doesn't force AIO support to wait for that.
 
 Your patches don't look that complicated yet but you openly
 admitted you waved away many of the more tricky issues (like
 signals etc.) and I bet there are yet-unknown side effects
 of this too that will need more changes.
 
 To quibble, waved away implies that they've been dismissed.  That's  
 not right.  It's a work in progress, so yes, there will be more  
 fiddly details discovered and addressed over time.  The hope is that  
 when it's said and done it'll still be worth merging.  If at some  
 point it gets to be too much, well, at least we'll have this work to  
 reference as a decisive attempt.
 
 I'm not sure the fibrils thing will be that much faster than
 a possibly somewhat fast pathed for this case clone+syscall+exit.
 
 I'll try and get some numbers for you sooner rather than later.
 
 Thanks for being diligent, this is exactly the kind of hard look I  
 want this work to get.

BTW, I like the way you are approaching this with a cautiously
critical eye cognizant of lurking details/issues, despite the obvious
(and justified) excitement/eureka feeling.  AIO _is_ hard !

Regards
Suparna

 
 - z
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

2007-02-01 Thread Suparna Bhattacharya

On Thu, Feb 01, 2007 at 02:18:55PM -0800, Zach Brown wrote:
 Wooo ...hold on ... I think this is swinging out of perspective :)
 
 I'm sorry, but I don't.  I think using the EIOCBRETRY method in  
 complicated code paths requires too much maintenance cost to justify  
 its benefits.  We can agree to disagree on that judgement :).

I don't disagree about limitations of the EIOCBRETRY approach, nor do I
recommend it for all sorts of complicated code paths. It is only good as
an approximation for specific blocking points involving idempotent 
behaviour, and I was trying to emphasise that that is *all* it was ever
intended for. I certainly do not see it as a viable path to make all syscalls
asynchronous, or to address all blocking points in filesystem IO.

And I do like the general direction of your approach, only need to think
deeper about the details like how to reduce stack per IO request so this
can scale better. So we don't disagree as much as you think :)

The point where we seem to disagree is that I think there is goodness in
maintaining the conceptual clarity between what parts of the operation assume
that it is executing in the original submitters context. For the IO paths
this is what allows things like readahead and writeback to work and to cluster
operations which may end up to/from multiple submitters. This means that
if there is some context that is needed thereafter it could be associated
with the IO request (as an argument or in some other manner), so that this
division is still maintained.

Regards
Suparna

 
 - z
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

2007-02-01 Thread Suparna Bhattacharya

On Thu, Feb 01, 2007 at 11:50:06AM -0800, Trond Myklebust wrote:
 On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote:
  Wooo ...hold on ... I think this is swinging out of perspective :)
  
  I have said some of this before, but let me try again.
  
  As you already discovered when going down the fibril path, there are
  two kinds of accesses to current- state, (1) common state
  for a given call chain (e.g. journal info etc), and (2) for 
  various validations against the caller's process (uid, ulimit etc). 
  
  (1) is not an issue when it comes to execution in background threads
  (the VFS already uses background writeback for example).
  
  As for (2), such checks need to happen upfront at the time of IO submission,
  so again are not an issue.
 
 Wrong! These checks can and do occur well after the time of I/O
 submission in the case of remote filesystems with asynchronous writeback
 support.
 
 Consider, for instance, the cases where the server reboots and loses all
 state. Then there is the case of failover and/or migration events, where
 the entire filesystem gets moved from one server to another, and again
 you may have to recover state, etc...
 
  I don't see any other reason why IO paths should be assuming that they are
  running in the original caller's context, midway through doing the IO. If
  that were the case background writeouts and readaheads could be fragile as
  well (or ptrace). The reason it isn't is because of this conceptual 
  division of
  responsibility.
 
 The problem with this is that the security context is getting
 progressively more heavy as we add more and more features. In addition
 to the original uid/gid/fsuid/fsgid/groups, we now have stuff like
 keyrings to carry around. Then there is all the context needed to
 support selinux,...

Isn't that kind of information supposed to be captured in nfs_open_context ?
Which is associated with the open file instance ...

I know this has been a traditional issue with network filesystems, and I
haven't kept up with the latest code and decisions in that respect, but how
would you do background writeback if there is an assumption of running in
the context of the original submitter ?

Regards
Suparna

 In the end, you end up recreating most of struct task_struct...
 
 Cheers
   Trond
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20-rc6-mm1

2007-01-29 Thread Suparna Bhattacharya

On Sun, Jan 28, 2007 at 03:01:33PM -0800, Andrew Morton wrote:
> On Sun, 28 Jan 2007 08:56:08 -0800
> "Martin J. Bligh" <[EMAIL PROTECTED]> wrote:
> 
> > > - It seems that people have been busy creating the need for this.  I had 
> > > to
> > >   apply over sixty patches to this tree to fix post-2.6.20-rc4-mm1 
> > > compilation
> > >   errors.  And a number of patches were dropped due to no-compile or to
> > >   runtime errors.  Heaven knows how many runtime bugs were added.
> > 
> > dbench seems to panic on xfs / cfq ?
> 
> OK, I'll dump git-block.patch.  That means that the fsaio patches get
> temporarily dropped as well as they depend on git-block changes somewhat.

Would you like me to rework these back again to not depend on git-block ?

> 
> btw, I didn't actually include the list-aio patches in rc6-mm1 due to
> various discouraging-looking review comments - I'll be awaiting version 2
> there.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20-rc6-mm1

2007-01-29 Thread Suparna Bhattacharya

On Sun, Jan 28, 2007 at 03:01:33PM -0800, Andrew Morton wrote:
 On Sun, 28 Jan 2007 08:56:08 -0800
 Martin J. Bligh [EMAIL PROTECTED] wrote:
 
   - It seems that people have been busy creating the need for this.  I had 
   to
 apply over sixty patches to this tree to fix post-2.6.20-rc4-mm1 
   compilation
 errors.  And a number of patches were dropped due to no-compile or to
 runtime errors.  Heaven knows how many runtime bugs were added.
  
  dbench seems to panic on xfs / cfq ?
 
 OK, I'll dump git-block.patch.  That means that the fsaio patches get
 temporarily dropped as well as they depend on git-block changes somewhat.

Would you like me to rework these back again to not depend on git-block ?

 
 btw, I didn't actually include the list-aio patches in rc6-mm1 due to
 various discouraging-looking review comments - I'll be awaiting version 2
 there.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-18 Thread Suparna Bhattacharya

On Wed, Jan 17, 2007 at 05:39:51PM +0300, Evgeniy Polyakov wrote:
> On Wed, Jan 17, 2007 at 07:21:42PM +0530, Suparna Bhattacharya ([EMAIL 
> PROTECTED]) wrote:
> >
> > Since you are implementing new APIs here, have you considered doing an
> > aio_sendfilev to be able to send a header with the data ?
> 
> It is doable, but why people do not like corking?
> With Linux less than microsecond syscall overhead it is better and more
> flexible solution, doesn't it?

That is what I used to think as well. However ...

The problem as I understand it now is not about bunching data together, but
of ensuring some sort of atomicity between the header and the data, when
there can be multiple outstanding aio requests on the same socket - i.e
ensuring strict ordering without other data coming in between, when data
to be sent is not already in cache, and in the meantime another sendfile
or aio write requests comes in for the same socket. Without having to lock
the socket when reading data from disk.

There are alternate ways to address this, aio_sendfilev is one of the options
I have heard people requesting.

Regards
Suparna

> 
> I'm not saying - 'no, there will not be any *v variants', just getting
> more info.
> 
> > Regards
> > Suparna
> 
> --
>   Evgeniy Polyakov

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-18 Thread Suparna Bhattacharya

On Wed, Jan 17, 2007 at 05:39:51PM +0300, Evgeniy Polyakov wrote:
 On Wed, Jan 17, 2007 at 07:21:42PM +0530, Suparna Bhattacharya ([EMAIL 
 PROTECTED]) wrote:
 
  Since you are implementing new APIs here, have you considered doing an
  aio_sendfilev to be able to send a header with the data ?
 
 It is doable, but why people do not like corking?
 With Linux less than microsecond syscall overhead it is better and more
 flexible solution, doesn't it?

That is what I used to think as well. However ...

The problem as I understand it now is not about bunching data together, but
of ensuring some sort of atomicity between the header and the data, when
there can be multiple outstanding aio requests on the same socket - i.e
ensuring strict ordering without other data coming in between, when data
to be sent is not already in cache, and in the meantime another sendfile
or aio write requests comes in for the same socket. Without having to lock
the socket when reading data from disk.

There are alternate ways to address this, aio_sendfilev is one of the options
I have heard people requesting.

Regards
Suparna

 
 I'm not saying - 'no, there will not be any *v variants', just getting
 more info.
 
  Regards
  Suparna
 
 --
   Evgeniy Polyakov

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Vectored AIO breakage for sockets and pipes ?

2007-01-17 Thread Suparna Bhattacharya


The call to aio_advance_iovec() in aio_rw_vect_retry() becomes problematic
when it comes to pipe and socket operations which internally modify/advance
the iovec themselves. As a result AIO writes to sockets fail to return
the correct result. 

I'm not sure what the best way to fix this is. One option is to always make
a copy of the iovec and pass that down. Any other thoughts ?

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-17 Thread Suparna Bhattacharya

 + }
> +
> + u = file->private_data;
> +
> + memset(, 0, sizeof(struct ukevent));
> +
> + uk.event = KEVENT_MASK_ALL;
> + uk.type = KEVENT_AIO;
> +
> + preempt_disable();
> + uk.id.raw[0] = smp_processor_id();
> + cnt = &__get_cpu_var(kaio_req_counter);
> + uk.id.raw[1] = *cnt;
> + *cnt = *cnt + 1;
> + preempt_enable();
> +
> + uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE;
> + uk.ptr = req;
> +
> + err = kevent_user_add_ukevent(, u);
> + if (err)
> + goto err_out_fput;
> +
> + kevent_user_get(u);
> +
> + fput_light(file, need_fput);
> +
> + return 0;
> +
> +err_out_fput:
> + fput_light(file, need_fput);
> +err_out:
> + return err;
> +}
> +
> +static void kaio_destructor(struct kaio_req *req)
> +{
> + struct kaio_private *priv = req->priv;
> + struct socket *sock = priv->dptr;
> + struct file *file = priv->sptr;
> +
> + fput(file);
> + sockfd_put(sock);
> +
> + kevent_storage_ready(>kevent_user->st, NULL, KEVENT_MASK_ALL);
> + kevent_user_put(priv->kevent_user);
> +
> + kmem_cache_free(kaio_priv_cache, req->priv);
> +}
> +
> +static struct kaio_req *kaio_sendfile(int kevent_fd, int sock_fd, struct 
> file *file, off_t offset, size_t count)
> +{
> + struct kaio_req *req;
> + struct socket *sock;
> + struct kaio_private *priv;
> + int err;
> +
> + sock = sockfd_lookup(sock_fd, );
> + if (!sock)
> + goto err_out_exit;
> +
> + priv = kmem_cache_alloc(kaio_priv_cache, GFP_KERNEL);
> + if (!priv)
> + goto err_out_sput;
> +
> + priv->sptr = file;
> + priv->dptr = sock;
> + priv->offset = offset;
> + priv->count = min_t(u64, i_size_read(file->f_dentry->d_inode), count);
> + priv->processed = offset;
> + priv->limit = 128;
> +
> + req = kaio_add_call(NULL, _read_send_pages, -1, GFP_KERNEL);
> + if (!req)
> + goto err_out_free;
> +
> + req->destructor = kaio_destructor;
> + req->priv = priv;
> +
> + err = kaio_add_kevent(kevent_fd, req);
> +
> + dprintk("%s: req: %p, priv: %p, call: %p [%p], offset: %Lu, processed: 
> %Lu, count: %Lu, err: %d.\n",
> + __func__, req, priv, _read_send_pages,
> + kaio_read_send_pages, priv->offset, priv->processed, 
> priv->count, err);
> +
> + if (err)
> + goto err_out_remove;
> +
> + kaio_schedule_req(req);
> +
> + return req;
> +
> +err_out_remove:
> + /* It is safe to just free the object since it is guaranteed that it 
> was not
> +  * queued for processing.
> +  */
> + kmem_cache_free(kaio_req_cache, req);
> +err_out_free:
> + kmem_cache_free(kaio_priv_cache, priv);
> +err_out_sput:
> + sockfd_put(sock);
> +err_out_exit:
> + return NULL;
> +
> +}
> +
> +asmlinkage long sys_aio_sendfile(int kevent_fd, int sock_fd, int in_fd, 
> off_t offset, size_t count)
> +{
> + struct kaio_req *req;
> + struct file *file;
> + int err;
> +
> + file = fget(in_fd);
> + if (!file) {
> + err = -EBADF;
> + goto err_out_exit;
> + }
> +
> + req = kaio_sendfile(kevent_fd, sock_fd, file, offset, count);
> + if (!req) {
> + err = -EINVAL;
> + goto err_out_fput;
> + }
> +
> + return (long)req;
> +
> +err_out_fput:
> + fput(file);
> +err_out_exit:
> + return err;
> +}
> +
> +asmlinkage long sys_aio_sendfile_path(int kevent_fd, int sock_fd, char 
> __user *filename, off_t offset, size_t count)
> +{
> + char *tmp = getname(filename);
> + int fd = PTR_ERR(tmp);
> + int flags = O_RDONLY, err;
> + struct nameidata nd;
> + struct file *file;
> + struct kaio_req *req;
> +
> + if (force_o_largefile())
> + flags = O_LARGEFILE;
> +
> + if (IS_ERR(tmp)) {
> + err = fd;
> + goto err_out_exit;
> + }
> +
> + fd = get_unused_fd();
> + if (fd < 0) {
> +     err = fd;
> + goto err_out_put_name;
> + }
> +
> + if ((flags+1) & O_ACCMODE)
> + flags++;
> +
> + err = open_namei(AT_FDCWD, tmp, flags, 0400, );
> + if (err)
> + goto err_out_fdput;
> +
> + file = nameidata_to_filp(, flags);
> + if (!file)
> + goto err_out_fdput;
> +
> + /* One reference will be released in sys_close(),
> +  * second one through req->destructor()
> +  */
> + atomic_inc(>f_count);
> +
> + req = kaio_sendfile(kevent_fd, sock_fd, file, offset, count);
> + if (!req) {
> + err = -EINVAL;
> + goto err_out_fput;
> + }
> +
> + fd_install(fd, file);
> +
> + return fd;
> +
> +err_out_fput:
> + fput(file);
> + fput(file);
> +err_out_fdput:
> + put_unused_fd(fd);
> +err_out_put_name:
> + putname(tmp);
> +err_out_exit:
> + return err;
> +}
> +
> +static int kevent_aio_callback(struct kevent *k)
> +{
> + struct kaio_req *req = k->event.ptr;
> + struct kaio_private *priv = req->priv;
> +
> + if (!priv->count) {
> + __u32 *processed = (__u32 *)>processed;
> + k->event.ret_data[0] = processed[0];
> + k->event.ret_data[1] = processed[1];
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +int kevent_aio_enqueue(struct kevent *k)
> +{
> + int err;
> + struct kaio_req *req = k->event.ptr;
> + struct kaio_private *priv = req->priv;
> +
> + err = kevent_storage_enqueue(>user->st, k);
> + if (err)
> + goto err_out_exit;
> +
> + priv->kevent_user = k->user;
> + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE)
> + kevent_requeue(k);
> +
> + return 0;
> +
> +err_out_exit:
> + return err;
> +}
> +
> +int kevent_aio_dequeue(struct kevent *k)
> +{
> + kevent_storage_dequeue(k->st, k);
> +
> + return 0;
> +}
> +
> +static void kaio_thread_stop(struct kaio_thread *th)
> +{
> + kthread_stop(th->thread);
> + kfree(th);
> +}
> +
> +static int kaio_thread_start(struct kaio_thread **thp, int num)
> +{
> + struct kaio_thread *th;
> +
> + th = kzalloc(sizeof(struct kaio_thread), GFP_KERNEL);
> + if (!th)
> + return -ENOMEM;
> +
> + th->refcnt = 1;
> + spin_lock_init(>req_lock);
> + INIT_LIST_HEAD(>req_list);
> + init_waitqueue_head(>wait);
> +
> + th->thread = kthread_run(kaio_thread_process, th, "kaio/%d", num);
> + if (IS_ERR(th->thread)) {
> + int err = PTR_ERR(th->thread);
> + kfree(th);
> + return err;
> + }
> +
> + *thp = th;
> + wmb();
> +
> + return 0;
> +}
> +
> +static int __init kevent_init_aio(void)
> +{
> + struct kevent_callbacks sc = {
> + .callback = _aio_callback,
> + .enqueue = _aio_enqueue,
> + .dequeue = _aio_dequeue,
> + .flags = 0,
> + };
> + int err, i;
> +
> + kaio_req_cache = kmem_cache_create("kaio_req", sizeof(struct kaio_req),
> + 0, SLAB_PANIC, NULL, NULL);
> + kaio_priv_cache = kmem_cache_create("kaio_priv", sizeof(struct 
> kaio_private),
> + 0, SLAB_PANIC, NULL, NULL);
> +
> + memset(kaio_threads, 0, sizeof(kaio_threads));
> +
> + for (i=0; i + err = kaio_thread_start(_threads[i], i);
> + if (err)
> + goto err_out_stop;
> + }
> +
> + err = kevent_add_callbacks(, KEVENT_AIO);
> + if (err)
> + goto err_out_stop;
> +
> + return 0;
> +
> +err_out_stop:
> + while (--i >= 0) {
> + struct kaio_thread *th = kaio_threads[i];
> +
> + kaio_threads[i] = NULL;
> + wmb();
> +
> + kaio_thread_stop(th);
> + }
> + return err;
> +}
> +module_init(kevent_init_aio);
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Vectored AIO breakage for sockets and pipes ?

2007-01-17 Thread Suparna Bhattacharya


The call to aio_advance_iovec() in aio_rw_vect_retry() becomes problematic
when it comes to pipe and socket operations which internally modify/advance
the iovec themselves. As a result AIO writes to sockets fail to return
the correct result. 

I'm not sure what the best way to fix this is. One option is to always make
a copy of the iovec and pass that down. Any other thoughts ?

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-17 Thread Suparna Bhattacharya

-destructor()
 +  */
 + atomic_inc(file-f_count);
 +
 + req = kaio_sendfile(kevent_fd, sock_fd, file, offset, count);
 + if (!req) {
 + err = -EINVAL;
 + goto err_out_fput;
 + }
 +
 + fd_install(fd, file);
 +
 + return fd;
 +
 +err_out_fput:
 + fput(file);
 + fput(file);
 +err_out_fdput:
 + put_unused_fd(fd);
 +err_out_put_name:
 + putname(tmp);
 +err_out_exit:
 + return err;
 +}
 +
 +static int kevent_aio_callback(struct kevent *k)
 +{
 + struct kaio_req *req = k-event.ptr;
 + struct kaio_private *priv = req-priv;
 +
 + if (!priv-count) {
 + __u32 *processed = (__u32 *)priv-processed;
 + k-event.ret_data[0] = processed[0];
 + k-event.ret_data[1] = processed[1];
 + return 1;
 + }
 +
 + return 0;
 +}
 +
 +int kevent_aio_enqueue(struct kevent *k)
 +{
 + int err;
 + struct kaio_req *req = k-event.ptr;
 + struct kaio_private *priv = req-priv;
 +
 + err = kevent_storage_enqueue(k-user-st, k);
 + if (err)
 + goto err_out_exit;
 +
 + priv-kevent_user = k-user;
 + if (k-event.req_flags  KEVENT_REQ_ALWAYS_QUEUE)
 + kevent_requeue(k);
 +
 + return 0;
 +
 +err_out_exit:
 + return err;
 +}
 +
 +int kevent_aio_dequeue(struct kevent *k)
 +{
 + kevent_storage_dequeue(k-st, k);
 +
 + return 0;
 +}
 +
 +static void kaio_thread_stop(struct kaio_thread *th)
 +{
 + kthread_stop(th-thread);
 + kfree(th);
 +}
 +
 +static int kaio_thread_start(struct kaio_thread **thp, int num)
 +{
 + struct kaio_thread *th;
 +
 + th = kzalloc(sizeof(struct kaio_thread), GFP_KERNEL);
 + if (!th)
 + return -ENOMEM;
 +
 + th-refcnt = 1;
 + spin_lock_init(th-req_lock);
 + INIT_LIST_HEAD(th-req_list);
 + init_waitqueue_head(th-wait);
 +
 + th-thread = kthread_run(kaio_thread_process, th, kaio/%d, num);
 + if (IS_ERR(th-thread)) {
 + int err = PTR_ERR(th-thread);
 + kfree(th);
 + return err;
 + }
 +
 + *thp = th;
 + wmb();
 +
 + return 0;
 +}
 +
 +static int __init kevent_init_aio(void)
 +{
 + struct kevent_callbacks sc = {
 + .callback = kevent_aio_callback,
 + .enqueue = kevent_aio_enqueue,
 + .dequeue = kevent_aio_dequeue,
 + .flags = 0,
 + };
 + int err, i;
 +
 + kaio_req_cache = kmem_cache_create(kaio_req, sizeof(struct kaio_req),
 + 0, SLAB_PANIC, NULL, NULL);
 + kaio_priv_cache = kmem_cache_create(kaio_priv, sizeof(struct 
 kaio_private),
 + 0, SLAB_PANIC, NULL, NULL);
 +
 + memset(kaio_threads, 0, sizeof(kaio_threads));
 +
 + for (i=0; iKAIO_THREAD_NUM; ++i) {
 + err = kaio_thread_start(kaio_threads[i], i);
 + if (err)
 + goto err_out_stop;
 + }
 +
 + err = kevent_add_callbacks(sc, KEVENT_AIO);
 + if (err)
 + goto err_out_stop;
 +
 + return 0;
 +
 +err_out_stop:
 + while (--i = 0) {
 + struct kaio_thread *th = kaio_threads[i];
 +
 + kaio_threads[i] = NULL;
 + wmb();
 +
 + kaio_thread_stop(th);
 + }
 + return err;
 +}
 +module_init(kevent_init_aio);
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: spurious sparse warnings from linux/aio.h (was: 2.6.20-rc4-mm1)

2007-01-12 Thread Suparna Bhattacharya

On Fri, Jan 12, 2007 at 12:55:18PM +0100, Tilman Schmidt wrote:
> Andrew Morton schrieb:
> > - Merged the "filesystem AIO patches".
> 
> This construct:
> 
> > --- linux-2.6.20-rc4/include/linux/aio.h2007-01-06 
> > 23:34:08.0 -0800
> > +++ devel/include/linux/aio.h   2007-01-11 21:36:16.0 -0800
> 
> > @@ -237,7 +243,8 @@ do {
> > \
> > }   \
> >  } while (0)
> > 
> > -#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait)
> > +#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait,   \
> > +   struct wait_bit_queue, wait), struct kiocb, ki_wait)
> > 
> >  #include 
> > 
> 
> causes a sparse warning:
> 
> > include/linux/sched.h:1313:29: warning: symbol '__mptr' shadows an earlier 
> > one
> > include/linux/sched.h:1313:29: originally declared here
> 
> for every source file referencing .
> Could that be avoided please?

So ... the nested container_of() is a problem ? I guess changing
io_wait_to_kiocb() to be an inline function instead of a macro could help ?

Regards
Suparna

> 
> Thanks
> Tilman
> 
> -- 
> Tilman SchmidtE-Mail: [EMAIL PROTECTED]
> Bonn, Germany
> Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
> Ungeöffnet mindestens haltbar bis: (siehe Rückseite)
> 



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: spurious sparse warnings from linux/aio.h (was: 2.6.20-rc4-mm1)

2007-01-12 Thread Suparna Bhattacharya

On Fri, Jan 12, 2007 at 12:55:18PM +0100, Tilman Schmidt wrote:
 Andrew Morton schrieb:
  - Merged the filesystem AIO patches.
 
 This construct:
 
  --- linux-2.6.20-rc4/include/linux/aio.h2007-01-06 
  23:34:08.0 -0800
  +++ devel/include/linux/aio.h   2007-01-11 21:36:16.0 -0800
 
  @@ -237,7 +243,8 @@ do {
  \
  }   \
   } while (0)
  
  -#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait)
  +#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait,   \
  +   struct wait_bit_queue, wait), struct kiocb, ki_wait)
  
   #include linux/aio_abi.h
  
 
 causes a sparse warning:
 
  include/linux/sched.h:1313:29: warning: symbol '__mptr' shadows an earlier 
  one
  include/linux/sched.h:1313:29: originally declared here
 
 for every source file referencing linux/sched.h.
 Could that be avoided please?

So ... the nested container_of() is a problem ? I guess changing
io_wait_to_kiocb() to be an inline function instead of a macro could help ?

Regards
Suparna

 
 Thanks
 Tilman
 
 -- 
 Tilman SchmidtE-Mail: [EMAIL PROTECTED]
 Bonn, Germany
 Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
 Ungeöffnet mindestens haltbar bis: (siehe Rückseite)
 



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-10 Thread Suparna Bhattacharya

On Wed, Jan 10, 2007 at 05:08:29PM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2007 11:14:19 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > > On Thu, 4 Jan 2007 10:26:21 +0530
> > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > >
> > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > >
> > > Patches against next -mm would be appreciated, please.  Sorry about that.
> >
> > I have updated the patchset against 2620-rc3-mm1, incorporated various
> > cleanups suggested during last review. Please let me know if I have missed
> > anything:
> 
> The s/lock_page_slow/lock_page_blocking/ got lost.  I redid it.

I thought the lock_page_blocking was an alternative you had suggested
to the __lock_page vs lock_page_async discussion which got resolved later.
That is why I didn't make the change in this patchset.
The call does not block in the async case, hence the choice of
the _slow suffix (like in fs/buffer.c). But if lock_page_blocking()
sounds more intuitive to you, that's OK. 

> 
> For the record, patches-via-http are very painful.  Please always always
> email them.
> 
> As a result, these patches ended up with titles which are derived from their
> filenames, which are cryptic.

Sorry about that - I wanted to ask if you'd prefer my resending them to the
list, but missed doing so. Some people have found it easier to download the
series as a whole when they intend to apply it, so I ended up maintaining it
that way all this while.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-10 Thread Suparna Bhattacharya

On Wed, Jan 10, 2007 at 05:08:29PM -0800, Andrew Morton wrote:
 On Wed, 10 Jan 2007 11:14:19 +0530
 Suparna Bhattacharya [EMAIL PROTECTED] wrote:
 
  On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
   On Thu, 4 Jan 2007 10:26:21 +0530
   Suparna Bhattacharya [EMAIL PROTECTED] wrote:
  
On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
  
   Patches against next -mm would be appreciated, please.  Sorry about that.
 
  I have updated the patchset against 2620-rc3-mm1, incorporated various
  cleanups suggested during last review. Please let me know if I have missed
  anything:
 
 The s/lock_page_slow/lock_page_blocking/ got lost.  I redid it.

I thought the lock_page_blocking was an alternative you had suggested
to the __lock_page vs lock_page_async discussion which got resolved later.
That is why I didn't make the change in this patchset.
The call does not block in the async case, hence the choice of
the _slow suffix (like in fs/buffer.c). But if lock_page_blocking()
sounds more intuitive to you, that's OK. 

 
 For the record, patches-via-http are very painful.  Please always always
 email them.
 
 As a result, these patches ended up with titles which are derived from their
 filenames, which are cryptic.

Sorry about that - I wanted to ask if you'd prefer my resending them to the
list, but missed doing so. Some people have found it easier to download the
series as a whole when they intend to apply it, so I ended up maintaining it
that way all this while.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-09 Thread Suparna Bhattacharya

On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> On Thu, 4 Jan 2007 10:26:21 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> 
> Patches against next -mm would be appreciated, please.  Sorry about that.

I have updated the patchset against 2620-rc3-mm1, incorporated various
cleanups suggested during last review. Please let me know if I have missed
anything:

It should show up at
www.kernel.org:/pub/linux/kernel/people/suparna/aio/2620-rc3-mm1

Brief changelog:
- Reworked against the block layer unplug changes 
- Switched from defines to inlines for init_wait_bit* etc (per akpm)
- Better naming:  __lock_page to lock_page_async (per hch, npiggin)
- Kill lock_page_slow wrapper and rename __lock_page_slow to lock_page_slow
  (per hch)
- Use a helper function aio_restarted() (per hch)
- Replace combined if/assignment (per hch)
- fix resetting of current->io_wait after ->retry in aio_run_iocb (per zab)

I have run my usual aio-stress variations script
(www.kernel.org:/pub/linux/kernel/people/suparna/aio/aio-results.sh)

Regards
Suparna


-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-09 Thread Suparna Bhattacharya

On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
 On Thu, 4 Jan 2007 10:26:21 +0530
 Suparna Bhattacharya [EMAIL PROTECTED] wrote:
 
  On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
 
 Patches against next -mm would be appreciated, please.  Sorry about that.

I have updated the patchset against 2620-rc3-mm1, incorporated various
cleanups suggested during last review. Please let me know if I have missed
anything:

It should show up at
www.kernel.org:/pub/linux/kernel/people/suparna/aio/2620-rc3-mm1

Brief changelog:
- Reworked against the block layer unplug changes 
- Switched from defines to inlines for init_wait_bit* etc (per akpm)
- Better naming:  __lock_page to lock_page_async (per hch, npiggin)
- Kill lock_page_slow wrapper and rename __lock_page_slow to lock_page_slow
  (per hch)
- Use a helper function aio_restarted() (per hch)
- Replace combined if/assignment (per hch)
- fix resetting of current-io_wait after -retry in aio_run_iocb (per zab)

I have run my usual aio-stress variations script
(www.kernel.org:/pub/linux/kernel/people/suparna/aio/aio-results.sh)

Regards
Suparna


-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How git affects kernel.org performance

2007-01-07 Thread Suparna Bhattacharya

On Sun, Jan 07, 2007 at 01:15:42AM -0800, Andrew Morton wrote:
> On Sun, 7 Jan 2007 09:55:26 +0100
> Willy Tarreau <[EMAIL PROTECTED]> wrote:
> 
> > On Sat, Jan 06, 2007 at 09:39:42PM -0800, Linus Torvalds wrote:
> > >
> > >
> > > On Sat, 6 Jan 2007, H. Peter Anvin wrote:
> > > >
> > > > During extremely high load, it appears that what slows kernel.org down 
> > > > more
> > > > than anything else is the time that each individual getdents() call 
> > > > takes.
> > > > When I've looked this I've observed times from 200 ms to almost 2 
> > > > seconds!
> > > > Since an unpacked *OR* unpruned git tree adds 256 directories to a 
> > > > cleanly
> > > > packed tree, you can do the math yourself.
> > >
> > > "getdents()" is totally serialized by the inode semaphore. It's one of the
> > > most expensive system calls in Linux, partly because of that, and partly
> > > because it has to call all the way down into the filesystem in a way that
> > > almost no other common system call has to (99% of all filesystem calls can
> > > be handled basically at the VFS layer with generic caches - but not
> > > getdents()).
> > >
> > > So if there are concurrent readdirs on the same directory, they get
> > > serialized. If there is any file creation/deletion activity in the
> > > directory, it serializes getdents().
> > >
> > > To make matters worse, I don't think it has any read-ahead at all when you
> > > use hashed directory entries. So if you have cold-cache case, you'll read
> > > every single block totally individually, and serialized. One block at a
> > > time (I think the non-hashed case is likely also suspect, but that's a
> > > separate issue)
> > >
> > > In other words, I'm not at all surprised it hits on filldir time.
> > > Especially on ext3.
> >
> > At work, we had the same problem on a file server with ext3. We use rsync
> > to make backups to a local IDE disk, and we noticed that getdents() took
> > about the same time as Peter reports (0.2 to 2 seconds), especially in
> > maildir directories. We tried many things to fix it with no result,
> > including enabling dirindexes. Finally, we made a full backup, and switched
> > over to XFS and the problem totally disappeared. So it seems that the
> > filesystem matters a lot here when there are lots of entries in a
> > directory, and that ext3 is not suitable for usages with thousands
> > of entries in directories with millions of files on disk. I'm not
> > certain it would be that easy to try other filesystems on kernel.org
> > though :-/
> >
> 
> Yeah, slowly-growing directories will get splattered all over the disk.
> 
> Possible short-term fixes would be to just allocate up to (say) eight
> blocks when we grow a directory by one block.  Or teach the
> directory-growth code to use ext3 reservations.
> 
> Longer-term people are talking about things like on-disk rerservations.
> But I expect directories are being forgotten about in all of that.

By on-disk reservations, do you mean persistent file preallocation ? (that
is explicit preallocation of blocks to a given file) If so, you are
right, we haven't really given any thought to the possibility of directories
needing that feature.

Regards
Suparna

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How git affects kernel.org performance

2007-01-07 Thread Suparna Bhattacharya

On Sun, Jan 07, 2007 at 01:15:42AM -0800, Andrew Morton wrote:
 On Sun, 7 Jan 2007 09:55:26 +0100
 Willy Tarreau [EMAIL PROTECTED] wrote:
 
  On Sat, Jan 06, 2007 at 09:39:42PM -0800, Linus Torvalds wrote:
  
  
   On Sat, 6 Jan 2007, H. Peter Anvin wrote:
   
During extremely high load, it appears that what slows kernel.org down 
more
than anything else is the time that each individual getdents() call 
takes.
When I've looked this I've observed times from 200 ms to almost 2 
seconds!
Since an unpacked *OR* unpruned git tree adds 256 directories to a 
cleanly
packed tree, you can do the math yourself.
  
   getdents() is totally serialized by the inode semaphore. It's one of the
   most expensive system calls in Linux, partly because of that, and partly
   because it has to call all the way down into the filesystem in a way that
   almost no other common system call has to (99% of all filesystem calls can
   be handled basically at the VFS layer with generic caches - but not
   getdents()).
  
   So if there are concurrent readdirs on the same directory, they get
   serialized. If there is any file creation/deletion activity in the
   directory, it serializes getdents().
  
   To make matters worse, I don't think it has any read-ahead at all when you
   use hashed directory entries. So if you have cold-cache case, you'll read
   every single block totally individually, and serialized. One block at a
   time (I think the non-hashed case is likely also suspect, but that's a
   separate issue)
  
   In other words, I'm not at all surprised it hits on filldir time.
   Especially on ext3.
 
  At work, we had the same problem on a file server with ext3. We use rsync
  to make backups to a local IDE disk, and we noticed that getdents() took
  about the same time as Peter reports (0.2 to 2 seconds), especially in
  maildir directories. We tried many things to fix it with no result,
  including enabling dirindexes. Finally, we made a full backup, and switched
  over to XFS and the problem totally disappeared. So it seems that the
  filesystem matters a lot here when there are lots of entries in a
  directory, and that ext3 is not suitable for usages with thousands
  of entries in directories with millions of files on disk. I'm not
  certain it would be that easy to try other filesystems on kernel.org
  though :-/
 
 
 Yeah, slowly-growing directories will get splattered all over the disk.
 
 Possible short-term fixes would be to just allocate up to (say) eight
 blocks when we grow a directory by one block.  Or teach the
 directory-growth code to use ext3 reservations.
 
 Longer-term people are talking about things like on-disk rerservations.
 But I expect directories are being forgotten about in all of that.

By on-disk reservations, do you mean persistent file preallocation ? (that
is explicit preallocation of blocks to a given file) If so, you are
right, we haven't really given any thought to the possibility of directories
needing that feature.

Regards
Suparna

 
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-05 Thread Suparna Bhattacharya

On Fri, Jan 05, 2007 at 08:02:33AM +0100, Jens Axboe wrote:
> On Fri, Jan 05 2007, Suparna Bhattacharya wrote:
> > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > > On Thu, 4 Jan 2007 10:26:21 +0530
> > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > >
> > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > This patchset implements changes to make filesystem AIO read
> > > > > > and write asynchronous for the non O_DIRECT case.
> > > > >
> > > > > Unfortunately the unplugging changes in Jen's block tree have trashed 
> > > > > these
> > > > > patches to a degree that I'm not confident in my repair attempts.  So 
> > > > > I'll
> > > > > drop the fasio patches from -mm.
> > > >
> > > > I took a quick look and the conflicts seem pretty minor to me, the 
> > > > unplugging
> > > > changes mostly touch nearby code.
> > >
> > > Well...  the conflicts (both mechanical and conceptual) are such that a
> > > round of retesting is needed.
> > >
> > > > Please let know how you want this fixed up.
> > > >
> > > > >From what I can tell the comments in the unplug patches seem to say 
> > > > >that
> > > > it needs more work and testing, so perhaps a separate fixup patch may be
> > > > a better idea rather than make the fsaio patchset dependent on this.
> > >
> > > Patches against next -mm would be appreciated, please.  Sorry about that.
> > >
> > > I _assume_ Jens is targetting 2.6.21?
> >
> > When is the next -mm likely to be out ?
> >
> > I was considering regenerating the blk unplug patches against the
> > fsaio changes instead of the other way around, if Jens were willing to
> > accept that. But if the next -mm is just around the corner then its
> > not an issue.
> 
> I don't really care much, but I work against mainline and anything but
> occasional one-off generations of a patch against a different base is
> not very likely.
> 
> The -mm order should just reflect the merge order of the patches, what
> is the fsaio target?

2.6.21 was what I had in mind, to enable the glibc folks to proceed with
conversion to native AIO.

Regenerating my patches against the unplug stuff is not a problem, I only
worry about being queued up behind something that may take longer to
stabilize and is likely to change ... If that is not the case, I don't
mind.

Regards
Suparna

> 
> --
> Jens Axboe
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-05 Thread Suparna Bhattacharya

On Fri, Jan 05, 2007 at 08:02:33AM +0100, Jens Axboe wrote:
 On Fri, Jan 05 2007, Suparna Bhattacharya wrote:
  On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
   On Thu, 4 Jan 2007 10:26:21 +0530
   Suparna Bhattacharya [EMAIL PROTECTED] wrote:
  
On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
 On Thu, 28 Dec 2006 13:53:08 +0530
 Suparna Bhattacharya [EMAIL PROTECTED] wrote:

  This patchset implements changes to make filesystem AIO read
  and write asynchronous for the non O_DIRECT case.

 Unfortunately the unplugging changes in Jen's block tree have trashed 
 these
 patches to a degree that I'm not confident in my repair attempts.  So 
 I'll
 drop the fasio patches from -mm.
   
I took a quick look and the conflicts seem pretty minor to me, the 
unplugging
changes mostly touch nearby code.
  
   Well...  the conflicts (both mechanical and conceptual) are such that a
   round of retesting is needed.
  
Please let know how you want this fixed up.
   
From what I can tell the comments in the unplug patches seem to say 
that
it needs more work and testing, so perhaps a separate fixup patch may be
a better idea rather than make the fsaio patchset dependent on this.
  
   Patches against next -mm would be appreciated, please.  Sorry about that.
  
   I _assume_ Jens is targetting 2.6.21?
 
  When is the next -mm likely to be out ?
 
  I was considering regenerating the blk unplug patches against the
  fsaio changes instead of the other way around, if Jens were willing to
  accept that. But if the next -mm is just around the corner then its
  not an issue.
 
 I don't really care much, but I work against mainline and anything but
 occasional one-off generations of a patch against a different base is
 not very likely.
 
 The -mm order should just reflect the merge order of the patches, what
 is the fsaio target?

2.6.21 was what I had in mind, to enable the glibc folks to proceed with
conversion to native AIO.

Regenerating my patches against the unplug stuff is not a problem, I only
worry about being queued up behind something that may take longer to
stabilize and is likely to change ... If that is not the case, I don't
mind.

Regards
Suparna

 
 --
 Jens Axboe
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-04 Thread Suparna Bhattacharya

On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> On Thu, 4 Jan 2007 10:26:21 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > >
> > > > This patchset implements changes to make filesystem AIO read
> > > > and write asynchronous for the non O_DIRECT case.
> > >
> > > Unfortunately the unplugging changes in Jen's block tree have trashed 
> > > these
> > > patches to a degree that I'm not confident in my repair attempts.  So I'll
> > > drop the fasio patches from -mm.
> >
> > I took a quick look and the conflicts seem pretty minor to me, the 
> > unplugging
> > changes mostly touch nearby code.
> 
> Well...  the conflicts (both mechanical and conceptual) are such that a
> round of retesting is needed.
> 
> > Please let know how you want this fixed up.
> >
> > >From what I can tell the comments in the unplug patches seem to say that
> > it needs more work and testing, so perhaps a separate fixup patch may be
> > a better idea rather than make the fsaio patchset dependent on this.
> 
> Patches against next -mm would be appreciated, please.  Sorry about that.
> 
> I _assume_ Jens is targetting 2.6.21?

When is the next -mm likely to be out ? 

I was considering regenerating the blk unplug patches against the
fsaio changes instead of the other way around, if Jens were willing to accept
that. But if the next -mm is just around the corner then its not an issue.

Regards
Suparna

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHSET 4][PATCH 1/1] AIO fallback for pipes, sockets and pollable fds

2007-01-04 Thread Suparna Bhattacharya

msg->msg_iovlen = nr_segs;
-   msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+   msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb))
+   ? MSG_DONTWAIT : 0;
 
return __sock_recvmsg(iocb, sock, msg, size, msg->msg_flags);
 }
@@ -741,7 +742,8 @@ static ssize_t do_sock_write(struct msgh
msg->msg_controllen = 0;
msg->msg_iov = (struct iovec *)iov;
msg->msg_iovlen = nr_segs;
-   msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+   msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb))
+   ? MSG_DONTWAIT : 0;
if (sock->type == SOCK_SEQPACKET)
msg->msg_flags |= MSG_EOR;
 
diff -puN fs/pipe.c~aio-fallback-nonblock fs/pipe.c
--- linux-2.6.20-rc1/fs/pipe.c~aio-fallback-nonblock2007-01-03 
19:16:36.0 +0530
+++ linux-2.6.20-rc1-root/fs/pipe.c 2007-01-03 19:16:36.0 +0530
@@ -226,14 +226,16 @@ pipe_read(struct kiocb *iocb, const stru
struct pipe_inode_info *pipe;
int do_wakeup;
ssize_t ret;
-   struct iovec *iov = (struct iovec *)_iov;
+   struct iovec iov_array[nr_segs];
+   struct iovec *iov = iov_array;
size_t total_len;
 
-   total_len = iov_length(iov, nr_segs);
+   total_len = iov_length(_iov, nr_segs);
/* Null read succeeds. */
if (unlikely(total_len == 0))
return 0;
 
+   memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(>i_mutex);
@@ -302,7 +304,8 @@ redo:
 */
if (ret)
break;
-   if (filp->f_flags & O_NONBLOCK) {
+   if (filp->f_flags & O_NONBLOCK ||
+   !is_sync_kiocb(iocb)) {
ret = -EAGAIN;
break;
}
@@ -339,15 +342,17 @@ pipe_write(struct kiocb *iocb, const str
struct pipe_inode_info *pipe;
ssize_t ret;
int do_wakeup;
-   struct iovec *iov = (struct iovec *)_iov;
+   struct iovec iov_array[nr_segs];
+   struct iovec *iov = iov_array;
size_t total_len;
ssize_t chars;
 
-   total_len = iov_length(iov, nr_segs);
+   total_len = iov_length(_iov, nr_segs);
/* Null write succeeds. */
if (unlikely(total_len == 0))
return 0;
 
+   memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(>i_mutex);
@@ -473,7 +478,7 @@ redo2:
}
if (bufs < PIPE_BUFFERS)
continue;
-   if (filp->f_flags & O_NONBLOCK) {
+   if (filp->f_flags & O_NONBLOCK || !is_sync_kiocb(iocb)) {
if (!ret)
ret = -EAGAIN;
break;
_

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-04 Thread Suparna Bhattacharya

On Thu, Jan 04, 2007 at 05:50:11PM +1100, Nick Piggin wrote:
> Suparna Bhattacharya wrote:
> >On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:
> 
> >>So long as AIO threads do the same, there would be no problem (plugging
> >>is optional, of course).
> >
> >
> >Yup, the AIO threads run the same code as for regular IO, i.e in the rare
> >situations where they actually end up submitting IO, so there should
> >be no problem. And you have already added plug/unplug at the appropriate
> >places in those path, so things should just work.
> 
> Yes I think it should.
> 
> >>This (is supposed to) give a number of improvements over the traditional
> >>plugging (although some downsides too). Most notably for me, the VM gets
> >>cleaner ;)
> >>
> >>However AIO could be an interesting case to test for explicit plugging
> >>because of the way they interact. What kind of improvements do you see
> >>with samba and do you have any benchmark setups?
> >
> >
> >I think aio-stress would be a good way to test/benchmark this sort of
> >stuff, at least for a start.
> >Samba (if I understand this correctly based on my discussions with Tridge)
> >is less likely to generate the kind of io patterns that could benefit from
> >explicit plugging (because the file server has no way to tell what the next
> >request is going to be, it ends up submitting each independently instead of
> >batching iocbs).
> 
> OK, but I think that after IO submission, you do not run sync_page to
> unplug the block device, like the normal IO path would (via lock_page,
> before the explicit plug patches).

In the buffered AIO case, we do run sync page like normal IO ... just
that we don't block in io_schedule(), everything else is pretty much 
similar.

In the case of AIO-DIO, the path is like the just like non-AIO DIO, there
is a call to blk_run_address_space() after submission.

> 
> However, with explicit plugging, AIO requests will be started immediately.
> Maybe this won't be noticable if the device is always busy, but I would
> like to know there isn't a regression.
> 
> >In future there may be optimization possibilities to consider when
> >submitting batches of iocbs, i.e. on the io submission path. Maybe
> >AIO - O_DIRECT would be interesting to play with first in this regardi ?
> 
> Well I've got some simple per-process batching in there now, each process
> has a list of pending requests. Request merging is done locklessly against
> the last request added; and submission at unplug time is batched under a
> single block device lock.
> 
> I'm sure more merging or batching could be done, but also consider that
> most programs will not ever make use of any added complexity.

I guess I didn't express myself well - by batching I meant being able to
surround submission of a batch of iocbs with explicit plug/unplug instead
of explicit plug/unplug for each iocb separately. Of course there is no
easy way to do that, since at the io_submit() level we do not know about
the block device (each iocb could be directed to a different fd and not
just block devices). So it may not be worth thinking about.

> 
> Regarding your patches, I've just had a quick look and have a question --
> what do you do about blocking in page reclaim and dirty balancing? Aren't
> those major points of blocking with buffered IO? Did your test cases
> dirty enough to start writeout or cause a lot of reclaim? (admittedly,
> blocking in reclaim will now be much less common since the dirty mapping
> accounting).

In my earlier versions of patches I actually had converted these waits to
be async retriable, but then I came to the conclusion that the additional
complexity wasn't worth it. For one it didn't seem to make a difference 
compared to the other bigger cases, and I was looking primarily at handling
the gross blocking points (say to enable an application to keep device queues
busy) and not making everything asynchronous; for another we had a long
discussion thread waay back about not making AIO submitters exempt from
throttling or memory availability waits.

Regards
Suparna

> 
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-04 Thread Suparna Bhattacharya

On Thu, Jan 04, 2007 at 05:50:11PM +1100, Nick Piggin wrote:
 Suparna Bhattacharya wrote:
 On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:
 
 So long as AIO threads do the same, there would be no problem (plugging
 is optional, of course).
 
 
 Yup, the AIO threads run the same code as for regular IO, i.e in the rare
 situations where they actually end up submitting IO, so there should
 be no problem. And you have already added plug/unplug at the appropriate
 places in those path, so things should just work.
 
 Yes I think it should.
 
 This (is supposed to) give a number of improvements over the traditional
 plugging (although some downsides too). Most notably for me, the VM gets
 cleaner ;)
 
 However AIO could be an interesting case to test for explicit plugging
 because of the way they interact. What kind of improvements do you see
 with samba and do you have any benchmark setups?
 
 
 I think aio-stress would be a good way to test/benchmark this sort of
 stuff, at least for a start.
 Samba (if I understand this correctly based on my discussions with Tridge)
 is less likely to generate the kind of io patterns that could benefit from
 explicit plugging (because the file server has no way to tell what the next
 request is going to be, it ends up submitting each independently instead of
 batching iocbs).
 
 OK, but I think that after IO submission, you do not run sync_page to
 unplug the block device, like the normal IO path would (via lock_page,
 before the explicit plug patches).

In the buffered AIO case, we do run sync page like normal IO ... just
that we don't block in io_schedule(), everything else is pretty much 
similar.

In the case of AIO-DIO, the path is like the just like non-AIO DIO, there
is a call to blk_run_address_space() after submission.

 
 However, with explicit plugging, AIO requests will be started immediately.
 Maybe this won't be noticable if the device is always busy, but I would
 like to know there isn't a regression.
 
 In future there may be optimization possibilities to consider when
 submitting batches of iocbs, i.e. on the io submission path. Maybe
 AIO - O_DIRECT would be interesting to play with first in this regardi ?
 
 Well I've got some simple per-process batching in there now, each process
 has a list of pending requests. Request merging is done locklessly against
 the last request added; and submission at unplug time is batched under a
 single block device lock.
 
 I'm sure more merging or batching could be done, but also consider that
 most programs will not ever make use of any added complexity.

I guess I didn't express myself well - by batching I meant being able to
surround submission of a batch of iocbs with explicit plug/unplug instead
of explicit plug/unplug for each iocb separately. Of course there is no
easy way to do that, since at the io_submit() level we do not know about
the block device (each iocb could be directed to a different fd and not
just block devices). So it may not be worth thinking about.

 
 Regarding your patches, I've just had a quick look and have a question --
 what do you do about blocking in page reclaim and dirty balancing? Aren't
 those major points of blocking with buffered IO? Did your test cases
 dirty enough to start writeout or cause a lot of reclaim? (admittedly,
 blocking in reclaim will now be much less common since the dirty mapping
 accounting).

In my earlier versions of patches I actually had converted these waits to
be async retriable, but then I came to the conclusion that the additional
complexity wasn't worth it. For one it didn't seem to make a difference 
compared to the other bigger cases, and I was looking primarily at handling
the gross blocking points (say to enable an application to keep device queues
busy) and not making everything asynchronous; for another we had a long
discussion thread waay back about not making AIO submitters exempt from
throttling or memory availability waits.

Regards
Suparna

 
 --
 SUSE Labs, Novell Inc.
 Send instant messages to your online friends http://au.messenger.yahoo.com
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHSET 4][PATCH 1/1] AIO fallback for pipes, sockets and pollable fds

2007-01-04 Thread Suparna Bhattacharya

 = (file-f_flags  O_NONBLOCK) ? MSG_DONTWAIT : 0;
+   msg-msg_flags = ((file-f_flags  O_NONBLOCK) || !is_sync_kiocb(iocb))
+   ? MSG_DONTWAIT : 0;
 
return __sock_recvmsg(iocb, sock, msg, size, msg-msg_flags);
 }
@@ -741,7 +742,8 @@ static ssize_t do_sock_write(struct msgh
msg-msg_controllen = 0;
msg-msg_iov = (struct iovec *)iov;
msg-msg_iovlen = nr_segs;
-   msg-msg_flags = (file-f_flags  O_NONBLOCK) ? MSG_DONTWAIT : 0;
+   msg-msg_flags = ((file-f_flags  O_NONBLOCK) || !is_sync_kiocb(iocb))
+   ? MSG_DONTWAIT : 0;
if (sock-type == SOCK_SEQPACKET)
msg-msg_flags |= MSG_EOR;
 
diff -puN fs/pipe.c~aio-fallback-nonblock fs/pipe.c
--- linux-2.6.20-rc1/fs/pipe.c~aio-fallback-nonblock2007-01-03 
19:16:36.0 +0530
+++ linux-2.6.20-rc1-root/fs/pipe.c 2007-01-03 19:16:36.0 +0530
@@ -226,14 +226,16 @@ pipe_read(struct kiocb *iocb, const stru
struct pipe_inode_info *pipe;
int do_wakeup;
ssize_t ret;
-   struct iovec *iov = (struct iovec *)_iov;
+   struct iovec iov_array[nr_segs];
+   struct iovec *iov = iov_array;
size_t total_len;
 
-   total_len = iov_length(iov, nr_segs);
+   total_len = iov_length(_iov, nr_segs);
/* Null read succeeds. */
if (unlikely(total_len == 0))
return 0;
 
+   memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(inode-i_mutex);
@@ -302,7 +304,8 @@ redo:
 */
if (ret)
break;
-   if (filp-f_flags  O_NONBLOCK) {
+   if (filp-f_flags  O_NONBLOCK ||
+   !is_sync_kiocb(iocb)) {
ret = -EAGAIN;
break;
}
@@ -339,15 +342,17 @@ pipe_write(struct kiocb *iocb, const str
struct pipe_inode_info *pipe;
ssize_t ret;
int do_wakeup;
-   struct iovec *iov = (struct iovec *)_iov;
+   struct iovec iov_array[nr_segs];
+   struct iovec *iov = iov_array;
size_t total_len;
ssize_t chars;
 
-   total_len = iov_length(iov, nr_segs);
+   total_len = iov_length(_iov, nr_segs);
/* Null write succeeds. */
if (unlikely(total_len == 0))
return 0;
 
+   memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(inode-i_mutex);
@@ -473,7 +478,7 @@ redo2:
}
if (bufs  PIPE_BUFFERS)
continue;
-   if (filp-f_flags  O_NONBLOCK) {
+   if (filp-f_flags  O_NONBLOCK || !is_sync_kiocb(iocb)) {
if (!ret)
ret = -EAGAIN;
break;
_

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-04 Thread Suparna Bhattacharya

On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
 On Thu, 4 Jan 2007 10:26:21 +0530
 Suparna Bhattacharya [EMAIL PROTECTED] wrote:
 
  On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
   On Thu, 28 Dec 2006 13:53:08 +0530
   Suparna Bhattacharya [EMAIL PROTECTED] wrote:
  
This patchset implements changes to make filesystem AIO read
and write asynchronous for the non O_DIRECT case.
  
   Unfortunately the unplugging changes in Jen's block tree have trashed 
   these
   patches to a degree that I'm not confident in my repair attempts.  So I'll
   drop the fasio patches from -mm.
 
  I took a quick look and the conflicts seem pretty minor to me, the 
  unplugging
  changes mostly touch nearby code.
 
 Well...  the conflicts (both mechanical and conceptual) are such that a
 round of retesting is needed.
 
  Please let know how you want this fixed up.
 
  From what I can tell the comments in the unplug patches seem to say that
  it needs more work and testing, so perhaps a separate fixup patch may be
  a better idea rather than make the fsaio patchset dependent on this.
 
 Patches against next -mm would be appreciated, please.  Sorry about that.
 
 I _assume_ Jens is targetting 2.6.21?

When is the next -mm likely to be out ? 

I was considering regenerating the blk unplug patches against the
fsaio changes instead of the other way around, if Jens were willing to accept
that. But if the next -mm is just around the corner then its not an issue.

Regards
Suparna

 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Suparna Bhattacharya

On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:
> Suparna Bhattacharya wrote:
> >On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> 
> >>Plus Jens's unplugging changes add more reliance upon context inside
> >>*current, for the plugging and unplugging operations.  I expect that the
> >>fsaio patches will need to be aware of the protocol which those proposed
> >>changes add.
> >
> >
> >Whatever logic applies to background writeout etc should also just apply
> >as is to aio worker threads, shouldn't it ? At least at a quick glance I
> >don't see anything special that needs to be done for fsaio, but its good
> >to be aware of this anyway, thanks !
> 
> The submitting process plugs itself, submits all its IO, then unplugs
> itself (ie. so the plug is now on the process, rather than the block
> device).
> 
> So long as AIO threads do the same, there would be no problem (plugging
> is optional, of course).

Yup, the AIO threads run the same code as for regular IO, i.e in the rare
situations where they actually end up submitting IO, so there should
be no problem. And you have already added plug/unplug at the appropriate
places in those path, so things should just work. 

> 
> This (is supposed to) give a number of improvements over the traditional
> plugging (although some downsides too). Most notably for me, the VM gets
> cleaner ;)
> 
> However AIO could be an interesting case to test for explicit plugging
> because of the way they interact. What kind of improvements do you see
> with samba and do you have any benchmark setups?

I think aio-stress would be a good way to test/benchmark this sort of
stuff, at least for a start. 
Samba (if I understand this correctly based on my discussions with Tridge)
is less likely to generate the kind of io patterns that could benefit from
explicit plugging (because the file server has no way to tell what the next
request is going to be, it ends up submitting each independently instead of
batching iocbs).

In future there may be optimization possibilities to consider when
submitting batches of iocbs, i.e. on the io submission path. Maybe
AIO - O_DIRECT would be interesting to play with first in this regardi ? 

Regards
Suparna

> 
> Thanks,
> Nick
> 
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Suparna Bhattacharya

On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 13:53:08 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > This patchset implements changes to make filesystem AIO read
> > and write asynchronous for the non O_DIRECT case.
> 
> Unfortunately the unplugging changes in Jen's block tree have trashed these
> patches to a degree that I'm not confident in my repair attempts.  So I'll
> drop the fasio patches from -mm.

I took a quick look and the conflicts seem pretty minor to me, the unplugging
changes mostly touch nearby code. Please let know how you want this fixed
up.

>From what I can tell the comments in the unplug patches seem to say that
it needs more work and testing, so perhaps a separate fixup patch may be
a better idea rather than make the fsaio patchset dependent on this.

> 
> Zach's observations regarding this code's reliance upon things at *current
> sounded pretty serious, so I expect we'd be seeing changes for that anyway?

Not really, at least nothing that I can see needing a change.
As I mentioned there is no reliance on *current in the code that
runs in the aio threads that we need to worry about. 

The generic_write_checks etc that Zach was referring to all happens in the
context of submitting process, not in retry context. The model is to perform
all validation at the time of io submission. And of course things like
copy_to_user() are already taken care of by use_mm().

Lets look at it this way - the kernel already has the ability to do 
background writeout on behalf of a task from a kernel thread and likewise
read(ahead) pages that may be consumed by another task. There is also the
ability to operate another task's address space (as used by ptrace).

So there is nothing groundbreaking here.

In fact on most occasions, all the IO is initiated in the context of the
submitting task, so the aio threads mainly deal with checking for completion
and transfering completed data to user space.

> 
> Plus Jens's unplugging changes add more reliance upon context inside
> *current, for the plugging and unplugging operations.  I expect that the
> fsaio patches will need to be aware of the protocol which those proposed
> changes add.

Whatever logic applies to background writeout etc should also just apply
as is to aio worker threads, shouldn't it ? At least at a quick glance I
don't see anything special that needs to be done for fsaio, but its good
to be aware of this anyway, thanks !

Regards
Suparna

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Suparna Bhattacharya

On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
 On Thu, 28 Dec 2006 13:53:08 +0530
 Suparna Bhattacharya [EMAIL PROTECTED] wrote:
 
  This patchset implements changes to make filesystem AIO read
  and write asynchronous for the non O_DIRECT case.
 
 Unfortunately the unplugging changes in Jen's block tree have trashed these
 patches to a degree that I'm not confident in my repair attempts.  So I'll
 drop the fasio patches from -mm.

I took a quick look and the conflicts seem pretty minor to me, the unplugging
changes mostly touch nearby code. Please let know how you want this fixed
up.

From what I can tell the comments in the unplug patches seem to say that
it needs more work and testing, so perhaps a separate fixup patch may be
a better idea rather than make the fsaio patchset dependent on this.

 
 Zach's observations regarding this code's reliance upon things at *current
 sounded pretty serious, so I expect we'd be seeing changes for that anyway?

Not really, at least nothing that I can see needing a change.
As I mentioned there is no reliance on *current in the code that
runs in the aio threads that we need to worry about. 

The generic_write_checks etc that Zach was referring to all happens in the
context of submitting process, not in retry context. The model is to perform
all validation at the time of io submission. And of course things like
copy_to_user() are already taken care of by use_mm().

Lets look at it this way - the kernel already has the ability to do 
background writeout on behalf of a task from a kernel thread and likewise
read(ahead) pages that may be consumed by another task. There is also the
ability to operate another task's address space (as used by ptrace).

So there is nothing groundbreaking here.

In fact on most occasions, all the IO is initiated in the context of the
submitting task, so the aio threads mainly deal with checking for completion
and transfering completed data to user space.

 
 Plus Jens's unplugging changes add more reliance upon context inside
 *current, for the plugging and unplugging operations.  I expect that the
 fsaio patches will need to be aware of the protocol which those proposed
 changes add.

Whatever logic applies to background writeout etc should also just apply
as is to aio worker threads, shouldn't it ? At least at a quick glance I
don't see anything special that needs to be done for fsaio, but its good
to be aware of this anyway, thanks !

Regards
Suparna

 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHSET 2][PATCH 1/1] Combining epoll and disk file AIO

2007-01-02 Thread Suparna Bhattacharya

On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote:
> (2) Most of these other applications need the ability to process both
> network events (epoll) and disk file AIO in the same loop. With POSIX AIO
> they could at least sort of do this using signals (yeah, and all 
> associated
> issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
> modifications from Jeff Moyer and me) addresses this problem for native
> linux aio in a simple manner. Tridge has written a test harness to
> try out the Samba4 event library modifications to use this. Jeff Moyer
> has a modified version of pipetest for comparison.
> 


Enable epoll wait to be unified with io_getevents

From: Zach Brown, Jeff Moyer, Suparna Bhattacharya

Previously there have been (complicated and scary) attempts to funnel
individual aio events down epoll or vice versa.  This instead lets one
issue an entire sys_epoll_wait() as an aio op.  You'd setup epoll as
usual and then issue epoll_wait aio ops which would complete once epoll
events had been copied. This will enable a single io_getevents() event
loop to process both disk AIO and epoll notifications.

>From an application standpoint a typical flow works like this:
- Use epoll_ctl as usual to add/remove epoll registrations
- Instead of issuing sys_epoll_wait, setup an iocb using
  io_prep_epoll_wait (see examples below) specifying the epoll
  events buffer to fill up with epoll notifications. Submit the iocb
  using io_submit
- Now io_getevents can be used to wait for both epoll waits and
  disk aio completion. If the returned AIO event is of type
  IO_CMD_EPOLL_WAIT, then corresponding result value indicates the
  number of epoll notifications in the iocb's event buffer, which
  can now be processed just like once would process results from a
  sys_epoll_wait()

There are a couple of sample applications:
- Andrew Tridgell has implemented a little test harness using an aio events
  library implementation intended for samba4 
  http://samba.org/~tridge/etest
  (The -e aio option uses aio epoll wait and can issue disk aio as well)
- An updated version of pipetest from Jeff Moyer has a --aio-epoll option
  http://people.redhat.com/jmoyer/aio/epoll/pipetest.c

There is obviously a little overhead compared to using sys_epoll_wait(), due
to the extra step of submitting the epoll wait iocb, most noticible when
there are very few events processed per loop. However, the goal here is not
to build an epoll alternative but merely to allow network and disk I/O to
be processed in the same event loop which is where the efficiencies really
come from. Picking up more epoll events in each loop can amortize the
overhead across many operations to mitigate the impact.
  
Thanks to Arjan Van de Van for helping figure out how to resolve the
lockdep complaints. Both ctx->lock and ep->lock can be held in certain 
wait queue callback routines, thus being nested inside q->lock. However, this
excludes ctx->wait or ep->wq wait queues, which can safetly be nested
inside ctx->lock or ep->lock respectively. So we teach lockdep to recognize
these as distinct classes.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
Signed-off-by: Jeff Moyer <[EMAIL PROTECTED]>
Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>

---

 linux-2.6.20-rc1-root/fs/aio.c  |   54 +
 linux-2.6.20-rc1-root/fs/eventpoll.c|   95 +---
 linux-2.6.20-rc1-root/include/linux/aio.h   |2 
 linux-2.6.20-rc1-root/include/linux/aio_abi.h   |1 
 linux-2.6.20-rc1-root/include/linux/eventpoll.h |   31 +++
 linux-2.6.20-rc1-root/include/linux/sched.h |2 
 linux-2.6.20-rc1-root/kernel/timer.c|   21 +
 7 files changed, 196 insertions(+), 10 deletions(-)

diff -puN fs/aio.c~aio-epoll-wait fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-epoll-wait2006-12-28 14:22:52.0 
+0530
+++ linux-2.6.20-rc1-root/fs/aio.c  2007-01-03 11:45:40.0 +0530
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -193,6 +194,8 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
 } while(0)
 
+static struct lock_class_key ioctx_wait_queue_head_lock_key;
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -224,6 +227,8 @@ static struct kioctx *ioctx_alloc(unsign
spin_lock_init(>ctx_lock);
spin_lock_init(>ring_info.ring_lock);
init_waitqueue_head(>wait);
+   /* Teach lockdep to recognize this lock as a different class */
+   lockdep_set_class(>wait.lock, _wait_queue_head_lock_key);
 
INIT_LIST_HEAD(>active_reqs);
INIT_LIST_HEAD(>run_list);
@@ -1401,6 +1406,42 @@ static ssize_t aio_setup_single_vector(s
return 0;
 }
 
+/* Uses

Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Suparna Bhattacharya

On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote:
> Sorry for the delay, I'm finally back from the holiday break :)

Welcome back !

> 
> >(1) The filesystem AIO patchset, attempts to address one part of
> >the problem
> >which is to make regular file IO, (without O_DIRECT)
> >asynchronous (mainly
> >the case of reads of uncached or partially cached files, and
> >O_SYNC writes).
> 
> One of the properties of the currently implemented EIOCBRETRY aio
> path is that ->mm is the only field in current which matches the
> submitting task_struct while inside the retry path.

Yes and that as I guess you know is to enable the aio worker thread to
operate on the caller's address space for copy_from/to_user. 

The actual io setup and associated checks are expected to have been
handled at submission time.

> 
> It looks like a retry-based aio write path would be broken because of
> this.  generic_write_checks() could run in the aio thread and get its
> task_struct instead of that of the submitter.  The wrong rlimit will
> be tested and SIGXFSZ won't be raised.  remove_suid() could check the
> capabilities of the aio thread instead of those of the submitter.

generic_write_checks() are done in the submission path, not repeated during
retries, so such types of checks are not intended to run in the aio thread.

Did I miss something here ?

> 
> I don't think EIOCBRETRY is the way to go because of this increased
> (and subtle!) complexity.  What are the chances that we would have
> ever found those bugs outside code review?  How do we make sure that
> current references don't sneak back in after having initially audited
> the paths?

The EIOCBRETRY route is not something that is intended to be used blindly,
It is just one alternative to implement an aio operation by splitting up
responsibility between the submitter and aio threads, where aio threads 
can run in the caller's address space.

> 
> Take the io_cmd_epoll_wait patch..
> 
> >issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach
> >Brown with
> >modifications from Jeff Moyer and me) addresses this problem
> >for native
> >linux aio in a simple manner.
> 
> It's simple looking, sure.  This current flipping didn't even occur
> to me while throwing the patch together!
> 
> But that patch ends up calling ->poll (and poll_table->qproc) and
> writing to userspace (so potentially calling ->nopage) from the aio

Yes of course, but why is that a problem ?
The copy_from/to_user/put_user constructs are designed to handle soft failures,
and we are already using the caller's ->mm. Do you see a need for any
additional asserts() ?

If there is something that is needed by ->nopage etc which is not abstracted
out within the ->mm, then we would need to fix that instead, for correctness
anyway, isn't that so ?

Now it is possible that there are minor blocking points in the code and the
effect of these would be to hold up / delay subsequent queued aio operations;
which is an efficiency issue, but not a correctness concern.

> threads.  Are we sure that none of them will behave surprisingly
> because current changed under them?

My take is that we should fix the problems that we see. It is likely that
what manifests relatively more easily with AIO is also a subtle problem
in other cases.

> 
> It might be safe now, but that isn't really the point.  I'd rather we
> didn't have yet one more subtle invariant to audit and maintain.
> 
> At the risk of making myself vulnerable to the charge of mentioning
> vapourware, I will admit that I've been working on a (slightly mad)
> implementation of async syscalls.  I've been quiet about it because I
> don't want to whip up complicated discussion without being able to
> show code that works, even if barely.  I mention it now only to make
> it clear that I want to be constructive, not just critical :).

That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that will be
easily integratable with kevents etc.

Which takes me back to Ingo's point - lets have the new evolve parallely
with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.

OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion. Glad to see that you found a way
to make it workable ... (I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ? At what
point do you do the allocation of the saved stacks ? Sorry I should hold
off all these questions till your patch c

Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Suparna Bhattacharya

On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote:
 Sorry for the delay, I'm finally back from the holiday break :)

Welcome back !

 
 (1) The filesystem AIO patchset, attempts to address one part of
 the problem
 which is to make regular file IO, (without O_DIRECT)
 asynchronous (mainly
 the case of reads of uncached or partially cached files, and
 O_SYNC writes).
 
 One of the properties of the currently implemented EIOCBRETRY aio
 path is that -mm is the only field in current which matches the
 submitting task_struct while inside the retry path.

Yes and that as I guess you know is to enable the aio worker thread to
operate on the caller's address space for copy_from/to_user. 

The actual io setup and associated checks are expected to have been
handled at submission time.

 
 It looks like a retry-based aio write path would be broken because of
 this.  generic_write_checks() could run in the aio thread and get its
 task_struct instead of that of the submitter.  The wrong rlimit will
 be tested and SIGXFSZ won't be raised.  remove_suid() could check the
 capabilities of the aio thread instead of those of the submitter.

generic_write_checks() are done in the submission path, not repeated during
retries, so such types of checks are not intended to run in the aio thread.

Did I miss something here ?

 
 I don't think EIOCBRETRY is the way to go because of this increased
 (and subtle!) complexity.  What are the chances that we would have
 ever found those bugs outside code review?  How do we make sure that
 current references don't sneak back in after having initially audited
 the paths?

The EIOCBRETRY route is not something that is intended to be used blindly,
It is just one alternative to implement an aio operation by splitting up
responsibility between the submitter and aio threads, where aio threads 
can run in the caller's address space.

 
 Take the io_cmd_epoll_wait patch..
 
 issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach
 Brown with
 modifications from Jeff Moyer and me) addresses this problem
 for native
 linux aio in a simple manner.
 
 It's simple looking, sure.  This current flipping didn't even occur
 to me while throwing the patch together!
 
 But that patch ends up calling -poll (and poll_table-qproc) and
 writing to userspace (so potentially calling -nopage) from the aio

Yes of course, but why is that a problem ?
The copy_from/to_user/put_user constructs are designed to handle soft failures,
and we are already using the caller's -mm. Do you see a need for any
additional asserts() ?

If there is something that is needed by -nopage etc which is not abstracted
out within the -mm, then we would need to fix that instead, for correctness
anyway, isn't that so ?

Now it is possible that there are minor blocking points in the code and the
effect of these would be to hold up / delay subsequent queued aio operations;
which is an efficiency issue, but not a correctness concern.

 threads.  Are we sure that none of them will behave surprisingly
 because current changed under them?

My take is that we should fix the problems that we see. It is likely that
what manifests relatively more easily with AIO is also a subtle problem
in other cases.

 
 It might be safe now, but that isn't really the point.  I'd rather we
 didn't have yet one more subtle invariant to audit and maintain.
 
 At the risk of making myself vulnerable to the charge of mentioning
 vapourware, I will admit that I've been working on a (slightly mad)
 implementation of async syscalls.  I've been quiet about it because I
 don't want to whip up complicated discussion without being able to
 show code that works, even if barely.  I mention it now only to make
 it clear that I want to be constructive, not just critical :).

That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that will be
easily integratable with kevents etc.

Which takes me back to Ingo's point - lets have the new evolve parallely
with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.

OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion. Glad to see that you found a way
to make it workable ... (I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ? At what
point do you do the allocation of the saved stacks ? Sorry I should hold
off all these questions till your patch comes out)

Regards
Suparna

 
 - z

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http

[PATCHSET 2][PATCH 1/1] Combining epoll and disk file AIO

2007-01-02 Thread Suparna Bhattacharya

On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote:
 (2) Most of these other applications need the ability to process both
 network events (epoll) and disk file AIO in the same loop. With POSIX AIO
 they could at least sort of do this using signals (yeah, and all 
 associated
 issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
 modifications from Jeff Moyer and me) addresses this problem for native
 linux aio in a simple manner. Tridge has written a test harness to
 try out the Samba4 event library modifications to use this. Jeff Moyer
 has a modified version of pipetest for comparison.
 


Enable epoll wait to be unified with io_getevents

From: Zach Brown, Jeff Moyer, Suparna Bhattacharya

Previously there have been (complicated and scary) attempts to funnel
individual aio events down epoll or vice versa.  This instead lets one
issue an entire sys_epoll_wait() as an aio op.  You'd setup epoll as
usual and then issue epoll_wait aio ops which would complete once epoll
events had been copied. This will enable a single io_getevents() event
loop to process both disk AIO and epoll notifications.

From an application standpoint a typical flow works like this:
- Use epoll_ctl as usual to add/remove epoll registrations
- Instead of issuing sys_epoll_wait, setup an iocb using
  io_prep_epoll_wait (see examples below) specifying the epoll
  events buffer to fill up with epoll notifications. Submit the iocb
  using io_submit
- Now io_getevents can be used to wait for both epoll waits and
  disk aio completion. If the returned AIO event is of type
  IO_CMD_EPOLL_WAIT, then corresponding result value indicates the
  number of epoll notifications in the iocb's event buffer, which
  can now be processed just like once would process results from a
  sys_epoll_wait()

There are a couple of sample applications:
- Andrew Tridgell has implemented a little test harness using an aio events
  library implementation intended for samba4 
  http://samba.org/~tridge/etest
  (The -e aio option uses aio epoll wait and can issue disk aio as well)
- An updated version of pipetest from Jeff Moyer has a --aio-epoll option
  http://people.redhat.com/jmoyer/aio/epoll/pipetest.c

There is obviously a little overhead compared to using sys_epoll_wait(), due
to the extra step of submitting the epoll wait iocb, most noticible when
there are very few events processed per loop. However, the goal here is not
to build an epoll alternative but merely to allow network and disk I/O to
be processed in the same event loop which is where the efficiencies really
come from. Picking up more epoll events in each loop can amortize the
overhead across many operations to mitigate the impact.
  
Thanks to Arjan Van de Van for helping figure out how to resolve the
lockdep complaints. Both ctx-lock and ep-lock can be held in certain 
wait queue callback routines, thus being nested inside q-lock. However, this
excludes ctx-wait or ep-wq wait queues, which can safetly be nested
inside ctx-lock or ep-lock respectively. So we teach lockdep to recognize
these as distinct classes.

Signed-off-by: Zach Brown [EMAIL PROTECTED]
Signed-off-by: Jeff Moyer [EMAIL PROTECTED]
Signed-off-by: Suparna Bhattacharya [EMAIL PROTECTED]

---

 linux-2.6.20-rc1-root/fs/aio.c  |   54 +
 linux-2.6.20-rc1-root/fs/eventpoll.c|   95 +---
 linux-2.6.20-rc1-root/include/linux/aio.h   |2 
 linux-2.6.20-rc1-root/include/linux/aio_abi.h   |1 
 linux-2.6.20-rc1-root/include/linux/eventpoll.h |   31 +++
 linux-2.6.20-rc1-root/include/linux/sched.h |2 
 linux-2.6.20-rc1-root/kernel/timer.c|   21 +
 7 files changed, 196 insertions(+), 10 deletions(-)

diff -puN fs/aio.c~aio-epoll-wait fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-epoll-wait2006-12-28 14:22:52.0 
+0530
+++ linux-2.6.20-rc1-root/fs/aio.c  2007-01-03 11:45:40.0 +0530
@@ -30,6 +30,7 @@
 #include linux/highmem.h
 #include linux/workqueue.h
 #include linux/security.h
+#include linux/eventpoll.h
 
 #include asm/kmap_types.h
 #include asm/uaccess.h
@@ -193,6 +194,8 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event  PAGE_MASK), km); \
 } while(0)
 
+static struct lock_class_key ioctx_wait_queue_head_lock_key;
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -224,6 +227,8 @@ static struct kioctx *ioctx_alloc(unsign
spin_lock_init(ctx-ctx_lock);
spin_lock_init(ctx-ring_info.ring_lock);
init_waitqueue_head(ctx-wait);
+   /* Teach lockdep to recognize this lock as a different class */
+   lockdep_set_class(ctx-wait.lock, ioctx_wait_queue_head_lock_key);
 
INIT_LIST_HEAD(ctx-active_reqs);
INIT_LIST_HEAD(ctx-run_list);
@@ -1401,6 +1406,42 @@ static ssize_t aio_setup_single_vector(s
return 0;
 }
 
+/* Uses

Re: [FSAIO][PATCH 7/8] Filesystem AIO read

2006-12-28 Thread Suparna Bhattacharya

On Thu, Dec 28, 2006 at 11:57:47AM +, Christoph Hellwig wrote:
> > +   if (in_aio()) {
> > +   /* Avoid repeat readahead */
> > +   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
> > +   next_index = last_index;
> > +   }
> 
> Every place we use kiocbTryRestart in this and the next patch it's in
> this from, so we should add a little helper for it:
> 
> int aio_try_restart(void)
> {
>   struct wait_queue_head_t *wq = current->io_wait;
> 
>   if (!is_sync_wait(wq) && kiocbTryRestart(io_wait_to_kiocb(wq)))
>   return 1;
>   return 0;
> }

Yes, we can do that -- how about aio_restarted() as an alternate name ?

> 
> with a big kerneldoc comment explaining this idiom (and possible a better
> name for the function ;-))
> 
> > +
> > +   if ((error = __lock_page(page, current->io_wait))) {
> > +   goto readpage_error;
> > +   }
> 
> This should  be
> 
>   error = __lock_page(page, current->io_wait);
>   if (error)
>   goto readpage_error;
> 
> Pluse possible naming updates discussed in the last mail.  Also do we
> really need to pass current->io_wait here?  Isn't the waitqueue in
> the kiocb always guaranteed to be the same?  Now that all pagecache

We don't have have the kiocb available to this routine. Using current->io_wait
avoids the need to pass the iocb down to deeper levels just for the sync vs
async checks, also allowing such routines to be shared by other code which
does not use iocbs (e.g. generic_file_sendfile->do_generic_file_read
->do_generic_mapping_read) without having to set up dummy iocbs.

Does that clarify ? We could abstract this away within a lock page wrapper,
but I don't know if that makes a difference.

> I/O goes through the ->aio_read/->aio_write routines I'd prefer to
> get rid of the task_struct field cludges and pass all this around in
> the kiocb.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

2006-12-28 Thread Suparna Bhattacharya

On Thu, Dec 28, 2006 at 11:55:10AM +, Christoph Hellwig wrote:
> On Thu, Dec 28, 2006 at 02:11:49PM +0530, Suparna Bhattacharya wrote:
> > -extern void FASTCALL(lock_page_slow(struct page *page));
> > +extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t 
> > *wait));
> >  extern void FASTCALL(__lock_page_nosync(struct page *page));
> >  extern void FASTCALL(unlock_page(struct page *page));
> >
> >  /*
> >   * lock_page may only be called if we have the page's inode pinned.
> >   */
> > -static inline void lock_page(struct page *page)
> > +static inline int __lock_page(struct page *page, wait_queue_t *wait)
> >  {
> > might_sleep();
> > if (TestSetPageLocked(page))
> > -   lock_page_slow(page);
> > +   return __lock_page_slow(page, wait);
> > +   return 0;
> >  }
> >
> > +#define lock_page(page)__lock_page(page, >__wait.wait)
> > +#define lock_page_slow(page)   __lock_page_slow(page, 
> > >__wait.wait)
> 
> Can we please simply kill your lock_page_slow wrapper and rename the
> arguments taking __lock_page_slow to lock_page_slow?  All too many
> variants of the locking functions aren't all that useful and there's
> very few users.

OK.

> 
> Similarly I don't really think __lock_page is an all that useful name here.
> What about lock_page_wq?  or aio_lock_page to denote it has special

I am really bad with names :(  I tried using the _wq suffixes earlier and
that seemed confusing to some, but if no one else objects I'm happy to use
that. I thought aio_lock_page() might be misleading because it is
synchronous if a regular wait queue entry is passed in, but again it may not
be too bad.

What's your preference ? Does anything more intuitive come to mind ?

> meaning in aio contect?  Then again because of these special sematics
> we need a bunch of really verbose kerneldoc comments for this function
> famility.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine

2006-12-28 Thread Suparna Bhattacharya

Sorry this should have read [PATCH 1/8] instead of [PATCH 1/6]

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[FSAIO][PATCH 8/8] AIO O_SYNC filesystem write

2006-12-28 Thread Suparna Bhattacharya


AIO support for O_SYNC buffered writes, built over O_SYNC-speedup.
It uses the tagged radix tree lookups to writeout just the pages
pertaining to this request, and retries instead of blocking
for writeback to complete on the same range. All the writeout is 
issued at the time of io submission, and there is a check to make
sure that retries skip over straight to the wait_on_page_writeback_range.

Limitations: Extending file writes or hole overwrites with O_SYNC may
still block because we have yet to convert generic_osync_inode to be
asynchronous. For non O_SYNC writes, writeout happens in the background
and so typically appears async to the caller except for memory throttling
and non-block aligned writes involving read-modify-write.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 include/linux/aio.h  |0 
 linux-2.6.20-rc1-root/include/linux/fs.h |   13 +-
 linux-2.6.20-rc1-root/mm/filemap.c   |   61 +--
 3 files changed, 54 insertions(+), 20 deletions(-)

diff -puN include/linux/aio.h~aio-fs-write include/linux/aio.h
diff -puN mm/filemap.c~aio-fs-write mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~aio-fs-write  2006-12-21 08:46:21.0 
+0530
+++ linux-2.6.20-rc1-root/mm/filemap.c  2006-12-21 08:46:21.0 +0530
@@ -239,10 +239,11 @@ EXPORT_SYMBOL(filemap_flush);
  * @end:   ending page index
  *
  * Wait for writeback to complete against pages indexed by start->end
- * inclusive
+ * inclusive. In AIO context, this may queue an async notification
+ * and retry callback and return, instead of blocking the caller.
  */
-int wait_on_page_writeback_range(struct address_space *mapping,
-   pgoff_t start, pgoff_t end)
+int __wait_on_page_writeback_range(struct address_space *mapping,
+   pgoff_t start, pgoff_t end, wait_queue_t *wait)
 {
struct pagevec pvec;
int nr_pages;
@@ -254,20 +255,20 @@ int wait_on_page_writeback_range(struct 
 
pagevec_init(, 0);
index = start;
-   while ((index <= end) &&
+   while (!ret && (index <= end) &&
(nr_pages = pagevec_lookup_tag(, mapping, ,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
unsigned i;
 
-   for (i = 0; i < nr_pages; i++) {
+   for (i = 0; !ret && (i < nr_pages); i++) {
struct page *page = pvec.pages[i];
 
/* until radix tree lookup accepts end_index */
if (page->index > end)
continue;
 
-   wait_on_page_writeback(page);
+   ret = __wait_on_page_writeback(page, wait);
if (PageError(page))
ret = -EIO;
}
@@ -303,18 +304,27 @@ int sync_page_range(struct inode *inode,
 {
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-   int ret;
+   int ret = 0;
 
if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+   if (in_aio()) {
+   /* Already issued writeouts for this iocb ? */
+   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+   goto do_wait; /* just need to check if done */
+   }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-   if (ret == 0) {
+
+   if (ret >= 0) {
mutex_lock(>i_mutex);
ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
mutex_unlock(>i_mutex);
}
-   if (ret == 0)
-   ret = wait_on_page_writeback_range(mapping, start, end);
+do_wait:
+   if (ret >= 0) {
+   ret = __wait_on_page_writeback_range(mapping, start, end,
+   current->io_wait);
+   }
return ret;
 }
 EXPORT_SYMBOL(sync_page_range);
@@ -335,15 +345,23 @@ int sync_page_range_nolock(struct inode 
 {
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-   int ret;
+   int ret = 0;
 
if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+   if (in_aio()) {
+   /* Already issued writeouts for this iocb ? */
+   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+   goto do_wait; /* just need to check if done */
+   }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-   if (ret == 0)
+   if (ret >= 0)
ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-   if (

[FSAIO][PATCH 7/8] Filesystem AIO read

2006-12-28 Thread Suparna Bhattacharya


Converts the wait for page to become uptodate (lock page)
after readahead/readpage (in do_generic_mapping_read) to a retry
exit, to make buffered filesystem AIO reads actually synchronous.

The patch avoids exclusive wakeups with AIO, a problem originally
spotted by Chris Mason, though the reasoning for why it is an
issue is now much clearer (see explanation in the comment below
in aio.c), and the solution is perhaps slightly simpler.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/fs/aio.c|   11 ++-
 linux-2.6.20-rc1-root/include/linux/aio.h |5 +
 linux-2.6.20-rc1-root/mm/filemap.c|   19 ---
 3 files changed, 31 insertions(+), 4 deletions(-)

diff -puN fs/aio.c~aio-fs-read fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-fs-read   2006-12-21 08:46:13.0 
+0530
+++ linux-2.6.20-rc1-root/fs/aio.c  2006-12-28 09:26:30.0 +0530
@@ -1529,7 +1529,16 @@ static int aio_wake_function(wait_queue_
 
list_del_init(>task_list);
kick_iocb(iocb);
-   return 1;
+   /*
+* Avoid exclusive wakeups with retries since an exclusive wakeup
+* may involve implicit expectations of waking up the next waiter
+* and there is no guarantee that the retry will take a path that
+* would do so. For example if a page has become up-to-date, then
+* a retried read may end up straightaway performing a copyout
+* and not go through a lock_page - unlock_page that would have
+* passed the baton to the next waiter.
+*/
+   return 0;
 }
 
 int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
diff -puN mm/filemap.c~aio-fs-read mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~aio-fs-read   2006-12-21 08:46:13.0 
+0530
+++ linux-2.6.20-rc1-root/mm/filemap.c  2006-12-28 09:31:48.0 +0530
@@ -909,6 +909,11 @@ void do_generic_mapping_read(struct addr
if (!isize)
goto out;
 
+   if (in_aio()) {
+   /* Avoid repeat readahead */
+   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+   next_index = last_index;
+   }
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -978,7 +983,10 @@ page_ok:
 
 page_not_up_to_date:
/* Get exclusive access to the page ... */
-   lock_page(page);
+
+   if ((error = __lock_page(page, current->io_wait))) {
+   goto readpage_error;
+   }
 
/* Did it get truncated before we got the lock? */
if (!page->mapping) {
@@ -1006,7 +1014,8 @@ readpage:
}
 
if (!PageUptodate(page)) {
-   lock_page(page);
+   if ((error = __lock_page(page, current->io_wait)))
+   goto readpage_error;
if (!PageUptodate(page)) {
if (page->mapping == NULL) {
/*
@@ -1052,7 +1061,11 @@ readpage:
goto page_ok;
 
 readpage_error:
-   /* UHHUH! A synchronous read error occurred. Report it */
+   /* We don't have uptodate data in the page yet */
+   /* Could be due to an error or because we need to
+* retry when we get an async i/o notification.
+* Report the reason.
+*/
desc->error = error;
page_cache_release(page);
goto out;
diff -puN include/linux/aio.h~aio-fs-read include/linux/aio.h
--- linux-2.6.20-rc1/include/linux/aio.h~aio-fs-read2006-12-21 
08:46:13.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio.h   2006-12-28 09:26:27.0 
+0530
@@ -34,21 +34,26 @@ struct kioctx;
 /* #define KIF_LOCKED  0 */
 #define KIF_KICKED 1
 #define KIF_CANCELLED  2
+#define KIF_RESTARTED  3
 
 #define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbTryRestart(iocb)  test_and_set_bit(KIF_RESTARTED, 
&(iocb)->ki_flags)
 
 #define kiocbSetLocked(iocb)   set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbSetKicked(iocb)   set_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbSetCancelled(iocb)set_bit(KIF_CANCELLED, 
&(iocb)->ki_flags)
+#define kiocbSetRestarted(iocb)set_bit(KIF_RESTARTED, 
&(iocb)->ki_flags)
 
 #define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbClearCancelled(iocb)  clear_bit(KIF_CANCELLED, 
&(iocb)-

[FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

2006-12-28 Thread Suparna Bhattacharya


Define low-level page wait and lock page routines which take a
wait queue entry pointer as an additional parameter and
return status (which may be non-zero when the wait queue
parameter signifies an asynchronous wait, typically during
AIO).

Synchronous IO waits become a special case where the wait
queue parameter is the running task's default io wait context.
Asynchronous IO waits happen when the wait queue parameter
is the io wait context of a kiocb. Code paths which choose to
execute synchronous or asynchronous behaviour depending on the
called context specify the current io wait context (which points
to sync or async context as the case may be) as the wait
parameter. 

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/pagemap.h |   30 ++---
 linux-2.6.20-rc1-root/include/linux/sched.h   |1 
 linux-2.6.20-rc1-root/kernel/sched.c  |   14 +++
 linux-2.6.20-rc1-root/mm/filemap.c|   31 +++---
 4 files changed, 55 insertions(+), 21 deletions(-)

diff -puN include/linux/pagemap.h~aio-wait-page include/linux/pagemap.h
--- linux-2.6.20-rc1/include/linux/pagemap.h~aio-wait-page  2006-12-21 
08:46:02.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/pagemap.h   2006-12-21 
08:46:02.0 +0530
@@ -133,20 +133,24 @@ static inline pgoff_t linear_page_index(
return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 }
 
-extern void FASTCALL(lock_page_slow(struct page *page));
+extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t *wait));
 extern void FASTCALL(__lock_page_nosync(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
-static inline void lock_page(struct page *page)
+static inline int __lock_page(struct page *page, wait_queue_t *wait)
 {
might_sleep();
if (TestSetPageLocked(page))
-   lock_page_slow(page);
+   return __lock_page_slow(page, wait);
+   return 0;
 }
 
+#define lock_page(page)__lock_page(page, >__wait.wait)
+#define lock_page_slow(page)   __lock_page_slow(page, >__wait.wait)
+
 /*
  * lock_page_nosync should only be used if we can't pin the page's inode.
  * Doesn't play quite so well with block device plugging.
@@ -162,7 +166,8 @@ static inline void lock_page_nosync(stru
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
-extern void FASTCALL(wait_on_page_bit(struct page *page, int bit_nr));
+extern int FASTCALL(wait_on_page_bit(struct page *page, int bit_nr,
+   wait_queue_t *wait));
 
 /* 
  * Wait for a page to be unlocked.
@@ -171,21 +176,30 @@ extern void FASTCALL(wait_on_page_bit(st
  * ie with increased "page->count" so that the page won't
  * go away during the wait..
  */
-static inline void wait_on_page_locked(struct page *page)
+static inline int __wait_on_page_locked(struct page *page, wait_queue_t *wait)
 {
if (PageLocked(page))
-   wait_on_page_bit(page, PG_locked);
+   return wait_on_page_bit(page, PG_locked, wait);
+   return 0;
 }
 
+#define wait_on_page_locked(page) \
+   __wait_on_page_locked(page, >__wait.wait)
+
 /* 
  * Wait for a page to complete writeback
  */
-static inline void wait_on_page_writeback(struct page *page)
+static inline int __wait_on_page_writeback(struct page *page,
+   wait_queue_t *wait)
 {
if (PageWriteback(page))
-   wait_on_page_bit(page, PG_writeback);
+   return wait_on_page_bit(page, PG_writeback, wait);
+   return 0;
 }
 
+#define wait_on_page_writeback(page) \
+   __wait_on_page_writeback(page, >__wait.wait)
+
 extern void end_page_writeback(struct page *page);
 
 /*
diff -puN include/linux/sched.h~aio-wait-page include/linux/sched.h
--- linux-2.6.20-rc1/include/linux/sched.h~aio-wait-page2006-12-21 
08:46:02.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:26:27.0 
+0530
@@ -216,6 +216,7 @@ extern void show_stack(struct task_struc
 
 void io_schedule(void);
 long io_schedule_timeout(long timeout);
+int io_wait_schedule(wait_queue_t *wait);
 
 extern void cpu_init (void);
 extern void trap_init(void);
diff -puN kernel/sched.c~aio-wait-page kernel/sched.c
--- linux-2.6.20-rc1/kernel/sched.c~aio-wait-page   2006-12-21 
08:46:02.0 +0530
+++ linux-2.6.20-rc1-root/kernel/sched.c2006-12-21 08:46:02.0 
+0530
@@ -4743,6 +4743,20 @@ long __sched io_schedule_timeout(long ti
return ret;
 }
 
+/*
+ * Sleep only if the wait context passed is not async,
+ * otherwise return so that a retry can be issued later.
+ */
+int __sched io_wait_schedule(wait_queue_t *wait)
+{
+   if (

[FSAIO][PATCH 5/8] Enable wait bit based filtered wakeups to work for AIO

2006-12-28 Thread Suparna Bhattacharya


Enable wait bit based filtered wakeups to work for AIO.
Replaces the wait queue entry in the kiocb with a wait bit
structure, to allow enough space for the wait bit key.
This adds an extra level of indirection in references to the 
wait queue entry in the iocb. Also, an extra check had to be
added in aio_wake_function to allow for other kinds of waiters 
which do not require wait bit, based on the assumption that
the key passed in would be NULL in such cases.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---

 linux-2.6.20-rc1-root/fs/aio.c|   21 ++---
 linux-2.6.20-rc1-root/include/linux/aio.h |7 ---
 linux-2.6.20-rc1-root/kernel/wait.c   |   17 ++---
 3 files changed, 32 insertions(+), 13 deletions(-)

diff -puN fs/aio.c~aio-wait-bit fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-wait-bit  2006-12-21 08:45:57.0 
+0530
+++ linux-2.6.20-rc1-root/fs/aio.c  2006-12-28 09:32:27.0 +0530
@@ -719,13 +719,13 @@ static ssize_t aio_run_iocb(struct kiocb
 * cause the iocb to be kicked for continuation (through
 * the aio_wake_function callback).
 */
-   BUG_ON(current->io_wait != NULL);
-   current->io_wait = >ki_wait;
+   BUG_ON(!is_sync_wait(current->io_wait));
+   current->io_wait = >ki_wait.wait;
ret = retry(iocb);
current->io_wait = NULL;
 
if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
-   BUG_ON(!list_empty(>ki_wait.task_list));
+   BUG_ON(!list_empty(>ki_wait.wait.task_list));
aio_complete(iocb, ret, 0);
}
 out:
@@ -883,7 +883,7 @@ static void try_queue_kicked_iocb(struct
 * than retry has happened before we could queue the iocb.  This also
 * means that the retry could have completed and freed our iocb, no
 * good. */
-   BUG_ON((!list_empty(>ki_wait.task_list)));
+   BUG_ON((!list_empty(>ki_wait.wait.task_list)));
 
spin_lock_irqsave(>ctx_lock, flags);
/* set this inside the lock so that we can't race with aio_run_iocb()
@@ -1519,7 +1519,13 @@ static ssize_t aio_setup_iocb(struct kio
 static int aio_wake_function(wait_queue_t *wait, unsigned mode,
 int sync, void *key)
 {
-   struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+   struct wait_bit_queue *wait_bit
+   = container_of(wait, struct wait_bit_queue, wait);
+   struct kiocb *iocb = container_of(wait_bit, struct kiocb, ki_wait);
+
+   /* Assumes that a non-NULL key implies wait bit filtering */
+   if (key && !test_wait_bit_key(wait, key))
+   return 0;
 
list_del_init(>task_list);
kick_iocb(iocb);
@@ -1574,8 +1580,9 @@ int fastcall io_submit_one(struct kioctx
req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf;
req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
req->ki_opcode = iocb->aio_lio_opcode;
-   init_waitqueue_func_entry(>ki_wait, aio_wake_function);
-   INIT_LIST_HEAD(>ki_wait.task_list);
+   init_waitqueue_func_entry(>ki_wait.wait, aio_wake_function);
+   INIT_LIST_HEAD(>ki_wait.wait.task_list);
+   req->ki_run_list.next = req->ki_run_list.prev = NULL;
 
ret = aio_setup_iocb(req);
 
diff -puN include/linux/aio.h~aio-wait-bit include/linux/aio.h
--- linux-2.6.20-rc1/include/linux/aio.h~aio-wait-bit   2006-12-21 
08:45:57.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio.h   2006-12-28 09:32:27.0 
+0530
@@ -102,7 +102,7 @@ struct kiocb {
} ki_obj;
 
__u64   ki_user_data;   /* user's data for completion */
-   wait_queue_tki_wait;
+   struct wait_bit_queue   ki_wait;
loff_t  ki_pos;
 
atomic_tki_bio_count;   /* num bio used for this iocb */
@@ -135,7 +135,7 @@ struct kiocb {
(x)->ki_dtor = NULL;\
(x)->ki_obj.tsk = tsk;  \
(x)->ki_user_data = 0;  \
-   init_wait((&(x)->ki_wait)); \
+   init_wait_bit_task((&(x)->ki_wait), current);\
} while (0)
 
 #define AIO_RING_MAGIC 0xa10a10a1
@@ -237,7 +237,8 @@ do {
\
}   \
 } while (0)
 
-#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait)
+#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait,   \
+   struct wait_bit_queue, wait), struct kiocb, ki_wait)
 
 #include 
 
diff -puN kernel/wait.c~aio-wait-bit kernel/wait.c
--- linux-2.6.20-rc1/kernel/wait.c~aio-wait-bit 2006-12-21 08:45:

[FSAIO][PATCH 4/8] Add a default io wait bit field in task struct

2006-12-28 Thread Suparna Bhattacharya


Allocates space for the default io wait queue entry (actually a wait
bit entry) in the task struct. Doing so simplifies the patches 
for AIO wait page allowing for cleaner and more efficient 
implementation, at the cost of 28 additional bytes in task struct
vs allocation on demand on-stack.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/sched.h |   11 +++
 linux-2.6.20-rc1-root/kernel/fork.c |3 ++-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN include/linux/sched.h~tsk-default-io-wait include/linux/sched.h
--- linux-2.6.20-rc1/include/linux/sched.h~tsk-default-io-wait  2006-12-21 
08:45:51.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:32:39.0 
+0530
@@ -1006,11 +1006,14 @@ struct task_struct {
 
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use.  */
+
+/* Space for default IO wait bit entry used for synchronous IO waits */
+   struct wait_bit_queue __wait;
 /*
- * current io wait handle: wait queue entry to use for io waits
- * If this thread is processing aio, this points at the waitqueue
- * inside the currently handled kiocb. It may be NULL (i.e. default
- * to a stack based synchronous wait) if its doing sync IO.
+ * Current IO wait handle: wait queue entry to use for IO waits
+ * If this thread is processing AIO, this points at the waitqueue
+ * inside the currently handled kiocb. Otherwise, points to the
+ * default IO wait field (i.e &__wait.wait above).
  */
wait_queue_t *io_wait;
 /* i/o counters(bytes read/written, #syscalls */
diff -puN kernel/fork.c~tsk-default-io-wait kernel/fork.c
--- linux-2.6.20-rc1/kernel/fork.c~tsk-default-io-wait  2006-12-21 
08:45:51.0 +0530
+++ linux-2.6.20-rc1-root/kernel/fork.c 2006-12-21 08:45:51.0 +0530
@@ -1056,7 +1056,8 @@ static struct task_struct *copy_process(
do_posix_clock_monotonic_gettime(>start_time);
p->security = NULL;
p->io_context = NULL;
-   p->io_wait = NULL;
+   init_wait_bit_task(>__wait, p);
+   p->io_wait = >__wait.wait;
p->audit_context = NULL;
cpuset_fork(p);
 #ifdef CONFIG_NUMA
_
-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[FSAIO][PATCH 3/8] Routines to initialize and test a wait bit key

2006-12-28 Thread Suparna Bhattacharya


init_wait_bit_key() initializes the key field in an already 
allocated wait bit structure, useful for async wait bit support.
Also separate out the wait bit test to a common routine which
can be used by different kinds of wakeup callbacks.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/wait.h |   24 
 linux-2.6.20-rc1-root/kernel/wait.c|   11 ++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff -puN include/linux/wait.h~init-wait-bit-key include/linux/wait.h
--- linux-2.6.20-rc1/include/linux/wait.h~init-wait-bit-key 2006-12-21 
08:45:46.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/wait.h  2006-12-21 08:45:46.0 
+0530
@@ -108,6 +108,17 @@ static inline int waitqueue_active(wait_
return !list_empty(>task_list);
 }
 
+static inline int test_wait_bit_key(wait_queue_t *wait,
+   struct wait_bit_key *key)
+{
+   struct wait_bit_queue *wait_bit
+   = container_of(wait, struct wait_bit_queue, wait);
+
+   return (wait_bit->key.flags == key->flags &&
+   wait_bit->key.bit_nr == key->bit_nr &&
+   !test_bit(key->bit_nr, key->flags));
+}
+
 /*
  * Used to distinguish between sync and async io wait context:
  * sync i/o typically specifies a NULL wait queue entry or a wait
@@ -416,6 +427,19 @@ int wake_bit_function(wait_queue_t *wait
INIT_LIST_HEAD(&(wait)->task_list); \
} while (0)
 
+#define init_wait_bit_key(waitbit, word, bit)  \
+   do {\
+   (waitbit)->key.flags = word;\
+   (waitbit)->key.bit_nr = bit;\
+   } while (0)
+
+#define init_wait_bit_task(waitbit, tsk)   \
+   do {\
+   (waitbit)->wait.private = tsk;  \
+   (waitbit)->wait.func = wake_bit_function;   \
+   INIT_LIST_HEAD(&(waitbit)->wait.task_list); \
+   } while (0)
+
 /**
  * wait_on_bit - wait for a bit to be cleared
  * @word: the word being waited on, a kernel virtual address
diff -puN kernel/wait.c~init-wait-bit-key kernel/wait.c
--- linux-2.6.20-rc1/kernel/wait.c~init-wait-bit-key2006-12-21 
08:45:46.0 +0530
+++ linux-2.6.20-rc1-root/kernel/wait.c 2006-12-28 09:32:44.0 +0530
@@ -139,16 +139,9 @@ EXPORT_SYMBOL(autoremove_wake_function);
 
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 {
-   struct wait_bit_key *key = arg;
-   struct wait_bit_queue *wait_bit
-   = container_of(wait, struct wait_bit_queue, wait);
-
-   if (wait_bit->key.flags != key->flags ||
-   wait_bit->key.bit_nr != key->bit_nr ||
-   test_bit(key->bit_nr, key->flags))
+   if (!test_wait_bit_key(wait, arg))
return 0;
-   else
-   return autoremove_wake_function(wait, mode, sync, key);
+   return autoremove_wake_function(wait, mode, sync, arg);
 }
 EXPORT_SYMBOL(wake_bit_function);
 
_
-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[FSAIO][PATCH 2/8] Rename __lock_page to lock_page_slow

2006-12-28 Thread Suparna Bhattacharya


In order to allow for interruptible and asynchronous versions of
lock_page in conjunction with the wait_on_bit changes, we need to
define low-level lock page routines which take an additional
argument, i.e a wait queue entry and may return non-zero status,
e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames 
__lock_page to lock_page_slow, so that __lock_page and 
__lock_page_slow can denote the versions which take a wait queue 
parameter.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/pagemap.h |4 ++--
 linux-2.6.20-rc1-root/mm/filemap.c|8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff -puN include/linux/pagemap.h~lock_page_slow include/linux/pagemap.h
--- linux-2.6.20-rc1/include/linux/pagemap.h~lock_page_slow 2006-12-21 
08:45:40.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/pagemap.h   2006-12-28 
09:32:39.0 +0530
@@ -133,7 +133,7 @@ static inline pgoff_t linear_page_index(
return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 }
 
-extern void FASTCALL(__lock_page(struct page *page));
+extern void FASTCALL(lock_page_slow(struct page *page));
 extern void FASTCALL(__lock_page_nosync(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
@@ -144,7 +144,7 @@ static inline void lock_page(struct page
 {
might_sleep();
if (TestSetPageLocked(page))
-   __lock_page(page);
+   lock_page_slow(page);
 }
 
 /*
diff -puN mm/filemap.c~lock_page_slow mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~lock_page_slow2006-12-21 
08:45:40.0 +0530
+++ linux-2.6.20-rc1-root/mm/filemap.c  2006-12-28 09:32:39.0 +0530
@@ -556,7 +556,7 @@ void end_page_writeback(struct page *pag
 EXPORT_SYMBOL(end_page_writeback);
 
 /**
- * __lock_page - get a lock on the page, assuming we need to sleep to get it
+ * lock_page_slow - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
  *
  * Ugly. Running sync_page() in state TASK_UNINTERRUPTIBLE is scary.  If some
@@ -564,14 +564,14 @@ EXPORT_SYMBOL(end_page_writeback);
  * chances are that on the second loop, the block layer's plug list is empty,
  * so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
  */
-void fastcall __lock_page(struct page *page)
+void fastcall lock_page_slow(struct page *page)
 {
DEFINE_WAIT_BIT(wait, >flags, PG_locked);
 
__wait_on_bit_lock(page_waitqueue(page), , sync_page,
TASK_UNINTERRUPTIBLE);
 }
-EXPORT_SYMBOL(__lock_page);
+EXPORT_SYMBOL(lock_page_slow);
 
 /*
  * Variant of lock_page that does not require the caller to hold a reference
@@ -647,7 +647,7 @@ repeat:
page_cache_get(page);
if (TestSetPageLocked(page)) {
read_unlock_irq(>tree_lock);
-   __lock_page(page);
+   lock_page_slow(page);
read_lock_irq(>tree_lock);
 
/* Has the page been truncated while we slept? */
_
-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine

2006-12-28 Thread Suparna Bhattacharya


Add a wait queue parameter to the action routine called by 
__wait_on_bit to allow it to determine whether to block or
not.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/fs/buffer.c  |2 +-
 linux-2.6.20-rc1-root/fs/inode.c   |2 +-
 linux-2.6.20-rc1-root/fs/nfs/inode.c   |2 +-
 linux-2.6.20-rc1-root/fs/nfs/nfs4proc.c|2 +-
 linux-2.6.20-rc1-root/fs/nfs/pagelist.c|2 +-
 linux-2.6.20-rc1-root/include/linux/sunrpc/sched.h |3 ++-
 linux-2.6.20-rc1-root/include/linux/wait.h |   18 --
 linux-2.6.20-rc1-root/include/linux/writeback.h|2 +-
 linux-2.6.20-rc1-root/kernel/wait.c|   14 --
 linux-2.6.20-rc1-root/mm/filemap.c |2 +-
 linux-2.6.20-rc1-root/net/sunrpc/sched.c   |5 +++--
 11 files changed, 32 insertions(+), 22 deletions(-)

diff -puN fs/buffer.c~modify-wait-bit-action-args fs/buffer.c
--- linux-2.6.20-rc1/fs/buffer.c~modify-wait-bit-action-args2006-12-21 
08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/fs/buffer.c   2006-12-21 08:45:34.0 +0530
@@ -55,7 +55,7 @@ init_buffer(struct buffer_head *bh, bh_e
bh->b_private = private;
 }
 
-static int sync_buffer(void *word)
+static int sync_buffer(void *word, wait_queue_t *wait)
 {
struct block_device *bd;
struct buffer_head *bh
diff -puN fs/inode.c~modify-wait-bit-action-args fs/inode.c
--- linux-2.6.20-rc1/fs/inode.c~modify-wait-bit-action-args 2006-12-21 
08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/fs/inode.c2006-12-21 08:45:34.0 +0530
@@ -1279,7 +1279,7 @@ void remove_dquot_ref(struct super_block
 
 #endif
 
-int inode_wait(void *word)
+int inode_wait(void *word, wait_queue_t *wait)
 {
schedule();
return 0;
diff -puN include/linux/wait.h~modify-wait-bit-action-args include/linux/wait.h
--- linux-2.6.20-rc1/include/linux/wait.h~modify-wait-bit-action-args   
2006-12-21 08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/wait.h  2006-12-28 09:32:57.0 
+0530
@@ -145,11 +145,15 @@ void FASTCALL(__wake_up(wait_queue_head_
 extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int 
mode));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, 
int nr));
 void FASTCALL(__wake_up_bit(wait_queue_head_t *, void *, int));
-int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int 
(*)(void *), unsigned));
-int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, 
int (*)(void *), unsigned));
+int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *,
+   int (*)(void *, wait_queue_t *), unsigned));
+int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+   int (*)(void *, wait_queue_t *), unsigned));
 void FASTCALL(wake_up_bit(void *, int));
-int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned));
-int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), 
unsigned));
+int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *,
+   wait_queue_t *), unsigned));
+int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *,
+   wait_queue_t *), unsigned));
 wait_queue_head_t *FASTCALL(bit_waitqueue(void *, int));
 
 #define wake_up(x) __wake_up(x, TASK_UNINTERRUPTIBLE | 
TASK_INTERRUPTIBLE, 1, NULL)
@@ -427,7 +431,8 @@ int wake_bit_function(wait_queue_t *wait
  * but has no intention of setting it.
  */
 static inline int wait_on_bit(void *word, int bit,
-   int (*action)(void *), unsigned mode)
+   int (*action)(void *, wait_queue_t *),
+   unsigned mode)
 {
if (!test_bit(bit, word))
return 0;
@@ -451,7 +456,8 @@ static inline int wait_on_bit(void *word
  * clear with the intention of setting it, and when done, clearing it.
  */
 static inline int wait_on_bit_lock(void *word, int bit,
-   int (*action)(void *), unsigned mode)
+   int (*action)(void *, wait_queue_t *),
+   unsigned mode)
 {
if (!test_and_set_bit(bit, word))
return 0;
diff -puN include/linux/writeback.h~modify-wait-bit-action-args 
include/linux/writeback.h
--- linux-2.6.20-rc1/include/linux/writeback.h~modify-wait-bit-action-args  
2006-12-21 08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/writeback.h 2006-12-21 
08:45:34.0 +0530
@@ -66,7 +66,7 @@ struct writeback_control {
  */
 void writeback_inodes(struct writeback_control *wbc);
 void wake_up_inode(struct inode *inode);
-int inode_wait(void *);
+int inode_wait(void *, wait_queue_t *);
 void sync_inodes_sb(struct super_block *, i

[PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2006-12-28 Thread Suparna Bhattacharya


Currently native linux AIO is properly supported (in the sense of
actually being asynchronous) only for files opened with O_DIRECT.
While this suffices for a major (and most visible) user of AIO, i.e. databases,
other types of users like Samba require AIO support for regular file IO.
Also, for glibc POSIX AIO to be able to switch to using native AIO instead
of the current simulation using threads, it needs/expects asynchronous
behaviour for both O_DIRECT and buffered file AIO.

This patchset implements changes to make filesystem AIO read
and write asynchronous for the non O_DIRECT case. This is mainly
relevant in the case of reads of uncached or partially cached files, and
O_SYNC writes. 

Instead of translating regular IO to [AIO + wait], it translates AIO
to [regular IO - blocking + retries]. The intent of implementing it
this way is to avoid modifying or slowing down normal usage, by keeping
it pretty much the way it is without AIO, while avoiding code duplication.
Instead we make AIO vs regular IO checks inside io_schedule(), i.e. at
the blocking points. The low-level unit of distinction is a wait queue
entry, which in the AIO case is contained in an iocb and in the
synchronous IO case is associated with the calling task.

The core idea is that is we complete as much IO as we can in a non-blocking
fashion, and then continue the remaining part of the transfer again when
woken up asynchronously via a wait queue callback when pages are ready ... 
thus each iteration progresses through more of the request until it is
completed. The interesting part here is that owing largely to the idempotence
in the way radix-tree page cache traveral happens, every iteration is simply
a smaller read/write. Almost all of the iocb manipulation and advancement
in the AIO case happens in the high level AIO code, and rather than in
regular VFS/filesystem paths.

The following is a sampling of comparative aio-stress results with the
patches (each run starts with uncached files):

-

aio-stress throughput comparisons (in MB/s):

file size 1GB, record size 64KB, depth 64, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
4 way Pentium III SMP box, Adaptec AIC-7896/7 Ultra2 SCSI, 40 MB/s
Filesystem: ext2


Buffered (non O_DIRECT)
Vanilla Patched O_DIRECT

   Vanilla Patched
Random-Read 10.08   23.91   18.91,   18.98
Random-O_SYNC-Write  8.86   15.84   16.51,   16.53
Sequential-Read 31.49   33.00   31.86,   31.79
Sequential-O_SYNC-Write  8.68   32.60   31.45,   32.44
Random-Write31.09 (19.65)   30.90 (19.65)   
Sequential-Write30.84 (28.94)   30.09 (28.39)



Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2006-12-28 Thread Suparna Bhattacharya


Currently native linux AIO is properly supported (in the sense of
actually being asynchronous) only for files opened with O_DIRECT.
While this suffices for a major (and most visible) user of AIO, i.e. databases,
other types of users like Samba require AIO support for regular file IO.
Also, for glibc POSIX AIO to be able to switch to using native AIO instead
of the current simulation using threads, it needs/expects asynchronous
behaviour for both O_DIRECT and buffered file AIO.

This patchset implements changes to make filesystem AIO read
and write asynchronous for the non O_DIRECT case. This is mainly
relevant in the case of reads of uncached or partially cached files, and
O_SYNC writes. 

Instead of translating regular IO to [AIO + wait], it translates AIO
to [regular IO - blocking + retries]. The intent of implementing it
this way is to avoid modifying or slowing down normal usage, by keeping
it pretty much the way it is without AIO, while avoiding code duplication.
Instead we make AIO vs regular IO checks inside io_schedule(), i.e. at
the blocking points. The low-level unit of distinction is a wait queue
entry, which in the AIO case is contained in an iocb and in the
synchronous IO case is associated with the calling task.

The core idea is that is we complete as much IO as we can in a non-blocking
fashion, and then continue the remaining part of the transfer again when
woken up asynchronously via a wait queue callback when pages are ready ... 
thus each iteration progresses through more of the request until it is
completed. The interesting part here is that owing largely to the idempotence
in the way radix-tree page cache traveral happens, every iteration is simply
a smaller read/write. Almost all of the iocb manipulation and advancement
in the AIO case happens in the high level AIO code, and rather than in
regular VFS/filesystem paths.

The following is a sampling of comparative aio-stress results with the
patches (each run starts with uncached files):

-

aio-stress throughput comparisons (in MB/s):

file size 1GB, record size 64KB, depth 64, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
4 way Pentium III SMP box, Adaptec AIC-7896/7 Ultra2 SCSI, 40 MB/s
Filesystem: ext2


Buffered (non O_DIRECT)
Vanilla Patched O_DIRECT

   Vanilla Patched
Random-Read 10.08   23.91   18.91,   18.98
Random-O_SYNC-Write  8.86   15.84   16.51,   16.53
Sequential-Read 31.49   33.00   31.86,   31.79
Sequential-O_SYNC-Write  8.68   32.60   31.45,   32.44
Random-Write31.09 (19.65)   30.90 (19.65)   
Sequential-Write30.84 (28.94)   30.09 (28.39)



Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 167 matches

Mail list logo