Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote: > On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: > > Well, if you see the modes proposed using above flags : > > > > #define FA_ALLOCATE 0 > > #define FA_DEALLOCATE FA_FL_DEALLOC > > #define FA_RESV_SPACE FA_FL_KEEP_SIZE > > #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | > > FA_FL_DEL_DATA) > > > > FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes > > for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this > > flag. Hence prealloction will never delete data. > > This mode is required only for FA_UNRESV_SPACE, which is a deallocation > > mode, to support any existing XFS aware applications/usage-scenarios. > > Sorry, but this doesn't make any sense. There is no need to put every > feature in the XFS ioctls in the syscalls. The XFS ioctls will need to > be supported forever anyway - as I suggested before they really should > be moved to generic code. > > What needs to be supported is what makes sense as an interface. > A punch a hole interface does make sense, but trying to hack this into > a preallocation system call is just madness. We're not IRIX or windows > that fit things into random subcall just because there was some space > left to squeeze them in. > > > > > > FA_FL_NO_MTIME0x10 /* keep same mtime (default change on > > > > > size, data change) */ > > > > > FA_FL_NO_CTIME0x20 /* keep same ctime (default change on > > > > > size, data change) */ > > > > > > NACK to these aswell. If i_size changes c/mtime need updates, if the size > > > doesn't chamge they don't. No need to add more flags for this. > > > > This requirement was from the point of view of HSM applications. Hope > > you saw Andreas previous post and are keeping that in mind. > > HSMs needs this basically for every system call, which screams for an > open flag like O_INVISIBLE anyway. Adding this in a generic way is > a good idea, but hacking bits and pieces that won't fit into the global > design is completely wrong. Why don't we just merge the interface for preallocation (essentially enough to satisfy posix_fallocate() and the simple XFS requirement for space reservation without changing file size), which there is clear agreement on (I hope :)). After all, this was all that we set out to do when we started. And leave all the dealloc/punch/hsm type features for separate future patches/ debates, those really shouldn't hold up the basic fallocate interface. I agree with Christoph that we are just diverging too much in trying to club those decisions here. Dave, Andreas, Ted ? Regards Suparna > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote: On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space left to squeeze them in. FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. Why don't we just merge the interface for preallocation (essentially enough to satisfy posix_fallocate() and the simple XFS requirement for space reservation without changing file size), which there is clear agreement on (I hope :)). After all, this was all that we set out to do when we started. And leave all the dealloc/punch/hsm type features for separate future patches/ debates, those really shouldn't hold up the basic fallocate interface. I agree with Christoph that we are just diverging too much in trying to club those decisions here. Dave, Andreas, Ted ? Regards Suparna - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dio: remove bogus refcounting BUG_ON
On Wed, Jul 04, 2007 at 07:25:10PM -0700, Badari Pulavarty wrote: > On Tue, 2007-07-03 at 15:28 -0700, Zach Brown wrote: > > Linus, Andrew, please apply the bug fix patch at the end of this reply > > for .22. > > > > > >>One of our perf. team ran into this while doing some runs. > > > >>I didn't see anything obvious - it looks like we converted > > > >>async IO to synchronous one. I didn't spend much time digging > > > >>around. > > > > OK, I think this BUG_ON() is just broken. I wasn't able to find any > > obvious bugs from reading the code which would cause the BUG_ON() to > > fire. If it's reproducible I'd love to hear what the recipe is. > > > > I did notice that this BUG_ON() is evaluating dio after having dropped > > it's ref :/. So it's not completely absurd to fear that it's a race > > with the dio's memory being reused, but that'd be a pretty tight race. > > > > Let's remove this stupid BUG_ON and see if that test box still has > > trouble. It might just hit the valid BUG_ON a few lines down, but this > > unsafe BUG_ON needs to go. > > I went through the code multiple times, I can't find how we can trigger > the BUG_ON(). But unfortunately, our perf. team is able reproduce the > problem. Debug indicated that, the ret2 == 1 :( > > Not sure how that can happen. Ideas ? Does it trigger even if you avoid referencing dio in the BUG_ON(), i.e. with something like ... --- direct-io.c 2007-07-02 01:24:24.0 +0530 +++ direct-io-debug.c 2007-07-05 09:18:56.0 +0530 @@ -1104,9 +1104,10 @@ direct_io_worker(int rw, struct kiocb *i * decide to wake the submission path atomically. */ spin_lock_irqsave(>bio_lock, flags); + is_async = dio->is_async; ret2 = --dio->refcount; spin_unlock_irqrestore(>bio_lock, flags); - BUG_ON(!dio->is_async && ret2 != 0); + BUG_ON(!is_async && ret2 != 0); if (ret2 == 0) { ret = dio_complete(dio, offset, ret); kfree(dio); > > Thanks, > Badari > > > > > --- > > > > dio: remove bogus refcounting BUG_ON > > > > Badari Pulavarty reported a case of this BUG_ON is triggering during > > testing. It's completely bogus and should be removed. > > > > It's trying to notice if we left references to the dio hanging around in > > the sync case. They should have been dropped as IO completed while this > > path was in dio_await_completion(). This condition will also be > > checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few > > lines lower. So to start this BUG_ON() is redundant. > > > > More fatally, it's dereferencing dio-> after having dropped its > > reference. It's only safe to dereference the dio after releasing the > > lock if the final reference was just dropped. Another CPU might free > > the dio in bio completion and reuse the memory after this path drops the > > dio lock but before the BUG_ON() is evaluated. > > > > This patch passed aio+dio regression unit tests and aio-stress on ext3. > > > > Signed-off-by: Zach Brown <[EMAIL PROTECTED]> > > Cc: Badari Pulavarty <[EMAIL PROTECTED]> > > > > diff -r 509ce354ae1b fs/direct-io.c > > --- a/fs/direct-io.cSun Jul 01 22:00:49 2007 + > > +++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700 > > @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i > > spin_lock_irqsave(>bio_lock, flags); > > ret2 = --dio->refcount; > > spin_unlock_irqrestore(>bio_lock, flags); > > - BUG_ON(!dio->is_async && ret2 != 0); > > + > > if (ret2 == 0) { > > ret = dio_complete(dio, offset, ret); > > kfree(dio); > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dio: remove bogus refcounting BUG_ON
On Wed, Jul 04, 2007 at 07:25:10PM -0700, Badari Pulavarty wrote: On Tue, 2007-07-03 at 15:28 -0700, Zach Brown wrote: Linus, Andrew, please apply the bug fix patch at the end of this reply for .22. One of our perf. team ran into this while doing some runs. I didn't see anything obvious - it looks like we converted async IO to synchronous one. I didn't spend much time digging around. OK, I think this BUG_ON() is just broken. I wasn't able to find any obvious bugs from reading the code which would cause the BUG_ON() to fire. If it's reproducible I'd love to hear what the recipe is. I did notice that this BUG_ON() is evaluating dio after having dropped it's ref :/. So it's not completely absurd to fear that it's a race with the dio's memory being reused, but that'd be a pretty tight race. Let's remove this stupid BUG_ON and see if that test box still has trouble. It might just hit the valid BUG_ON a few lines down, but this unsafe BUG_ON needs to go. I went through the code multiple times, I can't find how we can trigger the BUG_ON(). But unfortunately, our perf. team is able reproduce the problem. Debug indicated that, the ret2 == 1 :( Not sure how that can happen. Ideas ? Does it trigger even if you avoid referencing dio in the BUG_ON(), i.e. with something like ... --- direct-io.c 2007-07-02 01:24:24.0 +0530 +++ direct-io-debug.c 2007-07-05 09:18:56.0 +0530 @@ -1104,9 +1104,10 @@ direct_io_worker(int rw, struct kiocb *i * decide to wake the submission path atomically. */ spin_lock_irqsave(dio-bio_lock, flags); + is_async = dio-is_async; ret2 = --dio-refcount; spin_unlock_irqrestore(dio-bio_lock, flags); - BUG_ON(!dio-is_async ret2 != 0); + BUG_ON(!is_async ret2 != 0); if (ret2 == 0) { ret = dio_complete(dio, offset, ret); kfree(dio); Thanks, Badari --- dio: remove bogus refcounting BUG_ON Badari Pulavarty reported a case of this BUG_ON is triggering during testing. It's completely bogus and should be removed. It's trying to notice if we left references to the dio hanging around in the sync case. They should have been dropped as IO completed while this path was in dio_await_completion(). This condition will also be checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few lines lower. So to start this BUG_ON() is redundant. More fatally, it's dereferencing dio- after having dropped its reference. It's only safe to dereference the dio after releasing the lock if the final reference was just dropped. Another CPU might free the dio in bio completion and reuse the memory after this path drops the dio lock but before the BUG_ON() is evaluated. This patch passed aio+dio regression unit tests and aio-stress on ext3. Signed-off-by: Zach Brown [EMAIL PROTECTED] Cc: Badari Pulavarty [EMAIL PROTECTED] diff -r 509ce354ae1b fs/direct-io.c --- a/fs/direct-io.cSun Jul 01 22:00:49 2007 + +++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700 @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i spin_lock_irqsave(dio-bio_lock, flags); ret2 = --dio-refcount; spin_unlock_irqrestore(dio-bio_lock, flags); - BUG_ON(!dio-is_async ret2 != 0); + if (ret2 == 0) { ret = dio_complete(dio, offset, ret); kfree(dio); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] file capabilities: Introduction
On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote: > Hi! > > > "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote: > > > > > Following are two patches which have been sitting for some time in -mm. > > > > Where "some time" == "nearly six months". > > > > We need help considering, reviewing and testing this code, please. > > I did quick scan, and it looks ok. Plus, it means we can finally start > using that old capabilities subsystem... so I think we should do it. FWIW, I looked through it recently as well, and it looked reasonable enough to me, though I'm not a security expert. I did have a question about testing corner cases etc, which Serge has tried to address. Serge, are you planning to post an update without STRICTXATTR ? That should simplify the second patch. Regards Suparna > > Pavel > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) > http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] file capabilities: Introduction
On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote: Hi! Serge E. Hallyn [EMAIL PROTECTED] wrote: Following are two patches which have been sitting for some time in -mm. Where some time == nearly six months. We need help considering, reviewing and testing this code, please. I did quick scan, and it looks ok. Plus, it means we can finally start using that old capabilities subsystem... so I think we should do it. FWIW, I looked through it recently as well, and it looked reasonable enough to me, though I'm not a security expert. I did have a question about testing corner cases etc, which Serge has tried to address. Serge, are you planning to post an update without STRICTXATTR ? That should simplify the second patch. Regards Suparna Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/10] Linux Kernel Markers - i386 optimized version
On Fri, May 11, 2007 at 10:27:29AM +0530, Ananth N Mavinakayanahalli wrote: > On Thu, May 10, 2007 at 12:59:18PM -0400, Mathieu Desnoyers wrote: > > * Alan Cox ([EMAIL PROTECTED]) wrote: > > ... > > > > * Third issue : Scalability. Changing code will stop every CPU on the > > > > system for a while. Compared to this, the int3-based approach will run > > > > through the breakpoint handler "if" one of the CPU happens to execute > > > > this code at the wrong time. The standard case is just an IPI (to > > > > > > If I read the errata right then patching in an int3 will itself trigger > > > the errata so anything could happen. > > > > > > I believe there are other safe sequences for doing code patching - perhaps > > > one of the Intel folk can advise ? > > IIRC, when the first implementation of what exists now as kprobes was > done (as part of the dprobes framework), this question did come up. I > think the conclusion was that the errata applies only to multi-byte > modifications and single-byte changes are guaranteed to be atomic. > Given int3 on Intel is just 1-byte, we are safe. > > > I'll let the Intel guys confirm this, I don't have the reference nearby > > (I got this information by talking with the kprobe team members, and > > they got this information directly from Intel developers) but the > > int3 is the one special case to which the errata does not apply. > > Otherwise, kprobes and gdb would have a big, big issue. > > Perhaps Richard/Suparna can confirm. I just tried digging up past discussions on this from Richard, about int3 being safe http://sourceware.org/ml/systemtap/2005-q3/msg00208.html http://lkml.org/lkml/2006/9/20/30 Regards Suparna > > Ananth -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
e all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards Suparna > > Cheers, > > Dave, > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities
On Thu, May 10, 2007 at 01:01:27PM -0700, Andreas Dilger wrote: > On May 08, 2007 16:49 -0500, Serge E. Hallyn wrote: > > Quoting Andreas Dilger ([EMAIL PROTECTED]): > > > One of the important use cases I can see today is the ability to > > > split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine > > > grained attributes. > > > > Sounds plausible, though it suffers from both making capabilities far > > more cumbersome (i.e. finding the right capability for what you wanted > > to do) and backward compatibility. Perhaps at that point we should > > introduce security.capabilityv2 xattrs. A binary can then carry > > security.capability=CAP_SYS_ADMIN=p, and > > security.capabilityv2=cap_may_clone_mntns=p. > > Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing > 12 bytes worth of data, so it is probably just better to keep extending > the original capability fields as was in the proposal. > > > > What we definitely do NOT want to happen is an application that needs > > > priviledged access (e.g. e2fsck, mount) to stop running because the > > > new capabilities _would_ have been granted by the new kernel and are > > > not by the old kernel and STRICTXATTR is used. > > > > > > To me it would seem that having extra capabilities on an old kernel > > > is relatively harmless if the old kernel doesn't know what they are. > > > It's like having a key to a door that you don't know where it is. > > > > If we ditch the STRICTXATTR option do the semantics seem sane to you? > > Seems reasonable. It would simplify the code as well, which is good. This does mean no sanity checking of fcaps, am not sure if that matters, I'm guessing it should be similar to the case for other security attributes. Regards Suparna > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] file capabilities: accomodate 32 bit capabilities
On Thu, May 10, 2007 at 01:01:27PM -0700, Andreas Dilger wrote: On May 08, 2007 16:49 -0500, Serge E. Hallyn wrote: Quoting Andreas Dilger ([EMAIL PROTECTED]): One of the important use cases I can see today is the ability to split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine grained attributes. Sounds plausible, though it suffers from both making capabilities far more cumbersome (i.e. finding the right capability for what you wanted to do) and backward compatibility. Perhaps at that point we should introduce security.capabilityv2 xattrs. A binary can then carry security.capability=CAP_SYS_ADMIN=p, and security.capabilityv2=cap_may_clone_mntns=p. Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing 12 bytes worth of data, so it is probably just better to keep extending the original capability fields as was in the proposal. What we definitely do NOT want to happen is an application that needs priviledged access (e.g. e2fsck, mount) to stop running because the new capabilities _would_ have been granted by the new kernel and are not by the old kernel and STRICTXATTR is used. To me it would seem that having extra capabilities on an old kernel is relatively harmless if the old kernel doesn't know what they are. It's like having a key to a door that you don't know where it is. If we ditch the STRICTXATTR option do the semantics seem sane to you? Seems reasonable. It would simplify the code as well, which is good. This does mean no sanity checking of fcaps, am not sure if that matters, I'm guessing it should be similar to the case for other security attributes. Regards Suparna Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. By both above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. Yes, that's right. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Well, posix_fallocate() either gets all the space or it fails. If you truncate to extend the file size after an ENOSPC, then that is a buggy implementation. The same could be said for any application, or even the fallocate() call itself if it changes the filesize without having completely preallocated the space asked Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth considering. We've got different modes to specify different behaviour. That's what the mode field was put there for in the first place - the interface is *designed* to support different preallocation behaviours 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. Ok. But, some people may not expect/like this. I think, we can keep it on the backburner for a while, till other issues are sorted out. How can it be a backburner issue when it defines the implementation? I've already implemented some thing in XFS that sort of does what I think that the interface is supposed to do, but I need that interface to be nailed down before proceeding any further. All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards Suparna Cheers, Dave, -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send
Re: [patch 05/10] Linux Kernel Markers - i386 optimized version
On Fri, May 11, 2007 at 10:27:29AM +0530, Ananth N Mavinakayanahalli wrote: On Thu, May 10, 2007 at 12:59:18PM -0400, Mathieu Desnoyers wrote: * Alan Cox ([EMAIL PROTECTED]) wrote: ... * Third issue : Scalability. Changing code will stop every CPU on the system for a while. Compared to this, the int3-based approach will run through the breakpoint handler if one of the CPU happens to execute this code at the wrong time. The standard case is just an IPI (to If I read the errata right then patching in an int3 will itself trigger the errata so anything could happen. I believe there are other safe sequences for doing code patching - perhaps one of the Intel folk can advise ? IIRC, when the first implementation of what exists now as kprobes was done (as part of the dprobes framework), this question did come up. I think the conclusion was that the errata applies only to multi-byte modifications and single-byte changes are guaranteed to be atomic. Given int3 on Intel is just 1-byte, we are safe. I'll let the Intel guys confirm this, I don't have the reference nearby (I got this information by talking with the kprobe team members, and they got this information directly from Intel developers) but the int3 is the one special case to which the errata does not apply. Otherwise, kprobes and gdb would have a big, big issue. Perhaps Richard/Suparna can confirm. I just tried digging up past discussions on this from Richard, about int3 being safe http://sourceware.org/ml/systemtap/2005-q3/msg00208.html http://lkml.org/lkml/2006/9/20/30 Regards Suparna Ananth -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > This looks like it will have the same problem on s390 as > > > sys_sync_file_range. Maybe the prototype should be: > > > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > > > Yes, but the trouble is that there was a contrary viewpoint preferring that > > fd > > first be maintained as a convention like other syscalls (see the following > > posts) > > Of course the interface used by an application program would have the > fd first. Glibc can do the translation. I think that was understood. Regards Suparna > > Paul. -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: > Andrew Morton writes: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t > > > len) > > > > Please add a comment over this function which specifies its behaviour. > > Really it should be enough material from which a full manpage can be > > written. > > This looks like it will have the same problem on s390 as > sys_sync_file_range. Maybe the prototype should be: > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevel=117585330016809=2 (Andreas) http://marc.info/?l=linux-fsdevel=117690157917378=2 (Andreas) http://marc.info/?l=linux-fsdevel=117578821827323=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevel=117760995610639=2 http://marc.info/?l=linux-fsdevel=117708124913098=2 http://marc.info/?l=linux-fsdevel=117767607229807=2 Are there any better ideas ? Regards Suparna > > Paul. > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: Andrew Morton writes: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevelm=117585330016809w=2 (Andreas) http://marc.info/?l=linux-fsdevelm=117690157917378w=2 (Andreas) http://marc.info/?l=linux-fsdevelm=117578821827323w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevelm=117760995610639w=2 http://marc.info/?l=linux-fsdevelm=117708124913098w=2 http://marc.info/?l=linux-fsdevelm=117767607229807w=2 Are there any better ideas ? Regards Suparna Paul. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote: Suparna Bhattacharya writes: This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) Of course the interface used by an application program would have the fd first. Glibc can do the translation. I think that was understood. Regards Suparna Paul. -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ChunkFS - measuring cross-chunk references
On Wed, Apr 25, 2007 at 05:50:55AM +0530, Karuna sagar K wrote: > On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote: > >On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: > . > >It would also be good to distinguish between directories referencing > >files in another chunk, and directories referencing subdirectories in > >another chunk (which would be simpler to handle, given the topological > >restrictions on directories, as compared to files and hard links). > > > > Modified the tool to distinguish between > 1. cross references between directories and files > 2. cross references between directories and sub directories > 3. cross references within a file (due to huge file size) One more set of numbers to calculate would be an estimate of cross-references across chunks of block groups -- 1 (=128MB), 2 (=256MB), 4 (=512MB), 8(=1GB) as suggested by Kalpak. Once we have that, it would be nice if we can get data on results with the tool from other people, especially with larger filesystem sizes. Regards Suparna > > Below is the result from / partition of ext3 file system: > > Number of files = 221794 > Number of directories = 24457 > Total size = 8193116 KB > Total data stored = 7187392 KB > Size of block groups = 131072 KB > Number of inodes per block group = 16288 > No. of cross references between directories and sub-directories = 7791 > No. of cross references between directories and file = 657 > Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570) > > Thanks for the suggestions. > > >There may also be special things we will need to do to handle > >scenarios such as BackupPC, where if it looks like a directory > >contains a huge number of hard links to a particular chunk, we'll need > >to make sure that directory is either created in the right chunk > >(possibly with hints from the application) or migrated to the right > >chunk (but this might cause the inode number of the directory to > >change --- maybe we allow this as long as the directory has never been > >stat'ed, so that the inode number has never been observed). > > > >The other thing which we should consider is that chunkfs really > >requires a 64-bit inode number space, which means either we only allow > >it on 64-bit systems, or we need to consider a migration so that even > >on 32-bit platforms, stat() functions like stat64(), insofar that it > >uses a stat structure which returns a 64-bit ino_t. > > > > - Ted > > > > > Thanks, > Karuna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ChunkFS - measuring cross-chunk references
On Wed, Apr 25, 2007 at 05:50:55AM +0530, Karuna sagar K wrote: On 4/24/07, Theodore Tso [EMAIL PROTECTED] wrote: On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: . It would also be good to distinguish between directories referencing files in another chunk, and directories referencing subdirectories in another chunk (which would be simpler to handle, given the topological restrictions on directories, as compared to files and hard links). Modified the tool to distinguish between 1. cross references between directories and files 2. cross references between directories and sub directories 3. cross references within a file (due to huge file size) One more set of numbers to calculate would be an estimate of cross-references across chunks of block groups -- 1 (=128MB), 2 (=256MB), 4 (=512MB), 8(=1GB) as suggested by Kalpak. Once we have that, it would be nice if we can get data on results with the tool from other people, especially with larger filesystem sizes. Regards Suparna Below is the result from / partition of ext3 file system: Number of files = 221794 Number of directories = 24457 Total size = 8193116 KB Total data stored = 7187392 KB Size of block groups = 131072 KB Number of inodes per block group = 16288 No. of cross references between directories and sub-directories = 7791 No. of cross references between directories and file = 657 Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570) Thanks for the suggestions. There may also be special things we will need to do to handle scenarios such as BackupPC, where if it looks like a directory contains a huge number of hard links to a particular chunk, we'll need to make sure that directory is either created in the right chunk (possibly with hints from the application) or migrated to the right chunk (but this might cause the inode number of the directory to change --- maybe we allow this as long as the directory has never been stat'ed, so that the inode number has never been observed). The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow it on 64-bit systems, or we need to consider a migration so that even on 32-bit platforms, stat() functions like stat64(), insofar that it uses a stat structure which returns a 64-bit ino_t. - Ted Thanks, Karuna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote: > On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > at: http://lwn.net/Articles/190222 and > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > This implementation is done within ext2 driver. Every chunk is an > > independent ext2 file system. The knowledge about chunks is kept within > > ext2 and 'continuation inodes', which are used to allow files and > > directories span across multiple chunks, are managed within ext2. > > > > At mount time, super blocks for all the chunks are created and linked with > > the global super_blocks list maintained by VFS. This allows independent > > behavior or individual chunks and also helps writebacks to happen > > seamlessly. > > > > Apart from this, chunkfs code in ext2 effectively only provides knowledge > > of: > > > > - what inode's which block number to look for, for a given file's logical > > block number > > - in which chunk to allocate next inode / block > > - number of inodes to scan when a directory is being read > > > > To maintain the ext2's inode number uniqueness property, 8 msb bits of > > inode number are used to indicate the chunk number in which it resides. > > > > As said, this is a preliminary implementation and lots of changes are > > expected before this code is even sanely usable. Some known issues and > > obvious optimizations are listed in the TODO file in the chunkfs patch. > > > > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch > > - one big patch > > - applies to 2.6.18 > > > Could you send this out as a patch to ext2 codebase, so we can just look > at the changes for chunkfs ? That might also make it small enough > to inline your patch in email for review. Sorry, I missed the part about ext2-chunkfs-diff below. Regards suparna > > What kind of results are you planning to gather to evaluate/optimize this ? > > Regards > Suparna > > > > > Attached - ext2-chunkfs-diff.patch.gz > > - since the code is a spin-off of ext2, this patch explains better what > > has changed from the ext2. > > > > git://cislinux.cis.ksu.edu/chunkfs-tools > > - mkfs, and fsck for chunkfs. > > > > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml > > - config file used; tested mostly on UML with loopback file systems. > > > > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP > > should be "no" for clean compile. > > > > > > Please comment, suggest, criticize. Patches most welcome. > > > > > > Best, > > AG > > -- > > May the source be with you. > > http://www.cis.ksu.edu/~gud > > > > -- > Suparna Bhattacharya ([EMAIL PROTECTED]) > Linux Technology Center > IBM Software Lab, India > -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > This implementation is done within ext2 driver. Every chunk is an > independent ext2 file system. The knowledge about chunks is kept within > ext2 and 'continuation inodes', which are used to allow files and > directories span across multiple chunks, are managed within ext2. > > At mount time, super blocks for all the chunks are created and linked with > the global super_blocks list maintained by VFS. This allows independent > behavior or individual chunks and also helps writebacks to happen > seamlessly. > > Apart from this, chunkfs code in ext2 effectively only provides knowledge > of: > > - what inode's which block number to look for, for a given file's logical > block number > - in which chunk to allocate next inode / block > - number of inodes to scan when a directory is being read > > To maintain the ext2's inode number uniqueness property, 8 msb bits of > inode number are used to indicate the chunk number in which it resides. > > As said, this is a preliminary implementation and lots of changes are > expected before this code is even sanely usable. Some known issues and > obvious optimizations are listed in the TODO file in the chunkfs patch. > > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch > - one big patch > - applies to 2.6.18 Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. What kind of results are you planning to gather to evaluate/optimize this ? Regards Suparna > > Attached - ext2-chunkfs-diff.patch.gz > - since the code is a spin-off of ext2, this patch explains better what > has changed from the ext2. > > git://cislinux.cis.ksu.edu/chunkfs-tools > - mkfs, and fsck for chunkfs. > > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml > - config file used; tested mostly on UML with loopback file systems. > > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP > should be "no" for clean compile. > > > Please comment, suggest, criticize. Patches most welcome. > > > Best, > AG > -- > May the source be with you. > http://www.cis.ksu.edu/~gud -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: This is an initial implementation of ChunkFS technique, briefly discussed at: http://lwn.net/Articles/190222 and http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf This implementation is done within ext2 driver. Every chunk is an independent ext2 file system. The knowledge about chunks is kept within ext2 and 'continuation inodes', which are used to allow files and directories span across multiple chunks, are managed within ext2. At mount time, super blocks for all the chunks are created and linked with the global super_blocks list maintained by VFS. This allows independent behavior or individual chunks and also helps writebacks to happen seamlessly. Apart from this, chunkfs code in ext2 effectively only provides knowledge of: - what inode's which block number to look for, for a given file's logical block number - in which chunk to allocate next inode / block - number of inodes to scan when a directory is being read To maintain the ext2's inode number uniqueness property, 8 msb bits of inode number are used to indicate the chunk number in which it resides. As said, this is a preliminary implementation and lots of changes are expected before this code is even sanely usable. Some known issues and obvious optimizations are listed in the TODO file in the chunkfs patch. http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch - one big patch - applies to 2.6.18 Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. What kind of results are you planning to gather to evaluate/optimize this ? Regards Suparna Attached - ext2-chunkfs-diff.patch.gz - since the code is a spin-off of ext2, this patch explains better what has changed from the ext2. git://cislinux.cis.ksu.edu/chunkfs-tools - mkfs, and fsck for chunkfs. http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml - config file used; tested mostly on UML with loopback file systems. NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP should be no for clean compile. Please comment, suggest, criticize. Patches most welcome. Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote: On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: This is an initial implementation of ChunkFS technique, briefly discussed at: http://lwn.net/Articles/190222 and http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf This implementation is done within ext2 driver. Every chunk is an independent ext2 file system. The knowledge about chunks is kept within ext2 and 'continuation inodes', which are used to allow files and directories span across multiple chunks, are managed within ext2. At mount time, super blocks for all the chunks are created and linked with the global super_blocks list maintained by VFS. This allows independent behavior or individual chunks and also helps writebacks to happen seamlessly. Apart from this, chunkfs code in ext2 effectively only provides knowledge of: - what inode's which block number to look for, for a given file's logical block number - in which chunk to allocate next inode / block - number of inodes to scan when a directory is being read To maintain the ext2's inode number uniqueness property, 8 msb bits of inode number are used to indicate the chunk number in which it resides. As said, this is a preliminary implementation and lots of changes are expected before this code is even sanely usable. Some known issues and obvious optimizations are listed in the TODO file in the chunkfs patch. http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch - one big patch - applies to 2.6.18 Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. Sorry, I missed the part about ext2-chunkfs-diff below. Regards suparna What kind of results are you planning to gather to evaluate/optimize this ? Regards Suparna Attached - ext2-chunkfs-diff.patch.gz - since the code is a spin-off of ext2, this patch explains better what has changed from the ext2. git://cislinux.cis.ksu.edu/chunkfs-tools - mkfs, and fsck for chunkfs. http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml - config file used; tested mostly on UML with loopback file systems. NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP should be no for clean compile. Please comment, suggest, criticize. Patches most welcome. Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)
On Tue, Feb 27, 2007 at 10:42:11AM +0100, Jens Axboe wrote: > On Tue, Feb 27 2007, Suparna Bhattacharya wrote: > > On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote: > > > On Mon, Feb 26 2007, Suparna Bhattacharya wrote: > > > > On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote: > > > > > > > > > > Some more results, using a larger number of processes and io depths. A > > > > > repeat of the tests from friday, with added depth 2 for syslet and > > > > > libaio: > > > > > > > > > > Engine Depth Processes Bw (MiB/sec) > > > > > > > > > > libaio1 1602 > > > > > syslet1 1759 > > > > > sync 1 1776 > > > > > libaio 32 1832 > > > > > syslet 32 1898 > > > > > libaio2 1581 > > > > > syslet2 1609 > > > > > > > > > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs > > > > > with 100 processes each with a depth of 200, reading a per-process > > > > > private file of 10mb (need to fit in my ram...) 10 times each. IOW, > > > > > doing 10,000MiB of IO in total: > > > > > > > > But, why ramfs ? Don't we want to exercise the case where O_DIRECT > > > > actually > > > > blocks ? Or am I missing something here ? > > > > > > Just overhead numbers for that test case, lets try something like your > > > described job. > > > > > > Test case is doing random reads from /dev/sdb, in chunks of 64kb: > > > > > > Engine Depth Processes Bw (KiB/sec) > > > > > > libaio 200 1002813 > > > syslet 200 1003944 > > > libaio 2 12793 > > > syslet 2 13854 > > > sync (*) 2 12866 > > > > > > deadline was used for IO scheduling, to minimize impact. Not sure why > > > syslet actually does so much better here, looing at vmstat the rate is > > > steady and all runs are basically 50/50 idle/wait. One difference is > > > that the submission itself takes a long time on libaio, since the > > > io_submit() will block on request allocation. The generated IO pattern > > > from each process is the same for all runs. The drive is a lousy sata > > > that doesn't even do queuing, FWIW. > > > > > > I tried the latest fio code with syslet v4, and my results are a little > > different - have yet to figure out why or what to make of it. > > I hope I have all the right pieces now. > > > > This is an ext2 filesystem, SCSI AIC7xxx. > > > > I used an iodepth_batch size of 8 to limit the number of ios in a single > > io_submit (thanks for adding that parameter to fio !), like we did in > > aio-stress. > > > > Engine Depth BatchBw (KiB/sec) > > > > libaio 64 817,226 > > syslet 64 817,620 > > libaio 2 818,552 > > syslet 2 814,935 > > > > > > Which is not bad, actually. > > It's not bad for such a high depth/batch setting, but I still wonder why > are results are so different. I'll look around for an x86 box with some > TCQ/NCQ enabled storage attached for testing. Can you pass me your > command line or job file (whatever you use) so we are on the same page? Sure - I used variations of the following job file (e.g. engine=syslet-rw, iodepth=2). Also the io scheduler on my system is set to Anticipatory by default. FWIW it is a 4 way SMP (PIII, 700MHz) ; aio-stress -l -O -o3 <1GB file> [global] ioengine=libaio buffered=0 rw=randread bs=64k size=1024m directory=/kdump/suparna [testfile2] iodepth=64 iodepth_batch=8 > > > If I do not specify the iodepth_batch (i.e. default to depth), then the > > difference becomes more pronounced at higher depths. However, I doubt > > whether anyone would be using such high batch sizes in practice ... > > > > Engine Depth BatchBw (KiB/
Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)
On Tue, Feb 27, 2007 at 10:42:11AM +0100, Jens Axboe wrote: On Tue, Feb 27 2007, Suparna Bhattacharya wrote: On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote: On Mon, Feb 26 2007, Suparna Bhattacharya wrote: On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote: Some more results, using a larger number of processes and io depths. A repeat of the tests from friday, with added depth 2 for syslet and libaio: Engine Depth Processes Bw (MiB/sec) libaio1 1602 syslet1 1759 sync 1 1776 libaio 32 1832 syslet 32 1898 libaio2 1581 syslet2 1609 syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs with 100 processes each with a depth of 200, reading a per-process private file of 10mb (need to fit in my ram...) 10 times each. IOW, doing 10,000MiB of IO in total: But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually blocks ? Or am I missing something here ? Just overhead numbers for that test case, lets try something like your described job. Test case is doing random reads from /dev/sdb, in chunks of 64kb: Engine Depth Processes Bw (KiB/sec) libaio 200 1002813 syslet 200 1003944 libaio 2 12793 syslet 2 13854 sync (*) 2 12866 deadline was used for IO scheduling, to minimize impact. Not sure why syslet actually does so much better here, looing at vmstat the rate is steady and all runs are basically 50/50 idle/wait. One difference is that the submission itself takes a long time on libaio, since the io_submit() will block on request allocation. The generated IO pattern from each process is the same for all runs. The drive is a lousy sata that doesn't even do queuing, FWIW. I tried the latest fio code with syslet v4, and my results are a little different - have yet to figure out why or what to make of it. I hope I have all the right pieces now. This is an ext2 filesystem, SCSI AIC7xxx. I used an iodepth_batch size of 8 to limit the number of ios in a single io_submit (thanks for adding that parameter to fio !), like we did in aio-stress. Engine Depth BatchBw (KiB/sec) libaio 64 817,226 syslet 64 817,620 libaio 2 818,552 syslet 2 814,935 Which is not bad, actually. It's not bad for such a high depth/batch setting, but I still wonder why are results are so different. I'll look around for an x86 box with some TCQ/NCQ enabled storage attached for testing. Can you pass me your command line or job file (whatever you use) so we are on the same page? Sure - I used variations of the following job file (e.g. engine=syslet-rw, iodepth=2). Also the io scheduler on my system is set to Anticipatory by default. FWIW it is a 4 way SMP (PIII, 700MHz) ; aio-stress -l -O -o3 1GB file [global] ioengine=libaio buffered=0 rw=randread bs=64k size=1024m directory=/kdump/suparna [testfile2] iodepth=64 iodepth_batch=8 If I do not specify the iodepth_batch (i.e. default to depth), then the difference becomes more pronounced at higher depths. However, I doubt whether anyone would be using such high batch sizes in practice ... Engine Depth BatchBw (KiB/sec) libaio 64 default 17,429 syslet 64 default 16,155 libaio 2 default 15,494 syslet 2 default 7,971 If iodepth_batch isn't set, the syslet queued io will be serialized and I see, so then this particular setting is not very meaningful not take advantage of queueing. How does the job file perform with ioengine=sync? Just tried it now : 9,027KiB/s Often times it is the application tuning that makes all the difference, so am not really sure how much to read into these results. That's always been the hard part of async io ... Yes I agree, it's handy to get an overview though. True, at least some of this helps us gain a little more understanding about the boundaries and how to tune it to be most effective. Regards Suparna -- Jens Axboe -- Suparna Bhattacharya ([EMAIL
Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)
On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote: > On Mon, Feb 26 2007, Suparna Bhattacharya wrote: > > On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote: > > > > > > Some more results, using a larger number of processes and io depths. A > > > repeat of the tests from friday, with added depth 2 for syslet and > > > libaio: > > > > > > Engine Depth Processes Bw (MiB/sec) > > > > > > libaio1 1602 > > > syslet1 1759 > > > sync 1 1776 > > > libaio 32 1832 > > > syslet 32 1898 > > > libaio2 1581 > > > syslet2 1609 > > > > > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs > > > with 100 processes each with a depth of 200, reading a per-process > > > private file of 10mb (need to fit in my ram...) 10 times each. IOW, > > > doing 10,000MiB of IO in total: > > > > But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually > > blocks ? Or am I missing something here ? > > Just overhead numbers for that test case, lets try something like your > described job. > > Test case is doing random reads from /dev/sdb, in chunks of 64kb: > > Engine Depth Processes Bw (KiB/sec) > > libaio 200 1002813 > syslet 200 1003944 > libaio 2 12793 > syslet 2 13854 > sync (*) 2 12866 > > deadline was used for IO scheduling, to minimize impact. Not sure why > syslet actually does so much better here, looing at vmstat the rate is > steady and all runs are basically 50/50 idle/wait. One difference is > that the submission itself takes a long time on libaio, since the > io_submit() will block on request allocation. The generated IO pattern > from each process is the same for all runs. The drive is a lousy sata > that doesn't even do queuing, FWIW. I tried the latest fio code with syslet v4, and my results are a little different - have yet to figure out why or what to make of it. I hope I have all the right pieces now. This is an ext2 filesystem, SCSI AIC7xxx. I used an iodepth_batch size of 8 to limit the number of ios in a single io_submit (thanks for adding that parameter to fio !), like we did in aio-stress. Engine Depth BatchBw (KiB/sec) libaio 64 817,226 syslet 64 817,620 libaio 2 818,552 syslet 2 814,935 Which is not bad, actually. If I do not specify the iodepth_batch (i.e. default to depth), then the difference becomes more pronounced at higher depths. However, I doubt whether anyone would be using such high batch sizes in practice ... Engine Depth BatchBw (KiB/sec) libaio 64 default 17,429 syslet 64 default 16,155 libaio 2 default 15,494 syslet 2 default 7,971 Often times it is the application tuning that makes all the difference, so am not really sure how much to read into these results. That's always been the hard part of async io ... Regards Suparna > > [*] Just for comparison, the depth is obviously really 1 at the kernel > side since it's sync. > > -- > Jens Axboe -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)
On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote: > > Some more results, using a larger number of processes and io depths. A > repeat of the tests from friday, with added depth 2 for syslet and > libaio: > > Engine Depth Processes Bw (MiB/sec) > > libaio1 1602 > syslet1 1759 > sync 1 1776 > libaio 32 1832 > syslet 32 1898 > libaio2 1581 > syslet2 1609 > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs > with 100 processes each with a depth of 200, reading a per-process > private file of 10mb (need to fit in my ram...) 10 times each. IOW, > doing 10,000MiB of IO in total: But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually blocks ? Or am I missing something here ? Regards Suparna > > Engine Depth Processes Bw (MiB/sec) > > libaio 200 1001488 > syslet 200 1001714 > > Results are stable to within approx +/- 10MiB/sec. The syslet case > completes a whole second faster than libaio (~6 vs ~7 seconds). Testing > was done with fio HEAD eb7c8ae27bc301b77490b3586dd5ccab7c95880a, and it > uses the v4 patch series. > > Engine Depth Processes Bw (MiB/sec) > > libaio200 1001488 > syslet200 1001714 > sync 200 1001843 > > -- > Jens Axboe -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)
On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote: Some more results, using a larger number of processes and io depths. A repeat of the tests from friday, with added depth 2 for syslet and libaio: Engine Depth Processes Bw (MiB/sec) libaio1 1602 syslet1 1759 sync 1 1776 libaio 32 1832 syslet 32 1898 libaio2 1581 syslet2 1609 syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs with 100 processes each with a depth of 200, reading a per-process private file of 10mb (need to fit in my ram...) 10 times each. IOW, doing 10,000MiB of IO in total: But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually blocks ? Or am I missing something here ? Regards Suparna Engine Depth Processes Bw (MiB/sec) libaio 200 1001488 syslet 200 1001714 Results are stable to within approx +/- 10MiB/sec. The syslet case completes a whole second faster than libaio (~6 vs ~7 seconds). Testing was done with fio HEAD eb7c8ae27bc301b77490b3586dd5ccab7c95880a, and it uses the v4 patch series. Engine Depth Processes Bw (MiB/sec) libaio200 1001488 syslet200 1001714 sync 200 1001843 -- Jens Axboe -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)
On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote: On Mon, Feb 26 2007, Suparna Bhattacharya wrote: On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote: Some more results, using a larger number of processes and io depths. A repeat of the tests from friday, with added depth 2 for syslet and libaio: Engine Depth Processes Bw (MiB/sec) libaio1 1602 syslet1 1759 sync 1 1776 libaio 32 1832 syslet 32 1898 libaio2 1581 syslet2 1609 syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs with 100 processes each with a depth of 200, reading a per-process private file of 10mb (need to fit in my ram...) 10 times each. IOW, doing 10,000MiB of IO in total: But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually blocks ? Or am I missing something here ? Just overhead numbers for that test case, lets try something like your described job. Test case is doing random reads from /dev/sdb, in chunks of 64kb: Engine Depth Processes Bw (KiB/sec) libaio 200 1002813 syslet 200 1003944 libaio 2 12793 syslet 2 13854 sync (*) 2 12866 deadline was used for IO scheduling, to minimize impact. Not sure why syslet actually does so much better here, looing at vmstat the rate is steady and all runs are basically 50/50 idle/wait. One difference is that the submission itself takes a long time on libaio, since the io_submit() will block on request allocation. The generated IO pattern from each process is the same for all runs. The drive is a lousy sata that doesn't even do queuing, FWIW. I tried the latest fio code with syslet v4, and my results are a little different - have yet to figure out why or what to make of it. I hope I have all the right pieces now. This is an ext2 filesystem, SCSI AIC7xxx. I used an iodepth_batch size of 8 to limit the number of ios in a single io_submit (thanks for adding that parameter to fio !), like we did in aio-stress. Engine Depth BatchBw (KiB/sec) libaio 64 817,226 syslet 64 817,620 libaio 2 818,552 syslet 2 814,935 Which is not bad, actually. If I do not specify the iodepth_batch (i.e. default to depth), then the difference becomes more pronounced at higher depths. However, I doubt whether anyone would be using such high batch sizes in practice ... Engine Depth BatchBw (KiB/sec) libaio 64 default 17,429 syslet 64 default 16,155 libaio 2 default 15,494 syslet 2 default 7,971 Often times it is the application tuning that makes all the difference, so am not really sure how much to read into these results. That's always been the hard part of async io ... Regards Suparna [*] Just for comparison, the depth is obviously really 1 at the kernel side since it's sync. -- Jens Axboe -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)
On Fri, Feb 23, 2007 at 05:25:08PM +0100, Jens Axboe wrote: > On Fri, Feb 23 2007, Suparna Bhattacharya wrote: > > On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote: > > > > > > * Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > > > > > As a really crude (and not very realistic) example of the potential > > > > impact of large numbers of outstanding IOs, I tried some quick direct > > > > IO comparisons using fio: > > > > > > > > [global] > > > > ioengine=syslet-rw > > > > buffered=0 > > > > rw=randread > > > > bs=64k > > > > size=1024m > > > > iodepth=64 > > > > > > could you please try those iodepth=2 tests with the latest > > > fio-testing branch of fio as well? Jens wrote a new, smarter syslet > > > plugin for FIO. You'll need the v3 syslet kernel plus: > > > > > > git-clone git://git.kernel.dk/data/git/fio.git > > > cd fio > > > git-checkout syslet-testing > > > > > > my expectation is that it should behave better with iodepth=2 > > > (although i havent tried that yet). > > > > I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz) > > and used the v3 syslet patches from your web-site. > > > > Do I still need to get something more recent ? > > Yes, you need to test the syslet+testing branch that Ingo referenced. > Your test above is not totally fair right now, since you are doing > significantly less system calls with libaio. So to compare apples with > apples, try the syslet-testing branch. If you can't get it because of > firewall problems, check http://brick.kernel.dk/snaps/ for the latest > fio snapshot. If it has the syslet-testing branch, then that is > recent enough. I have a feeling this is getting to be a little more bleeding edge than I had anticipated :), so will just hold off for a bit until this crystallizes a bit. Regards Suparna > > -- > Jens Axboe -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)
On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote: > > * Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > As a really crude (and not very realistic) example of the potential > > impact of large numbers of outstanding IOs, I tried some quick direct > > IO comparisons using fio: > > > > [global] > > ioengine=syslet-rw > > buffered=0 > > rw=randread > > bs=64k > > size=1024m > > iodepth=64 > > could you please try those iodepth=2 tests with the latest > fio-testing branch of fio as well? Jens wrote a new, smarter syslet > plugin for FIO. You'll need the v3 syslet kernel plus: > > git-clone git://git.kernel.dk/data/git/fio.git > cd fio > git-checkout syslet-testing > > my expectation is that it should behave better with iodepth=2 > (although i havent tried that yet). I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz) and used the v3 syslet patches from your web-site. Do I still need to get something more recent ? Regards Suparna > > Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
On Thu, Feb 22, 2007 at 03:36:58PM +0100, Ingo Molnar wrote: > > * Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > > maybe it will, maybe it wont. Lets try? There is no true difference > > > between having a 'request structure' that represents the current > > > state of the HTTP connection plus a statemachine that moves that > > > request between various queues, and a 'kernel stack' that goes in > > > and out of runnable state and carries its processing state in its > > > stack - other than the amount of RAM they take. (the kernel stack is > > > 4K at a minimum - so with a million outstanding requests they would > > > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM > > > - that's acceptable.) > > > > At what point are the cachemiss threads destroyed ? In other words how > > well does this adapt to load variations ? For example, would this 80MB > > of RAM continue to be locked down even during periods of lighter loads > > thereafter ? > > you can destroy them at will from user-space too - just start a slow > timer that zaps them if load goes down. I can add a > sys_async_thread_exit(nr_threads) API to be able to drive this without > knowing the TIDs of those threads, and/or i can add a kernel-internal > mechanism to zap inactive threads. It would be rather easy and > low-overhead - the v2 code already had a max_nr_threads tunable, i can > reintroduce it. So the size of the pool of contexts does not have to be > permanent at all. If you can find a way to do this without additional tunables burden on the administrator that would certainly help ! IIRC, performance problems linked to having too many or too few AIO kernel threads has been a commonly reported issue elsewhere - it would be nice to be able to avoid repeating the crux of that (mistake) in Linux. To me, any need to manually tune the number has always seemed to defeat the very benefit of adaptability of varying loads that AIO intrinsically provides. Regards Suparna > > Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)
On Fri, Feb 23, 2007 at 01:52:47PM +0100, Jens Axboe wrote: > On Wed, Feb 21 2007, Ingo Molnar wrote: > > this is the v3 release of the syslet/threadlet subsystem: > > > >http://redhat.com/~mingo/syslet-patches/ > > [snip] > > Ingo, some testing of the experimental syslet queueing stuff, in the > syslet-testing branch of fio. > > Fio job file: > > [global] > bs=8k > size=1g > direct=0 > ioengine=syslet-rw > iodepth=32 > rw=read > > [file] > filename=/ramfs/testfile > > Only changes between runs was changing ioengine and iodepth as indicated > in the table below. > > Results: > > Engine Depth Bw (MiB/sec) > > libaio1 441 > syslet1 574 > sync 1 589 > libaio 32 613 > syslet 32 681 > > Results are stable to within +/- 1MiB/sec. So you can see that syslet > are still a bit slower than sync for depth 1, but beats the pants off > libaio for equal depths. Note that this is buffered IO, I'll be out for > the weekend, but I'll hack some direct IO testing up next week to > compare "real" queuing. > > Just a quick microbenchmark to gauge current overhead... This is just ramfs, to gauge pure overheads, is that correct ? BTW, I'm not surprised at Ingo's initial results of syslet vs libaio overheads, for aio-stress/fio type streaming io runs, because these cases do not involve large numbers of outstanding ios. So the overhead of thread creation with syslets is amortized across the entire run of io submissions because of the reuse of already created async threads. While in the libaio case there is a setup and teardown of kiocbs per request. What I have been concerned about instead in the past when considering thread based AIO implementations is the resource(memory) consumption impact on overall system performance and adaptability to varying loads. It is nice that we can avoid that for the cached cases, but for the general blocking cases, it is still not clear to me whether we have addressed this well enough yet. I used to think that even the kiocb was too heavyweight for its purpose ... especially in terms of scaling to larger loads. As a really crude (and not very realistic) example of the potential impact of large numbers of outstanding IOs, I tried some quick direct IO comparisons using fio: [global] ioengine=syslet-rw buffered=0 rw=randread bs=64k size=1024m iodepth=64 Engine Depth Bw (MiB/sec) libaio 64 17.323 syslet 64 17.524 libaio 2 15.226 syslet 2 11.015 Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)
On Fri, Feb 23, 2007 at 01:52:47PM +0100, Jens Axboe wrote: On Wed, Feb 21 2007, Ingo Molnar wrote: this is the v3 release of the syslet/threadlet subsystem: http://redhat.com/~mingo/syslet-patches/ [snip] Ingo, some testing of the experimental syslet queueing stuff, in the syslet-testing branch of fio. Fio job file: [global] bs=8k size=1g direct=0 ioengine=syslet-rw iodepth=32 rw=read [file] filename=/ramfs/testfile Only changes between runs was changing ioengine and iodepth as indicated in the table below. Results: Engine Depth Bw (MiB/sec) libaio1 441 syslet1 574 sync 1 589 libaio 32 613 syslet 32 681 Results are stable to within +/- 1MiB/sec. So you can see that syslet are still a bit slower than sync for depth 1, but beats the pants off libaio for equal depths. Note that this is buffered IO, I'll be out for the weekend, but I'll hack some direct IO testing up next week to compare real queuing. Just a quick microbenchmark to gauge current overhead... This is just ramfs, to gauge pure overheads, is that correct ? BTW, I'm not surprised at Ingo's initial results of syslet vs libaio overheads, for aio-stress/fio type streaming io runs, because these cases do not involve large numbers of outstanding ios. So the overhead of thread creation with syslets is amortized across the entire run of io submissions because of the reuse of already created async threads. While in the libaio case there is a setup and teardown of kiocbs per request. What I have been concerned about instead in the past when considering thread based AIO implementations is the resource(memory) consumption impact on overall system performance and adaptability to varying loads. It is nice that we can avoid that for the cached cases, but for the general blocking cases, it is still not clear to me whether we have addressed this well enough yet. I used to think that even the kiocb was too heavyweight for its purpose ... especially in terms of scaling to larger loads. As a really crude (and not very realistic) example of the potential impact of large numbers of outstanding IOs, I tried some quick direct IO comparisons using fio: [global] ioengine=syslet-rw buffered=0 rw=randread bs=64k size=1024m iodepth=64 Engine Depth Bw (MiB/sec) libaio 64 17.323 syslet 64 17.524 libaio 2 15.226 syslet 2 11.015 Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3
On Thu, Feb 22, 2007 at 03:36:58PM +0100, Ingo Molnar wrote: * Suparna Bhattacharya [EMAIL PROTECTED] wrote: maybe it will, maybe it wont. Lets try? There is no true difference between having a 'request structure' that represents the current state of the HTTP connection plus a statemachine that moves that request between various queues, and a 'kernel stack' that goes in and out of runnable state and carries its processing state in its stack - other than the amount of RAM they take. (the kernel stack is 4K at a minimum - so with a million outstanding requests they would use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM - that's acceptable.) At what point are the cachemiss threads destroyed ? In other words how well does this adapt to load variations ? For example, would this 80MB of RAM continue to be locked down even during periods of lighter loads thereafter ? you can destroy them at will from user-space too - just start a slow timer that zaps them if load goes down. I can add a sys_async_thread_exit(nr_threads) API to be able to drive this without knowing the TIDs of those threads, and/or i can add a kernel-internal mechanism to zap inactive threads. It would be rather easy and low-overhead - the v2 code already had a max_nr_threads tunable, i can reintroduce it. So the size of the pool of contexts does not have to be permanent at all. If you can find a way to do this without additional tunables burden on the administrator that would certainly help ! IIRC, performance problems linked to having too many or too few AIO kernel threads has been a commonly reported issue elsewhere - it would be nice to be able to avoid repeating the crux of that (mistake) in Linux. To me, any need to manually tune the number has always seemed to defeat the very benefit of adaptability of varying loads that AIO intrinsically provides. Regards Suparna Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)
On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote: * Suparna Bhattacharya [EMAIL PROTECTED] wrote: As a really crude (and not very realistic) example of the potential impact of large numbers of outstanding IOs, I tried some quick direct IO comparisons using fio: [global] ioengine=syslet-rw buffered=0 rw=randread bs=64k size=1024m iodepth=64 could you please try those iodepth=2 tests with the latest fio-testing branch of fio as well? Jens wrote a new, smarter syslet plugin for FIO. You'll need the v3 syslet kernel plus: git-clone git://git.kernel.dk/data/git/fio.git cd fio git-checkout syslet-testing my expectation is that it should behave better with iodepth=2 (although i havent tried that yet). I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz) and used the v3 syslet patches from your web-site. Do I still need to get something more recent ? Regards Suparna Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A quick fio test (was Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3)
On Fri, Feb 23, 2007 at 05:25:08PM +0100, Jens Axboe wrote: On Fri, Feb 23 2007, Suparna Bhattacharya wrote: On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote: * Suparna Bhattacharya [EMAIL PROTECTED] wrote: As a really crude (and not very realistic) example of the potential impact of large numbers of outstanding IOs, I tried some quick direct IO comparisons using fio: [global] ioengine=syslet-rw buffered=0 rw=randread bs=64k size=1024m iodepth=64 could you please try those iodepth=2 tests with the latest fio-testing branch of fio as well? Jens wrote a new, smarter syslet plugin for FIO. You'll need the v3 syslet kernel plus: git-clone git://git.kernel.dk/data/git/fio.git cd fio git-checkout syslet-testing my expectation is that it should behave better with iodepth=2 (although i havent tried that yet). I picked up the fio snapshot from 22nd Feb (fio-git-2007012513.tar.gz) and used the v3 syslet patches from your web-site. Do I still need to get something more recent ? Yes, you need to test the syslet+testing branch that Ingo referenced. Your test above is not totally fair right now, since you are doing significantly less system calls with libaio. So to compare apples with apples, try the syslet-testing branch. If you can't get it because of firewall problems, check http://brick.kernel.dk/snaps/ for the latest fio snapshot. If it has the syslet-testing branch, then that is recent enough. I have a feeling this is getting to be a little more bleeding edge than I had anticipated :), so will just hold off for a bit until this crystallizes a bit. Regards Suparna -- Jens Axboe -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar wrote: > > * Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > > It is not a TUX anymore - you had 1024 threads, and all of them will > > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill > > a machine. > > maybe it will, maybe it wont. Lets try? There is no true difference > between having a 'request structure' that represents the current state > of the HTTP connection plus a statemachine that moves that request > between various queues, and a 'kernel stack' that goes in and out of > runnable state and carries its processing state in its stack - other > than the amount of RAM they take. (the kernel stack is 4K at a minimum - > so with a million outstanding requests they would use up 4 GB of RAM. > With 20k outstanding requests it's 80 MB of RAM - that's acceptable.) At what point are the cachemiss threads destroyed ? In other words how well does this adapt to load variations ? For example, would this 80MB of RAM continue to be locked down even during periods of lighter loads thereafter ? Regards Suparna > > > My tests show that with 4k connections per second (8k concurrency) > > more than 20k connections of 80k total block in tcp_sendmsg() over > > gigabit lan between quite fast machines. > > yeah. Note that you can have a million sleeping threads if you want, the > scheduler wont care. What matters more is the amount of true concurrency > that is present at any given time. But yes, i agree that overscheduling > can be a problem. > > btw., what is the measurement utility you are using with kevents ('ab' > perhaps, with a high -c concurrency count?), and which webserver are you > using? (light-httpd?) > > Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
s_async_register()/unregister() has been removed as it is not >needed anymore. sys_async_exec() can be called straight away. > > - there is no kernel-side resource used up by async completion rings at > all (all the state is in user-space), so an arbitrary number of >completion rings are supported. > > plus lots of bugs were fixed and a good number of cleanups were done as > well. The v3 code is ABI-incompatible with v2, due to these fundamental > changes. > > As always, comments, suggestions, reports are welcome. > > Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3
to these fundamental changes. As always, comments, suggestions, reports are welcome. Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3
On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar wrote: * Evgeniy Polyakov [EMAIL PROTECTED] wrote: It is not a TUX anymore - you had 1024 threads, and all of them will be consumed by tcp_sendmsg() for slow clients - rescheduling will kill a machine. maybe it will, maybe it wont. Lets try? There is no true difference between having a 'request structure' that represents the current state of the HTTP connection plus a statemachine that moves that request between various queues, and a 'kernel stack' that goes in and out of runnable state and carries its processing state in its stack - other than the amount of RAM they take. (the kernel stack is 4K at a minimum - so with a million outstanding requests they would use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM - that's acceptable.) At what point are the cachemiss threads destroyed ? In other words how well does this adapt to load variations ? For example, would this 80MB of RAM continue to be locked down even during periods of lighter loads thereafter ? Regards Suparna My tests show that with 4k connections per second (8k concurrency) more than 20k connections of 80k total block in tcp_sendmsg() over gigabit lan between quite fast machines. yeah. Note that you can have a million sleeping threads if you want, the scheduler wont care. What matters more is the amount of true concurrency that is present at any given time. But yes, i agree that overscheduling can be a problem. btw., what is the measurement utility you are using with kevents ('ab' perhaps, with a high -c concurrency count?), and which webserver are you using? (light-httpd?) Ingo -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
he first result code given to > + * aio_complete(). Use this only for errors which must overwrite whatever > the > + * return code might have been. The first non-zero new_err given to this > + * function for a given iocb will be returned to userspace. > + */ > +static inline int aio_propogate_error(struct kiocb *iocb, int existing_err, > + int new_err) > +{ > + if (existing_err != -EIOCBQUEUED) > + return new_err; > + if (!iocb->ki_pending_err) > + iocb->ki_pending_err = new_err; > + return -EIOCBQUEUED; > +} > + > /* for sysctl: */ > extern unsigned long aio_nr; > extern unsigned long aio_max_nr; > diff -r 8a740eb579d4 mm/filemap.c > --- a/mm/filemap.cMon Feb 19 13:12:20 2007 -0800 > +++ b/mm/filemap.cMon Feb 19 13:16:00 2007 -0800 > @@ -2031,7 +2031,7 @@ generic_file_direct_write(struct kiocb * > ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { > int err = generic_osync_inode(inode, mapping, OSYNC_METADATA); > if (err < 0) > - written = err; > + written = aio_propogate_error(iocb, written, err); > } > return written; > } > @@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki > int err = invalidate_inode_pages2_range(mapping, > offset >> PAGE_CACHE_SHIFT, end); > if (err) > - retval = err; > + retval = aio_propogate_error(iocb, retval, err); > } > } > return retval; -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
; + written = aio_propogate_error(iocb, written, err); } return written; } @@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki int err = invalidate_inode_pages2_range(mapping, offset PAGE_CACHE_SHIFT, end); if (err) - retval = err; + retval = aio_propogate_error(iocb, retval, err); } } return retval; -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: fix kernel bug when page is temporally busy
On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote: > On Fri, 9 Feb 2007, Andrew Morton wrote: > > > > @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const > > > struct iovec *iov, > > > do_generic_file_read(filp,ppos,,file_read_actor); > > > retval += desc.written; > > > if (desc.error) { > > > - retval = retval ?: desc.error; > > > + retval = desc.error; > > > break; > > > } > > Nope. On error the read() syscall must return the number of bytes which > > were successfully read. > > You are right. > > In current mainline this even is not a problem, because noone seems to be > setting the error values to EIOCBRETRY. But it still stinks a bit, as > there are tests for EIOCBRETRY. We'd want retries to act only on the remaining bytes which haven't been transferred as yet, so even in the EIOCBRETRY case the right thing to do is to return the number of bytes that were successfully read without blocking. The high level AIO code (see aio_rw_vect_rety) has the ability to handle this. So this is still correct. Is there a real bug that you are seeing here ? Regards Suparna > > -- > Jiri Kosina > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: fix kernel bug when page is temporally busy
On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote: On Fri, 9 Feb 2007, Andrew Morton wrote: @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, do_generic_file_read(filp,ppos,desc,file_read_actor); retval += desc.written; if (desc.error) { - retval = retval ?: desc.error; + retval = desc.error; break; } Nope. On error the read() syscall must return the number of bytes which were successfully read. You are right. In current mainline this even is not a problem, because noone seems to be setting the error values to EIOCBRETRY. But it still stinks a bit, as there are tests for EIOCBRETRY. We'd want retries to act only on the remaining bytes which haven't been transferred as yet, so even in the EIOCBRETRY case the right thing to do is to return the number of bytes that were successfully read without blocking. The high level AIO code (see aio_rw_vect_rety) has the ability to handle this. So this is still correct. Is there a real bug that you are seeing here ? Regards Suparna -- Jiri Kosina -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: fix kernel bug when page is temporally busy
On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote: > > invalidate_inode_pages2() has other callers. I suspect with this > change > > we'll end up leaking EIOCBRETRY back to userspace. > > EIOCBRETRY is used and caught already in do_sync_read() and > do_sync_readv_writev(). > > Below fixed patch against kernel 2.6.20. I agree with Andrew, this doesn't look like the right solution. EIOCBRETRY isn't the same as EAGAIN or "try again later". EIOCBRETRY means "I have queued up an action to notify (wakeup) the callback to retry the operation once data is ready". Regards Suparna > > >From Leonid Ananiev > > Fix kernel bug when IO page is temporally busy: > invalidate_inode_pages2() returns EIOCBRETRY but not EIO.. > > Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> > --- > --- linux-2.6.20/mm/truncate.c2007-02-04 10:44:54.0 -0800 > +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.0 -0800 > @@ -366,7 +366,7 @@ static int do_launder_page(struct addres > * Any pages which are found to be mapped into pagetables are unmapped > prior to > * invalidation. > * > - * Returns -EIO if any pages could not be invalidated. > + * Returns -EIOCBRETRY if any pages could not be invalidated. > */ > int invalidate_inode_pages2_range(struct address_space *mapping, > pgoff_t start, pgoff_t end) > @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct > } > ret = do_launder_page(mapping, page); > if (ret == 0 && > !invalidate_complete_page2(mapping, page)) > - ret = -EIO; > + ret = -EIOCBRETRY; > unlock_page(page); > } > pagevec_release(); > @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages > * Any pages which are found to be mapped into pagetables are unmapped > prior to > * invalidation. > * > - * Returns -EIO if any pages could not be invalidated. > + * Returns -EIOCBRETRY if any pages could not be invalidated. > */ > int invalidate_inode_pages2(struct address_space *mapping) > { > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: fix kernel bug when page is temporally busy
On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote: invalidate_inode_pages2() has other callers. I suspect with this change we'll end up leaking EIOCBRETRY back to userspace. EIOCBRETRY is used and caught already in do_sync_read() and do_sync_readv_writev(). Below fixed patch against kernel 2.6.20. I agree with Andrew, this doesn't look like the right solution. EIOCBRETRY isn't the same as EAGAIN or try again later. EIOCBRETRY means I have queued up an action to notify (wakeup) the callback to retry the operation once data is ready. Regards Suparna From Leonid Ananiev Fix kernel bug when IO page is temporally busy: invalidate_inode_pages2() returns EIOCBRETRY but not EIO.. Signed-off-by: Leonid Ananiev [EMAIL PROTECTED] --- --- linux-2.6.20/mm/truncate.c2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(pvec); @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2(struct address_space *mapping) { -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
On Fri, Feb 02, 2007 at 04:56:22PM -0800, Linus Torvalds wrote: > > On Sat, 3 Feb 2007, Ingo Molnar wrote: > > > > Well, in my picture, 'only if you block' is a pure thread utilization > > decision: bounce a piece of work to another thread if this thread cannot > > complete it. (if the kernel is lucky enough that the user context told > > it "it's fine to do that".) > > Sure, you can do it that way too. But at that point, your argument that we > shouldn't do it with fibrils is wrong: you'd still need basically the > exact same setup that Zach does in his fibril stuff, and the exact same > hook in the scheduler, testing the exact same value ("do we have a pending > queue of work"). > > So at that point, you really are arguing about a rather small detail in > the implementation, I think. > > Which is fair enough. > > But I actually think the *bigger* argument and problems are elsewhere, > namely in the interface details. Notably, I think the *real* issues end up > how we handle synchronization, and how we handle signalling. Those are in > many ways (I think) more important than whether we actually can schedule > these trivial things on multiple CPU's concurrently or not. > > For example, I think serialization is potentially a much more expensive > issue. Could we, for example, allow users to serialize with these things > *without* having to go through the expense of doing a system call? Again, > I'm thinking of the case of no IO happening, in which case there also > won't be any actual threading taking place, in which case it's a total > waste of time to do a system call at all. > > And trying to do that actually has implications for the interfaces (like > possibly returning a zero cookie for the async() system call if it was > doable totally synchronously?) This would be useful - the application wouldn't have to set up state to remember for handling completions for operations that complete synchronously I know Samba folks would like that. The laio_syscall implementation (Lazy asynchronous IO) seems to have experimented with such an interface http://www.usenix.org/events/usenix04/tech/general/elmeleegy.html Regards Suparna > > Signal handling is similar: I actually think that a "async()" system call > should be interruptible within the context of the caller, since we would > want to *try* to execute it synchronously. That automatically means that > we have semantic meaning for fibrils and signal handling. > > Finally, can we actually get POSIX aio semantics with this? Can we > implement the current aio_xyzzy() system calls using this same feature? > And most importantly - does it perform well enough that we really can do > that? > > THOSE are to me bigger questions than what happens inside the kernel, and > whether we actually end up using another thread if we end up doing it > non-synchronously. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/9] buffered write deadlock fix
On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote: > On Mon, 29 Jan 2007 11:31:37 +0100 (CET) > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > The following set of patches attempt to fix the buffered write > > locking problems (and there are a couple of peripheral patches > > and cleanups there too). > > > > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would > > be an easier diff with the fsaio patches gone, but the readahead > > rewrite clashes badly :( > > Well fsaio is restored, but there's now considerable doubt over it due to > the recent febril febrility. I think Ingo made a point earlier about letting the old co-exist with the new. Fibrils + kevents have great potential for a next generation solution but we need to give the whole story some time to play out and prove it in practice, debate and benchmark the alternative combinations, optimize it for various workloads etc. It will also take more work on top before we can get the whole POSIX AIO implementation supported on top of this. I'll be very happy when that happens ... it is just that it is still too early to be sure. Since this is going to be a new interface, not the existing linux AIO interface, I do not see any conflict between the two. Samba4 already uses fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which depends on fsaio). The more we delay real world usage the longer we take to learn about the application patterns that matter. And it is those patterns that are key. > > How bad is the clash with the readahead patches? > > Clashes with git-block are likely, too. > > Bugfixes come first, so I will drop readahead and fsaio and git-block to get > this work completed if needed - please work agaisnt mainline. If you need help with fixing the clashes, please let me know. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/9] buffered write deadlock fix
On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote: On Mon, 29 Jan 2007 11:31:37 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote: The following set of patches attempt to fix the buffered write locking problems (and there are a couple of peripheral patches and cleanups there too). Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would be an easier diff with the fsaio patches gone, but the readahead rewrite clashes badly :( Well fsaio is restored, but there's now considerable doubt over it due to the recent febril febrility. I think Ingo made a point earlier about letting the old co-exist with the new. Fibrils + kevents have great potential for a next generation solution but we need to give the whole story some time to play out and prove it in practice, debate and benchmark the alternative combinations, optimize it for various workloads etc. It will also take more work on top before we can get the whole POSIX AIO implementation supported on top of this. I'll be very happy when that happens ... it is just that it is still too early to be sure. Since this is going to be a new interface, not the existing linux AIO interface, I do not see any conflict between the two. Samba4 already uses fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which depends on fsaio). The more we delay real world usage the longer we take to learn about the application patterns that matter. And it is those patterns that are key. How bad is the clash with the readahead patches? Clashes with git-block are likely, too. Bugfixes come first, so I will drop readahead and fsaio and git-block to get this work completed if needed - please work agaisnt mainline. If you need help with fixing the clashes, please let me know. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
On Fri, Feb 02, 2007 at 04:56:22PM -0800, Linus Torvalds wrote: On Sat, 3 Feb 2007, Ingo Molnar wrote: Well, in my picture, 'only if you block' is a pure thread utilization decision: bounce a piece of work to another thread if this thread cannot complete it. (if the kernel is lucky enough that the user context told it it's fine to do that.) Sure, you can do it that way too. But at that point, your argument that we shouldn't do it with fibrils is wrong: you'd still need basically the exact same setup that Zach does in his fibril stuff, and the exact same hook in the scheduler, testing the exact same value (do we have a pending queue of work). So at that point, you really are arguing about a rather small detail in the implementation, I think. Which is fair enough. But I actually think the *bigger* argument and problems are elsewhere, namely in the interface details. Notably, I think the *real* issues end up how we handle synchronization, and how we handle signalling. Those are in many ways (I think) more important than whether we actually can schedule these trivial things on multiple CPU's concurrently or not. For example, I think serialization is potentially a much more expensive issue. Could we, for example, allow users to serialize with these things *without* having to go through the expense of doing a system call? Again, I'm thinking of the case of no IO happening, in which case there also won't be any actual threading taking place, in which case it's a total waste of time to do a system call at all. And trying to do that actually has implications for the interfaces (like possibly returning a zero cookie for the async() system call if it was doable totally synchronously?) This would be useful - the application wouldn't have to set up state to remember for handling completions for operations that complete synchronously I know Samba folks would like that. The laio_syscall implementation (Lazy asynchronous IO) seems to have experimented with such an interface http://www.usenix.org/events/usenix04/tech/general/elmeleegy.html Regards Suparna Signal handling is similar: I actually think that a async() system call should be interruptible within the context of the caller, since we would want to *try* to execute it synchronously. That automatically means that we have semantic meaning for fibrils and signal handling. Finally, can we actually get POSIX aio semantics with this? Can we implement the current aio_xyzzy() system calls using this same feature? And most importantly - does it perform well enough that we really can do that? THOSE are to me bigger questions than what happens inside the kernel, and whether we actually end up using another thread if we end up doing it non-synchronously. Linus -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
On Thu, Feb 01, 2007 at 11:50:06AM -0800, Trond Myklebust wrote: > On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote: > > Wooo ...hold on ... I think this is swinging out of perspective :) > > > > I have said some of this before, but let me try again. > > > > As you already discovered when going down the fibril path, there are > > two kinds of accesses to current-> state, (1) common state > > for a given call chain (e.g. journal info etc), and (2) for > > various validations against the caller's process (uid, ulimit etc). > > > > (1) is not an issue when it comes to execution in background threads > > (the VFS already uses background writeback for example). > > > > As for (2), such checks need to happen upfront at the time of IO submission, > > so again are not an issue. > > Wrong! These checks can and do occur well after the time of I/O > submission in the case of remote filesystems with asynchronous writeback > support. > > Consider, for instance, the cases where the server reboots and loses all > state. Then there is the case of failover and/or migration events, where > the entire filesystem gets moved from one server to another, and again > you may have to recover state, etc... > > > I don't see any other reason why IO paths should be assuming that they are > > running in the original caller's context, midway through doing the IO. If > > that were the case background writeouts and readaheads could be fragile as > > well (or ptrace). The reason it isn't is because of this conceptual > > division of > > responsibility. > > The problem with this is that the security context is getting > progressively more heavy as we add more and more features. In addition > to the original uid/gid/fsuid/fsgid/groups, we now have stuff like > keyrings to carry around. Then there is all the context needed to > support selinux,... Isn't that kind of information supposed to be captured in nfs_open_context ? Which is associated with the open file instance ... I know this has been a traditional issue with network filesystems, and I haven't kept up with the latest code and decisions in that respect, but how would you do background writeback if there is an assumption of running in the context of the original submitter ? Regards Suparna > In the end, you end up recreating most of struct task_struct... > > Cheers > Trond > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
On Thu, Feb 01, 2007 at 02:18:55PM -0800, Zach Brown wrote: > >Wooo ...hold on ... I think this is swinging out of perspective :) > > I'm sorry, but I don't. I think using the EIOCBRETRY method in > complicated code paths requires too much maintenance cost to justify > its benefits. We can agree to disagree on that judgement :). I don't disagree about limitations of the EIOCBRETRY approach, nor do I recommend it for all sorts of complicated code paths. It is only good as an approximation for specific blocking points involving idempotent behaviour, and I was trying to emphasise that that is *all* it was ever intended for. I certainly do not see it as a viable path to make all syscalls asynchronous, or to address all blocking points in filesystem IO. And I do like the general direction of your approach, only need to think deeper about the details like how to reduce stack per IO request so this can scale better. So we don't disagree as much as you think :) The point where we seem to disagree is that I think there is goodness in maintaining the conceptual clarity between what parts of the operation assume that it is executing in the original submitters context. For the IO paths this is what allows things like readahead and writeback to work and to cluster operations which may end up to/from multiple submitters. This means that if there is some context that is needed thereafter it could be associated with the IO request (as an argument or in some other manner), so that this division is still maintained. Regards Suparna > > - z > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
ing to a state machine just to > the VFS? If you look no further than some hypothetical AIO iscsi/aoe/ > nbd/whatever target you obviously include networking. Probably splice > () if you're aggressive :). > > Let's be clear. I would be thrilled if AIO was implemented by native > non-blocking handler implementations. I don't think it will happen. > Not because we don't think it sounds great on paper, but because it's > a hugely complex endeavor that would take development and maintenance > effort away from the task of keeping basic functionality working. > > So the hope with fibrils is that we lower the barrier to getting AIO > syscall support across the board at an acceptable cost. > > It doesn't *stop* us from migrating very important paths (storage, > networking) to wildly optimized AIO implementations. But it also > doesn't force AIO support to wait for that. > > >Your patches don't look that complicated yet but you openly > >admitted you waved away many of the more tricky issues (like > >signals etc.) and I bet there are yet-unknown side effects > >of this too that will need more changes. > > To quibble, "waved away" implies that they've been dismissed. That's > not right. It's a work in progress, so yes, there will be more > fiddly details discovered and addressed over time. The hope is that > when it's said and done it'll still be worth merging. If at some > point it gets to be too much, well, at least we'll have this work to > reference as a decisive attempt. > > >I'm not sure the fibrils thing will be that much faster than > >a possibly somewhat fast pathed for this case clone+syscall+exit. > > I'll try and get some numbers for you sooner rather than later. > > Thanks for being diligent, this is exactly the kind of hard look I > want this work to get. BTW, I like the way you are approaching this with a cautiously critical eye cognizant of lurking details/issues, despite the obvious (and justified) excitement/eureka feeling. AIO _is_ hard ! Regards Suparna > > - z > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
think it will happen. Not because we don't think it sounds great on paper, but because it's a hugely complex endeavor that would take development and maintenance effort away from the task of keeping basic functionality working. So the hope with fibrils is that we lower the barrier to getting AIO syscall support across the board at an acceptable cost. It doesn't *stop* us from migrating very important paths (storage, networking) to wildly optimized AIO implementations. But it also doesn't force AIO support to wait for that. Your patches don't look that complicated yet but you openly admitted you waved away many of the more tricky issues (like signals etc.) and I bet there are yet-unknown side effects of this too that will need more changes. To quibble, waved away implies that they've been dismissed. That's not right. It's a work in progress, so yes, there will be more fiddly details discovered and addressed over time. The hope is that when it's said and done it'll still be worth merging. If at some point it gets to be too much, well, at least we'll have this work to reference as a decisive attempt. I'm not sure the fibrils thing will be that much faster than a possibly somewhat fast pathed for this case clone+syscall+exit. I'll try and get some numbers for you sooner rather than later. Thanks for being diligent, this is exactly the kind of hard look I want this work to get. BTW, I like the way you are approaching this with a cautiously critical eye cognizant of lurking details/issues, despite the obvious (and justified) excitement/eureka feeling. AIO _is_ hard ! Regards Suparna - z -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
On Thu, Feb 01, 2007 at 02:18:55PM -0800, Zach Brown wrote: Wooo ...hold on ... I think this is swinging out of perspective :) I'm sorry, but I don't. I think using the EIOCBRETRY method in complicated code paths requires too much maintenance cost to justify its benefits. We can agree to disagree on that judgement :). I don't disagree about limitations of the EIOCBRETRY approach, nor do I recommend it for all sorts of complicated code paths. It is only good as an approximation for specific blocking points involving idempotent behaviour, and I was trying to emphasise that that is *all* it was ever intended for. I certainly do not see it as a viable path to make all syscalls asynchronous, or to address all blocking points in filesystem IO. And I do like the general direction of your approach, only need to think deeper about the details like how to reduce stack per IO request so this can scale better. So we don't disagree as much as you think :) The point where we seem to disagree is that I think there is goodness in maintaining the conceptual clarity between what parts of the operation assume that it is executing in the original submitters context. For the IO paths this is what allows things like readahead and writeback to work and to cluster operations which may end up to/from multiple submitters. This means that if there is some context that is needed thereafter it could be associated with the IO request (as an argument or in some other manner), so that this division is still maintained. Regards Suparna - z -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
On Thu, Feb 01, 2007 at 11:50:06AM -0800, Trond Myklebust wrote: On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote: Wooo ...hold on ... I think this is swinging out of perspective :) I have said some of this before, but let me try again. As you already discovered when going down the fibril path, there are two kinds of accesses to current- state, (1) common state for a given call chain (e.g. journal info etc), and (2) for various validations against the caller's process (uid, ulimit etc). (1) is not an issue when it comes to execution in background threads (the VFS already uses background writeback for example). As for (2), such checks need to happen upfront at the time of IO submission, so again are not an issue. Wrong! These checks can and do occur well after the time of I/O submission in the case of remote filesystems with asynchronous writeback support. Consider, for instance, the cases where the server reboots and loses all state. Then there is the case of failover and/or migration events, where the entire filesystem gets moved from one server to another, and again you may have to recover state, etc... I don't see any other reason why IO paths should be assuming that they are running in the original caller's context, midway through doing the IO. If that were the case background writeouts and readaheads could be fragile as well (or ptrace). The reason it isn't is because of this conceptual division of responsibility. The problem with this is that the security context is getting progressively more heavy as we add more and more features. In addition to the original uid/gid/fsuid/fsgid/groups, we now have stuff like keyrings to carry around. Then there is all the context needed to support selinux,... Isn't that kind of information supposed to be captured in nfs_open_context ? Which is associated with the open file instance ... I know this has been a traditional issue with network filesystems, and I haven't kept up with the latest code and decisions in that respect, but how would you do background writeback if there is an assumption of running in the context of the original submitter ? Regards Suparna In the end, you end up recreating most of struct task_struct... Cheers Trond -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc6-mm1
On Sun, Jan 28, 2007 at 03:01:33PM -0800, Andrew Morton wrote: > On Sun, 28 Jan 2007 08:56:08 -0800 > "Martin J. Bligh" <[EMAIL PROTECTED]> wrote: > > > > - It seems that people have been busy creating the need for this. I had > > > to > > > apply over sixty patches to this tree to fix post-2.6.20-rc4-mm1 > > > compilation > > > errors. And a number of patches were dropped due to no-compile or to > > > runtime errors. Heaven knows how many runtime bugs were added. > > > > dbench seems to panic on xfs / cfq ? > > OK, I'll dump git-block.patch. That means that the fsaio patches get > temporarily dropped as well as they depend on git-block changes somewhat. Would you like me to rework these back again to not depend on git-block ? > > btw, I didn't actually include the list-aio patches in rc6-mm1 due to > various discouraging-looking review comments - I'll be awaiting version 2 > there. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc6-mm1
On Sun, Jan 28, 2007 at 03:01:33PM -0800, Andrew Morton wrote: On Sun, 28 Jan 2007 08:56:08 -0800 Martin J. Bligh [EMAIL PROTECTED] wrote: - It seems that people have been busy creating the need for this. I had to apply over sixty patches to this tree to fix post-2.6.20-rc4-mm1 compilation errors. And a number of patches were dropped due to no-compile or to runtime errors. Heaven knows how many runtime bugs were added. dbench seems to panic on xfs / cfq ? OK, I'll dump git-block.patch. That means that the fsaio patches get temporarily dropped as well as they depend on git-block changes somewhat. Would you like me to rework these back again to not depend on git-block ? btw, I didn't actually include the list-aio patches in rc6-mm1 due to various discouraging-looking review comments - I'll be awaiting version 2 there. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
On Wed, Jan 17, 2007 at 05:39:51PM +0300, Evgeniy Polyakov wrote: > On Wed, Jan 17, 2007 at 07:21:42PM +0530, Suparna Bhattacharya ([EMAIL > PROTECTED]) wrote: > > > > Since you are implementing new APIs here, have you considered doing an > > aio_sendfilev to be able to send a header with the data ? > > It is doable, but why people do not like corking? > With Linux less than microsecond syscall overhead it is better and more > flexible solution, doesn't it? That is what I used to think as well. However ... The problem as I understand it now is not about bunching data together, but of ensuring some sort of atomicity between the header and the data, when there can be multiple outstanding aio requests on the same socket - i.e ensuring strict ordering without other data coming in between, when data to be sent is not already in cache, and in the meantime another sendfile or aio write requests comes in for the same socket. Without having to lock the socket when reading data from disk. There are alternate ways to address this, aio_sendfilev is one of the options I have heard people requesting. Regards Suparna > > I'm not saying - 'no, there will not be any *v variants', just getting > more info. > > > Regards > > Suparna > > -- > Evgeniy Polyakov -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
On Wed, Jan 17, 2007 at 05:39:51PM +0300, Evgeniy Polyakov wrote: On Wed, Jan 17, 2007 at 07:21:42PM +0530, Suparna Bhattacharya ([EMAIL PROTECTED]) wrote: Since you are implementing new APIs here, have you considered doing an aio_sendfilev to be able to send a header with the data ? It is doable, but why people do not like corking? With Linux less than microsecond syscall overhead it is better and more flexible solution, doesn't it? That is what I used to think as well. However ... The problem as I understand it now is not about bunching data together, but of ensuring some sort of atomicity between the header and the data, when there can be multiple outstanding aio requests on the same socket - i.e ensuring strict ordering without other data coming in between, when data to be sent is not already in cache, and in the meantime another sendfile or aio write requests comes in for the same socket. Without having to lock the socket when reading data from disk. There are alternate ways to address this, aio_sendfilev is one of the options I have heard people requesting. Regards Suparna I'm not saying - 'no, there will not be any *v variants', just getting more info. Regards Suparna -- Evgeniy Polyakov -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Vectored AIO breakage for sockets and pipes ?
The call to aio_advance_iovec() in aio_rw_vect_retry() becomes problematic when it comes to pipe and socket operations which internally modify/advance the iovec themselves. As a result AIO writes to sockets fail to return the correct result. I'm not sure what the best way to fix this is. One option is to always make a copy of the iovec and pass that down. Any other thoughts ? Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
+ } > + > + u = file->private_data; > + > + memset(, 0, sizeof(struct ukevent)); > + > + uk.event = KEVENT_MASK_ALL; > + uk.type = KEVENT_AIO; > + > + preempt_disable(); > + uk.id.raw[0] = smp_processor_id(); > + cnt = &__get_cpu_var(kaio_req_counter); > + uk.id.raw[1] = *cnt; > + *cnt = *cnt + 1; > + preempt_enable(); > + > + uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE; > + uk.ptr = req; > + > + err = kevent_user_add_ukevent(, u); > + if (err) > + goto err_out_fput; > + > + kevent_user_get(u); > + > + fput_light(file, need_fput); > + > + return 0; > + > +err_out_fput: > + fput_light(file, need_fput); > +err_out: > + return err; > +} > + > +static void kaio_destructor(struct kaio_req *req) > +{ > + struct kaio_private *priv = req->priv; > + struct socket *sock = priv->dptr; > + struct file *file = priv->sptr; > + > + fput(file); > + sockfd_put(sock); > + > + kevent_storage_ready(>kevent_user->st, NULL, KEVENT_MASK_ALL); > + kevent_user_put(priv->kevent_user); > + > + kmem_cache_free(kaio_priv_cache, req->priv); > +} > + > +static struct kaio_req *kaio_sendfile(int kevent_fd, int sock_fd, struct > file *file, off_t offset, size_t count) > +{ > + struct kaio_req *req; > + struct socket *sock; > + struct kaio_private *priv; > + int err; > + > + sock = sockfd_lookup(sock_fd, ); > + if (!sock) > + goto err_out_exit; > + > + priv = kmem_cache_alloc(kaio_priv_cache, GFP_KERNEL); > + if (!priv) > + goto err_out_sput; > + > + priv->sptr = file; > + priv->dptr = sock; > + priv->offset = offset; > + priv->count = min_t(u64, i_size_read(file->f_dentry->d_inode), count); > + priv->processed = offset; > + priv->limit = 128; > + > + req = kaio_add_call(NULL, _read_send_pages, -1, GFP_KERNEL); > + if (!req) > + goto err_out_free; > + > + req->destructor = kaio_destructor; > + req->priv = priv; > + > + err = kaio_add_kevent(kevent_fd, req); > + > + dprintk("%s: req: %p, priv: %p, call: %p [%p], offset: %Lu, processed: > %Lu, count: %Lu, err: %d.\n", > + __func__, req, priv, _read_send_pages, > + kaio_read_send_pages, priv->offset, priv->processed, > priv->count, err); > + > + if (err) > + goto err_out_remove; > + > + kaio_schedule_req(req); > + > + return req; > + > +err_out_remove: > + /* It is safe to just free the object since it is guaranteed that it > was not > + * queued for processing. > + */ > + kmem_cache_free(kaio_req_cache, req); > +err_out_free: > + kmem_cache_free(kaio_priv_cache, priv); > +err_out_sput: > + sockfd_put(sock); > +err_out_exit: > + return NULL; > + > +} > + > +asmlinkage long sys_aio_sendfile(int kevent_fd, int sock_fd, int in_fd, > off_t offset, size_t count) > +{ > + struct kaio_req *req; > + struct file *file; > + int err; > + > + file = fget(in_fd); > + if (!file) { > + err = -EBADF; > + goto err_out_exit; > + } > + > + req = kaio_sendfile(kevent_fd, sock_fd, file, offset, count); > + if (!req) { > + err = -EINVAL; > + goto err_out_fput; > + } > + > + return (long)req; > + > +err_out_fput: > + fput(file); > +err_out_exit: > + return err; > +} > + > +asmlinkage long sys_aio_sendfile_path(int kevent_fd, int sock_fd, char > __user *filename, off_t offset, size_t count) > +{ > + char *tmp = getname(filename); > + int fd = PTR_ERR(tmp); > + int flags = O_RDONLY, err; > + struct nameidata nd; > + struct file *file; > + struct kaio_req *req; > + > + if (force_o_largefile()) > + flags = O_LARGEFILE; > + > + if (IS_ERR(tmp)) { > + err = fd; > + goto err_out_exit; > + } > + > + fd = get_unused_fd(); > + if (fd < 0) { > + err = fd; > + goto err_out_put_name; > + } > + > + if ((flags+1) & O_ACCMODE) > + flags++; > + > + err = open_namei(AT_FDCWD, tmp, flags, 0400, ); > + if (err) > + goto err_out_fdput; > + > + file = nameidata_to_filp(, flags); > + if (!file) > + goto err_out_fdput; > + > + /* One reference will be released in sys_close(), > + * second one through req->destructor() > + */ > + atomic_inc(>f_count); > + > + req = kaio_sendfile(kevent_fd, sock_fd, file, offset, count); > + if (!req) { > + err = -EINVAL; > + goto err_out_fput; > + } > + > + fd_install(fd, file); > + > + return fd; > + > +err_out_fput: > + fput(file); > + fput(file); > +err_out_fdput: > + put_unused_fd(fd); > +err_out_put_name: > + putname(tmp); > +err_out_exit: > + return err; > +} > + > +static int kevent_aio_callback(struct kevent *k) > +{ > + struct kaio_req *req = k->event.ptr; > + struct kaio_private *priv = req->priv; > + > + if (!priv->count) { > + __u32 *processed = (__u32 *)>processed; > + k->event.ret_data[0] = processed[0]; > + k->event.ret_data[1] = processed[1]; > + return 1; > + } > + > + return 0; > +} > + > +int kevent_aio_enqueue(struct kevent *k) > +{ > + int err; > + struct kaio_req *req = k->event.ptr; > + struct kaio_private *priv = req->priv; > + > + err = kevent_storage_enqueue(>user->st, k); > + if (err) > + goto err_out_exit; > + > + priv->kevent_user = k->user; > + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) > + kevent_requeue(k); > + > + return 0; > + > +err_out_exit: > + return err; > +} > + > +int kevent_aio_dequeue(struct kevent *k) > +{ > + kevent_storage_dequeue(k->st, k); > + > + return 0; > +} > + > +static void kaio_thread_stop(struct kaio_thread *th) > +{ > + kthread_stop(th->thread); > + kfree(th); > +} > + > +static int kaio_thread_start(struct kaio_thread **thp, int num) > +{ > + struct kaio_thread *th; > + > + th = kzalloc(sizeof(struct kaio_thread), GFP_KERNEL); > + if (!th) > + return -ENOMEM; > + > + th->refcnt = 1; > + spin_lock_init(>req_lock); > + INIT_LIST_HEAD(>req_list); > + init_waitqueue_head(>wait); > + > + th->thread = kthread_run(kaio_thread_process, th, "kaio/%d", num); > + if (IS_ERR(th->thread)) { > + int err = PTR_ERR(th->thread); > + kfree(th); > + return err; > + } > + > + *thp = th; > + wmb(); > + > + return 0; > +} > + > +static int __init kevent_init_aio(void) > +{ > + struct kevent_callbacks sc = { > + .callback = _aio_callback, > + .enqueue = _aio_enqueue, > + .dequeue = _aio_dequeue, > + .flags = 0, > + }; > + int err, i; > + > + kaio_req_cache = kmem_cache_create("kaio_req", sizeof(struct kaio_req), > + 0, SLAB_PANIC, NULL, NULL); > + kaio_priv_cache = kmem_cache_create("kaio_priv", sizeof(struct > kaio_private), > + 0, SLAB_PANIC, NULL, NULL); > + > + memset(kaio_threads, 0, sizeof(kaio_threads)); > + > + for (i=0; i + err = kaio_thread_start(_threads[i], i); > + if (err) > + goto err_out_stop; > + } > + > + err = kevent_add_callbacks(, KEVENT_AIO); > + if (err) > + goto err_out_stop; > + > + return 0; > + > +err_out_stop: > + while (--i >= 0) { > + struct kaio_thread *th = kaio_threads[i]; > + > + kaio_threads[i] = NULL; > + wmb(); > + > + kaio_thread_stop(th); > + } > + return err; > +} > +module_init(kevent_init_aio); > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Vectored AIO breakage for sockets and pipes ?
The call to aio_advance_iovec() in aio_rw_vect_retry() becomes problematic when it comes to pipe and socket operations which internally modify/advance the iovec themselves. As a result AIO writes to sockets fail to return the correct result. I'm not sure what the best way to fix this is. One option is to always make a copy of the iovec and pass that down. Any other thoughts ? Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
-destructor() + */ + atomic_inc(file-f_count); + + req = kaio_sendfile(kevent_fd, sock_fd, file, offset, count); + if (!req) { + err = -EINVAL; + goto err_out_fput; + } + + fd_install(fd, file); + + return fd; + +err_out_fput: + fput(file); + fput(file); +err_out_fdput: + put_unused_fd(fd); +err_out_put_name: + putname(tmp); +err_out_exit: + return err; +} + +static int kevent_aio_callback(struct kevent *k) +{ + struct kaio_req *req = k-event.ptr; + struct kaio_private *priv = req-priv; + + if (!priv-count) { + __u32 *processed = (__u32 *)priv-processed; + k-event.ret_data[0] = processed[0]; + k-event.ret_data[1] = processed[1]; + return 1; + } + + return 0; +} + +int kevent_aio_enqueue(struct kevent *k) +{ + int err; + struct kaio_req *req = k-event.ptr; + struct kaio_private *priv = req-priv; + + err = kevent_storage_enqueue(k-user-st, k); + if (err) + goto err_out_exit; + + priv-kevent_user = k-user; + if (k-event.req_flags KEVENT_REQ_ALWAYS_QUEUE) + kevent_requeue(k); + + return 0; + +err_out_exit: + return err; +} + +int kevent_aio_dequeue(struct kevent *k) +{ + kevent_storage_dequeue(k-st, k); + + return 0; +} + +static void kaio_thread_stop(struct kaio_thread *th) +{ + kthread_stop(th-thread); + kfree(th); +} + +static int kaio_thread_start(struct kaio_thread **thp, int num) +{ + struct kaio_thread *th; + + th = kzalloc(sizeof(struct kaio_thread), GFP_KERNEL); + if (!th) + return -ENOMEM; + + th-refcnt = 1; + spin_lock_init(th-req_lock); + INIT_LIST_HEAD(th-req_list); + init_waitqueue_head(th-wait); + + th-thread = kthread_run(kaio_thread_process, th, kaio/%d, num); + if (IS_ERR(th-thread)) { + int err = PTR_ERR(th-thread); + kfree(th); + return err; + } + + *thp = th; + wmb(); + + return 0; +} + +static int __init kevent_init_aio(void) +{ + struct kevent_callbacks sc = { + .callback = kevent_aio_callback, + .enqueue = kevent_aio_enqueue, + .dequeue = kevent_aio_dequeue, + .flags = 0, + }; + int err, i; + + kaio_req_cache = kmem_cache_create(kaio_req, sizeof(struct kaio_req), + 0, SLAB_PANIC, NULL, NULL); + kaio_priv_cache = kmem_cache_create(kaio_priv, sizeof(struct kaio_private), + 0, SLAB_PANIC, NULL, NULL); + + memset(kaio_threads, 0, sizeof(kaio_threads)); + + for (i=0; iKAIO_THREAD_NUM; ++i) { + err = kaio_thread_start(kaio_threads[i], i); + if (err) + goto err_out_stop; + } + + err = kevent_add_callbacks(sc, KEVENT_AIO); + if (err) + goto err_out_stop; + + return 0; + +err_out_stop: + while (--i = 0) { + struct kaio_thread *th = kaio_threads[i]; + + kaio_threads[i] = NULL; + wmb(); + + kaio_thread_stop(th); + } + return err; +} +module_init(kevent_init_aio); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: spurious sparse warnings from linux/aio.h (was: 2.6.20-rc4-mm1)
On Fri, Jan 12, 2007 at 12:55:18PM +0100, Tilman Schmidt wrote: > Andrew Morton schrieb: > > - Merged the "filesystem AIO patches". > > This construct: > > > --- linux-2.6.20-rc4/include/linux/aio.h2007-01-06 > > 23:34:08.0 -0800 > > +++ devel/include/linux/aio.h 2007-01-11 21:36:16.0 -0800 > > > @@ -237,7 +243,8 @@ do { > > \ > > } \ > > } while (0) > > > > -#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait) > > +#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait, \ > > + struct wait_bit_queue, wait), struct kiocb, ki_wait) > > > > #include > > > > causes a sparse warning: > > > include/linux/sched.h:1313:29: warning: symbol '__mptr' shadows an earlier > > one > > include/linux/sched.h:1313:29: originally declared here > > for every source file referencing . > Could that be avoided please? So ... the nested container_of() is a problem ? I guess changing io_wait_to_kiocb() to be an inline function instead of a macro could help ? Regards Suparna > > Thanks > Tilman > > -- > Tilman SchmidtE-Mail: [EMAIL PROTECTED] > Bonn, Germany > Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. > Ungeöffnet mindestens haltbar bis: (siehe Rückseite) > -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: spurious sparse warnings from linux/aio.h (was: 2.6.20-rc4-mm1)
On Fri, Jan 12, 2007 at 12:55:18PM +0100, Tilman Schmidt wrote: Andrew Morton schrieb: - Merged the filesystem AIO patches. This construct: --- linux-2.6.20-rc4/include/linux/aio.h2007-01-06 23:34:08.0 -0800 +++ devel/include/linux/aio.h 2007-01-11 21:36:16.0 -0800 @@ -237,7 +243,8 @@ do { \ } \ } while (0) -#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait) +#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait, \ + struct wait_bit_queue, wait), struct kiocb, ki_wait) #include linux/aio_abi.h causes a sparse warning: include/linux/sched.h:1313:29: warning: symbol '__mptr' shadows an earlier one include/linux/sched.h:1313:29: originally declared here for every source file referencing linux/sched.h. Could that be avoided please? So ... the nested container_of() is a problem ? I guess changing io_wait_to_kiocb() to be an inline function instead of a macro could help ? Regards Suparna Thanks Tilman -- Tilman SchmidtE-Mail: [EMAIL PROTECTED] Bonn, Germany Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. Ungeöffnet mindestens haltbar bis: (siehe Rückseite) -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Wed, Jan 10, 2007 at 05:08:29PM -0800, Andrew Morton wrote: > On Wed, 10 Jan 2007 11:14:19 +0530 > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: > > > On Thu, 4 Jan 2007 10:26:21 +0530 > > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > > > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: > > > > > > Patches against next -mm would be appreciated, please. Sorry about that. > > > > I have updated the patchset against 2620-rc3-mm1, incorporated various > > cleanups suggested during last review. Please let me know if I have missed > > anything: > > The s/lock_page_slow/lock_page_blocking/ got lost. I redid it. I thought the lock_page_blocking was an alternative you had suggested to the __lock_page vs lock_page_async discussion which got resolved later. That is why I didn't make the change in this patchset. The call does not block in the async case, hence the choice of the _slow suffix (like in fs/buffer.c). But if lock_page_blocking() sounds more intuitive to you, that's OK. > > For the record, patches-via-http are very painful. Please always always > email them. > > As a result, these patches ended up with titles which are derived from their > filenames, which are cryptic. Sorry about that - I wanted to ask if you'd prefer my resending them to the list, but missed doing so. Some people have found it easier to download the series as a whole when they intend to apply it, so I ended up maintaining it that way all this while. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Wed, Jan 10, 2007 at 05:08:29PM -0800, Andrew Morton wrote: On Wed, 10 Jan 2007 11:14:19 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: On Thu, 4 Jan 2007 10:26:21 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: Patches against next -mm would be appreciated, please. Sorry about that. I have updated the patchset against 2620-rc3-mm1, incorporated various cleanups suggested during last review. Please let me know if I have missed anything: The s/lock_page_slow/lock_page_blocking/ got lost. I redid it. I thought the lock_page_blocking was an alternative you had suggested to the __lock_page vs lock_page_async discussion which got resolved later. That is why I didn't make the change in this patchset. The call does not block in the async case, hence the choice of the _slow suffix (like in fs/buffer.c). But if lock_page_blocking() sounds more intuitive to you, that's OK. For the record, patches-via-http are very painful. Please always always email them. As a result, these patches ended up with titles which are derived from their filenames, which are cryptic. Sorry about that - I wanted to ask if you'd prefer my resending them to the list, but missed doing so. Some people have found it easier to download the series as a whole when they intend to apply it, so I ended up maintaining it that way all this while. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: > On Thu, 4 Jan 2007 10:26:21 +0530 > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: > > Patches against next -mm would be appreciated, please. Sorry about that. I have updated the patchset against 2620-rc3-mm1, incorporated various cleanups suggested during last review. Please let me know if I have missed anything: It should show up at www.kernel.org:/pub/linux/kernel/people/suparna/aio/2620-rc3-mm1 Brief changelog: - Reworked against the block layer unplug changes - Switched from defines to inlines for init_wait_bit* etc (per akpm) - Better naming: __lock_page to lock_page_async (per hch, npiggin) - Kill lock_page_slow wrapper and rename __lock_page_slow to lock_page_slow (per hch) - Use a helper function aio_restarted() (per hch) - Replace combined if/assignment (per hch) - fix resetting of current->io_wait after ->retry in aio_run_iocb (per zab) I have run my usual aio-stress variations script (www.kernel.org:/pub/linux/kernel/people/suparna/aio/aio-results.sh) Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: On Thu, 4 Jan 2007 10:26:21 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: Patches against next -mm would be appreciated, please. Sorry about that. I have updated the patchset against 2620-rc3-mm1, incorporated various cleanups suggested during last review. Please let me know if I have missed anything: It should show up at www.kernel.org:/pub/linux/kernel/people/suparna/aio/2620-rc3-mm1 Brief changelog: - Reworked against the block layer unplug changes - Switched from defines to inlines for init_wait_bit* etc (per akpm) - Better naming: __lock_page to lock_page_async (per hch, npiggin) - Kill lock_page_slow wrapper and rename __lock_page_slow to lock_page_slow (per hch) - Use a helper function aio_restarted() (per hch) - Replace combined if/assignment (per hch) - fix resetting of current-io_wait after -retry in aio_run_iocb (per zab) I have run my usual aio-stress variations script (www.kernel.org:/pub/linux/kernel/people/suparna/aio/aio-results.sh) Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How git affects kernel.org performance
On Sun, Jan 07, 2007 at 01:15:42AM -0800, Andrew Morton wrote: > On Sun, 7 Jan 2007 09:55:26 +0100 > Willy Tarreau <[EMAIL PROTECTED]> wrote: > > > On Sat, Jan 06, 2007 at 09:39:42PM -0800, Linus Torvalds wrote: > > > > > > > > > On Sat, 6 Jan 2007, H. Peter Anvin wrote: > > > > > > > > During extremely high load, it appears that what slows kernel.org down > > > > more > > > > than anything else is the time that each individual getdents() call > > > > takes. > > > > When I've looked this I've observed times from 200 ms to almost 2 > > > > seconds! > > > > Since an unpacked *OR* unpruned git tree adds 256 directories to a > > > > cleanly > > > > packed tree, you can do the math yourself. > > > > > > "getdents()" is totally serialized by the inode semaphore. It's one of the > > > most expensive system calls in Linux, partly because of that, and partly > > > because it has to call all the way down into the filesystem in a way that > > > almost no other common system call has to (99% of all filesystem calls can > > > be handled basically at the VFS layer with generic caches - but not > > > getdents()). > > > > > > So if there are concurrent readdirs on the same directory, they get > > > serialized. If there is any file creation/deletion activity in the > > > directory, it serializes getdents(). > > > > > > To make matters worse, I don't think it has any read-ahead at all when you > > > use hashed directory entries. So if you have cold-cache case, you'll read > > > every single block totally individually, and serialized. One block at a > > > time (I think the non-hashed case is likely also suspect, but that's a > > > separate issue) > > > > > > In other words, I'm not at all surprised it hits on filldir time. > > > Especially on ext3. > > > > At work, we had the same problem on a file server with ext3. We use rsync > > to make backups to a local IDE disk, and we noticed that getdents() took > > about the same time as Peter reports (0.2 to 2 seconds), especially in > > maildir directories. We tried many things to fix it with no result, > > including enabling dirindexes. Finally, we made a full backup, and switched > > over to XFS and the problem totally disappeared. So it seems that the > > filesystem matters a lot here when there are lots of entries in a > > directory, and that ext3 is not suitable for usages with thousands > > of entries in directories with millions of files on disk. I'm not > > certain it would be that easy to try other filesystems on kernel.org > > though :-/ > > > > Yeah, slowly-growing directories will get splattered all over the disk. > > Possible short-term fixes would be to just allocate up to (say) eight > blocks when we grow a directory by one block. Or teach the > directory-growth code to use ext3 reservations. > > Longer-term people are talking about things like on-disk rerservations. > But I expect directories are being forgotten about in all of that. By on-disk reservations, do you mean persistent file preallocation ? (that is explicit preallocation of blocks to a given file) If so, you are right, we haven't really given any thought to the possibility of directories needing that feature. Regards Suparna > > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How git affects kernel.org performance
On Sun, Jan 07, 2007 at 01:15:42AM -0800, Andrew Morton wrote: On Sun, 7 Jan 2007 09:55:26 +0100 Willy Tarreau [EMAIL PROTECTED] wrote: On Sat, Jan 06, 2007 at 09:39:42PM -0800, Linus Torvalds wrote: On Sat, 6 Jan 2007, H. Peter Anvin wrote: During extremely high load, it appears that what slows kernel.org down more than anything else is the time that each individual getdents() call takes. When I've looked this I've observed times from 200 ms to almost 2 seconds! Since an unpacked *OR* unpruned git tree adds 256 directories to a cleanly packed tree, you can do the math yourself. getdents() is totally serialized by the inode semaphore. It's one of the most expensive system calls in Linux, partly because of that, and partly because it has to call all the way down into the filesystem in a way that almost no other common system call has to (99% of all filesystem calls can be handled basically at the VFS layer with generic caches - but not getdents()). So if there are concurrent readdirs on the same directory, they get serialized. If there is any file creation/deletion activity in the directory, it serializes getdents(). To make matters worse, I don't think it has any read-ahead at all when you use hashed directory entries. So if you have cold-cache case, you'll read every single block totally individually, and serialized. One block at a time (I think the non-hashed case is likely also suspect, but that's a separate issue) In other words, I'm not at all surprised it hits on filldir time. Especially on ext3. At work, we had the same problem on a file server with ext3. We use rsync to make backups to a local IDE disk, and we noticed that getdents() took about the same time as Peter reports (0.2 to 2 seconds), especially in maildir directories. We tried many things to fix it with no result, including enabling dirindexes. Finally, we made a full backup, and switched over to XFS and the problem totally disappeared. So it seems that the filesystem matters a lot here when there are lots of entries in a directory, and that ext3 is not suitable for usages with thousands of entries in directories with millions of files on disk. I'm not certain it would be that easy to try other filesystems on kernel.org though :-/ Yeah, slowly-growing directories will get splattered all over the disk. Possible short-term fixes would be to just allocate up to (say) eight blocks when we grow a directory by one block. Or teach the directory-growth code to use ext3 reservations. Longer-term people are talking about things like on-disk rerservations. But I expect directories are being forgotten about in all of that. By on-disk reservations, do you mean persistent file preallocation ? (that is explicit preallocation of blocks to a given file) If so, you are right, we haven't really given any thought to the possibility of directories needing that feature. Regards Suparna - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Fri, Jan 05, 2007 at 08:02:33AM +0100, Jens Axboe wrote: > On Fri, Jan 05 2007, Suparna Bhattacharya wrote: > > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: > > > On Thu, 4 Jan 2007 10:26:21 +0530 > > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > > > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: > > > > > On Thu, 28 Dec 2006 13:53:08 +0530 > > > > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > This patchset implements changes to make filesystem AIO read > > > > > > and write asynchronous for the non O_DIRECT case. > > > > > > > > > > Unfortunately the unplugging changes in Jen's block tree have trashed > > > > > these > > > > > patches to a degree that I'm not confident in my repair attempts. So > > > > > I'll > > > > > drop the fasio patches from -mm. > > > > > > > > I took a quick look and the conflicts seem pretty minor to me, the > > > > unplugging > > > > changes mostly touch nearby code. > > > > > > Well... the conflicts (both mechanical and conceptual) are such that a > > > round of retesting is needed. > > > > > > > Please let know how you want this fixed up. > > > > > > > > >From what I can tell the comments in the unplug patches seem to say > > > > >that > > > > it needs more work and testing, so perhaps a separate fixup patch may be > > > > a better idea rather than make the fsaio patchset dependent on this. > > > > > > Patches against next -mm would be appreciated, please. Sorry about that. > > > > > > I _assume_ Jens is targetting 2.6.21? > > > > When is the next -mm likely to be out ? > > > > I was considering regenerating the blk unplug patches against the > > fsaio changes instead of the other way around, if Jens were willing to > > accept that. But if the next -mm is just around the corner then its > > not an issue. > > I don't really care much, but I work against mainline and anything but > occasional one-off generations of a patch against a different base is > not very likely. > > The -mm order should just reflect the merge order of the patches, what > is the fsaio target? 2.6.21 was what I had in mind, to enable the glibc folks to proceed with conversion to native AIO. Regenerating my patches against the unplug stuff is not a problem, I only worry about being queued up behind something that may take longer to stabilize and is likely to change ... If that is not the case, I don't mind. Regards Suparna > > -- > Jens Axboe > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Fri, Jan 05, 2007 at 08:02:33AM +0100, Jens Axboe wrote: On Fri, Jan 05 2007, Suparna Bhattacharya wrote: On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: On Thu, 4 Jan 2007 10:26:21 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: On Thu, 28 Dec 2006 13:53:08 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: This patchset implements changes to make filesystem AIO read and write asynchronous for the non O_DIRECT case. Unfortunately the unplugging changes in Jen's block tree have trashed these patches to a degree that I'm not confident in my repair attempts. So I'll drop the fasio patches from -mm. I took a quick look and the conflicts seem pretty minor to me, the unplugging changes mostly touch nearby code. Well... the conflicts (both mechanical and conceptual) are such that a round of retesting is needed. Please let know how you want this fixed up. From what I can tell the comments in the unplug patches seem to say that it needs more work and testing, so perhaps a separate fixup patch may be a better idea rather than make the fsaio patchset dependent on this. Patches against next -mm would be appreciated, please. Sorry about that. I _assume_ Jens is targetting 2.6.21? When is the next -mm likely to be out ? I was considering regenerating the blk unplug patches against the fsaio changes instead of the other way around, if Jens were willing to accept that. But if the next -mm is just around the corner then its not an issue. I don't really care much, but I work against mainline and anything but occasional one-off generations of a patch against a different base is not very likely. The -mm order should just reflect the merge order of the patches, what is the fsaio target? 2.6.21 was what I had in mind, to enable the glibc folks to proceed with conversion to native AIO. Regenerating my patches against the unplug stuff is not a problem, I only worry about being queued up behind something that may take longer to stabilize and is likely to change ... If that is not the case, I don't mind. Regards Suparna -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: > On Thu, 4 Jan 2007 10:26:21 +0530 > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: > > > On Thu, 28 Dec 2006 13:53:08 +0530 > > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > > > > > This patchset implements changes to make filesystem AIO read > > > > and write asynchronous for the non O_DIRECT case. > > > > > > Unfortunately the unplugging changes in Jen's block tree have trashed > > > these > > > patches to a degree that I'm not confident in my repair attempts. So I'll > > > drop the fasio patches from -mm. > > > > I took a quick look and the conflicts seem pretty minor to me, the > > unplugging > > changes mostly touch nearby code. > > Well... the conflicts (both mechanical and conceptual) are such that a > round of retesting is needed. > > > Please let know how you want this fixed up. > > > > >From what I can tell the comments in the unplug patches seem to say that > > it needs more work and testing, so perhaps a separate fixup patch may be > > a better idea rather than make the fsaio patchset dependent on this. > > Patches against next -mm would be appreciated, please. Sorry about that. > > I _assume_ Jens is targetting 2.6.21? When is the next -mm likely to be out ? I was considering regenerating the blk unplug patches against the fsaio changes instead of the other way around, if Jens were willing to accept that. But if the next -mm is just around the corner then its not an issue. Regards Suparna > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHSET 4][PATCH 1/1] AIO fallback for pipes, sockets and pollable fds
msg->msg_iovlen = nr_segs; - msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0; + msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb)) + ? MSG_DONTWAIT : 0; return __sock_recvmsg(iocb, sock, msg, size, msg->msg_flags); } @@ -741,7 +742,8 @@ static ssize_t do_sock_write(struct msgh msg->msg_controllen = 0; msg->msg_iov = (struct iovec *)iov; msg->msg_iovlen = nr_segs; - msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0; + msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb)) + ? MSG_DONTWAIT : 0; if (sock->type == SOCK_SEQPACKET) msg->msg_flags |= MSG_EOR; diff -puN fs/pipe.c~aio-fallback-nonblock fs/pipe.c --- linux-2.6.20-rc1/fs/pipe.c~aio-fallback-nonblock2007-01-03 19:16:36.0 +0530 +++ linux-2.6.20-rc1-root/fs/pipe.c 2007-01-03 19:16:36.0 +0530 @@ -226,14 +226,16 @@ pipe_read(struct kiocb *iocb, const stru struct pipe_inode_info *pipe; int do_wakeup; ssize_t ret; - struct iovec *iov = (struct iovec *)_iov; + struct iovec iov_array[nr_segs]; + struct iovec *iov = iov_array; size_t total_len; - total_len = iov_length(iov, nr_segs); + total_len = iov_length(_iov, nr_segs); /* Null read succeeds. */ if (unlikely(total_len == 0)) return 0; + memcpy(iov, _iov, nr_segs * sizeof(struct iovec)); do_wakeup = 0; ret = 0; mutex_lock(>i_mutex); @@ -302,7 +304,8 @@ redo: */ if (ret) break; - if (filp->f_flags & O_NONBLOCK) { + if (filp->f_flags & O_NONBLOCK || + !is_sync_kiocb(iocb)) { ret = -EAGAIN; break; } @@ -339,15 +342,17 @@ pipe_write(struct kiocb *iocb, const str struct pipe_inode_info *pipe; ssize_t ret; int do_wakeup; - struct iovec *iov = (struct iovec *)_iov; + struct iovec iov_array[nr_segs]; + struct iovec *iov = iov_array; size_t total_len; ssize_t chars; - total_len = iov_length(iov, nr_segs); + total_len = iov_length(_iov, nr_segs); /* Null write succeeds. */ if (unlikely(total_len == 0)) return 0; + memcpy(iov, _iov, nr_segs * sizeof(struct iovec)); do_wakeup = 0; ret = 0; mutex_lock(>i_mutex); @@ -473,7 +478,7 @@ redo2: } if (bufs < PIPE_BUFFERS) continue; - if (filp->f_flags & O_NONBLOCK) { + if (filp->f_flags & O_NONBLOCK || !is_sync_kiocb(iocb)) { if (!ret) ret = -EAGAIN; break; _ -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, Jan 04, 2007 at 05:50:11PM +1100, Nick Piggin wrote: > Suparna Bhattacharya wrote: > >On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote: > > >>So long as AIO threads do the same, there would be no problem (plugging > >>is optional, of course). > > > > > >Yup, the AIO threads run the same code as for regular IO, i.e in the rare > >situations where they actually end up submitting IO, so there should > >be no problem. And you have already added plug/unplug at the appropriate > >places in those path, so things should just work. > > Yes I think it should. > > >>This (is supposed to) give a number of improvements over the traditional > >>plugging (although some downsides too). Most notably for me, the VM gets > >>cleaner ;) > >> > >>However AIO could be an interesting case to test for explicit plugging > >>because of the way they interact. What kind of improvements do you see > >>with samba and do you have any benchmark setups? > > > > > >I think aio-stress would be a good way to test/benchmark this sort of > >stuff, at least for a start. > >Samba (if I understand this correctly based on my discussions with Tridge) > >is less likely to generate the kind of io patterns that could benefit from > >explicit plugging (because the file server has no way to tell what the next > >request is going to be, it ends up submitting each independently instead of > >batching iocbs). > > OK, but I think that after IO submission, you do not run sync_page to > unplug the block device, like the normal IO path would (via lock_page, > before the explicit plug patches). In the buffered AIO case, we do run sync page like normal IO ... just that we don't block in io_schedule(), everything else is pretty much similar. In the case of AIO-DIO, the path is like the just like non-AIO DIO, there is a call to blk_run_address_space() after submission. > > However, with explicit plugging, AIO requests will be started immediately. > Maybe this won't be noticable if the device is always busy, but I would > like to know there isn't a regression. > > >In future there may be optimization possibilities to consider when > >submitting batches of iocbs, i.e. on the io submission path. Maybe > >AIO - O_DIRECT would be interesting to play with first in this regardi ? > > Well I've got some simple per-process batching in there now, each process > has a list of pending requests. Request merging is done locklessly against > the last request added; and submission at unplug time is batched under a > single block device lock. > > I'm sure more merging or batching could be done, but also consider that > most programs will not ever make use of any added complexity. I guess I didn't express myself well - by batching I meant being able to surround submission of a batch of iocbs with explicit plug/unplug instead of explicit plug/unplug for each iocb separately. Of course there is no easy way to do that, since at the io_submit() level we do not know about the block device (each iocb could be directed to a different fd and not just block devices). So it may not be worth thinking about. > > Regarding your patches, I've just had a quick look and have a question -- > what do you do about blocking in page reclaim and dirty balancing? Aren't > those major points of blocking with buffered IO? Did your test cases > dirty enough to start writeout or cause a lot of reclaim? (admittedly, > blocking in reclaim will now be much less common since the dirty mapping > accounting). In my earlier versions of patches I actually had converted these waits to be async retriable, but then I came to the conclusion that the additional complexity wasn't worth it. For one it didn't seem to make a difference compared to the other bigger cases, and I was looking primarily at handling the gross blocking points (say to enable an application to keep device queues busy) and not making everything asynchronous; for another we had a long discussion thread waay back about not making AIO submitters exempt from throttling or memory availability waits. Regards Suparna > > -- > SUSE Labs, Novell Inc. > Send instant messages to your online friends http://au.messenger.yahoo.com > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, Jan 04, 2007 at 05:50:11PM +1100, Nick Piggin wrote: Suparna Bhattacharya wrote: On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote: So long as AIO threads do the same, there would be no problem (plugging is optional, of course). Yup, the AIO threads run the same code as for regular IO, i.e in the rare situations where they actually end up submitting IO, so there should be no problem. And you have already added plug/unplug at the appropriate places in those path, so things should just work. Yes I think it should. This (is supposed to) give a number of improvements over the traditional plugging (although some downsides too). Most notably for me, the VM gets cleaner ;) However AIO could be an interesting case to test for explicit plugging because of the way they interact. What kind of improvements do you see with samba and do you have any benchmark setups? I think aio-stress would be a good way to test/benchmark this sort of stuff, at least for a start. Samba (if I understand this correctly based on my discussions with Tridge) is less likely to generate the kind of io patterns that could benefit from explicit plugging (because the file server has no way to tell what the next request is going to be, it ends up submitting each independently instead of batching iocbs). OK, but I think that after IO submission, you do not run sync_page to unplug the block device, like the normal IO path would (via lock_page, before the explicit plug patches). In the buffered AIO case, we do run sync page like normal IO ... just that we don't block in io_schedule(), everything else is pretty much similar. In the case of AIO-DIO, the path is like the just like non-AIO DIO, there is a call to blk_run_address_space() after submission. However, with explicit plugging, AIO requests will be started immediately. Maybe this won't be noticable if the device is always busy, but I would like to know there isn't a regression. In future there may be optimization possibilities to consider when submitting batches of iocbs, i.e. on the io submission path. Maybe AIO - O_DIRECT would be interesting to play with first in this regardi ? Well I've got some simple per-process batching in there now, each process has a list of pending requests. Request merging is done locklessly against the last request added; and submission at unplug time is batched under a single block device lock. I'm sure more merging or batching could be done, but also consider that most programs will not ever make use of any added complexity. I guess I didn't express myself well - by batching I meant being able to surround submission of a batch of iocbs with explicit plug/unplug instead of explicit plug/unplug for each iocb separately. Of course there is no easy way to do that, since at the io_submit() level we do not know about the block device (each iocb could be directed to a different fd and not just block devices). So it may not be worth thinking about. Regarding your patches, I've just had a quick look and have a question -- what do you do about blocking in page reclaim and dirty balancing? Aren't those major points of blocking with buffered IO? Did your test cases dirty enough to start writeout or cause a lot of reclaim? (admittedly, blocking in reclaim will now be much less common since the dirty mapping accounting). In my earlier versions of patches I actually had converted these waits to be async retriable, but then I came to the conclusion that the additional complexity wasn't worth it. For one it didn't seem to make a difference compared to the other bigger cases, and I was looking primarily at handling the gross blocking points (say to enable an application to keep device queues busy) and not making everything asynchronous; for another we had a long discussion thread waay back about not making AIO submitters exempt from throttling or memory availability waits. Regards Suparna -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHSET 4][PATCH 1/1] AIO fallback for pipes, sockets and pollable fds
= (file-f_flags O_NONBLOCK) ? MSG_DONTWAIT : 0; + msg-msg_flags = ((file-f_flags O_NONBLOCK) || !is_sync_kiocb(iocb)) + ? MSG_DONTWAIT : 0; return __sock_recvmsg(iocb, sock, msg, size, msg-msg_flags); } @@ -741,7 +742,8 @@ static ssize_t do_sock_write(struct msgh msg-msg_controllen = 0; msg-msg_iov = (struct iovec *)iov; msg-msg_iovlen = nr_segs; - msg-msg_flags = (file-f_flags O_NONBLOCK) ? MSG_DONTWAIT : 0; + msg-msg_flags = ((file-f_flags O_NONBLOCK) || !is_sync_kiocb(iocb)) + ? MSG_DONTWAIT : 0; if (sock-type == SOCK_SEQPACKET) msg-msg_flags |= MSG_EOR; diff -puN fs/pipe.c~aio-fallback-nonblock fs/pipe.c --- linux-2.6.20-rc1/fs/pipe.c~aio-fallback-nonblock2007-01-03 19:16:36.0 +0530 +++ linux-2.6.20-rc1-root/fs/pipe.c 2007-01-03 19:16:36.0 +0530 @@ -226,14 +226,16 @@ pipe_read(struct kiocb *iocb, const stru struct pipe_inode_info *pipe; int do_wakeup; ssize_t ret; - struct iovec *iov = (struct iovec *)_iov; + struct iovec iov_array[nr_segs]; + struct iovec *iov = iov_array; size_t total_len; - total_len = iov_length(iov, nr_segs); + total_len = iov_length(_iov, nr_segs); /* Null read succeeds. */ if (unlikely(total_len == 0)) return 0; + memcpy(iov, _iov, nr_segs * sizeof(struct iovec)); do_wakeup = 0; ret = 0; mutex_lock(inode-i_mutex); @@ -302,7 +304,8 @@ redo: */ if (ret) break; - if (filp-f_flags O_NONBLOCK) { + if (filp-f_flags O_NONBLOCK || + !is_sync_kiocb(iocb)) { ret = -EAGAIN; break; } @@ -339,15 +342,17 @@ pipe_write(struct kiocb *iocb, const str struct pipe_inode_info *pipe; ssize_t ret; int do_wakeup; - struct iovec *iov = (struct iovec *)_iov; + struct iovec iov_array[nr_segs]; + struct iovec *iov = iov_array; size_t total_len; ssize_t chars; - total_len = iov_length(iov, nr_segs); + total_len = iov_length(_iov, nr_segs); /* Null write succeeds. */ if (unlikely(total_len == 0)) return 0; + memcpy(iov, _iov, nr_segs * sizeof(struct iovec)); do_wakeup = 0; ret = 0; mutex_lock(inode-i_mutex); @@ -473,7 +478,7 @@ redo2: } if (bufs PIPE_BUFFERS) continue; - if (filp-f_flags O_NONBLOCK) { + if (filp-f_flags O_NONBLOCK || !is_sync_kiocb(iocb)) { if (!ret) ret = -EAGAIN; break; _ -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote: On Thu, 4 Jan 2007 10:26:21 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: On Thu, 28 Dec 2006 13:53:08 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: This patchset implements changes to make filesystem AIO read and write asynchronous for the non O_DIRECT case. Unfortunately the unplugging changes in Jen's block tree have trashed these patches to a degree that I'm not confident in my repair attempts. So I'll drop the fasio patches from -mm. I took a quick look and the conflicts seem pretty minor to me, the unplugging changes mostly touch nearby code. Well... the conflicts (both mechanical and conceptual) are such that a round of retesting is needed. Please let know how you want this fixed up. From what I can tell the comments in the unplug patches seem to say that it needs more work and testing, so perhaps a separate fixup patch may be a better idea rather than make the fsaio patchset dependent on this. Patches against next -mm would be appreciated, please. Sorry about that. I _assume_ Jens is targetting 2.6.21? When is the next -mm likely to be out ? I was considering regenerating the blk unplug patches against the fsaio changes instead of the other way around, if Jens were willing to accept that. But if the next -mm is just around the corner then its not an issue. Regards Suparna -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote: > Suparna Bhattacharya wrote: > >On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: > > >>Plus Jens's unplugging changes add more reliance upon context inside > >>*current, for the plugging and unplugging operations. I expect that the > >>fsaio patches will need to be aware of the protocol which those proposed > >>changes add. > > > > > >Whatever logic applies to background writeout etc should also just apply > >as is to aio worker threads, shouldn't it ? At least at a quick glance I > >don't see anything special that needs to be done for fsaio, but its good > >to be aware of this anyway, thanks ! > > The submitting process plugs itself, submits all its IO, then unplugs > itself (ie. so the plug is now on the process, rather than the block > device). > > So long as AIO threads do the same, there would be no problem (plugging > is optional, of course). Yup, the AIO threads run the same code as for regular IO, i.e in the rare situations where they actually end up submitting IO, so there should be no problem. And you have already added plug/unplug at the appropriate places in those path, so things should just work. > > This (is supposed to) give a number of improvements over the traditional > plugging (although some downsides too). Most notably for me, the VM gets > cleaner ;) > > However AIO could be an interesting case to test for explicit plugging > because of the way they interact. What kind of improvements do you see > with samba and do you have any benchmark setups? I think aio-stress would be a good way to test/benchmark this sort of stuff, at least for a start. Samba (if I understand this correctly based on my discussions with Tridge) is less likely to generate the kind of io patterns that could benefit from explicit plugging (because the file server has no way to tell what the next request is going to be, it ends up submitting each independently instead of batching iocbs). In future there may be optimization possibilities to consider when submitting batches of iocbs, i.e. on the io submission path. Maybe AIO - O_DIRECT would be interesting to play with first in this regardi ? Regards Suparna > > Thanks, > Nick > > -- > SUSE Labs, Novell Inc. > Send instant messages to your online friends http://au.messenger.yahoo.com > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: > On Thu, 28 Dec 2006 13:53:08 +0530 > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote: > > > This patchset implements changes to make filesystem AIO read > > and write asynchronous for the non O_DIRECT case. > > Unfortunately the unplugging changes in Jen's block tree have trashed these > patches to a degree that I'm not confident in my repair attempts. So I'll > drop the fasio patches from -mm. I took a quick look and the conflicts seem pretty minor to me, the unplugging changes mostly touch nearby code. Please let know how you want this fixed up. >From what I can tell the comments in the unplug patches seem to say that it needs more work and testing, so perhaps a separate fixup patch may be a better idea rather than make the fsaio patchset dependent on this. > > Zach's observations regarding this code's reliance upon things at *current > sounded pretty serious, so I expect we'd be seeing changes for that anyway? Not really, at least nothing that I can see needing a change. As I mentioned there is no reliance on *current in the code that runs in the aio threads that we need to worry about. The generic_write_checks etc that Zach was referring to all happens in the context of submitting process, not in retry context. The model is to perform all validation at the time of io submission. And of course things like copy_to_user() are already taken care of by use_mm(). Lets look at it this way - the kernel already has the ability to do background writeout on behalf of a task from a kernel thread and likewise read(ahead) pages that may be consumed by another task. There is also the ability to operate another task's address space (as used by ptrace). So there is nothing groundbreaking here. In fact on most occasions, all the IO is initiated in the context of the submitting task, so the aio threads mainly deal with checking for completion and transfering completed data to user space. > > Plus Jens's unplugging changes add more reliance upon context inside > *current, for the plugging and unplugging operations. I expect that the > fsaio patches will need to be aware of the protocol which those proposed > changes add. Whatever logic applies to background writeout etc should also just apply as is to aio worker threads, shouldn't it ? At least at a quick glance I don't see anything special that needs to be done for fsaio, but its good to be aware of this anyway, thanks ! Regards Suparna > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: On Thu, 28 Dec 2006 13:53:08 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: This patchset implements changes to make filesystem AIO read and write asynchronous for the non O_DIRECT case. Unfortunately the unplugging changes in Jen's block tree have trashed these patches to a degree that I'm not confident in my repair attempts. So I'll drop the fasio patches from -mm. I took a quick look and the conflicts seem pretty minor to me, the unplugging changes mostly touch nearby code. Please let know how you want this fixed up. From what I can tell the comments in the unplug patches seem to say that it needs more work and testing, so perhaps a separate fixup patch may be a better idea rather than make the fsaio patchset dependent on this. Zach's observations regarding this code's reliance upon things at *current sounded pretty serious, so I expect we'd be seeing changes for that anyway? Not really, at least nothing that I can see needing a change. As I mentioned there is no reliance on *current in the code that runs in the aio threads that we need to worry about. The generic_write_checks etc that Zach was referring to all happens in the context of submitting process, not in retry context. The model is to perform all validation at the time of io submission. And of course things like copy_to_user() are already taken care of by use_mm(). Lets look at it this way - the kernel already has the ability to do background writeout on behalf of a task from a kernel thread and likewise read(ahead) pages that may be consumed by another task. There is also the ability to operate another task's address space (as used by ptrace). So there is nothing groundbreaking here. In fact on most occasions, all the IO is initiated in the context of the submitting task, so the aio threads mainly deal with checking for completion and transfering completed data to user space. Plus Jens's unplugging changes add more reliance upon context inside *current, for the plugging and unplugging operations. I expect that the fsaio patches will need to be aware of the protocol which those proposed changes add. Whatever logic applies to background writeout etc should also just apply as is to aio worker threads, shouldn't it ? At least at a quick glance I don't see anything special that needs to be done for fsaio, but its good to be aware of this anyway, thanks ! Regards Suparna -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHSET 2][PATCH 1/1] Combining epoll and disk file AIO
On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote: > (2) Most of these other applications need the ability to process both > network events (epoll) and disk file AIO in the same loop. With POSIX AIO > they could at least sort of do this using signals (yeah, and all > associated > issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with > modifications from Jeff Moyer and me) addresses this problem for native > linux aio in a simple manner. Tridge has written a test harness to > try out the Samba4 event library modifications to use this. Jeff Moyer > has a modified version of pipetest for comparison. > Enable epoll wait to be unified with io_getevents From: Zach Brown, Jeff Moyer, Suparna Bhattacharya Previously there have been (complicated and scary) attempts to funnel individual aio events down epoll or vice versa. This instead lets one issue an entire sys_epoll_wait() as an aio op. You'd setup epoll as usual and then issue epoll_wait aio ops which would complete once epoll events had been copied. This will enable a single io_getevents() event loop to process both disk AIO and epoll notifications. >From an application standpoint a typical flow works like this: - Use epoll_ctl as usual to add/remove epoll registrations - Instead of issuing sys_epoll_wait, setup an iocb using io_prep_epoll_wait (see examples below) specifying the epoll events buffer to fill up with epoll notifications. Submit the iocb using io_submit - Now io_getevents can be used to wait for both epoll waits and disk aio completion. If the returned AIO event is of type IO_CMD_EPOLL_WAIT, then corresponding result value indicates the number of epoll notifications in the iocb's event buffer, which can now be processed just like once would process results from a sys_epoll_wait() There are a couple of sample applications: - Andrew Tridgell has implemented a little test harness using an aio events library implementation intended for samba4 http://samba.org/~tridge/etest (The -e aio option uses aio epoll wait and can issue disk aio as well) - An updated version of pipetest from Jeff Moyer has a --aio-epoll option http://people.redhat.com/jmoyer/aio/epoll/pipetest.c There is obviously a little overhead compared to using sys_epoll_wait(), due to the extra step of submitting the epoll wait iocb, most noticible when there are very few events processed per loop. However, the goal here is not to build an epoll alternative but merely to allow network and disk I/O to be processed in the same event loop which is where the efficiencies really come from. Picking up more epoll events in each loop can amortize the overhead across many operations to mitigate the impact. Thanks to Arjan Van de Van for helping figure out how to resolve the lockdep complaints. Both ctx->lock and ep->lock can be held in certain wait queue callback routines, thus being nested inside q->lock. However, this excludes ctx->wait or ep->wq wait queues, which can safetly be nested inside ctx->lock or ep->lock respectively. So we teach lockdep to recognize these as distinct classes. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> Signed-off-by: Jeff Moyer <[EMAIL PROTECTED]> Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/fs/aio.c | 54 + linux-2.6.20-rc1-root/fs/eventpoll.c| 95 +--- linux-2.6.20-rc1-root/include/linux/aio.h |2 linux-2.6.20-rc1-root/include/linux/aio_abi.h |1 linux-2.6.20-rc1-root/include/linux/eventpoll.h | 31 +++ linux-2.6.20-rc1-root/include/linux/sched.h |2 linux-2.6.20-rc1-root/kernel/timer.c| 21 + 7 files changed, 196 insertions(+), 10 deletions(-) diff -puN fs/aio.c~aio-epoll-wait fs/aio.c --- linux-2.6.20-rc1/fs/aio.c~aio-epoll-wait2006-12-28 14:22:52.0 +0530 +++ linux-2.6.20-rc1-root/fs/aio.c 2007-01-03 11:45:40.0 +0530 @@ -30,6 +30,7 @@ #include #include #include +#include #include #include @@ -193,6 +194,8 @@ static int aio_setup_ring(struct kioctx kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \ } while(0) +static struct lock_class_key ioctx_wait_queue_head_lock_key; + /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. */ @@ -224,6 +227,8 @@ static struct kioctx *ioctx_alloc(unsign spin_lock_init(>ctx_lock); spin_lock_init(>ring_info.ring_lock); init_waitqueue_head(>wait); + /* Teach lockdep to recognize this lock as a different class */ + lockdep_set_class(>wait.lock, _wait_queue_head_lock_key); INIT_LIST_HEAD(>active_reqs); INIT_LIST_HEAD(>run_list); @@ -1401,6 +1406,42 @@ static ssize_t aio_setup_single_vector(s return 0; } +/* Uses
Re: [RFC] Heads up on a series of AIO patchsets
On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote: > Sorry for the delay, I'm finally back from the holiday break :) Welcome back ! > > >(1) The filesystem AIO patchset, attempts to address one part of > >the problem > >which is to make regular file IO, (without O_DIRECT) > >asynchronous (mainly > >the case of reads of uncached or partially cached files, and > >O_SYNC writes). > > One of the properties of the currently implemented EIOCBRETRY aio > path is that ->mm is the only field in current which matches the > submitting task_struct while inside the retry path. Yes and that as I guess you know is to enable the aio worker thread to operate on the caller's address space for copy_from/to_user. The actual io setup and associated checks are expected to have been handled at submission time. > > It looks like a retry-based aio write path would be broken because of > this. generic_write_checks() could run in the aio thread and get its > task_struct instead of that of the submitter. The wrong rlimit will > be tested and SIGXFSZ won't be raised. remove_suid() could check the > capabilities of the aio thread instead of those of the submitter. generic_write_checks() are done in the submission path, not repeated during retries, so such types of checks are not intended to run in the aio thread. Did I miss something here ? > > I don't think EIOCBRETRY is the way to go because of this increased > (and subtle!) complexity. What are the chances that we would have > ever found those bugs outside code review? How do we make sure that > current references don't sneak back in after having initially audited > the paths? The EIOCBRETRY route is not something that is intended to be used blindly, It is just one alternative to implement an aio operation by splitting up responsibility between the submitter and aio threads, where aio threads can run in the caller's address space. > > Take the io_cmd_epoll_wait patch.. > > >issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach > >Brown with > >modifications from Jeff Moyer and me) addresses this problem > >for native > >linux aio in a simple manner. > > It's simple looking, sure. This current flipping didn't even occur > to me while throwing the patch together! > > But that patch ends up calling ->poll (and poll_table->qproc) and > writing to userspace (so potentially calling ->nopage) from the aio Yes of course, but why is that a problem ? The copy_from/to_user/put_user constructs are designed to handle soft failures, and we are already using the caller's ->mm. Do you see a need for any additional asserts() ? If there is something that is needed by ->nopage etc which is not abstracted out within the ->mm, then we would need to fix that instead, for correctness anyway, isn't that so ? Now it is possible that there are minor blocking points in the code and the effect of these would be to hold up / delay subsequent queued aio operations; which is an efficiency issue, but not a correctness concern. > threads. Are we sure that none of them will behave surprisingly > because current changed under them? My take is that we should fix the problems that we see. It is likely that what manifests relatively more easily with AIO is also a subtle problem in other cases. > > It might be safe now, but that isn't really the point. I'd rather we > didn't have yet one more subtle invariant to audit and maintain. > > At the risk of making myself vulnerable to the charge of mentioning > vapourware, I will admit that I've been working on a (slightly mad) > implementation of async syscalls. I've been quiet about it because I > don't want to whip up complicated discussion without being able to > show code that works, even if barely. I mention it now only to make > it clear that I want to be constructive, not just critical :). That is great and I look forward to it :) I am, however assuming that whatever implementation you come up will have a different interface from current linux aio -- i.e. a next generation aio model, that will be easily integratable with kevents etc. Which takes me back to Ingo's point - lets have the new evolve parallely with the old, if we can, and not hold up the patches for POSIX AIO to start using kernel AIO, or for epoll to integrate with AIO. OK, I just took a quick look at your blog and I see that you are basically implementing Linus' microthreads scheduling approach - one year since we had that discussion. Glad to see that you found a way to make it workable ... (I'm guessing that you are copying over the part of the stack in use at the time of every switch, is that correct ? At what point do you do the allocation of the saved stacks ? Sorry I should hold off all these questions till your patch c
Re: [RFC] Heads up on a series of AIO patchsets
On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote: Sorry for the delay, I'm finally back from the holiday break :) Welcome back ! (1) The filesystem AIO patchset, attempts to address one part of the problem which is to make regular file IO, (without O_DIRECT) asynchronous (mainly the case of reads of uncached or partially cached files, and O_SYNC writes). One of the properties of the currently implemented EIOCBRETRY aio path is that -mm is the only field in current which matches the submitting task_struct while inside the retry path. Yes and that as I guess you know is to enable the aio worker thread to operate on the caller's address space for copy_from/to_user. The actual io setup and associated checks are expected to have been handled at submission time. It looks like a retry-based aio write path would be broken because of this. generic_write_checks() could run in the aio thread and get its task_struct instead of that of the submitter. The wrong rlimit will be tested and SIGXFSZ won't be raised. remove_suid() could check the capabilities of the aio thread instead of those of the submitter. generic_write_checks() are done in the submission path, not repeated during retries, so such types of checks are not intended to run in the aio thread. Did I miss something here ? I don't think EIOCBRETRY is the way to go because of this increased (and subtle!) complexity. What are the chances that we would have ever found those bugs outside code review? How do we make sure that current references don't sneak back in after having initially audited the paths? The EIOCBRETRY route is not something that is intended to be used blindly, It is just one alternative to implement an aio operation by splitting up responsibility between the submitter and aio threads, where aio threads can run in the caller's address space. Take the io_cmd_epoll_wait patch.. issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with modifications from Jeff Moyer and me) addresses this problem for native linux aio in a simple manner. It's simple looking, sure. This current flipping didn't even occur to me while throwing the patch together! But that patch ends up calling -poll (and poll_table-qproc) and writing to userspace (so potentially calling -nopage) from the aio Yes of course, but why is that a problem ? The copy_from/to_user/put_user constructs are designed to handle soft failures, and we are already using the caller's -mm. Do you see a need for any additional asserts() ? If there is something that is needed by -nopage etc which is not abstracted out within the -mm, then we would need to fix that instead, for correctness anyway, isn't that so ? Now it is possible that there are minor blocking points in the code and the effect of these would be to hold up / delay subsequent queued aio operations; which is an efficiency issue, but not a correctness concern. threads. Are we sure that none of them will behave surprisingly because current changed under them? My take is that we should fix the problems that we see. It is likely that what manifests relatively more easily with AIO is also a subtle problem in other cases. It might be safe now, but that isn't really the point. I'd rather we didn't have yet one more subtle invariant to audit and maintain. At the risk of making myself vulnerable to the charge of mentioning vapourware, I will admit that I've been working on a (slightly mad) implementation of async syscalls. I've been quiet about it because I don't want to whip up complicated discussion without being able to show code that works, even if barely. I mention it now only to make it clear that I want to be constructive, not just critical :). That is great and I look forward to it :) I am, however assuming that whatever implementation you come up will have a different interface from current linux aio -- i.e. a next generation aio model, that will be easily integratable with kevents etc. Which takes me back to Ingo's point - lets have the new evolve parallely with the old, if we can, and not hold up the patches for POSIX AIO to start using kernel AIO, or for epoll to integrate with AIO. OK, I just took a quick look at your blog and I see that you are basically implementing Linus' microthreads scheduling approach - one year since we had that discussion. Glad to see that you found a way to make it workable ... (I'm guessing that you are copying over the part of the stack in use at the time of every switch, is that correct ? At what point do you do the allocation of the saved stacks ? Sorry I should hold off all these questions till your patch comes out) Regards Suparna - z -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http
[PATCHSET 2][PATCH 1/1] Combining epoll and disk file AIO
On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote: (2) Most of these other applications need the ability to process both network events (epoll) and disk file AIO in the same loop. With POSIX AIO they could at least sort of do this using signals (yeah, and all associated issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with modifications from Jeff Moyer and me) addresses this problem for native linux aio in a simple manner. Tridge has written a test harness to try out the Samba4 event library modifications to use this. Jeff Moyer has a modified version of pipetest for comparison. Enable epoll wait to be unified with io_getevents From: Zach Brown, Jeff Moyer, Suparna Bhattacharya Previously there have been (complicated and scary) attempts to funnel individual aio events down epoll or vice versa. This instead lets one issue an entire sys_epoll_wait() as an aio op. You'd setup epoll as usual and then issue epoll_wait aio ops which would complete once epoll events had been copied. This will enable a single io_getevents() event loop to process both disk AIO and epoll notifications. From an application standpoint a typical flow works like this: - Use epoll_ctl as usual to add/remove epoll registrations - Instead of issuing sys_epoll_wait, setup an iocb using io_prep_epoll_wait (see examples below) specifying the epoll events buffer to fill up with epoll notifications. Submit the iocb using io_submit - Now io_getevents can be used to wait for both epoll waits and disk aio completion. If the returned AIO event is of type IO_CMD_EPOLL_WAIT, then corresponding result value indicates the number of epoll notifications in the iocb's event buffer, which can now be processed just like once would process results from a sys_epoll_wait() There are a couple of sample applications: - Andrew Tridgell has implemented a little test harness using an aio events library implementation intended for samba4 http://samba.org/~tridge/etest (The -e aio option uses aio epoll wait and can issue disk aio as well) - An updated version of pipetest from Jeff Moyer has a --aio-epoll option http://people.redhat.com/jmoyer/aio/epoll/pipetest.c There is obviously a little overhead compared to using sys_epoll_wait(), due to the extra step of submitting the epoll wait iocb, most noticible when there are very few events processed per loop. However, the goal here is not to build an epoll alternative but merely to allow network and disk I/O to be processed in the same event loop which is where the efficiencies really come from. Picking up more epoll events in each loop can amortize the overhead across many operations to mitigate the impact. Thanks to Arjan Van de Van for helping figure out how to resolve the lockdep complaints. Both ctx-lock and ep-lock can be held in certain wait queue callback routines, thus being nested inside q-lock. However, this excludes ctx-wait or ep-wq wait queues, which can safetly be nested inside ctx-lock or ep-lock respectively. So we teach lockdep to recognize these as distinct classes. Signed-off-by: Zach Brown [EMAIL PROTECTED] Signed-off-by: Jeff Moyer [EMAIL PROTECTED] Signed-off-by: Suparna Bhattacharya [EMAIL PROTECTED] --- linux-2.6.20-rc1-root/fs/aio.c | 54 + linux-2.6.20-rc1-root/fs/eventpoll.c| 95 +--- linux-2.6.20-rc1-root/include/linux/aio.h |2 linux-2.6.20-rc1-root/include/linux/aio_abi.h |1 linux-2.6.20-rc1-root/include/linux/eventpoll.h | 31 +++ linux-2.6.20-rc1-root/include/linux/sched.h |2 linux-2.6.20-rc1-root/kernel/timer.c| 21 + 7 files changed, 196 insertions(+), 10 deletions(-) diff -puN fs/aio.c~aio-epoll-wait fs/aio.c --- linux-2.6.20-rc1/fs/aio.c~aio-epoll-wait2006-12-28 14:22:52.0 +0530 +++ linux-2.6.20-rc1-root/fs/aio.c 2007-01-03 11:45:40.0 +0530 @@ -30,6 +30,7 @@ #include linux/highmem.h #include linux/workqueue.h #include linux/security.h +#include linux/eventpoll.h #include asm/kmap_types.h #include asm/uaccess.h @@ -193,6 +194,8 @@ static int aio_setup_ring(struct kioctx kunmap_atomic((void *)((unsigned long)__event PAGE_MASK), km); \ } while(0) +static struct lock_class_key ioctx_wait_queue_head_lock_key; + /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. */ @@ -224,6 +227,8 @@ static struct kioctx *ioctx_alloc(unsign spin_lock_init(ctx-ctx_lock); spin_lock_init(ctx-ring_info.ring_lock); init_waitqueue_head(ctx-wait); + /* Teach lockdep to recognize this lock as a different class */ + lockdep_set_class(ctx-wait.lock, ioctx_wait_queue_head_lock_key); INIT_LIST_HEAD(ctx-active_reqs); INIT_LIST_HEAD(ctx-run_list); @@ -1401,6 +1406,42 @@ static ssize_t aio_setup_single_vector(s return 0; } +/* Uses
Re: [FSAIO][PATCH 7/8] Filesystem AIO read
On Thu, Dec 28, 2006 at 11:57:47AM +, Christoph Hellwig wrote: > > + if (in_aio()) { > > + /* Avoid repeat readahead */ > > + if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait))) > > + next_index = last_index; > > + } > > Every place we use kiocbTryRestart in this and the next patch it's in > this from, so we should add a little helper for it: > > int aio_try_restart(void) > { > struct wait_queue_head_t *wq = current->io_wait; > > if (!is_sync_wait(wq) && kiocbTryRestart(io_wait_to_kiocb(wq))) > return 1; > return 0; > } Yes, we can do that -- how about aio_restarted() as an alternate name ? > > with a big kerneldoc comment explaining this idiom (and possible a better > name for the function ;-)) > > > + > > + if ((error = __lock_page(page, current->io_wait))) { > > + goto readpage_error; > > + } > > This should be > > error = __lock_page(page, current->io_wait); > if (error) > goto readpage_error; > > Pluse possible naming updates discussed in the last mail. Also do we > really need to pass current->io_wait here? Isn't the waitqueue in > the kiocb always guaranteed to be the same? Now that all pagecache We don't have have the kiocb available to this routine. Using current->io_wait avoids the need to pass the iocb down to deeper levels just for the sync vs async checks, also allowing such routines to be shared by other code which does not use iocbs (e.g. generic_file_sendfile->do_generic_file_read ->do_generic_mapping_read) without having to set up dummy iocbs. Does that clarify ? We could abstract this away within a lock page wrapper, but I don't know if that makes a difference. > I/O goes through the ->aio_read/->aio_write routines I'd prefer to > get rid of the task_struct field cludges and pass all this around in > the kiocb. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page
On Thu, Dec 28, 2006 at 11:55:10AM +, Christoph Hellwig wrote: > On Thu, Dec 28, 2006 at 02:11:49PM +0530, Suparna Bhattacharya wrote: > > -extern void FASTCALL(lock_page_slow(struct page *page)); > > +extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t > > *wait)); > > extern void FASTCALL(__lock_page_nosync(struct page *page)); > > extern void FASTCALL(unlock_page(struct page *page)); > > > > /* > > * lock_page may only be called if we have the page's inode pinned. > > */ > > -static inline void lock_page(struct page *page) > > +static inline int __lock_page(struct page *page, wait_queue_t *wait) > > { > > might_sleep(); > > if (TestSetPageLocked(page)) > > - lock_page_slow(page); > > + return __lock_page_slow(page, wait); > > + return 0; > > } > > > > +#define lock_page(page)__lock_page(page, >__wait.wait) > > +#define lock_page_slow(page) __lock_page_slow(page, > > >__wait.wait) > > Can we please simply kill your lock_page_slow wrapper and rename the > arguments taking __lock_page_slow to lock_page_slow? All too many > variants of the locking functions aren't all that useful and there's > very few users. OK. > > Similarly I don't really think __lock_page is an all that useful name here. > What about lock_page_wq? or aio_lock_page to denote it has special I am really bad with names :( I tried using the _wq suffixes earlier and that seemed confusing to some, but if no one else objects I'm happy to use that. I thought aio_lock_page() might be misleading because it is synchronous if a regular wait queue entry is passed in, but again it may not be too bad. What's your preference ? Does anything more intuitive come to mind ? > meaning in aio contect? Then again because of these special sematics > we need a bunch of really verbose kerneldoc comments for this function > famility. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine
Sorry this should have read [PATCH 1/8] instead of [PATCH 1/6] Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[FSAIO][PATCH 8/8] AIO O_SYNC filesystem write
AIO support for O_SYNC buffered writes, built over O_SYNC-speedup. It uses the tagged radix tree lookups to writeout just the pages pertaining to this request, and retries instead of blocking for writeback to complete on the same range. All the writeout is issued at the time of io submission, and there is a check to make sure that retries skip over straight to the wait_on_page_writeback_range. Limitations: Extending file writes or hole overwrites with O_SYNC may still block because we have yet to convert generic_osync_inode to be asynchronous. For non O_SYNC writes, writeout happens in the background and so typically appears async to the caller except for memory throttling and non-block aligned writes involving read-modify-write. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- include/linux/aio.h |0 linux-2.6.20-rc1-root/include/linux/fs.h | 13 +- linux-2.6.20-rc1-root/mm/filemap.c | 61 +-- 3 files changed, 54 insertions(+), 20 deletions(-) diff -puN include/linux/aio.h~aio-fs-write include/linux/aio.h diff -puN mm/filemap.c~aio-fs-write mm/filemap.c --- linux-2.6.20-rc1/mm/filemap.c~aio-fs-write 2006-12-21 08:46:21.0 +0530 +++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-21 08:46:21.0 +0530 @@ -239,10 +239,11 @@ EXPORT_SYMBOL(filemap_flush); * @end: ending page index * * Wait for writeback to complete against pages indexed by start->end - * inclusive + * inclusive. In AIO context, this may queue an async notification + * and retry callback and return, instead of blocking the caller. */ -int wait_on_page_writeback_range(struct address_space *mapping, - pgoff_t start, pgoff_t end) +int __wait_on_page_writeback_range(struct address_space *mapping, + pgoff_t start, pgoff_t end, wait_queue_t *wait) { struct pagevec pvec; int nr_pages; @@ -254,20 +255,20 @@ int wait_on_page_writeback_range(struct pagevec_init(, 0); index = start; - while ((index <= end) && + while (!ret && (index <= end) && (nr_pages = pagevec_lookup_tag(, mapping, , PAGECACHE_TAG_WRITEBACK, min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) { unsigned i; - for (i = 0; i < nr_pages; i++) { + for (i = 0; !ret && (i < nr_pages); i++) { struct page *page = pvec.pages[i]; /* until radix tree lookup accepts end_index */ if (page->index > end) continue; - wait_on_page_writeback(page); + ret = __wait_on_page_writeback(page, wait); if (PageError(page)) ret = -EIO; } @@ -303,18 +304,27 @@ int sync_page_range(struct inode *inode, { pgoff_t start = pos >> PAGE_CACHE_SHIFT; pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; - int ret; + int ret = 0; if (!mapping_cap_writeback_dirty(mapping) || !count) return 0; + if (in_aio()) { + /* Already issued writeouts for this iocb ? */ + if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait))) + goto do_wait; /* just need to check if done */ + } ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1); - if (ret == 0) { + + if (ret >= 0) { mutex_lock(>i_mutex); ret = generic_osync_inode(inode, mapping, OSYNC_METADATA); mutex_unlock(>i_mutex); } - if (ret == 0) - ret = wait_on_page_writeback_range(mapping, start, end); +do_wait: + if (ret >= 0) { + ret = __wait_on_page_writeback_range(mapping, start, end, + current->io_wait); + } return ret; } EXPORT_SYMBOL(sync_page_range); @@ -335,15 +345,23 @@ int sync_page_range_nolock(struct inode { pgoff_t start = pos >> PAGE_CACHE_SHIFT; pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; - int ret; + int ret = 0; if (!mapping_cap_writeback_dirty(mapping) || !count) return 0; + if (in_aio()) { + /* Already issued writeouts for this iocb ? */ + if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait))) + goto do_wait; /* just need to check if done */ + } ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1); - if (ret == 0) + if (ret >= 0) ret = generic_osync_inode(inode, mapping, OSYNC_METADATA); - if (
[FSAIO][PATCH 7/8] Filesystem AIO read
Converts the wait for page to become uptodate (lock page) after readahead/readpage (in do_generic_mapping_read) to a retry exit, to make buffered filesystem AIO reads actually synchronous. The patch avoids exclusive wakeups with AIO, a problem originally spotted by Chris Mason, though the reasoning for why it is an issue is now much clearer (see explanation in the comment below in aio.c), and the solution is perhaps slightly simpler. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/fs/aio.c| 11 ++- linux-2.6.20-rc1-root/include/linux/aio.h |5 + linux-2.6.20-rc1-root/mm/filemap.c| 19 --- 3 files changed, 31 insertions(+), 4 deletions(-) diff -puN fs/aio.c~aio-fs-read fs/aio.c --- linux-2.6.20-rc1/fs/aio.c~aio-fs-read 2006-12-21 08:46:13.0 +0530 +++ linux-2.6.20-rc1-root/fs/aio.c 2006-12-28 09:26:30.0 +0530 @@ -1529,7 +1529,16 @@ static int aio_wake_function(wait_queue_ list_del_init(>task_list); kick_iocb(iocb); - return 1; + /* +* Avoid exclusive wakeups with retries since an exclusive wakeup +* may involve implicit expectations of waking up the next waiter +* and there is no guarantee that the retry will take a path that +* would do so. For example if a page has become up-to-date, then +* a retried read may end up straightaway performing a copyout +* and not go through a lock_page - unlock_page that would have +* passed the baton to the next waiter. +*/ + return 0; } int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, diff -puN mm/filemap.c~aio-fs-read mm/filemap.c --- linux-2.6.20-rc1/mm/filemap.c~aio-fs-read 2006-12-21 08:46:13.0 +0530 +++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-28 09:31:48.0 +0530 @@ -909,6 +909,11 @@ void do_generic_mapping_read(struct addr if (!isize) goto out; + if (in_aio()) { + /* Avoid repeat readahead */ + if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait))) + next_index = last_index; + } end_index = (isize - 1) >> PAGE_CACHE_SHIFT; for (;;) { struct page *page; @@ -978,7 +983,10 @@ page_ok: page_not_up_to_date: /* Get exclusive access to the page ... */ - lock_page(page); + + if ((error = __lock_page(page, current->io_wait))) { + goto readpage_error; + } /* Did it get truncated before we got the lock? */ if (!page->mapping) { @@ -1006,7 +1014,8 @@ readpage: } if (!PageUptodate(page)) { - lock_page(page); + if ((error = __lock_page(page, current->io_wait))) + goto readpage_error; if (!PageUptodate(page)) { if (page->mapping == NULL) { /* @@ -1052,7 +1061,11 @@ readpage: goto page_ok; readpage_error: - /* UHHUH! A synchronous read error occurred. Report it */ + /* We don't have uptodate data in the page yet */ + /* Could be due to an error or because we need to +* retry when we get an async i/o notification. +* Report the reason. +*/ desc->error = error; page_cache_release(page); goto out; diff -puN include/linux/aio.h~aio-fs-read include/linux/aio.h --- linux-2.6.20-rc1/include/linux/aio.h~aio-fs-read2006-12-21 08:46:13.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/aio.h 2006-12-28 09:26:27.0 +0530 @@ -34,21 +34,26 @@ struct kioctx; /* #define KIF_LOCKED 0 */ #define KIF_KICKED 1 #define KIF_CANCELLED 2 +#define KIF_RESTARTED 3 #define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags) #define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags) +#define kiocbTryRestart(iocb) test_and_set_bit(KIF_RESTARTED, &(iocb)->ki_flags) #define kiocbSetLocked(iocb) set_bit(KIF_LOCKED, &(iocb)->ki_flags) #define kiocbSetKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags) #define kiocbSetCancelled(iocb)set_bit(KIF_CANCELLED, &(iocb)->ki_flags) +#define kiocbSetRestarted(iocb)set_bit(KIF_RESTARTED, &(iocb)->ki_flags) #define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags) #define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags) #define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)-
[FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page
Define low-level page wait and lock page routines which take a wait queue entry pointer as an additional parameter and return status (which may be non-zero when the wait queue parameter signifies an asynchronous wait, typically during AIO). Synchronous IO waits become a special case where the wait queue parameter is the running task's default io wait context. Asynchronous IO waits happen when the wait queue parameter is the io wait context of a kiocb. Code paths which choose to execute synchronous or asynchronous behaviour depending on the called context specify the current io wait context (which points to sync or async context as the case may be) as the wait parameter. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/include/linux/pagemap.h | 30 ++--- linux-2.6.20-rc1-root/include/linux/sched.h |1 linux-2.6.20-rc1-root/kernel/sched.c | 14 +++ linux-2.6.20-rc1-root/mm/filemap.c| 31 +++--- 4 files changed, 55 insertions(+), 21 deletions(-) diff -puN include/linux/pagemap.h~aio-wait-page include/linux/pagemap.h --- linux-2.6.20-rc1/include/linux/pagemap.h~aio-wait-page 2006-12-21 08:46:02.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/pagemap.h 2006-12-21 08:46:02.0 +0530 @@ -133,20 +133,24 @@ static inline pgoff_t linear_page_index( return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); } -extern void FASTCALL(lock_page_slow(struct page *page)); +extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t *wait)); extern void FASTCALL(__lock_page_nosync(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); /* * lock_page may only be called if we have the page's inode pinned. */ -static inline void lock_page(struct page *page) +static inline int __lock_page(struct page *page, wait_queue_t *wait) { might_sleep(); if (TestSetPageLocked(page)) - lock_page_slow(page); + return __lock_page_slow(page, wait); + return 0; } +#define lock_page(page)__lock_page(page, >__wait.wait) +#define lock_page_slow(page) __lock_page_slow(page, >__wait.wait) + /* * lock_page_nosync should only be used if we can't pin the page's inode. * Doesn't play quite so well with block device plugging. @@ -162,7 +166,8 @@ static inline void lock_page_nosync(stru * This is exported only for wait_on_page_locked/wait_on_page_writeback. * Never use this directly! */ -extern void FASTCALL(wait_on_page_bit(struct page *page, int bit_nr)); +extern int FASTCALL(wait_on_page_bit(struct page *page, int bit_nr, + wait_queue_t *wait)); /* * Wait for a page to be unlocked. @@ -171,21 +176,30 @@ extern void FASTCALL(wait_on_page_bit(st * ie with increased "page->count" so that the page won't * go away during the wait.. */ -static inline void wait_on_page_locked(struct page *page) +static inline int __wait_on_page_locked(struct page *page, wait_queue_t *wait) { if (PageLocked(page)) - wait_on_page_bit(page, PG_locked); + return wait_on_page_bit(page, PG_locked, wait); + return 0; } +#define wait_on_page_locked(page) \ + __wait_on_page_locked(page, >__wait.wait) + /* * Wait for a page to complete writeback */ -static inline void wait_on_page_writeback(struct page *page) +static inline int __wait_on_page_writeback(struct page *page, + wait_queue_t *wait) { if (PageWriteback(page)) - wait_on_page_bit(page, PG_writeback); + return wait_on_page_bit(page, PG_writeback, wait); + return 0; } +#define wait_on_page_writeback(page) \ + __wait_on_page_writeback(page, >__wait.wait) + extern void end_page_writeback(struct page *page); /* diff -puN include/linux/sched.h~aio-wait-page include/linux/sched.h --- linux-2.6.20-rc1/include/linux/sched.h~aio-wait-page2006-12-21 08:46:02.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:26:27.0 +0530 @@ -216,6 +216,7 @@ extern void show_stack(struct task_struc void io_schedule(void); long io_schedule_timeout(long timeout); +int io_wait_schedule(wait_queue_t *wait); extern void cpu_init (void); extern void trap_init(void); diff -puN kernel/sched.c~aio-wait-page kernel/sched.c --- linux-2.6.20-rc1/kernel/sched.c~aio-wait-page 2006-12-21 08:46:02.0 +0530 +++ linux-2.6.20-rc1-root/kernel/sched.c2006-12-21 08:46:02.0 +0530 @@ -4743,6 +4743,20 @@ long __sched io_schedule_timeout(long ti return ret; } +/* + * Sleep only if the wait context passed is not async, + * otherwise return so that a retry can be issued later. + */ +int __sched io_wait_schedule(wait_queue_t *wait) +{ + if (
[FSAIO][PATCH 5/8] Enable wait bit based filtered wakeups to work for AIO
Enable wait bit based filtered wakeups to work for AIO. Replaces the wait queue entry in the kiocb with a wait bit structure, to allow enough space for the wait bit key. This adds an extra level of indirection in references to the wait queue entry in the iocb. Also, an extra check had to be added in aio_wake_function to allow for other kinds of waiters which do not require wait bit, based on the assumption that the key passed in would be NULL in such cases. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/fs/aio.c| 21 ++--- linux-2.6.20-rc1-root/include/linux/aio.h |7 --- linux-2.6.20-rc1-root/kernel/wait.c | 17 ++--- 3 files changed, 32 insertions(+), 13 deletions(-) diff -puN fs/aio.c~aio-wait-bit fs/aio.c --- linux-2.6.20-rc1/fs/aio.c~aio-wait-bit 2006-12-21 08:45:57.0 +0530 +++ linux-2.6.20-rc1-root/fs/aio.c 2006-12-28 09:32:27.0 +0530 @@ -719,13 +719,13 @@ static ssize_t aio_run_iocb(struct kiocb * cause the iocb to be kicked for continuation (through * the aio_wake_function callback). */ - BUG_ON(current->io_wait != NULL); - current->io_wait = >ki_wait; + BUG_ON(!is_sync_wait(current->io_wait)); + current->io_wait = >ki_wait.wait; ret = retry(iocb); current->io_wait = NULL; if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) { - BUG_ON(!list_empty(>ki_wait.task_list)); + BUG_ON(!list_empty(>ki_wait.wait.task_list)); aio_complete(iocb, ret, 0); } out: @@ -883,7 +883,7 @@ static void try_queue_kicked_iocb(struct * than retry has happened before we could queue the iocb. This also * means that the retry could have completed and freed our iocb, no * good. */ - BUG_ON((!list_empty(>ki_wait.task_list))); + BUG_ON((!list_empty(>ki_wait.wait.task_list))); spin_lock_irqsave(>ctx_lock, flags); /* set this inside the lock so that we can't race with aio_run_iocb() @@ -1519,7 +1519,13 @@ static ssize_t aio_setup_iocb(struct kio static int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) { - struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait); + struct wait_bit_queue *wait_bit + = container_of(wait, struct wait_bit_queue, wait); + struct kiocb *iocb = container_of(wait_bit, struct kiocb, ki_wait); + + /* Assumes that a non-NULL key implies wait bit filtering */ + if (key && !test_wait_bit_key(wait, key)) + return 0; list_del_init(>task_list); kick_iocb(iocb); @@ -1574,8 +1580,9 @@ int fastcall io_submit_one(struct kioctx req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf; req->ki_left = req->ki_nbytes = iocb->aio_nbytes; req->ki_opcode = iocb->aio_lio_opcode; - init_waitqueue_func_entry(>ki_wait, aio_wake_function); - INIT_LIST_HEAD(>ki_wait.task_list); + init_waitqueue_func_entry(>ki_wait.wait, aio_wake_function); + INIT_LIST_HEAD(>ki_wait.wait.task_list); + req->ki_run_list.next = req->ki_run_list.prev = NULL; ret = aio_setup_iocb(req); diff -puN include/linux/aio.h~aio-wait-bit include/linux/aio.h --- linux-2.6.20-rc1/include/linux/aio.h~aio-wait-bit 2006-12-21 08:45:57.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/aio.h 2006-12-28 09:32:27.0 +0530 @@ -102,7 +102,7 @@ struct kiocb { } ki_obj; __u64 ki_user_data; /* user's data for completion */ - wait_queue_tki_wait; + struct wait_bit_queue ki_wait; loff_t ki_pos; atomic_tki_bio_count; /* num bio used for this iocb */ @@ -135,7 +135,7 @@ struct kiocb { (x)->ki_dtor = NULL;\ (x)->ki_obj.tsk = tsk; \ (x)->ki_user_data = 0; \ - init_wait((&(x)->ki_wait)); \ + init_wait_bit_task((&(x)->ki_wait), current);\ } while (0) #define AIO_RING_MAGIC 0xa10a10a1 @@ -237,7 +237,8 @@ do { \ } \ } while (0) -#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait) +#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait, \ + struct wait_bit_queue, wait), struct kiocb, ki_wait) #include diff -puN kernel/wait.c~aio-wait-bit kernel/wait.c --- linux-2.6.20-rc1/kernel/wait.c~aio-wait-bit 2006-12-21 08:45:
[FSAIO][PATCH 4/8] Add a default io wait bit field in task struct
Allocates space for the default io wait queue entry (actually a wait bit entry) in the task struct. Doing so simplifies the patches for AIO wait page allowing for cleaner and more efficient implementation, at the cost of 28 additional bytes in task struct vs allocation on demand on-stack. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/include/linux/sched.h | 11 +++ linux-2.6.20-rc1-root/kernel/fork.c |3 ++- 2 files changed, 9 insertions(+), 5 deletions(-) diff -puN include/linux/sched.h~tsk-default-io-wait include/linux/sched.h --- linux-2.6.20-rc1/include/linux/sched.h~tsk-default-io-wait 2006-12-21 08:45:51.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:32:39.0 +0530 @@ -1006,11 +1006,14 @@ struct task_struct { unsigned long ptrace_message; siginfo_t *last_siginfo; /* For ptrace use. */ + +/* Space for default IO wait bit entry used for synchronous IO waits */ + struct wait_bit_queue __wait; /* - * current io wait handle: wait queue entry to use for io waits - * If this thread is processing aio, this points at the waitqueue - * inside the currently handled kiocb. It may be NULL (i.e. default - * to a stack based synchronous wait) if its doing sync IO. + * Current IO wait handle: wait queue entry to use for IO waits + * If this thread is processing AIO, this points at the waitqueue + * inside the currently handled kiocb. Otherwise, points to the + * default IO wait field (i.e &__wait.wait above). */ wait_queue_t *io_wait; /* i/o counters(bytes read/written, #syscalls */ diff -puN kernel/fork.c~tsk-default-io-wait kernel/fork.c --- linux-2.6.20-rc1/kernel/fork.c~tsk-default-io-wait 2006-12-21 08:45:51.0 +0530 +++ linux-2.6.20-rc1-root/kernel/fork.c 2006-12-21 08:45:51.0 +0530 @@ -1056,7 +1056,8 @@ static struct task_struct *copy_process( do_posix_clock_monotonic_gettime(>start_time); p->security = NULL; p->io_context = NULL; - p->io_wait = NULL; + init_wait_bit_task(>__wait, p); + p->io_wait = >__wait.wait; p->audit_context = NULL; cpuset_fork(p); #ifdef CONFIG_NUMA _ -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[FSAIO][PATCH 3/8] Routines to initialize and test a wait bit key
init_wait_bit_key() initializes the key field in an already allocated wait bit structure, useful for async wait bit support. Also separate out the wait bit test to a common routine which can be used by different kinds of wakeup callbacks. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/include/linux/wait.h | 24 linux-2.6.20-rc1-root/kernel/wait.c| 11 ++- 2 files changed, 26 insertions(+), 9 deletions(-) diff -puN include/linux/wait.h~init-wait-bit-key include/linux/wait.h --- linux-2.6.20-rc1/include/linux/wait.h~init-wait-bit-key 2006-12-21 08:45:46.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/wait.h 2006-12-21 08:45:46.0 +0530 @@ -108,6 +108,17 @@ static inline int waitqueue_active(wait_ return !list_empty(>task_list); } +static inline int test_wait_bit_key(wait_queue_t *wait, + struct wait_bit_key *key) +{ + struct wait_bit_queue *wait_bit + = container_of(wait, struct wait_bit_queue, wait); + + return (wait_bit->key.flags == key->flags && + wait_bit->key.bit_nr == key->bit_nr && + !test_bit(key->bit_nr, key->flags)); +} + /* * Used to distinguish between sync and async io wait context: * sync i/o typically specifies a NULL wait queue entry or a wait @@ -416,6 +427,19 @@ int wake_bit_function(wait_queue_t *wait INIT_LIST_HEAD(&(wait)->task_list); \ } while (0) +#define init_wait_bit_key(waitbit, word, bit) \ + do {\ + (waitbit)->key.flags = word;\ + (waitbit)->key.bit_nr = bit;\ + } while (0) + +#define init_wait_bit_task(waitbit, tsk) \ + do {\ + (waitbit)->wait.private = tsk; \ + (waitbit)->wait.func = wake_bit_function; \ + INIT_LIST_HEAD(&(waitbit)->wait.task_list); \ + } while (0) + /** * wait_on_bit - wait for a bit to be cleared * @word: the word being waited on, a kernel virtual address diff -puN kernel/wait.c~init-wait-bit-key kernel/wait.c --- linux-2.6.20-rc1/kernel/wait.c~init-wait-bit-key2006-12-21 08:45:46.0 +0530 +++ linux-2.6.20-rc1-root/kernel/wait.c 2006-12-28 09:32:44.0 +0530 @@ -139,16 +139,9 @@ EXPORT_SYMBOL(autoremove_wake_function); int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg) { - struct wait_bit_key *key = arg; - struct wait_bit_queue *wait_bit - = container_of(wait, struct wait_bit_queue, wait); - - if (wait_bit->key.flags != key->flags || - wait_bit->key.bit_nr != key->bit_nr || - test_bit(key->bit_nr, key->flags)) + if (!test_wait_bit_key(wait, arg)) return 0; - else - return autoremove_wake_function(wait, mode, sync, key); + return autoremove_wake_function(wait, mode, sync, arg); } EXPORT_SYMBOL(wake_bit_function); _ -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[FSAIO][PATCH 2/8] Rename __lock_page to lock_page_slow
In order to allow for interruptible and asynchronous versions of lock_page in conjunction with the wait_on_bit changes, we need to define low-level lock page routines which take an additional argument, i.e a wait queue entry and may return non-zero status, e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames __lock_page to lock_page_slow, so that __lock_page and __lock_page_slow can denote the versions which take a wait queue parameter. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/include/linux/pagemap.h |4 ++-- linux-2.6.20-rc1-root/mm/filemap.c|8 2 files changed, 6 insertions(+), 6 deletions(-) diff -puN include/linux/pagemap.h~lock_page_slow include/linux/pagemap.h --- linux-2.6.20-rc1/include/linux/pagemap.h~lock_page_slow 2006-12-21 08:45:40.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/pagemap.h 2006-12-28 09:32:39.0 +0530 @@ -133,7 +133,7 @@ static inline pgoff_t linear_page_index( return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); } -extern void FASTCALL(__lock_page(struct page *page)); +extern void FASTCALL(lock_page_slow(struct page *page)); extern void FASTCALL(__lock_page_nosync(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); @@ -144,7 +144,7 @@ static inline void lock_page(struct page { might_sleep(); if (TestSetPageLocked(page)) - __lock_page(page); + lock_page_slow(page); } /* diff -puN mm/filemap.c~lock_page_slow mm/filemap.c --- linux-2.6.20-rc1/mm/filemap.c~lock_page_slow2006-12-21 08:45:40.0 +0530 +++ linux-2.6.20-rc1-root/mm/filemap.c 2006-12-28 09:32:39.0 +0530 @@ -556,7 +556,7 @@ void end_page_writeback(struct page *pag EXPORT_SYMBOL(end_page_writeback); /** - * __lock_page - get a lock on the page, assuming we need to sleep to get it + * lock_page_slow - get a lock on the page, assuming we need to sleep to get it * @page: the page to lock * * Ugly. Running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some @@ -564,14 +564,14 @@ EXPORT_SYMBOL(end_page_writeback); * chances are that on the second loop, the block layer's plug list is empty, * so sync_page() will then return in state TASK_UNINTERRUPTIBLE. */ -void fastcall __lock_page(struct page *page) +void fastcall lock_page_slow(struct page *page) { DEFINE_WAIT_BIT(wait, >flags, PG_locked); __wait_on_bit_lock(page_waitqueue(page), , sync_page, TASK_UNINTERRUPTIBLE); } -EXPORT_SYMBOL(__lock_page); +EXPORT_SYMBOL(lock_page_slow); /* * Variant of lock_page that does not require the caller to hold a reference @@ -647,7 +647,7 @@ repeat: page_cache_get(page); if (TestSetPageLocked(page)) { read_unlock_irq(>tree_lock); - __lock_page(page); + lock_page_slow(page); read_lock_irq(>tree_lock); /* Has the page been truncated while we slept? */ _ -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine
Add a wait queue parameter to the action routine called by __wait_on_bit to allow it to determine whether to block or not. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux-2.6.20-rc1-root/fs/buffer.c |2 +- linux-2.6.20-rc1-root/fs/inode.c |2 +- linux-2.6.20-rc1-root/fs/nfs/inode.c |2 +- linux-2.6.20-rc1-root/fs/nfs/nfs4proc.c|2 +- linux-2.6.20-rc1-root/fs/nfs/pagelist.c|2 +- linux-2.6.20-rc1-root/include/linux/sunrpc/sched.h |3 ++- linux-2.6.20-rc1-root/include/linux/wait.h | 18 -- linux-2.6.20-rc1-root/include/linux/writeback.h|2 +- linux-2.6.20-rc1-root/kernel/wait.c| 14 -- linux-2.6.20-rc1-root/mm/filemap.c |2 +- linux-2.6.20-rc1-root/net/sunrpc/sched.c |5 +++-- 11 files changed, 32 insertions(+), 22 deletions(-) diff -puN fs/buffer.c~modify-wait-bit-action-args fs/buffer.c --- linux-2.6.20-rc1/fs/buffer.c~modify-wait-bit-action-args2006-12-21 08:45:34.0 +0530 +++ linux-2.6.20-rc1-root/fs/buffer.c 2006-12-21 08:45:34.0 +0530 @@ -55,7 +55,7 @@ init_buffer(struct buffer_head *bh, bh_e bh->b_private = private; } -static int sync_buffer(void *word) +static int sync_buffer(void *word, wait_queue_t *wait) { struct block_device *bd; struct buffer_head *bh diff -puN fs/inode.c~modify-wait-bit-action-args fs/inode.c --- linux-2.6.20-rc1/fs/inode.c~modify-wait-bit-action-args 2006-12-21 08:45:34.0 +0530 +++ linux-2.6.20-rc1-root/fs/inode.c2006-12-21 08:45:34.0 +0530 @@ -1279,7 +1279,7 @@ void remove_dquot_ref(struct super_block #endif -int inode_wait(void *word) +int inode_wait(void *word, wait_queue_t *wait) { schedule(); return 0; diff -puN include/linux/wait.h~modify-wait-bit-action-args include/linux/wait.h --- linux-2.6.20-rc1/include/linux/wait.h~modify-wait-bit-action-args 2006-12-21 08:45:34.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/wait.h 2006-12-28 09:32:57.0 +0530 @@ -145,11 +145,15 @@ void FASTCALL(__wake_up(wait_queue_head_ extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode)); extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr)); void FASTCALL(__wake_up_bit(wait_queue_head_t *, void *, int)); -int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned)); -int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned)); +int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, + int (*)(void *, wait_queue_t *), unsigned)); +int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, + int (*)(void *, wait_queue_t *), unsigned)); void FASTCALL(wake_up_bit(void *, int)); -int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned)); -int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), unsigned)); +int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *, + wait_queue_t *), unsigned)); +int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *, + wait_queue_t *), unsigned)); wait_queue_head_t *FASTCALL(bit_waitqueue(void *, int)); #define wake_up(x) __wake_up(x, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, NULL) @@ -427,7 +431,8 @@ int wake_bit_function(wait_queue_t *wait * but has no intention of setting it. */ static inline int wait_on_bit(void *word, int bit, - int (*action)(void *), unsigned mode) + int (*action)(void *, wait_queue_t *), + unsigned mode) { if (!test_bit(bit, word)) return 0; @@ -451,7 +456,8 @@ static inline int wait_on_bit(void *word * clear with the intention of setting it, and when done, clearing it. */ static inline int wait_on_bit_lock(void *word, int bit, - int (*action)(void *), unsigned mode) + int (*action)(void *, wait_queue_t *), + unsigned mode) { if (!test_and_set_bit(bit, word)) return 0; diff -puN include/linux/writeback.h~modify-wait-bit-action-args include/linux/writeback.h --- linux-2.6.20-rc1/include/linux/writeback.h~modify-wait-bit-action-args 2006-12-21 08:45:34.0 +0530 +++ linux-2.6.20-rc1-root/include/linux/writeback.h 2006-12-21 08:45:34.0 +0530 @@ -66,7 +66,7 @@ struct writeback_control { */ void writeback_inodes(struct writeback_control *wbc); void wake_up_inode(struct inode *inode); -int inode_wait(void *); +int inode_wait(void *, wait_queue_t *); void sync_inodes_sb(struct super_block *, i
[PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
Currently native linux AIO is properly supported (in the sense of actually being asynchronous) only for files opened with O_DIRECT. While this suffices for a major (and most visible) user of AIO, i.e. databases, other types of users like Samba require AIO support for regular file IO. Also, for glibc POSIX AIO to be able to switch to using native AIO instead of the current simulation using threads, it needs/expects asynchronous behaviour for both O_DIRECT and buffered file AIO. This patchset implements changes to make filesystem AIO read and write asynchronous for the non O_DIRECT case. This is mainly relevant in the case of reads of uncached or partially cached files, and O_SYNC writes. Instead of translating regular IO to [AIO + wait], it translates AIO to [regular IO - blocking + retries]. The intent of implementing it this way is to avoid modifying or slowing down normal usage, by keeping it pretty much the way it is without AIO, while avoiding code duplication. Instead we make AIO vs regular IO checks inside io_schedule(), i.e. at the blocking points. The low-level unit of distinction is a wait queue entry, which in the AIO case is contained in an iocb and in the synchronous IO case is associated with the calling task. The core idea is that is we complete as much IO as we can in a non-blocking fashion, and then continue the remaining part of the transfer again when woken up asynchronously via a wait queue callback when pages are ready ... thus each iteration progresses through more of the request until it is completed. The interesting part here is that owing largely to the idempotence in the way radix-tree page cache traveral happens, every iteration is simply a smaller read/write. Almost all of the iocb manipulation and advancement in the AIO case happens in the high level AIO code, and rather than in regular VFS/filesystem paths. The following is a sampling of comparative aio-stress results with the patches (each run starts with uncached files): - aio-stress throughput comparisons (in MB/s): file size 1GB, record size 64KB, depth 64, ios per iteration 8 max io_submit 8, buffer alignment set to 4KB 4 way Pentium III SMP box, Adaptec AIC-7896/7 Ultra2 SCSI, 40 MB/s Filesystem: ext2 Buffered (non O_DIRECT) Vanilla Patched O_DIRECT Vanilla Patched Random-Read 10.08 23.91 18.91, 18.98 Random-O_SYNC-Write 8.86 15.84 16.51, 16.53 Sequential-Read 31.49 33.00 31.86, 31.79 Sequential-O_SYNC-Write 8.68 32.60 31.45, 32.44 Random-Write31.09 (19.65) 30.90 (19.65) Sequential-Write30.84 (28.94) 30.09 (28.39) Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
Currently native linux AIO is properly supported (in the sense of actually being asynchronous) only for files opened with O_DIRECT. While this suffices for a major (and most visible) user of AIO, i.e. databases, other types of users like Samba require AIO support for regular file IO. Also, for glibc POSIX AIO to be able to switch to using native AIO instead of the current simulation using threads, it needs/expects asynchronous behaviour for both O_DIRECT and buffered file AIO. This patchset implements changes to make filesystem AIO read and write asynchronous for the non O_DIRECT case. This is mainly relevant in the case of reads of uncached or partially cached files, and O_SYNC writes. Instead of translating regular IO to [AIO + wait], it translates AIO to [regular IO - blocking + retries]. The intent of implementing it this way is to avoid modifying or slowing down normal usage, by keeping it pretty much the way it is without AIO, while avoiding code duplication. Instead we make AIO vs regular IO checks inside io_schedule(), i.e. at the blocking points. The low-level unit of distinction is a wait queue entry, which in the AIO case is contained in an iocb and in the synchronous IO case is associated with the calling task. The core idea is that is we complete as much IO as we can in a non-blocking fashion, and then continue the remaining part of the transfer again when woken up asynchronously via a wait queue callback when pages are ready ... thus each iteration progresses through more of the request until it is completed. The interesting part here is that owing largely to the idempotence in the way radix-tree page cache traveral happens, every iteration is simply a smaller read/write. Almost all of the iocb manipulation and advancement in the AIO case happens in the high level AIO code, and rather than in regular VFS/filesystem paths. The following is a sampling of comparative aio-stress results with the patches (each run starts with uncached files): - aio-stress throughput comparisons (in MB/s): file size 1GB, record size 64KB, depth 64, ios per iteration 8 max io_submit 8, buffer alignment set to 4KB 4 way Pentium III SMP box, Adaptec AIC-7896/7 Ultra2 SCSI, 40 MB/s Filesystem: ext2 Buffered (non O_DIRECT) Vanilla Patched O_DIRECT Vanilla Patched Random-Read 10.08 23.91 18.91, 18.98 Random-O_SYNC-Write 8.86 15.84 16.51, 16.53 Sequential-Read 31.49 33.00 31.86, 31.79 Sequential-O_SYNC-Write 8.68 32.60 31.45, 32.44 Random-Write31.09 (19.65) 30.90 (19.65) Sequential-Write30.84 (28.94) 30.09 (28.39) Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/