Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private
On Wed, 17 Jan 2007, Benjamin LaHaise wrote: On Mon, Jan 15, 2007 at 08:25:15PM -0800, Nate Diller wrote: the right thing to do from a design perspective. Hopefully it enables a new architecture that can reduce context switches in I/O completion, and reduce overhead. That's the real motive ;) And it's a broken motive. Context switches per se are not bad, as they make it possible to properly schedule code in a busy system (which is *very* important when realtime concerns come into play). Have a look at how things were done in the 2.4 aio code to see how completion would get done with a non-retry method, typically in interrupt context. I had code that did direct I/O rather differently by sharing code with the read/write code paths at some point, the catch being that it was pretty invasive, which meant that it never got merged with the changes to handle writeback pressure and other work that happened during 2.5. I'm having some trouble understanding your concern. From my perspective, any unnecessary context switch represents not only performance loss, but extra complexity in the code. In this case, I'm not suggesting that the aio.c code causes problems, quite the opposite. The code I'd like to change is FS and md levels, where context switches happen because of timers, workqueues, and worker threads. For sync I/O, these layers could be doing their completion work in process context, but because waiting on sync I/O is done in layers above, they must resort to other means, even for the common case. The dm-crypt module is the most straightforward example. I took a look at some 2.4.18 aio patches in kernel.org/.../bcrl/aio/, and if I understand what you did, you were basically operating at the aops level rather than f_ops. I actually like that idea, it's nicer than having the direct-io code do its work seperately from the aio code. Part of where I'm going with this patch is a better integration between the block layer (make_request), page layer (aops), and FS layer (f_ops), particularly in the completion paths. The direct-io code is an improvement over the common code on that point, do_readahead() and friends all wait on individual pages to become uptodate. I'd like to bring some improvements from the directIO architecture into use in the common case, which I hope will help performance. I know that might seem somewhat unrelated, but I don't think it is. This change goes hand in hand with using completion handlers in the aops. That will link together the completion callback in the bio with the aio callback, so that the whole stack can finish its work in one context. That said, you can't make kiocb private without completely removing the ability of the rest of the kernel to complete an aio sanely from irq context. You need some form of i/o descriptor, and a kiocb is just that. Adding more layering is just going to make things messier and slower for no real gain. This patchset does not change how or when I/O completion happens, aio_complete() will still get called from direct-io.c, nfs-direct.c, et al. The iocb structure is still passed to aio_complete, just like before. The only difference is that the lower level code doesn't know that it's got an iocb, all it sees is an opaque cookie. It's more like enforcing a layer that's already in place, and I think things got simpler rather than messier. Whether things are slower or not remains to be seen, but I expect no measurable changes either way with this patch. I'm releasing a new version of the patch soon, it will use a new iodesc structure to keep track of iovec state, which simplifies things further. It also will have a new version of the usb gadget code, and some general cleanups. I hope you'll take a look at it. NATE - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops
On 1/15/07, David Brownell [EMAIL PROTECTED] wrote: On Monday 15 January 2007 5:54 pm, Nate Diller wrote: This removes the aio implementation from the usb gadget file system. NAK. I see a deep mis-understanding here. Aside from making very creative (!) use of the aio retry path, it can't be of any use performance-wise Other than the basic win of letting one userspace thread keep an I/O stream active while at the same time processing the data it reads or writes?? That's the async part of AIO. There's a not-so-little thing called I/O overlap ... which is the only way to prevent wasting bandwidth between (non-cacheable) I/O requests, and thus is the only way to let userspace code achieve anything close to the maximum I/O bandwidth the hardware can achieve. We want to see the host side usbfs evolve to support AIO like this too, for the same reasons. (Currently it has fairly ugly AIO code that looks unlike any other AIO code in Linux. Recent updates to support a file-per-endpoint device model are a necessary precursor to switching over to standard AIO syscalls.) because it always kmalloc()s a bounce buffer for the *whole* I/O size. By and large that's a negligible factor compared to being able to achieve I/O overlap. ISTR the reason for not doing fancy DMA magic was that the cost of this style AIO was under 1 KByte object code on ARM, which was easy to justify ... while DMA magic to do that sort of stuff would be much fatter, as well as more error prone. (And that's why the creative use of the retry path. As I've observed before, retry is a misnomer in the general sense of an async I/O framework. It's more of a semi-completion callback; I/O can't in general be retried on error or fault, and even in the current usage it's not really a retry.) Now that high speed peripheral hardware is becoming more common on embedded Linuxes -- TI has DaVinci, OMAP 2430, TUSB6010 (as found in the new Nokia 800 tablets); Atmel AVR32 AP7000; at least a couple parts that should be able to use the same musb_hdrc driver as those TI parts; and a few other chips I've heard of -- there may be some virtue in eliminating the memcpy, since those CPUs don't have many MIPS to waste. (Iff the memcpy turns out to be a real issue...) Perhaps the only reason to keep it around is the ability to cancel I/O requests, which only applies when using the user space async I/O interface. It's good to have almost the complete kernel API functionality exposed to userspace, and having I/O cancelation is an inevitable consequence of a complete AIO framework ... but that particular issue was not a driving concern. The reason for AIO is to have a *STANDARD* userspace interface for *ASYNC I/O* which otherwise can't exist. You know, the kind of I/O interface that can't be implemented with read() and write() syscalls, which for non-buffered I/O necessarily preclude all I/O overlap. AIO itself is a direct match to most I/O frameworks' primitives. (AIOCB being directly analagous to peripheral side struct usb_request and host side struct urb.) You know, I've always thought that one reason the AIO discussions seemed strange is that they weren't really focussed on I/O (the lowlevel after-the-caches stuff) so much as filesystems (several layers up in the stack, with intervening caching frameworks). The first several implementations of AIO that I saw were restricted to real I/O and not applicable to disk backed files. So while I was glad the Linux approach didn't make that mistake, it's seemed that it might be wanting to make a converse mistake: neglecting I/O that isn't aimed at data stored on disks. I highly doubt that is enough incentive to justify the extra complexity here or in user-space, so I think it's a safe bet to remove this. If that feature still desired, it would be possible to implement a sync interface that does an interruptible sleep. What's needed is an async, non-sleeeping, interface ... with I/O overlap. That's antithetical to using read()/write() calls, so your proposed approach couldn't possibly work. haha, wow ok you convinced me :) I got a bit impatient when I was working on this, it took some time just to figure out the intention of the code, and I'm trying to hold to a bit of a schedule here. Without any clear (to me) reason, I didn't want to spend a lot of effort fixing this up. There's really no big difference between the usb drivers here and the disk I/O scheduler queue, AFAICT, so it seems like the solution I want is to do a kmap() on the user buffer and then do the I/O straight out of that. That will eliminate the need for the bounce buffer. I'll post a new version along with the iodesc changes later this week. NATE - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t
On 1/15/07, David Brownell [EMAIL PROTECTED] wrote: On Monday 15 January 2007 5:54 pm, Nate Diller wrote: --- a/drivers/usb/gadget/inode.c 2007-01-12 14:42:29.0 -0800 +++ b/drivers/usb/gadget/inode.c 2007-01-12 14:25:34.0 -0800 @@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i return value; } -static ssize_t ep_aio_read_retry(struct kiocb *iocb) +static int ep_aio_read_retry(struct kiocb *iocb) { struct kiocb_priv *priv = iocb-private; - ssize_t len, total; - int i; + ssize_t total; + int i, err = 0; /* we retry to get the right mm context for this: */ /* copy stuff into user buffers */ total = priv-actual; - len = 0; for (i=0; i priv-nr_segs; i++) { ssize_t this = min((ssize_t)(priv-iv[i].iov_len), total); if (copy_to_user(priv-iv[i].iov_base, priv-buf, this)) { - if (len == 0) - len = -EFAULT; + err = -EFAULT; Discarding the capability to report partial success, e.g. that the first N bytes were properly transferred? I don't see any virtue in that change. Quite the opposite in fact. I think you're also expecting that if N bytes were requested, that's always how many will be received. That's not true for packetized I/O such as USB isochronous transfers ... where it's quite legit (and in some cases routine) for the other end to send packets that are shorter than the maximum allowed. Sending a zero length packet is not the same as sending no packet at all, for another example. I will convert this (usb) code to use the standard completion path, which you will notice *gained* the ability to properly report both an error and a partial success as part of this patch. In fact, fixing this up was my intention when I wrote this patch, and the later patch was a compromise intended to get this whole bundle out for review in a timely manner :) NATE - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -mm 0/10][RFC] aio: make struct kiocb private
This series is an attempt to generalize the async I/O paths to be implementation agnostic. It completely eliminates knowledge of the kiocb structure in the generic code and makes it private within the current aio code. Things get noticeably cleaner without that layering violation. The new interface takes a file_endio_t function pointer, and a private data pointer, which would normally be aio_complete and a kiocb pointer, respectively. If the aio submission function gets back EIOCBQUEUED, that is a guarantee that the endio function will be called, or *already has been called*. If the file_endio_t pointer provided to aio_[read|write] is NULL, the FS must block on I/O completion, then return either the number of bytes read, or an error. I had to touch more areas that I had originally expected, so there are changes in a corner of the socket code, and a slight behavior change in the direct-io completion path with affects XFS and OCFS2. I would appreciate further review there, so I copied some extra people I hope can help. This patch is against 2.6.20-rc4-mm1. It has been compile-tested at each stage. It needs some runtime testing yet, but I prefer to get it out for commentary and test later. These patches are for RFC only and have not yet been signed off. NATE --- Documentation/filesystems/Locking | 11 + Documentation/filesystems/vfs.txt | 11 + arch/s390/hypfs/inode.c | 16 +- drivers/net/pppoe.c |8 - drivers/net/tun.c | 13 +- drivers/usb/gadget/inode.c| 239 +- fs/aio.c | 74 ++- fs/bad_inode.c| 10 - fs/block_dev.c| 109 +++-- fs/cifs/cifsfs.c | 10 - fs/compat.c | 56 fs/direct-io.c| 92 -- fs/ecryptfs/file.c| 16 +- fs/ext2/inode.c | 12 - fs/ext3/file.c|9 - fs/ext3/inode.c | 11 - fs/ext4/file.c|9 - fs/ext4/inode.c | 11 - fs/fat/inode.c| 12 - fs/fuse/dev.c | 13 +- fs/gfs2/ops_address.c | 14 +- fs/hfs/inode.c| 13 -- fs/hfsplus/inode.c| 13 -- fs/jfs/inode.c| 12 - fs/nfs/direct.c | 92 +++--- fs/nfs/file.c | 62 + fs/ntfs/file.c| 71 ++- fs/ocfs2/aops.c | 24 +-- fs/ocfs2/aops.h |8 - fs/ocfs2/file.c | 44 +++--- fs/ocfs2/inode.h |2 fs/pipe.c | 12 - fs/read_write.c | 225 --- fs/read_write.h |8 - fs/reiserfs/inode.c | 13 -- fs/smbfs/file.c | 28 ++-- fs/udf/file.c | 13 +- fs/xfs/linux-2.6/xfs_aops.c | 44 +++--- fs/xfs/linux-2.6/xfs_file.c | 58 + fs/xfs/linux-2.6/xfs_lrw.c| 29 ++-- fs/xfs/linux-2.6/xfs_lrw.h| 10 - fs/xfs/linux-2.6/xfs_vnode.h | 20 +-- include/linux/aio.h | 11 - include/linux/fs.h| 114 +- include/linux/net.h | 18 +- include/linux/nfs_fs.h| 12 - include/net/bluetooth/bluetooth.h |2 include/net/inet_common.h |3 include/net/scm.h |2 include/net/sock.h| 45 +-- include/net/tcp.h |6 include/net/udp.h |3 mm/filemap.c | 109 - net/appletalk/ddp.c |5 net/atm/common.c |6 net/atm/common.h |7 - net/ax25/af_ax25.c|7 - net/bluetooth/af_bluetooth.c |4 net/bluetooth/hci_sock.c |7 - net/bluetooth/l2cap.c |2 net/bluetooth/rfcomm/sock.c |8 - net/bluetooth/sco.c |3 net/core/sock.c | 12 - net/dccp/dccp.h |8 - net/dccp/probe.c |3 net/dccp/proto.c |7 - net/decnet/af_decnet.c|7 - net/econet/af_econet.c|7 - net/ipv4/af_inet.c|5 net/ipv4/raw.c|8 - net/ipv4/tcp.c|7 - net/ipv4/tcp_probe.c |3 net/ipv4/udp.c|9 - net/ipv4/udp_impl.h |2 net/ipv6/raw.c|6 net/ipv6/udp.c| 10 - net/ipv6/udp_impl.h |6 net/ipx/af_ipx.c |7 - net/irda/af_irda.c| 29 ++-- net/key/af_key.c
[PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t
Define a new function typedef for I/O completion at the file/iovec level -- typedef void (file_endio_t)(void *endio_data, ssize_t count, int err); and convert aio_complete and all its callers to this new prototype. --- drivers/usb/gadget/inode.c | 24 +++--- fs/aio.c | 59 - fs/block_dev.c |8 +- fs/direct-io.c | 18 + fs/nfs/direct.c|9 ++ include/linux/aio.h| 11 +++- include/linux/fs.h |2 + 7 files changed, 61 insertions(+), 70 deletions(-) --- diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c --- a/drivers/usb/gadget/inode.c2007-01-12 14:42:29.0 -0800 +++ b/drivers/usb/gadget/inode.c2007-01-12 14:25:34.0 -0800 @@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i return value; } -static ssize_t ep_aio_read_retry(struct kiocb *iocb) +static int ep_aio_read_retry(struct kiocb *iocb) { struct kiocb_priv *priv = iocb-private; - ssize_t len, total; - int i; + ssize_t total; + int i, err = 0; /* we retry to get the right mm context for this: */ /* copy stuff into user buffers */ total = priv-actual; - len = 0; for (i=0; i priv-nr_segs; i++) { ssize_t this = min((ssize_t)(priv-iv[i].iov_len), total); if (copy_to_user(priv-iv[i].iov_base, priv-buf, this)) { - if (len == 0) - len = -EFAULT; + err = -EFAULT; break; } total -= this; - len += this; if (total == 0) break; } kfree(priv-buf); kfree(priv); aio_put_req(iocb); - return len; + return err; } static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) @@ -610,9 +607,7 @@ static void ep_aio_complete(struct usb_e if (unlikely(kiocbIsCancelled(iocb))) aio_put_req(iocb); else - aio_complete(iocb, - req-actual ? req-actual : req-status, - req-status); + aio_complete(iocb, req-actual, req-status); } else { /* retry() won't report both; so we hide some faults */ if (unlikely(0 != req-status)) @@ -702,16 +697,17 @@ ep_aio_read(struct kiocb *iocb, const st { struct ep_data *epdata = iocb-ki_filp-private_data; char*buf; + size_t len = iov_length(iov, nr_segs); if (unlikely(epdata-desc.bEndpointAddress USB_DIR_IN)) return -EINVAL; - buf = kmalloc(iocb-ki_left, GFP_KERNEL); + buf = kmalloc(len, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; iocb-ki_retry = ep_aio_read_retry; - return ep_aio_rwtail(iocb, buf, iocb-ki_left, epdata, iov, nr_segs); + return ep_aio_rwtail(iocb, buf, len, epdata, iov, nr_segs); } static ssize_t @@ -726,7 +722,7 @@ ep_aio_write(struct kiocb *iocb, const s if (unlikely(!(epdata-desc.bEndpointAddress USB_DIR_IN))) return -EINVAL; - buf = kmalloc(iocb-ki_left, GFP_KERNEL); + buf = kmalloc(iov_length(iov, nr_segs), GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; diff -urpN -X dontdiff a/fs/aio.c b/fs/aio.c --- a/fs/aio.c 2007-01-12 14:42:29.0 -0800 +++ b/fs/aio.c 2007-01-12 14:29:20.0 -0800 @@ -658,16 +658,16 @@ static inline int __queue_kicked_iocb(st * simplifies the coding of individual aio operations as * it avoids various potential races. */ -static ssize_t aio_run_iocb(struct kiocb *iocb) +static void aio_run_iocb(struct kiocb *iocb) { struct kioctx *ctx = iocb-ki_ctx; - ssize_t (*retry)(struct kiocb *); + int (*retry)(struct kiocb *); wait_queue_t *io_wait = current-io_wait; - ssize_t ret; + int err; if (!(retry = iocb-ki_retry)) { printk(aio_run_iocb: iocb-ki_retry = NULL\n); - return 0; + return; } /* @@ -702,8 +702,8 @@ static ssize_t aio_run_iocb(struct kiocb /* Quit retrying if the i/o has been cancelled */ if (kiocbIsCancelled(iocb)) { - ret = -EINTR; - aio_complete(iocb, ret, 0); + err = -EINTR; + aio_complete(iocb, iocb-ki_nbytes - iocb-ki_left, err); /* must not access the iocb after this */ goto out; } @@ -720,17 +720,17 @@ static ssize_t aio_run_iocb(struct kiocb */
[PATCH -mm 5/10][RFC] aio: make blk_directIO use file_endio_t
Convert the internals of blkdev_direct_IO to use a generic endio function, instead of directly calling aio_complete. This may also fix some bugs/races in this code, for instance it checks bio-bi_size instead of assuming it's zero, and it atomically accumulates the bytes_done counter (assuming that the bio completion handler can't race with itself *might* be valid here, but the direct-io code makes no such assumption). I'm also pretty sure that the address_space-directIO functions aren't supposed to mess with the iocb-ki_pos or -ki_left. --- diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c --- a/fs/block_dev.c2007-01-12 20:26:25.0 -0800 +++ b/fs/block_dev.c2007-01-12 20:23:55.0 -0800 @@ -131,10 +131,32 @@ blkdev_get_block(struct inode *inode, se return 0; } -static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error) +struct bdev_aio { + atomic_tiocount;/* refcount */ + atomic_tbytes_done; /* byte counter */ + int err;/* error handling */ + file_endio_t*endio; /* end I/O notify fn */ + void*endio_data;/* notify fn private data */ +}; + +static void blk_io_put(struct bdev_aio *io) +{ + if (!atomic_dec_and_test(io-iocount)) + return; + + if (!io-endio) + return complete((struct completion*)io-endio_data); + + io-endio(io-endio_data, atomic_read(io-bytes_done), io-err); + kfree(io); +} + +static int blk_bio_endio(struct bio *bio, unsigned int bytes_done, int error) { - struct kiocb *iocb = bio-bi_private; - atomic_t *bio_count = iocb-ki_bio_count; + struct bdev_aio *io = bio-bi_private; + + if (bio-bi_size) + return 1; if (bio_data_dir(bio) == READ) bio_check_pages_dirty(bio); @@ -143,16 +165,21 @@ static int blk_end_aio(struct bio *bio, bio_put(bio); } - /* iocb-ki_nbytes stores error code from LLDD */ - if (error) - iocb-ki_nbytes = -EIO; - - if (atomic_dec_and_test(bio_count)) - aio_complete(iocb, iocb-ki_left, iocb-ki_nbytes); + if (error) + io-err = error; + atomic_add(bytes_done, io-bytes_done); + blk_io_put(io); return 0; } +static void blk_io_init(struct bdev_aio *io) +{ + atomic_set(io-iocount, 1); + atomic_set(io-bytes_done, 0); + io-err = 0; +} + #define VEC_SIZE 16 struct pvec { unsigned short nr; @@ -208,24 +235,33 @@ blkdev_direct_IO(int rw, struct kiocb *i unsigned long addr; /* user iovec address */ size_t count; /* user iovec len */ - size_t nbytes = iocb-ki_nbytes = iocb-ki_left; /* total xfer size */ + size_t nbytes; /* total xfer size */ loff_t size;/* size of block device */ struct bio *bio; - atomic_t *bio_count = iocb-ki_bio_count; + struct bdev_aio stack_io, *io; + file_endio_t *endio = aio_complete; + void *endio_data = iocb; struct page *page; struct pvec pvec; pvec.nr = 0; pvec.idx = 0; + io = stack_io; + if (endio) { + io = kmalloc(sizeof(struct bdev_aio), GFP_KERNEL); + if (!io) + return -ENOMEM; + } + blk_io_init(io); + if (pos blocksize_mask) return -EINVAL; + nbytes = iov_length(iov, nr_segs); size = i_size_read(inode); - if (pos + nbytes size) { + if (pos + nbytes size) nbytes = size - pos; - iocb-ki_left = nbytes; - } /* * check first non-zero iov alignment, the remaining @@ -237,7 +273,6 @@ blkdev_direct_IO(int rw, struct kiocb *i if (addr blocksize_mask || count blocksize_mask) return -EINVAL; } while (!count ++seg nr_segs); - atomic_set(bio_count, 1); while (nbytes) { /* roughly estimate number of bio vec needed */ @@ -248,8 +283,8 @@ blkdev_direct_IO(int rw, struct kiocb *i /* bio_alloc should not fail with GFP_KERNEL flag */ bio = bio_alloc(GFP_KERNEL, nvec); bio-bi_bdev = I_BDEV(inode); - bio-bi_end_io = blk_end_aio; - bio-bi_private = iocb; + bio-bi_end_io = blk_bio_endio; + bio-bi_private = io; bio-bi_sector = pos blkbits; same_bio: cur_off = addr ~PAGE_MASK; @@ -289,18 +324,27 @@ same_bio: /* bio is ready, submit it */ if (rw == READ) bio_set_pages_dirty(bio); - atomic_inc(bio_count); + atomic_inc(io-iocount); submit_bio(rw, bio); } completion: -
[PATCH -mm 6/10][RFC] aio: make nfs_directIO use file_endio_t
This converts the iternals of nfs's directIO support to use a generic endio function, instead of directly calling aio_complete. It's pretty easy because it already has a pretty abstracted completion path. --- diff -urpN -X dontdiff a/fs/nfs/direct.c b/fs/nfs/direct.c --- a/fs/nfs/direct.c 2007-01-12 14:53:48.0 -0800 +++ b/fs/nfs/direct.c 2007-01-12 15:02:30.0 -0800 @@ -68,7 +68,6 @@ struct nfs_direct_req { /* I/O parameters */ struct nfs_open_context *ctx; /* file open context info */ - struct kiocb * iocb; /* controlling i/o request */ struct inode * inode; /* target file of i/o */ /* completion state */ @@ -77,6 +76,8 @@ struct nfs_direct_req { ssize_t count, /* bytes actually processed */ error; /* any reported error */ struct completion completion; /* wait for i/o completion */ + file_endio_t*endio; /* async completion function */ + void*endio_data;/* private completion data */ /* commit state */ struct list_headrewrite_list; /* saved nfs_write_data structs */ @@ -151,7 +152,7 @@ static inline struct nfs_direct_req *nfs kref_get(dreq-kref); init_completion(dreq-completion); INIT_LIST_HEAD(dreq-rewrite_list); - dreq-iocb = NULL; + dreq-endio = NULL; dreq-ctx = NULL; spin_lock_init(dreq-lock); atomic_set(dreq-io_count, 0); @@ -179,7 +180,7 @@ static ssize_t nfs_direct_wait(struct nf ssize_t result = -EIOCBQUEUED; /* Async requests don't wait here */ - if (dreq-iocb) + if (!dreq-endio) goto out; result = wait_for_completion_interruptible(dreq-completion); @@ -194,14 +195,10 @@ out: return (ssize_t) result; } -/* - * Synchronous I/O uses a stack-allocated iocb. Thus we can't trust - * the iocb is still valid here if this is a synchronous request. - */ static void nfs_direct_complete(struct nfs_direct_req *dreq) { - if (dreq-iocb) - aio_complete(dreq-iocb, dreq-count, dreq-error); + if (dreq-endio) + dreq-endio(dreq-endio_data, dreq-count, dreq-error); complete_all(dreq-completion); @@ -332,11 +329,13 @@ static ssize_t nfs_direct_read_schedule( return result 0 ? (ssize_t) result : -EFAULT; } -static ssize_t nfs_direct_read(struct kiocb *iocb, unsigned long user_addr, size_t count, loff_t pos) +static ssize_t nfs_direct_read(struct file *file, unsigned long user_addr, + size_t count, loff_t pos, + file_endio_t *endio, void *endio_data) { ssize_t result = 0; sigset_t oldset; - struct inode *inode = iocb-ki_filp-f_mapping-host; + struct inode *inode = file-f_mapping-host; struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; @@ -345,9 +344,9 @@ static ssize_t nfs_direct_read(struct ki return -ENOMEM; dreq-inode = inode; - dreq-ctx = get_nfs_open_context((struct nfs_open_context *)iocb-ki_filp-private_data); - if (!is_sync_kiocb(iocb)) - dreq-iocb = iocb; + dreq-ctx = get_nfs_open_context((struct nfs_open_context *)file-private_data); + dreq-endio = endio; + dreq-endio_data = endio_data; nfs_add_stats(inode, NFSIOS_DIRECTREADBYTES, count); rpc_clnt_sigmask(clnt, oldset); @@ -663,11 +662,13 @@ static ssize_t nfs_direct_write_schedule return result 0 ? (ssize_t) result : -EFAULT; } -static ssize_t nfs_direct_write(struct kiocb *iocb, unsigned long user_addr, size_t count, loff_t pos) +static ssize_t nfs_direct_write(struct file *file, unsigned long user_addr, + size_t count, loff_t pos, + file_endio_t *endio, void *endio_data) { ssize_t result = 0; sigset_t oldset; - struct inode *inode = iocb-ki_filp-f_mapping-host; + struct inode *inode = file-f_mapping-host; struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; size_t wsize = NFS_SERVER(inode)-wsize; @@ -682,9 +683,9 @@ static ssize_t nfs_direct_write(struct k sync = FLUSH_STABLE; dreq-inode = inode; - dreq-ctx = get_nfs_open_context((struct nfs_open_context *)iocb-ki_filp-private_data); - if (!is_sync_kiocb(iocb)) - dreq-iocb = iocb; + dreq-ctx = get_nfs_open_context((struct nfs_open_context *)file-private_data); + dreq-endio = endio; + dreq-endio_data = endio_data; nfs_add_stats(inode, NFSIOS_DIRECTWRITTENBYTES, count); @@ -701,10 +702,12 @@ static ssize_t nfs_direct_write(struct k /** * nfs_file_direct_read - file direct read operation for NFS
[PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left
Convert code using iocb-ki_left to use the more generic iov_length() call. --- diff -urpN -X dontdiff a/fs/ocfs2/file.c b/fs/ocfs2/file.c --- a/fs/ocfs2/file.c 2007-01-10 11:50:26.0 -0800 +++ b/fs/ocfs2/file.c 2007-01-10 12:42:09.0 -0800 @@ -1157,7 +1157,7 @@ static ssize_t ocfs2_file_aio_write(stru filp-f_path.dentry-d_name.name); /* happy write of zero bytes */ - if (iocb-ki_left == 0) + if (iov_length(iov, nr_segs) == 0) return 0; mutex_lock(inode-i_mutex); @@ -1177,7 +1177,7 @@ static ssize_t ocfs2_file_aio_write(stru } ret = ocfs2_prepare_inode_for_write(filp-f_path.dentry, iocb-ki_pos, - iocb-ki_left, appending); + iov_length(iov, nr_segs), appending); if (ret 0) { mlog_errno(ret); goto out; diff -urpN -X dontdiff a/fs/smbfs/file.c b/fs/smbfs/file.c --- a/fs/smbfs/file.c 2007-01-10 11:50:28.0 -0800 +++ b/fs/smbfs/file.c 2007-01-10 12:42:09.0 -0800 @@ -222,7 +222,7 @@ smb_file_aio_read(struct kiocb *iocb, co ssize_t status; VERBOSE(file %s/%s, [EMAIL PROTECTED], DENTRY_PATH(dentry), - (unsigned long) iocb-ki_left, (unsigned long) pos); + (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos); status = smb_revalidate_inode(dentry); if (status) { @@ -328,7 +328,7 @@ smb_file_aio_write(struct kiocb *iocb, c VERBOSE(file %s/%s, [EMAIL PROTECTED], DENTRY_PATH(dentry), - (unsigned long) iocb-ki_left, (unsigned long) pos); + (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos); result = smb_revalidate_inode(dentry); if (result) { @@ -341,7 +341,7 @@ smb_file_aio_write(struct kiocb *iocb, c if (result) goto out; - if (iocb-ki_left 0) { + if (iov_length(iov, nr_segs) 0) { result = generic_file_aio_write(iocb, iov, nr_segs, pos); VERBOSE(pos=%ld, size=%ld, mtime=%ld, atime=%ld\n, (long) file-f_pos, (long) dentry-d_inode-i_size, diff -urpN -X dontdiff a/fs/udf/file.c b/fs/udf/file.c --- a/fs/udf/file.c 2007-01-10 11:53:02.0 -0800 +++ b/fs/udf/file.c 2007-01-10 12:42:09.0 -0800 @@ -109,7 +109,7 @@ static ssize_t udf_file_aio_write(struct struct file *file = iocb-ki_filp; struct inode *inode = file-f_path.dentry-d_inode; int err, pos; - size_t count = iocb-ki_left; + size_t count = iov_length(iov, nr_segs); if (UDF_I_ALLOCTYPE(inode) == ICBTAG_FLAG_AD_IN_ICB) { diff -urpN -X dontdiff a/net/socket.c b/net/socket.c --- a/net/socket.c 2007-01-10 12:40:54.0 -0800 +++ b/net/socket.c 2007-01-10 12:42:09.0 -0800 @@ -632,7 +632,7 @@ static ssize_t sock_aio_read(struct kioc if (pos != 0) return -ESPIPE; - if (iocb-ki_left == 0) /* Match SYS5 behaviour */ + if (iov_length(iov, nr_segs) == 0) /* Match SYS5 behaviour */ return 0; for (i = 0; i nr_segs; i++) @@ -660,7 +660,7 @@ static ssize_t sock_aio_write(struct kio if (pos != 0) return -ESPIPE; - if (iocb-ki_left == 0) /* Match SYS5 behaviour */ + if (iov_length(iov, nr_segs) == 0) /* Match SYS5 behaviour */ return 0; for (i = 0; i nr_segs; i++) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops
This removes the aio implementation from the usb gadget file system. Aside from making very creative (!) use of the aio retry path, it can't be of any use performance-wise because it always kmalloc()s a bounce buffer for the *whole* I/O size. Perhaps the only reason to keep it around is the ability to cancel I/O requests, which only applies when using the user space async I/O interface. I highly doubt that is enough incentive to justify the extra complexity here or in user-space, so I think it's a safe bet to remove this. If that feature still desired, it would be possible to implement a sync interface that does an interruptible sleep. I can be convinced otherwise, but the alternatives are difficult. See for example the fuse, get_user_pages, flush_anon_page, aliasing caches and all that again LKML thread recently for why it's waaay easier to kmalloc a bounce buffer here, and (ab)use the retry interface. --- diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c --- a/drivers/usb/gadget/inode.c2007-01-10 13:23:46.0 -0800 +++ b/drivers/usb/gadget/inode.c2007-01-10 16:56:09.0 -0800 @@ -527,218 +527,6 @@ static int ep_ioctl (struct inode *inode /*--*/ -/* ASYNCHRONOUS ENDPOINT I/O OPERATIONS (bulk/intr/iso) */ - -struct kiocb_priv { - struct usb_request *req; - struct ep_data *epdata; - void*buf; - const struct iovec *iv; - unsigned long nr_segs; - unsignedactual; -}; - -static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e) -{ - struct kiocb_priv *priv = iocb-private; - struct ep_data *epdata; - int value; - - local_irq_disable(); - epdata = priv-epdata; - // spin_lock(epdata-dev-lock); - kiocbSetCancelled(iocb); - if (likely(epdata epdata-ep priv-req)) - value = usb_ep_dequeue (epdata-ep, priv-req); - else - value = -EINVAL; - // spin_unlock(epdata-dev-lock); - local_irq_enable(); - - aio_put_req(iocb); - return value; -} - -static int ep_aio_read_retry(struct kiocb *iocb) -{ - struct kiocb_priv *priv = iocb-private; - ssize_t total; - int i, err = 0; - - /* we retry to get the right mm context for this: */ - - /* copy stuff into user buffers */ - total = priv-actual; - for (i=0; i priv-nr_segs; i++) { - ssize_t this = min((ssize_t)(priv-iv[i].iov_len), total); - - if (copy_to_user(priv-iv[i].iov_base, priv-buf, this)) { - err = -EFAULT; - break; - } - - total -= this; - if (total == 0) - break; - } - kfree(priv-buf); - kfree(priv); - aio_put_req(iocb); - return err; -} - -static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) -{ - struct kiocb*iocb = req-context; - struct kiocb_priv *priv = iocb-private; - struct ep_data *epdata = priv-epdata; - - /* lock against disconnect (and ideally, cancel) */ - spin_lock(epdata-dev-lock); - priv-req = NULL; - priv-epdata = NULL; - if (priv-iv == NULL - || unlikely(req-actual == 0) - || unlikely(kiocbIsCancelled(iocb))) { - kfree(req-buf); - kfree(priv); - iocb-private = NULL; - /* aio_complete() reports bytes-transferred _and_ faults */ - if (unlikely(kiocbIsCancelled(iocb))) - aio_put_req(iocb); - else - aio_complete(iocb, req-actual, req-status); - } else { - /* retry() won't report both; so we hide some faults */ - if (unlikely(0 != req-status)) - DBG(epdata-dev, %s fault %d len %d\n, - ep-name, req-status, req-actual); - - priv-buf = req-buf; - priv-actual = req-actual; - kick_iocb(iocb); - } - spin_unlock(epdata-dev-lock); - - usb_ep_free_request(ep, req); - put_ep(epdata); -} - -static ssize_t -ep_aio_rwtail( - struct kiocb*iocb, - char*buf, - size_t len, - struct ep_data *epdata, - const struct iovec *iv, - unsigned long nr_segs -) -{ - struct kiocb_priv *priv; - struct usb_request *req; - ssize_t value; - - priv = kmalloc(sizeof *priv, GFP_KERNEL); - if (!priv) { - value = -ENOMEM; -fail: - kfree(buf); - return value; - } - iocb-private = priv; -
[PATCH -mm 7/10][RFC] aio: make __blockdev_direct_IO use file_endio_t
This converts the internals of __blockdev_direct_IO in fs/direct-io.c to use a generic endio function, instead of directly calling aio_complete. It also changes the semantics of dio_iodone to be more friendly to its only users, xfs and ocfs2. This allows the caller to know how to release locks and tear down data structures on error. It also converts the _own_locking and _no_locking variants of blockdev_direct_IO to use a generic endio function. --- fs/direct-io.c | 74 ++-- fs/gfs2/ops_address.c |6 +-- fs/ocfs2/aops.c | 15 ++-- fs/ocfs2/aops.h |8 fs/ocfs2/file.c | 18 -- fs/ocfs2/inode.h|2 - fs/xfs/linux-2.6/xfs_aops.c | 33 +++ include/linux/fs.h | 57 ++--- 8 files changed, 104 insertions(+), 109 deletions(-) --- diff -urpN -X dontdiff a/fs/direct-io.c b/fs/direct-io.c --- a/fs/direct-io.c2007-01-12 14:53:48.0 -0800 +++ b/fs/direct-io.c2007-01-12 15:06:44.0 -0800 @@ -67,7 +67,7 @@ struct dio { struct bio *bio;/* bio under assembly */ struct inode *inode; int rw; - loff_t i_size; /* i_size when submitted */ + unsigned max_to_read; /* (i_size when submitted) - offset */ int lock_type; /* doesn't change */ unsigned blkbits; /* doesn't change */ unsigned blkfactor; /* When we're using an alignment which @@ -89,6 +89,7 @@ struct dio { int reap_counter; /* rate limit reaping */ get_block_t *get_block; /* block mapping function */ dio_iodone_t *end_io; /* IO completion function */ + void *destructor_data; /* private data for completion fn */ sector_t final_block_in_bio;/* current final block in bio + 1 */ sector_t next_block_for_io; /* next block to be put under IO, in dio_blocks units */ @@ -127,7 +128,8 @@ struct dio { struct task_struct *waiter; /* waiting task (NULL if none) */ /* AIO related stuff */ - struct kiocb *iocb; /* kiocb */ + file_endio_t *file_endio; /* aio completion function */ + void *endio_data; /* private data for aio completion */ int is_async; /* is IO async ? */ int io_error; /* IO error in completion path */ ssize_t result; /* IO result */ @@ -222,7 +224,7 @@ static struct page *dio_get_page(struct * filesystems can use it to hold additional state between get_block calls and * dio_complete. */ -static int dio_complete(struct dio *dio, loff_t offset, int ret) +static int dio_complete(struct dio *dio, int ret) { /* * AIO submission can race with bio completion to get here while @@ -232,25 +234,21 @@ static int dio_complete(struct dio *dio, */ if (ret == -EIOCBQUEUED) ret = 0; + if (ret == 0) + ret = dio-page_errors; + if (ret == 0) + ret = dio-io_error; if (dio-result) { /* Check for short read case */ - if ((dio-rw == READ) ((offset + dio-result) dio-i_size)) - dio-result = dio-i_size - offset; + if ((dio-rw == READ) (dio-result dio-max_to_read)) + dio-result = dio-max_to_read; } - if (dio-end_io dio-result) - dio-end_io(dio-iocb, offset, dio-result, - dio-map_bh.b_private); if (dio-lock_type == DIO_LOCKING) /* lockdep: non-owner release */ up_read_non_owner(dio-inode-i_alloc_sem); - if (ret == 0) - ret = dio-page_errors; - if (ret == 0) - ret = dio-io_error; - return ret; } @@ -277,8 +275,11 @@ static int dio_bio_end_aio(struct bio *b spin_unlock_irqrestore(dio-bio_lock, flags); if (remaining == 0) { - int err = dio_complete(dio, dio-iocb-ki_pos, 0); - aio_complete(dio-iocb, dio-result, err); + int err = dio_complete(dio, 0); + if (dio-end_io) + dio-end_io(dio-destructor_data, dio-result, + dio-map_bh.b_private); + dio-file_endio(dio-endio_data, dio-result, err); kfree(dio); } @@ -944,10 +945,11 @@ out: * Releases both i_mutex and i_alloc_sem */ static ssize_t -direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, +direct_io_worker(int rw, struct file *file, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, unsigned blkbits, get_block_t get_block,
[PATCH -mm 8/10][RFC] aio: make direct_IO aops use file_endio_t
This converts the _locking variant of blockdev_direct_IO to use a generic endio function, and updates all the FS callsites. --- Documentation/filesystems/Locking |5 +++-- Documentation/filesystems/vfs.txt |5 +++-- fs/block_dev.c|9 - fs/ext2/inode.c | 12 +--- fs/ext3/inode.c | 11 +-- fs/ext4/inode.c | 11 +-- fs/fat/inode.c| 12 ++-- fs/gfs2/ops_address.c |8 fs/hfs/inode.c| 13 ++--- fs/hfsplus/inode.c| 13 ++--- fs/jfs/inode.c| 12 +--- fs/nfs/direct.c |8 +--- fs/ocfs2/aops.c |9 + fs/reiserfs/inode.c | 13 + fs/xfs/linux-2.6/xfs_aops.c | 11 ++- fs/xfs/linux-2.6/xfs_lrw.c|4 ++-- include/linux/fs.h| 28 +--- include/linux/nfs_fs.h|4 ++-- mm/filemap.c | 34 ++ 19 files changed, 108 insertions(+), 114 deletions(-) --- diff -urpN -X dontdiff a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking --- a/Documentation/filesystems/Locking 2007-01-12 20:26:06.0 -0800 +++ b/Documentation/filesystems/Locking 2007-01-12 20:42:37.0 -0800 @@ -169,8 +169,9 @@ prototypes: sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); - int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); + int (*direct_IO)(int, struct file *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, + file_endio_t *endio, void *endio_data); int (*launder_page) (struct page *); locking rules: diff -urpN -X dontdiff a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt --- a/Documentation/filesystems/vfs.txt 2007-01-12 20:26:06.0 -0800 +++ b/Documentation/filesystems/vfs.txt 2007-01-12 20:42:37.0 -0800 @@ -537,8 +537,9 @@ struct address_space_operations { sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); - ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); + ssize_t (*direct_IO)(int, struct file *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, + file_endio_t *endio, void *endio_data); struct page* (*get_xip_page)(struct address_space *, sector_t, int); /* migrate the contents of a page to the specified target */ diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c --- a/fs/block_dev.c2007-01-12 20:29:02.0 -0800 +++ b/fs/block_dev.c2007-01-12 20:42:37.0 -0800 @@ -222,10 +222,11 @@ static void blk_unget_page(struct page * } static ssize_t -blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, -loff_t pos, unsigned long nr_segs) +blkdev_direct_IO(int rw, struct file *file, const struct iovec *iov, +loff_t pos, unsigned long nr_segs, file_endio_t *endio, +void *endio_data) { - struct inode *inode = iocb-ki_filp-f_mapping-host; + struct inode *inode = file-f_mapping-host; unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode))); unsigned blocksize_mask = (1 blkbits) - 1; unsigned long seg = 0; /* iov segment iterator */ @@ -239,8 +240,6 @@ blkdev_direct_IO(int rw, struct kiocb *i loff_t size;/* size of block device */ struct bio *bio; struct bdev_aio stack_io, *io; - file_endio_t *endio = aio_complete; - void *endio_data = iocb; struct page *page; struct pvec pvec; diff -urpN -X dontdiff a/fs/ext2/inode.c b/fs/ext2/inode.c --- a/fs/ext2/inode.c 2007-01-12 20:26:06.0 -0800 +++ b/fs/ext2/inode.c 2007-01-12 20:42:37.0 -0800 @@ -752,14 +752,12 @@ static sector_t ext2_bmap(struct address } static ssize_t -ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, - loff_t offset, unsigned long nr_segs) +ext2_direct_IO(int rw, struct file *file, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, file_endio_t *endio, + void *endio_data) { - struct file *file = iocb-ki_filp; - struct inode *inode = file-f_mapping-host; - - return blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov, -
[PATCH -mm 2/10][RFC] aio: net use struct socket for io
Remove unused arg from socket operations The sendmsg and recvmsg socket operations take a kiocb pointer, but none of the functions actually use it. There's really no need even theoretically, it's really quite ugly having it there at all. Also, removing it will pave the way for a more generic completion path in the file_operations. --- drivers/net/pppoe.c |8 +++ include/linux/net.h | 18 +++-- include/net/bluetooth/bluetooth.h |2 - include/net/inet_common.h |3 -- include/net/sock.h| 19 -- include/net/tcp.h |6 ++--- include/net/udp.h |3 -- net/appletalk/ddp.c |5 +--- net/atm/common.c |6 + net/atm/common.h |7 ++ net/ax25/af_ax25.c|7 ++ net/bluetooth/af_bluetooth.c |4 +-- net/bluetooth/hci_sock.c |7 ++ net/bluetooth/l2cap.c |2 - net/bluetooth/rfcomm/sock.c |8 +++ net/bluetooth/sco.c |3 -- net/core/sock.c | 12 --- net/dccp/dccp.h |8 +++ net/dccp/probe.c |3 -- net/dccp/proto.c |7 ++ net/decnet/af_decnet.c|7 ++ net/econet/af_econet.c|7 ++ net/ipv4/af_inet.c|5 +--- net/ipv4/raw.c|8 ++- net/ipv4/tcp.c|7 ++ net/ipv4/tcp_probe.c |3 -- net/ipv4/udp.c|9 +++- net/ipv4/udp_impl.h |2 - net/ipv6/raw.c|6 + net/ipv6/udp.c| 10 +++-- net/ipv6/udp_impl.h |6 + net/ipx/af_ipx.c |7 ++ net/irda/af_irda.c| 29 +--- net/key/af_key.c |6 + net/llc/af_llc.c |7 ++ net/netlink/af_netlink.c |6 + net/netrom/af_netrom.c|7 ++ net/packet/af_packet.c| 11 -- net/rose/af_rose.c|7 ++ net/sctp/socket.c |9 +++- net/socket.c | 32 ++- net/tipc/socket.c | 28 +-- net/unix/af_unix.c| 39 +++--- net/wanrouter/af_wanpipe.c|7 ++ net/x25/af_x25.c |6 + 45 files changed, 166 insertions(+), 243 deletions(-) --- diff -urpN -X dontdiff a/drivers/net/pppoe.c b/drivers/net/pppoe.c --- a/drivers/net/pppoe.c 2007-01-12 11:18:47.244855016 -0800 +++ b/drivers/net/pppoe.c 2007-01-12 11:29:21.179177108 -0800 @@ -746,8 +746,8 @@ static int pppoe_ioctl(struct socket *so } -static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len) +static int pppoe_sendmsg(struct socket *sock, struct msghdr *m, +size_t total_len) { struct sk_buff *skb = NULL; struct sock *sk = sock-sk; @@ -912,8 +912,8 @@ static struct ppp_channel_ops pppoe_chan .start_xmit = pppoe_xmit, }; -static int pppoe_recvmsg(struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len, int flags) +static int pppoe_recvmsg(struct socket *sock, struct msghdr *m, +size_t total_len, int flags) { struct sock *sk = sock-sk; struct sk_buff *skb = NULL; diff -urpN -X dontdiff a/include/linux/net.h b/include/linux/net.h --- a/include/linux/net.h 2007-01-12 11:18:56.683629587 -0800 +++ b/include/linux/net.h 2007-01-12 11:29:21.185175058 -0800 @@ -118,7 +118,6 @@ struct socket { struct vm_area_struct; struct page; -struct kiocb; struct sockaddr; struct msghdr; struct module; @@ -156,11 +155,10 @@ struct proto_ops { int optname, char __user *optval, int optlen); int (*compat_getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); - int (*sendmsg) (struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len); - int (*recvmsg) (struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len, - int flags); + int (*sendmsg) (struct socket *sock, struct msghdr *m, + size_t total_len); + int (*recvmsg) (struct socket *sock, struct msghdr *m, + size_t total_len, int flags); int
Re: [take22 0/4] kevent: Generic event handling mechanism.
On 11/1/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller ([EMAIL PROTECTED]) wrote: Indesiciveness has certainly been an issue here, but I remember akpm and Ulrich both giving concrete suggestions. I was particularly interested in Andrew's request to explain and justify the differences between kevent and BSD's kqueue interface. Was there a discussion that I missed? I am very interested to see your work on this mechanism merged, because you've clearly emphasized performance and shown impressive results. But it seems like we lose out on a lot by throwing out all the applications that already use kqueue. It looks you missed that discussion - freebsd kqueue has fields in the kevent structure which have diffent sizes in 32 and 64 bit environments. Are you saying that the *only* reason we choose not to be source-compatible with BSD is the 32 bit userland on 64 bit arch problem? I've followed every thread that gmail 'kqueue' search returns, which thread are you referring to? Nicholas Miell, in The Proposed Linux kevent API thread, seems to think that there are no advantages over kqueue to justify the incompatibility, an argument you made no effort to refute. I've also read the Kevent wiki at linux-net.osdl.org, but it too is lacking in any direct comparisons (even theoretical, let alone benchmarks) of the flexibility, performance, etc. between the two. I'm not arguing that you've done a bad design, I'm asking you to brag about the things you improved on vs. kqueue. Your emphasis on unifying all the different event types into one interface is really cool, fill me in on why that can't be effectively done with the kqueue compatability and I also will advocate for kevent inclusion. NATE - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take22 0/4] kevent: Generic event handling mechanism.
On 11/1/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: On Wed, Nov 01, 2006 at 06:20:43PM +, Oleg Verych ([EMAIL PROTECTED]) wrote: Hallo, Evgeniy Polyakov. Hello, Oleg. On 2006-11-01, you wrote: [] Quantifying how much more scalable would be nice, as would be some example where it is useful. (It makes my webserver twice as fast on monster 64-cpu box). Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with [...] Seriously. I'm seeing that patches also. New, shiny, always ready for inclusion. But considering kernel (linux in this case) as not thing for itself, i want to ask following question. Where's real-life application to do configure make make install? Your real life or mine as developer? I fortunately do not know anything about your real life, but my real life applications can be found on project's homepage. There is a link to archive there, where you can find plenty of sources. You likely do not know, but it is a bit risky business to patch all existing applications to show that approach is correct, if implementation is not completed. You likely do not know, but after I first time announced kevents in February I changed interfaces 4 times - and it is just interfaces, not including numerous features added/removed by developer's requests. There were some comments about laking much of such programs, answers were was in prev. e-mail, need to update them, something like that. Trivial web server sources url, mentioned in benchmark isn't pointed in patch advertisement. If it was, should i actually try that new *trivial* wheel? Answer is trivial - there is archive where one can find a source code (filenames are posted regulary). Should I create a rpm? For what glibc version? Saying that, i want to give you some short examples, i know. *Linux kernel - userspace*: o Alexey Kuznetsov networking - (excellent) iproute set of utilities; iproute documentation was way too bad when Alexey presented it first time :) o Maxim Krasnyansky tun net driver - vtun daemon application; *Glibc with mister Drepper* has huge set of tests, please search for `tst*' files in the sources. Btw, show me splice() 'shiny' application? Does lighttpd use it? Or move_pages(). To make a little hint to you, Evgeniy, why don't you find a little animal in the open source zoo to implement little interface to proposed kernel subsystem and then show it to The Big Jury (not me), we have here? And i can not see, how you've managed to implement something like that having almost nothing on the test basket. Very *suspicious* ch. There are always people who do not like something, what can I do with it? I present the code, we discuss it, I ask for inclusion (since it is the only way to get feedback), something requires changes, it is changed and so on - it is development process. I created 'little animal in the open source zoo' by myself to show how simple kevents are. One, that comes in mind is lighthttpd http://www.lighttpd.net/. It had sub-interface for event systems like select,poll,epoll, when i checked its sources last time. And it is mature, btw. As I already told several times, I changed only interfaces 4 times already, since no one seems to know what we really want and how interface should look like. Indesiciveness has certainly been an issue here, but I remember akpm and Ulrich both giving concrete suggestions. I was particularly interested in Andrew's request to explain and justify the differences between kevent and BSD's kqueue interface. Was there a discussion that I missed? I am very interested to see your work on this mechanism merged, because you've clearly emphasized performance and shown impressive results. But it seems like we lose out on a lot by throwing out all the applications that already use kqueue. NATE - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/20] vm deadlock avoidance for NFS, NBD and iSCSI (take 7)
On 9/12/06, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 12 Sep 2006, Peter Zijlstra wrote: Linus, when I mentioned swap over network to you in Ottawa, you said it was a valid use case, that people actually do and want this. Can you agree with the approach taken in these patches? Well, in all honesty, I don't think I really said valid, but that I said that some crazy people want to do it, and that we should try to allow them their foibles. So I'd be nervous to do any _guarantees_. I think that good VM policies should make it be something that works in general (the dirty mapping limits in particular), but I'd be a bit nervous about anybody taking it _too_ seriously. Crazy people are still crazy, they just might be right under certain reasonably-well-controlled circumstances. (oops, forgot to cc: the list) Personally, I'm a little unhappy with the added complexity here, I'm not convinced that this extra feature is worth it. In particular, adding to the address_space_operations, the block_device_operations, and creating a new swap index/offset interface just for this seems questionable. I feel like interface bloat should be reserved for features that have widespread use and benefit. Not that I'm opposed to this feature, just that I think this patch is too invasive interface-wise. NATE - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html