Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private

2007-01-17 Thread Nate Diller

On Wed, 17 Jan 2007, Benjamin LaHaise wrote:


On Mon, Jan 15, 2007 at 08:25:15PM -0800, Nate Diller wrote:

the right thing to do from a design perspective.  Hopefully it enables
a new architecture that can reduce context switches in I/O completion,
and reduce overhead.  That's the real motive ;)


And it's a broken motive.  Context switches per se are not bad, as they
make it possible to properly schedule code in a busy system (which is
*very* important when realtime concerns come into play).  Have a look
at how things were done in the 2.4 aio code to see how completion would
get done with a non-retry method, typically in interrupt context.  I had
code that did direct I/O rather differently by sharing code with the
read/write code paths at some point, the catch being that it was pretty
invasive, which meant that it never got merged with the changes to handle
writeback pressure and other work that happened during 2.5.


I'm having some trouble understanding your concern.  From my perspective,
any unnecessary context switch represents not only performance loss, but
extra complexity in the code.  In this case, I'm not suggesting that the
aio.c code causes problems, quite the opposite.  The code I'd like to change
is FS and md levels, where context switches happen because of timers,
workqueues, and worker threads.  For sync I/O, these layers could be doing
their completion work in process context, but because waiting on sync I/O is
done in layers above, they must resort to other means, even for the common
case.  The dm-crypt module is the most straightforward example.

I took a look at some 2.4.18 aio patches in kernel.org/.../bcrl/aio/, and if
I understand what you did, you were basically operating at the aops level
rather than f_ops.  I actually like that idea, it's nicer than having the
direct-io code do its work seperately from the aio code.  Part of where I'm
going with this patch is a better integration between the block layer
(make_request), page layer (aops), and FS layer (f_ops), particularly in the
completion paths.  The direct-io code is an improvement over the common code
on that point, do_readahead() and friends all wait on individual pages to
become uptodate.  I'd like to bring some improvements from the directIO
architecture into use in the common case, which I hope will help
performance.

I know that might seem somewhat unrelated, but I don't think it is.  This
change goes hand in hand with using completion handlers in the aops.  That
will link together the completion callback in the bio with the aio callback,
so that the whole stack can finish its work in one context.


That said, you can't make kiocb private without completely removing the
ability of the rest of the kernel to complete an aio sanely from irq context.
You need some form of i/o descriptor, and a kiocb is just that.  Adding more
layering is just going to make things messier and slower for no real gain.


This patchset does not change how or when I/O completion happens,
aio_complete() will still get called from direct-io.c, nfs-direct.c, et al. 
The iocb structure is still passed to aio_complete, just like before.  The

only difference is that the lower level code doesn't know that it's got an
iocb, all it sees is an opaque cookie.  It's more like enforcing a layer
that's already in place, and I think things got simpler rather than messier. 
Whether things are slower or not remains to be seen, but I expect no

measurable changes either way with this patch.

I'm releasing a new version of the patch soon, it will use a new iodesc
structure to keep track of iovec state, which simplifies things further.  It
also will have a new version of the usb gadget code, and some general
cleanups.  I hope you'll take a look at it.

NATE
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops

2007-01-16 Thread Nate Diller

On 1/15/07, David Brownell [EMAIL PROTECTED] wrote:

On Monday 15 January 2007 5:54 pm, Nate Diller wrote:
 This removes the aio implementation from the usb gadget file system.

NAK.  I see a deep mis-understanding here.


 Aside
 from making very creative (!) use of the aio retry path, it can't be of any
 use performance-wise

Other than the basic win of letting one userspace thread keep an I/O
stream active while at the same time processing the data it reads or
writes??  That's the async part of AIO.

There's a not-so-little thing called I/O overlap ... which is the only
way to prevent wasting bandwidth between (non-cacheable) I/O requests,
and thus is the only way to let userspace code achieve anything close
to the maximum I/O bandwidth the hardware can achieve.

We want to see the host side usbfs evolve to support AIO like this
too, for the same reasons.  (Currently it has fairly ugly AIO code
that looks unlike any other AIO code in Linux.  Recent updates to
support a file-per-endpoint device model are a necessary precursor
to switching over to standard AIO syscalls.)


 because it always kmalloc()s a bounce buffer for the
 *whole* I/O size.

By and large that's a negligible factor compared to being able to
achieve I/O overlap.  ISTR the reason for not doing fancy DMA magic
was that the cost of this style AIO was under 1 KByte object code
on ARM, which was easy to justify ... while DMA magic to do that
sort of stuff would be much fatter, as well as more error prone.

(And that's why the creative use of the retry path.  As I've
observed before, retry is a misnomer in the general sense of
an async I/O framework.  It's more of a semi-completion callback;
I/O can't in general be retried on error or fault, and even in
the current usage it's not really a retry.)


Now that high speed peripheral hardware is becoming more common on
embedded Linuxes -- TI has DaVinci, OMAP 2430, TUSB6010 (as found
in the new Nokia 800 tablets); Atmel AVR32 AP7000; at least a couple
parts that should be able to use the same musb_hdrc driver as those
TI parts; and a few other chips I've heard of -- there may be some
virtue in eliminating the memcpy, since those CPUs don't have many
MIPS to waste.  (Iff the memcpy turns out to be a real issue...)


 Perhaps the only reason to keep it around is the ability
 to cancel I/O requests, which only applies when using the user space async
 I/O interface.

It's good to have almost the complete kernel API functionality
exposed to userspace, and having I/O cancelation is an inevitable
consequence of a complete AIO framework ... but that particular
issue was not a driving concern.


The reason for AIO is to have a *STANDARD* userspace interface
for *ASYNC I/O* which otherwise can't exist.  You know, the kind
of I/O interface that can't be implemented with read() and write()
syscalls, which for non-buffered I/O necessarily preclude all I/O
overlap.  AIO itself is a direct match to most I/O frameworks'
primitives.  (AIOCB being directly analagous to peripheral side
struct usb_request and host side struct urb.)


You know, I've always thought that one reason the AIO discussions
seemed strange is that they weren't really focussed on I/O (the
lowlevel after-the-caches stuff) so much as filesystems (several
layers up in the stack, with intervening caching frameworks).

The first several implementations of AIO that I saw were restricted
to real I/O and not applicable to disk backed files.  So while I
was glad the Linux approach didn't make that mistake, it's seemed
that it might be wanting to make a converse mistake: neglecting I/O
that isn't aimed at data stored on disks.


 I highly doubt that is enough incentive to justify the extra
 complexity here or in user-space, so I think it's a safe bet to remove this.
 If that feature still desired, it would be possible to implement a sync
 interface that does an interruptible sleep.

What's needed is an async, non-sleeeping, interface ... with I/O
overlap.  That's antithetical to using read()/write() calls, so
your proposed approach couldn't possibly work.


haha, wow ok you convinced me :)

I got a bit impatient when I was working on this, it took some time
just to figure out the intention of the code, and I'm trying to hold
to a bit of a schedule here.  Without any clear (to me) reason, I
didn't want to spend a lot of effort fixing this up.

There's really no big difference between the usb drivers here and the
disk I/O scheduler queue, AFAICT, so it seems like the solution I want
is to do a kmap() on the user buffer and then do the I/O straight out
of that.  That will eliminate the need for the bounce buffer.  I'll
post a new version along with the iodesc changes later this week.

NATE
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t

2007-01-16 Thread Nate Diller

On 1/15/07, David Brownell [EMAIL PROTECTED] wrote:

On Monday 15 January 2007 5:54 pm, Nate Diller wrote:
 --- a/drivers/usb/gadget/inode.c  2007-01-12 14:42:29.0 -0800
 +++ b/drivers/usb/gadget/inode.c  2007-01-12 14:25:34.0 -0800
 @@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i
   return value;
  }

 -static ssize_t ep_aio_read_retry(struct kiocb *iocb)
 +static int ep_aio_read_retry(struct kiocb *iocb)
  {
   struct kiocb_priv   *priv = iocb-private;
 - ssize_t len, total;
 - int i;
 + ssize_t total;
 + int i, err = 0;

   /* we retry to get the right mm context for this: */

   /* copy stuff into user buffers */
   total = priv-actual;
 - len = 0;
   for (i=0; i  priv-nr_segs; i++) {
   ssize_t this = min((ssize_t)(priv-iv[i].iov_len), total);

   if (copy_to_user(priv-iv[i].iov_base, priv-buf, this)) {
 - if (len == 0)
 - len = -EFAULT;
 + err = -EFAULT;

Discarding the capability to report partial success, e.g. that the first N
bytes were properly transferred?  I don't see any virtue in that change.
Quite the opposite in fact.

I think you're also expecting that if N bytes were requested, that's always
how many will be received.  That's not true for packetized I/O such as USB
isochronous transfers ... where it's quite legit (and in some cases routine)
for the other end to send packets that are shorter than the maximum allowed.
Sending a zero length packet is not the same as sending no packet at all,
for another example.


I will convert this (usb) code to use the standard completion path,
which you will notice *gained* the ability to properly report both an
error and a partial success as part of this patch.  In fact, fixing
this up was my intention when I wrote this patch, and the later patch
was a compromise intended to get this whole bundle out for review in a
timely manner :)

NATE
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -mm 0/10][RFC] aio: make struct kiocb private

2007-01-15 Thread Nate Diller
This series is an attempt to generalize the async I/O paths to be
implementation agnostic.  It completely eliminates knowledge of
the kiocb structure in the generic code and makes it private within the
current aio code.  Things get noticeably cleaner without that layering
violation.

The new interface takes a file_endio_t function pointer, and a private data
pointer, which would normally be aio_complete and a kiocb pointer,
respectively.  If the aio submission function gets back EIOCBQUEUED, that is
a guarantee that the endio function will be called, or *already has been
called*.  If the file_endio_t pointer provided to aio_[read|write] is NULL,
the FS must block on I/O completion, then return either the number of bytes
read, or an error.

I had to touch more areas that I had originally expected, so there are
changes in a corner of the socket code, and a slight behavior change in the
direct-io completion path with affects XFS and OCFS2.  I would appreciate
further review there, so I copied some extra people I hope can help.

This patch is against 2.6.20-rc4-mm1.  It has been compile-tested at each
stage.  It needs some runtime testing yet, but I prefer to get it out for
commentary and test later.

These patches are for RFC only and have not yet been signed off.

NATE

---

 Documentation/filesystems/Locking |   11 +
 Documentation/filesystems/vfs.txt |   11 +
 arch/s390/hypfs/inode.c   |   16 +-
 drivers/net/pppoe.c   |8 -
 drivers/net/tun.c |   13 +-
 drivers/usb/gadget/inode.c|  239 +-
 fs/aio.c  |   74 ++-
 fs/bad_inode.c|   10 -
 fs/block_dev.c|  109 +++--
 fs/cifs/cifsfs.c  |   10 -
 fs/compat.c   |   56 
 fs/direct-io.c|   92 --
 fs/ecryptfs/file.c|   16 +-
 fs/ext2/inode.c   |   12 -
 fs/ext3/file.c|9 -
 fs/ext3/inode.c   |   11 -
 fs/ext4/file.c|9 -
 fs/ext4/inode.c   |   11 -
 fs/fat/inode.c|   12 -
 fs/fuse/dev.c |   13 +-
 fs/gfs2/ops_address.c |   14 +-
 fs/hfs/inode.c|   13 --
 fs/hfsplus/inode.c|   13 --
 fs/jfs/inode.c|   12 -
 fs/nfs/direct.c   |   92 +++---
 fs/nfs/file.c |   62 +
 fs/ntfs/file.c|   71 ++-
 fs/ocfs2/aops.c   |   24 +--
 fs/ocfs2/aops.h   |8 -
 fs/ocfs2/file.c   |   44 +++---
 fs/ocfs2/inode.h  |2 
 fs/pipe.c |   12 -
 fs/read_write.c   |  225 ---
 fs/read_write.h   |8 -
 fs/reiserfs/inode.c   |   13 --
 fs/smbfs/file.c   |   28 ++--
 fs/udf/file.c |   13 +-
 fs/xfs/linux-2.6/xfs_aops.c   |   44 +++---
 fs/xfs/linux-2.6/xfs_file.c   |   58 +
 fs/xfs/linux-2.6/xfs_lrw.c|   29 ++--
 fs/xfs/linux-2.6/xfs_lrw.h|   10 -
 fs/xfs/linux-2.6/xfs_vnode.h  |   20 +--
 include/linux/aio.h   |   11 -
 include/linux/fs.h|  114 +-
 include/linux/net.h   |   18 +-
 include/linux/nfs_fs.h|   12 -
 include/net/bluetooth/bluetooth.h |2 
 include/net/inet_common.h |3 
 include/net/scm.h |2 
 include/net/sock.h|   45 +--
 include/net/tcp.h |6 
 include/net/udp.h |3 
 mm/filemap.c  |  109 -
 net/appletalk/ddp.c   |5 
 net/atm/common.c  |6 
 net/atm/common.h  |7 -
 net/ax25/af_ax25.c|7 -
 net/bluetooth/af_bluetooth.c  |4 
 net/bluetooth/hci_sock.c  |7 -
 net/bluetooth/l2cap.c |2 
 net/bluetooth/rfcomm/sock.c   |8 -
 net/bluetooth/sco.c   |3 
 net/core/sock.c   |   12 -
 net/dccp/dccp.h   |8 -
 net/dccp/probe.c  |3 
 net/dccp/proto.c  |7 -
 net/decnet/af_decnet.c|7 -
 net/econet/af_econet.c|7 -
 net/ipv4/af_inet.c|5 
 net/ipv4/raw.c|8 -
 net/ipv4/tcp.c|7 -
 net/ipv4/tcp_probe.c  |3 
 net/ipv4/udp.c|9 -
 net/ipv4/udp_impl.h   |2 
 net/ipv6/raw.c|6 
 net/ipv6/udp.c|   10 -
 net/ipv6/udp_impl.h   |6 
 net/ipx/af_ipx.c  |7 -
 net/irda/af_irda.c|   29 ++--
 net/key/af_key.c 

[PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t

2007-01-15 Thread Nate Diller
Define a new function typedef for I/O completion at the file/iovec level --

typedef void (file_endio_t)(void *endio_data, ssize_t count, int err);

and convert aio_complete and all its callers to this new prototype.

---

 drivers/usb/gadget/inode.c |   24 +++---
 fs/aio.c   |   59 -
 fs/block_dev.c |8 +-
 fs/direct-io.c |   18 +
 fs/nfs/direct.c|9 ++
 include/linux/aio.h|   11 +++-
 include/linux/fs.h |2 +
 7 files changed, 61 insertions(+), 70 deletions(-)

---

diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
--- a/drivers/usb/gadget/inode.c2007-01-12 14:42:29.0 -0800
+++ b/drivers/usb/gadget/inode.c2007-01-12 14:25:34.0 -0800
@@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i
return value;
 }
 
-static ssize_t ep_aio_read_retry(struct kiocb *iocb)
+static int ep_aio_read_retry(struct kiocb *iocb)
 {
struct kiocb_priv   *priv = iocb-private;
-   ssize_t len, total;
-   int i;
+   ssize_t total;
+   int i, err = 0;
 
/* we retry to get the right mm context for this: */
 
/* copy stuff into user buffers */
total = priv-actual;
-   len = 0;
for (i=0; i  priv-nr_segs; i++) {
ssize_t this = min((ssize_t)(priv-iv[i].iov_len), total);
 
if (copy_to_user(priv-iv[i].iov_base, priv-buf, this)) {
-   if (len == 0)
-   len = -EFAULT;
+   err = -EFAULT;
break;
}
 
total -= this;
-   len += this;
if (total == 0)
break;
}
kfree(priv-buf);
kfree(priv);
aio_put_req(iocb);
-   return len;
+   return err;
 }
 
 static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
@@ -610,9 +607,7 @@ static void ep_aio_complete(struct usb_e
if (unlikely(kiocbIsCancelled(iocb)))
aio_put_req(iocb);
else
-   aio_complete(iocb,
-   req-actual ? req-actual : req-status,
-   req-status);
+   aio_complete(iocb, req-actual, req-status);
} else {
/* retry() won't report both; so we hide some faults */
if (unlikely(0 != req-status))
@@ -702,16 +697,17 @@ ep_aio_read(struct kiocb *iocb, const st
 {
struct ep_data  *epdata = iocb-ki_filp-private_data;
char*buf;
+   size_t  len = iov_length(iov, nr_segs);
 
if (unlikely(epdata-desc.bEndpointAddress  USB_DIR_IN))
return -EINVAL;
 
-   buf = kmalloc(iocb-ki_left, GFP_KERNEL);
+   buf = kmalloc(len, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
iocb-ki_retry = ep_aio_read_retry;
-   return ep_aio_rwtail(iocb, buf, iocb-ki_left, epdata, iov, nr_segs);
+   return ep_aio_rwtail(iocb, buf, len, epdata, iov, nr_segs);
 }
 
 static ssize_t
@@ -726,7 +722,7 @@ ep_aio_write(struct kiocb *iocb, const s
if (unlikely(!(epdata-desc.bEndpointAddress  USB_DIR_IN)))
return -EINVAL;
 
-   buf = kmalloc(iocb-ki_left, GFP_KERNEL);
+   buf = kmalloc(iov_length(iov, nr_segs), GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
diff -urpN -X dontdiff a/fs/aio.c b/fs/aio.c
--- a/fs/aio.c  2007-01-12 14:42:29.0 -0800
+++ b/fs/aio.c  2007-01-12 14:29:20.0 -0800
@@ -658,16 +658,16 @@ static inline int __queue_kicked_iocb(st
  * simplifies the coding of individual aio operations as
  * it avoids various potential races.
  */
-static ssize_t aio_run_iocb(struct kiocb *iocb)
+static void aio_run_iocb(struct kiocb *iocb)
 {
struct kioctx   *ctx = iocb-ki_ctx;
-   ssize_t (*retry)(struct kiocb *);
+   int (*retry)(struct kiocb *);
wait_queue_t *io_wait = current-io_wait;
-   ssize_t ret;
+   int err;
 
if (!(retry = iocb-ki_retry)) {
printk(aio_run_iocb: iocb-ki_retry = NULL\n);
-   return 0;
+   return;
}
 
/*
@@ -702,8 +702,8 @@ static ssize_t aio_run_iocb(struct kiocb
 
/* Quit retrying if the i/o has been cancelled */
if (kiocbIsCancelled(iocb)) {
-   ret = -EINTR;
-   aio_complete(iocb, ret, 0);
+   err = -EINTR;
+   aio_complete(iocb, iocb-ki_nbytes - iocb-ki_left, err);
/* must not access the iocb after this */
goto out;
}
@@ -720,17 +720,17 @@ static ssize_t aio_run_iocb(struct kiocb
 */
   

[PATCH -mm 5/10][RFC] aio: make blk_directIO use file_endio_t

2007-01-15 Thread Nate Diller
Convert the internals of blkdev_direct_IO to use a generic endio function,
instead of directly calling aio_complete.  This may also fix some bugs/races
in this code, for instance it checks bio-bi_size instead of assuming it's
zero, and it atomically accumulates the bytes_done counter (assuming that
the bio completion handler can't race with itself *might* be valid here, but
the direct-io code makes no such assumption).  I'm also pretty sure that
the address_space-directIO functions aren't supposed to mess with the
iocb-ki_pos or -ki_left.

---

diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c
--- a/fs/block_dev.c2007-01-12 20:26:25.0 -0800
+++ b/fs/block_dev.c2007-01-12 20:23:55.0 -0800
@@ -131,10 +131,32 @@ blkdev_get_block(struct inode *inode, se
return 0;
 }
 
-static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error)
+struct bdev_aio {
+   atomic_tiocount;/* refcount */
+   atomic_tbytes_done; /* byte counter */
+   int err;/* error handling */
+   file_endio_t*endio; /* end I/O notify fn */
+   void*endio_data;/* notify fn private data */
+};
+
+static void blk_io_put(struct bdev_aio *io)
+{
+   if (!atomic_dec_and_test(io-iocount))
+   return;
+
+   if (!io-endio)
+   return complete((struct completion*)io-endio_data);
+
+   io-endio(io-endio_data, atomic_read(io-bytes_done), io-err);
+   kfree(io);
+}
+
+static int blk_bio_endio(struct bio *bio, unsigned int bytes_done, int error)
 {
-   struct kiocb *iocb = bio-bi_private;
-   atomic_t *bio_count = iocb-ki_bio_count;
+   struct bdev_aio *io = bio-bi_private;
+
+   if (bio-bi_size)
+   return 1;
 
if (bio_data_dir(bio) == READ)
bio_check_pages_dirty(bio);
@@ -143,16 +165,21 @@ static int blk_end_aio(struct bio *bio, 
bio_put(bio);
}
 
-   /* iocb-ki_nbytes stores error code from LLDD */
-   if (error)
-   iocb-ki_nbytes = -EIO;
-
-   if (atomic_dec_and_test(bio_count))
-   aio_complete(iocb, iocb-ki_left, iocb-ki_nbytes);
+   if (error)
+   io-err = error;
+   atomic_add(bytes_done, io-bytes_done);
 
+   blk_io_put(io);
return 0;
 }
 
+static void blk_io_init(struct bdev_aio *io)
+{
+   atomic_set(io-iocount, 1);
+   atomic_set(io-bytes_done, 0);
+   io-err = 0;
+}
+
 #define VEC_SIZE   16
 struct pvec {
unsigned short nr;
@@ -208,24 +235,33 @@ blkdev_direct_IO(int rw, struct kiocb *i
 
unsigned long addr; /* user iovec address */
size_t count;   /* user iovec len */
-   size_t nbytes = iocb-ki_nbytes = iocb-ki_left; /* total xfer size */
+   size_t nbytes;   /* total xfer size */
loff_t size;/* size of block device */
struct bio *bio;
-   atomic_t *bio_count = iocb-ki_bio_count;
+   struct bdev_aio stack_io, *io;
+   file_endio_t *endio = aio_complete;
+   void *endio_data = iocb;
struct page *page;
struct pvec pvec;
 
pvec.nr = 0;
pvec.idx = 0;
 
+   io = stack_io;
+   if (endio) {
+   io = kmalloc(sizeof(struct bdev_aio), GFP_KERNEL);
+   if (!io)
+   return -ENOMEM;
+   }
+   blk_io_init(io);
+
if (pos  blocksize_mask)
return -EINVAL;
 
+   nbytes = iov_length(iov, nr_segs);
size = i_size_read(inode);
-   if (pos + nbytes  size) {
+   if (pos + nbytes  size)
nbytes = size - pos;
-   iocb-ki_left = nbytes;
-   }
 
/*
 * check first non-zero iov alignment, the remaining
@@ -237,7 +273,6 @@ blkdev_direct_IO(int rw, struct kiocb *i
if (addr  blocksize_mask || count  blocksize_mask)
return -EINVAL;
} while (!count  ++seg  nr_segs);
-   atomic_set(bio_count, 1);
 
while (nbytes) {
/* roughly estimate number of bio vec needed */
@@ -248,8 +283,8 @@ blkdev_direct_IO(int rw, struct kiocb *i
/* bio_alloc should not fail with GFP_KERNEL flag */
bio = bio_alloc(GFP_KERNEL, nvec);
bio-bi_bdev = I_BDEV(inode);
-   bio-bi_end_io = blk_end_aio;
-   bio-bi_private = iocb;
+   bio-bi_end_io = blk_bio_endio;
+   bio-bi_private = io;
bio-bi_sector = pos  blkbits;
 same_bio:
cur_off = addr  ~PAGE_MASK;
@@ -289,18 +324,27 @@ same_bio:
/* bio is ready, submit it */
if (rw == READ)
bio_set_pages_dirty(bio);
-   atomic_inc(bio_count);
+   atomic_inc(io-iocount);
submit_bio(rw, bio);
}
 
 completion:
-  

[PATCH -mm 6/10][RFC] aio: make nfs_directIO use file_endio_t

2007-01-15 Thread Nate Diller
This converts the iternals of nfs's directIO support to use a generic endio
function, instead of directly calling aio_complete.  It's pretty easy
because it already has a pretty abstracted completion path.

---

diff -urpN -X dontdiff a/fs/nfs/direct.c b/fs/nfs/direct.c
--- a/fs/nfs/direct.c   2007-01-12 14:53:48.0 -0800
+++ b/fs/nfs/direct.c   2007-01-12 15:02:30.0 -0800
@@ -68,7 +68,6 @@ struct nfs_direct_req {
 
/* I/O parameters */
struct nfs_open_context *ctx;   /* file open context info */
-   struct kiocb *  iocb;   /* controlling i/o request */
struct inode *  inode;  /* target file of i/o */
 
/* completion state */
@@ -77,6 +76,8 @@ struct nfs_direct_req {
ssize_t count,  /* bytes actually processed */
error;  /* any reported error */
struct completion   completion; /* wait for i/o completion */
+   file_endio_t*endio; /* async completion function */
+   void*endio_data;/* private completion data */
 
/* commit state */
struct list_headrewrite_list;   /* saved nfs_write_data structs 
*/
@@ -151,7 +152,7 @@ static inline struct nfs_direct_req *nfs
kref_get(dreq-kref);
init_completion(dreq-completion);
INIT_LIST_HEAD(dreq-rewrite_list);
-   dreq-iocb = NULL;
+   dreq-endio = NULL;
dreq-ctx = NULL;
spin_lock_init(dreq-lock);
atomic_set(dreq-io_count, 0);
@@ -179,7 +180,7 @@ static ssize_t nfs_direct_wait(struct nf
ssize_t result = -EIOCBQUEUED;
 
/* Async requests don't wait here */
-   if (dreq-iocb)
+   if (!dreq-endio)
goto out;
 
result = wait_for_completion_interruptible(dreq-completion);
@@ -194,14 +195,10 @@ out:
return (ssize_t) result;
 }
 
-/*
- * Synchronous I/O uses a stack-allocated iocb.  Thus we can't trust
- * the iocb is still valid here if this is a synchronous request.
- */
 static void nfs_direct_complete(struct nfs_direct_req *dreq)
 {
-   if (dreq-iocb)
-   aio_complete(dreq-iocb, dreq-count, dreq-error);
+   if (dreq-endio)
+   dreq-endio(dreq-endio_data, dreq-count, dreq-error);
 
complete_all(dreq-completion);
 
@@ -332,11 +329,13 @@ static ssize_t nfs_direct_read_schedule(
return result  0 ? (ssize_t) result : -EFAULT;
 }
 
-static ssize_t nfs_direct_read(struct kiocb *iocb, unsigned long user_addr, 
size_t count, loff_t pos)
+static ssize_t nfs_direct_read(struct file *file, unsigned long user_addr,
+  size_t count, loff_t pos,
+  file_endio_t *endio, void *endio_data)
 {
ssize_t result = 0;
sigset_t oldset;
-   struct inode *inode = iocb-ki_filp-f_mapping-host;
+   struct inode *inode = file-f_mapping-host;
struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
 
@@ -345,9 +344,9 @@ static ssize_t nfs_direct_read(struct ki
return -ENOMEM;
 
dreq-inode = inode;
-   dreq-ctx = get_nfs_open_context((struct nfs_open_context 
*)iocb-ki_filp-private_data);
-   if (!is_sync_kiocb(iocb))
-   dreq-iocb = iocb;
+   dreq-ctx = get_nfs_open_context((struct nfs_open_context 
*)file-private_data);
+   dreq-endio = endio;
+   dreq-endio_data = endio_data;
 
nfs_add_stats(inode, NFSIOS_DIRECTREADBYTES, count);
rpc_clnt_sigmask(clnt, oldset);
@@ -663,11 +662,13 @@ static ssize_t nfs_direct_write_schedule
return result  0 ? (ssize_t) result : -EFAULT;
 }
 
-static ssize_t nfs_direct_write(struct kiocb *iocb, unsigned long user_addr, 
size_t count, loff_t pos)
+static ssize_t nfs_direct_write(struct file *file, unsigned long user_addr,
+   size_t count, loff_t pos,
+   file_endio_t *endio, void *endio_data)
 {
ssize_t result = 0;
sigset_t oldset;
-   struct inode *inode = iocb-ki_filp-f_mapping-host;
+   struct inode *inode = file-f_mapping-host;
struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
size_t wsize = NFS_SERVER(inode)-wsize;
@@ -682,9 +683,9 @@ static ssize_t nfs_direct_write(struct k
sync = FLUSH_STABLE;
 
dreq-inode = inode;
-   dreq-ctx = get_nfs_open_context((struct nfs_open_context 
*)iocb-ki_filp-private_data);
-   if (!is_sync_kiocb(iocb))
-   dreq-iocb = iocb;
+   dreq-ctx = get_nfs_open_context((struct nfs_open_context 
*)file-private_data);
+   dreq-endio = endio;
+   dreq-endio_data = endio_data;
 
nfs_add_stats(inode, NFSIOS_DIRECTWRITTENBYTES, count);
 
@@ -701,10 +702,12 @@ static ssize_t nfs_direct_write(struct k
 
 /**
  * nfs_file_direct_read - file direct read operation for NFS 

[PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left

2007-01-15 Thread Nate Diller
Convert code using iocb-ki_left to use the more generic iov_length() call. 

---

diff -urpN -X dontdiff a/fs/ocfs2/file.c b/fs/ocfs2/file.c
--- a/fs/ocfs2/file.c   2007-01-10 11:50:26.0 -0800
+++ b/fs/ocfs2/file.c   2007-01-10 12:42:09.0 -0800
@@ -1157,7 +1157,7 @@ static ssize_t ocfs2_file_aio_write(stru
   filp-f_path.dentry-d_name.name);
 
/* happy write of zero bytes */
-   if (iocb-ki_left == 0)
+   if (iov_length(iov, nr_segs) == 0)
return 0;
 
mutex_lock(inode-i_mutex);
@@ -1177,7 +1177,7 @@ static ssize_t ocfs2_file_aio_write(stru
}
 
ret = ocfs2_prepare_inode_for_write(filp-f_path.dentry, iocb-ki_pos,
-   iocb-ki_left, appending);
+   iov_length(iov, nr_segs), appending);
if (ret  0) {
mlog_errno(ret);
goto out;
diff -urpN -X dontdiff a/fs/smbfs/file.c b/fs/smbfs/file.c
--- a/fs/smbfs/file.c   2007-01-10 11:50:28.0 -0800
+++ b/fs/smbfs/file.c   2007-01-10 12:42:09.0 -0800
@@ -222,7 +222,7 @@ smb_file_aio_read(struct kiocb *iocb, co
ssize_t status;
 
VERBOSE(file %s/%s, [EMAIL PROTECTED], DENTRY_PATH(dentry),
-   (unsigned long) iocb-ki_left, (unsigned long) pos);
+   (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos);
 
status = smb_revalidate_inode(dentry);
if (status) {
@@ -328,7 +328,7 @@ smb_file_aio_write(struct kiocb *iocb, c
 
VERBOSE(file %s/%s, [EMAIL PROTECTED],
DENTRY_PATH(dentry),
-   (unsigned long) iocb-ki_left, (unsigned long) pos);
+   (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos);
 
result = smb_revalidate_inode(dentry);
if (result) {
@@ -341,7 +341,7 @@ smb_file_aio_write(struct kiocb *iocb, c
if (result)
goto out;
 
-   if (iocb-ki_left  0) {
+   if (iov_length(iov, nr_segs)  0) {
result = generic_file_aio_write(iocb, iov, nr_segs, pos);
VERBOSE(pos=%ld, size=%ld, mtime=%ld, atime=%ld\n,
(long) file-f_pos, (long) dentry-d_inode-i_size,
diff -urpN -X dontdiff a/fs/udf/file.c b/fs/udf/file.c
--- a/fs/udf/file.c 2007-01-10 11:53:02.0 -0800
+++ b/fs/udf/file.c 2007-01-10 12:42:09.0 -0800
@@ -109,7 +109,7 @@ static ssize_t udf_file_aio_write(struct
struct file *file = iocb-ki_filp;
struct inode *inode = file-f_path.dentry-d_inode;
int err, pos;
-   size_t count = iocb-ki_left;
+   size_t count = iov_length(iov, nr_segs);
 
if (UDF_I_ALLOCTYPE(inode) == ICBTAG_FLAG_AD_IN_ICB)
{
diff -urpN -X dontdiff a/net/socket.c b/net/socket.c
--- a/net/socket.c  2007-01-10 12:40:54.0 -0800
+++ b/net/socket.c  2007-01-10 12:42:09.0 -0800
@@ -632,7 +632,7 @@ static ssize_t sock_aio_read(struct kioc
if (pos != 0)
return -ESPIPE;
 
-   if (iocb-ki_left == 0) /* Match SYS5 behaviour */
+   if (iov_length(iov, nr_segs) == 0)  /* Match SYS5 behaviour */
return 0;
 
for (i = 0; i  nr_segs; i++)
@@ -660,7 +660,7 @@ static ssize_t sock_aio_write(struct kio
if (pos != 0)
return -ESPIPE;
 
-   if (iocb-ki_left == 0) /* Match SYS5 behaviour */
+   if (iov_length(iov, nr_segs) == 0)  /* Match SYS5 behaviour */
return 0;
 
for (i = 0; i  nr_segs; i++)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops

2007-01-15 Thread Nate Diller
This removes the aio implementation from the usb gadget file system.  Aside
from making very creative (!) use of the aio retry path, it can't be of any
use performance-wise because it always kmalloc()s a bounce buffer for the
*whole* I/O size.  Perhaps the only reason to keep it around is the ability
to cancel I/O requests, which only applies when using the user space async
I/O interface.  I highly doubt that is enough incentive to justify the extra
complexity here or in user-space, so I think it's a safe bet to remove this. 
If that feature still desired, it would be possible to implement a sync
interface that does an interruptible sleep.

I can be convinced otherwise, but the alternatives are difficult.  See for
example the fuse, get_user_pages, flush_anon_page, aliasing caches and all
that again LKML thread recently for why it's waaay easier to kmalloc a
bounce buffer here, and (ab)use the retry interface.

---

diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
--- a/drivers/usb/gadget/inode.c2007-01-10 13:23:46.0 -0800
+++ b/drivers/usb/gadget/inode.c2007-01-10 16:56:09.0 -0800
@@ -527,218 +527,6 @@ static int ep_ioctl (struct inode *inode
 
 /*--*/
 
-/* ASYNCHRONOUS ENDPOINT I/O OPERATIONS (bulk/intr/iso) */
-
-struct kiocb_priv {
-   struct usb_request  *req;
-   struct ep_data  *epdata;
-   void*buf;
-   const struct iovec  *iv;
-   unsigned long   nr_segs;
-   unsignedactual;
-};
-
-static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
-{
-   struct kiocb_priv   *priv = iocb-private;
-   struct ep_data  *epdata;
-   int value;
-
-   local_irq_disable();
-   epdata = priv-epdata;
-   // spin_lock(epdata-dev-lock);
-   kiocbSetCancelled(iocb);
-   if (likely(epdata  epdata-ep  priv-req))
-   value = usb_ep_dequeue (epdata-ep, priv-req);
-   else
-   value = -EINVAL;
-   // spin_unlock(epdata-dev-lock);
-   local_irq_enable();
-
-   aio_put_req(iocb);
-   return value;
-}
-
-static int ep_aio_read_retry(struct kiocb *iocb)
-{
-   struct kiocb_priv   *priv = iocb-private;
-   ssize_t total;
-   int i, err = 0;
-
-   /* we retry to get the right mm context for this: */
-
-   /* copy stuff into user buffers */
-   total = priv-actual;
-   for (i=0; i  priv-nr_segs; i++) {
-   ssize_t this = min((ssize_t)(priv-iv[i].iov_len), total);
-
-   if (copy_to_user(priv-iv[i].iov_base, priv-buf, this)) {
-   err = -EFAULT;
-   break;
-   }
-
-   total -= this;
-   if (total == 0)
-   break;
-   }
-   kfree(priv-buf);
-   kfree(priv);
-   aio_put_req(iocb);
-   return err;
-}
-
-static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
-{
-   struct kiocb*iocb = req-context;
-   struct kiocb_priv   *priv = iocb-private;
-   struct ep_data  *epdata = priv-epdata;
-
-   /* lock against disconnect (and ideally, cancel) */
-   spin_lock(epdata-dev-lock);
-   priv-req = NULL;
-   priv-epdata = NULL;
-   if (priv-iv == NULL
-   || unlikely(req-actual == 0)
-   || unlikely(kiocbIsCancelled(iocb))) {
-   kfree(req-buf);
-   kfree(priv);
-   iocb-private = NULL;
-   /* aio_complete() reports bytes-transferred _and_ faults */
-   if (unlikely(kiocbIsCancelled(iocb)))
-   aio_put_req(iocb);
-   else
-   aio_complete(iocb, req-actual, req-status);
-   } else {
-   /* retry() won't report both; so we hide some faults */
-   if (unlikely(0 != req-status))
-   DBG(epdata-dev, %s fault %d len %d\n,
-   ep-name, req-status, req-actual);
-
-   priv-buf = req-buf;
-   priv-actual = req-actual;
-   kick_iocb(iocb);
-   }
-   spin_unlock(epdata-dev-lock);
-
-   usb_ep_free_request(ep, req);
-   put_ep(epdata);
-}
-
-static ssize_t
-ep_aio_rwtail(
-   struct kiocb*iocb,
-   char*buf,
-   size_t  len,
-   struct ep_data  *epdata,
-   const struct iovec *iv,
-   unsigned long   nr_segs
-)
-{
-   struct kiocb_priv   *priv;
-   struct usb_request  *req;
-   ssize_t value;
-
-   priv = kmalloc(sizeof *priv, GFP_KERNEL);
-   if (!priv) {
-   value = -ENOMEM;
-fail:
-   kfree(buf);
-   return value;
-   }
-   iocb-private = priv;
-   

[PATCH -mm 7/10][RFC] aio: make __blockdev_direct_IO use file_endio_t

2007-01-15 Thread Nate Diller
This converts the internals of __blockdev_direct_IO in fs/direct-io.c to use
a generic endio function, instead of directly calling aio_complete.  It also
changes the semantics of dio_iodone to be more friendly to its only users,
xfs and ocfs2.  This allows the caller to know how to release locks and tear
down data structures on error.

It also converts the _own_locking and _no_locking variants of
blockdev_direct_IO to use a generic endio function.

---

 fs/direct-io.c  |   74 ++--
 fs/gfs2/ops_address.c   |6 +--
 fs/ocfs2/aops.c |   15 ++--
 fs/ocfs2/aops.h |8 
 fs/ocfs2/file.c |   18 --
 fs/ocfs2/inode.h|2 -
 fs/xfs/linux-2.6/xfs_aops.c |   33 +++
 include/linux/fs.h  |   57 ++---
 8 files changed, 104 insertions(+), 109 deletions(-)

---

diff -urpN -X dontdiff a/fs/direct-io.c b/fs/direct-io.c
--- a/fs/direct-io.c2007-01-12 14:53:48.0 -0800
+++ b/fs/direct-io.c2007-01-12 15:06:44.0 -0800
@@ -67,7 +67,7 @@ struct dio {
struct bio *bio;/* bio under assembly */
struct inode *inode;
int rw;
-   loff_t i_size;  /* i_size when submitted */
+   unsigned max_to_read;   /* (i_size when submitted) - offset */
int lock_type;  /* doesn't change */
unsigned blkbits;   /* doesn't change */
unsigned blkfactor; /* When we're using an alignment which
@@ -89,6 +89,7 @@ struct dio {
int reap_counter;   /* rate limit reaping */
get_block_t *get_block; /* block mapping function */
dio_iodone_t *end_io;   /* IO completion function */
+   void *destructor_data;  /* private data for completion fn */
sector_t final_block_in_bio;/* current final block in bio + 1 */
sector_t next_block_for_io; /* next block to be put under IO,
   in dio_blocks units */
@@ -127,7 +128,8 @@ struct dio {
struct task_struct *waiter; /* waiting task (NULL if none) */
 
/* AIO related stuff */
-   struct kiocb *iocb; /* kiocb */
+   file_endio_t *file_endio;   /* aio completion function */
+   void *endio_data;   /* private data for aio completion */
int is_async;   /* is IO async ? */
int io_error;   /* IO error in completion path */
ssize_t result; /* IO result */
@@ -222,7 +224,7 @@ static struct page *dio_get_page(struct 
  * filesystems can use it to hold additional state between get_block calls and
  * dio_complete.
  */
-static int dio_complete(struct dio *dio, loff_t offset, int ret)
+static int dio_complete(struct dio *dio, int ret)
 {
/*
 * AIO submission can race with bio completion to get here while
@@ -232,25 +234,21 @@ static int dio_complete(struct dio *dio,
 */
if (ret == -EIOCBQUEUED)
ret = 0;
+   if (ret == 0)
+   ret = dio-page_errors;
+   if (ret == 0)
+   ret = dio-io_error;
 
if (dio-result) {
/* Check for short read case */
-   if ((dio-rw == READ)  ((offset + dio-result)  dio-i_size))
-   dio-result = dio-i_size - offset;
+   if ((dio-rw == READ)  (dio-result  dio-max_to_read))
+   dio-result = dio-max_to_read;
}
 
-   if (dio-end_io  dio-result)
-   dio-end_io(dio-iocb, offset, dio-result,
-   dio-map_bh.b_private);
if (dio-lock_type == DIO_LOCKING)
/* lockdep: non-owner release */
up_read_non_owner(dio-inode-i_alloc_sem);
 
-   if (ret == 0)
-   ret = dio-page_errors;
-   if (ret == 0)
-   ret = dio-io_error;
-
return ret;
 }
 
@@ -277,8 +275,11 @@ static int dio_bio_end_aio(struct bio *b
spin_unlock_irqrestore(dio-bio_lock, flags);
 
if (remaining == 0) {
-   int err = dio_complete(dio, dio-iocb-ki_pos, 0);
-   aio_complete(dio-iocb, dio-result, err);
+   int err = dio_complete(dio, 0);
+   if (dio-end_io)
+   dio-end_io(dio-destructor_data, dio-result,
+   dio-map_bh.b_private);
+   dio-file_endio(dio-endio_data, dio-result, err);
kfree(dio);
}
 
@@ -944,10 +945,11 @@ out:
  * Releases both i_mutex and i_alloc_sem
  */
 static ssize_t
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
+direct_io_worker(int rw, struct file *file, struct inode *inode, 
const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
unsigned blkbits, get_block_t get_block, 

[PATCH -mm 8/10][RFC] aio: make direct_IO aops use file_endio_t

2007-01-15 Thread Nate Diller
This converts the _locking variant of blockdev_direct_IO to use a generic
endio function, and updates all the FS callsites.

---

 Documentation/filesystems/Locking |5 +++--
 Documentation/filesystems/vfs.txt |5 +++--
 fs/block_dev.c|9 -
 fs/ext2/inode.c   |   12 +---
 fs/ext3/inode.c   |   11 +--
 fs/ext4/inode.c   |   11 +--
 fs/fat/inode.c|   12 ++--
 fs/gfs2/ops_address.c |8 
 fs/hfs/inode.c|   13 ++---
 fs/hfsplus/inode.c|   13 ++---
 fs/jfs/inode.c|   12 +---
 fs/nfs/direct.c   |8 +---
 fs/ocfs2/aops.c   |9 +
 fs/reiserfs/inode.c   |   13 +
 fs/xfs/linux-2.6/xfs_aops.c   |   11 ++-
 fs/xfs/linux-2.6/xfs_lrw.c|4 ++--
 include/linux/fs.h|   28 +---
 include/linux/nfs_fs.h|4 ++--
 mm/filemap.c  |   34 ++
 19 files changed, 108 insertions(+), 114 deletions(-)

---

diff -urpN -X dontdiff a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
--- a/Documentation/filesystems/Locking 2007-01-12 20:26:06.0 -0800
+++ b/Documentation/filesystems/Locking 2007-01-12 20:42:37.0 -0800
@@ -169,8 +169,9 @@ prototypes:
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
-   int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs);
+   int (*direct_IO)(int, struct file *, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs,
+   file_endio_t *endio, void *endio_data);
int (*launder_page) (struct page *);
 
 locking rules:
diff -urpN -X dontdiff a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
--- a/Documentation/filesystems/vfs.txt 2007-01-12 20:26:06.0 -0800
+++ b/Documentation/filesystems/vfs.txt 2007-01-12 20:42:37.0 -0800
@@ -537,8 +537,9 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
-   ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs);
+   ssize_t (*direct_IO)(int, struct file *, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs,
+   file_endio_t *endio, void *endio_data);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
/* migrate the contents of a page to the specified target */
diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c
--- a/fs/block_dev.c2007-01-12 20:29:02.0 -0800
+++ b/fs/block_dev.c2007-01-12 20:42:37.0 -0800
@@ -222,10 +222,11 @@ static void blk_unget_page(struct page *
 }
 
 static ssize_t
-blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-loff_t pos, unsigned long nr_segs)
+blkdev_direct_IO(int rw, struct file *file, const struct iovec *iov,
+loff_t pos, unsigned long nr_segs, file_endio_t *endio,
+void *endio_data)
 {
-   struct inode *inode = iocb-ki_filp-f_mapping-host;
+   struct inode *inode = file-f_mapping-host;
unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode)));
unsigned blocksize_mask = (1  blkbits) - 1;
unsigned long seg = 0;  /* iov segment iterator */
@@ -239,8 +240,6 @@ blkdev_direct_IO(int rw, struct kiocb *i
loff_t size;/* size of block device */
struct bio *bio;
struct bdev_aio stack_io, *io;
-   file_endio_t *endio = aio_complete;
-   void *endio_data = iocb;
struct page *page;
struct pvec pvec;
 
diff -urpN -X dontdiff a/fs/ext2/inode.c b/fs/ext2/inode.c
--- a/fs/ext2/inode.c   2007-01-12 20:26:06.0 -0800
+++ b/fs/ext2/inode.c   2007-01-12 20:42:37.0 -0800
@@ -752,14 +752,12 @@ static sector_t ext2_bmap(struct address
 }
 
 static ssize_t
-ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs)
+ext2_direct_IO(int rw, struct file *file, const struct iovec *iov,
+  loff_t offset, unsigned long nr_segs, file_endio_t *endio,
+  void *endio_data)
 {
-   struct file *file = iocb-ki_filp;
-   struct inode *inode = file-f_mapping-host;
-
-   return blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov,
-   

[PATCH -mm 2/10][RFC] aio: net use struct socket for io

2007-01-15 Thread Nate Diller
Remove unused arg from socket operations

The sendmsg and recvmsg socket operations take a kiocb pointer, but none of
the functions actually use it.  There's really no need even theoretically,
it's really quite ugly having it there at all.  Also, removing it will pave
the way for a more generic completion path in the file_operations.

---

 drivers/net/pppoe.c   |8 +++
 include/linux/net.h   |   18 +++--
 include/net/bluetooth/bluetooth.h |2 -
 include/net/inet_common.h |3 --
 include/net/sock.h|   19 --
 include/net/tcp.h |6 ++---
 include/net/udp.h |3 --
 net/appletalk/ddp.c   |5 +---
 net/atm/common.c  |6 +
 net/atm/common.h  |7 ++
 net/ax25/af_ax25.c|7 ++
 net/bluetooth/af_bluetooth.c  |4 +--
 net/bluetooth/hci_sock.c  |7 ++
 net/bluetooth/l2cap.c |2 -
 net/bluetooth/rfcomm/sock.c   |8 +++
 net/bluetooth/sco.c   |3 --
 net/core/sock.c   |   12 ---
 net/dccp/dccp.h   |8 +++
 net/dccp/probe.c  |3 --
 net/dccp/proto.c  |7 ++
 net/decnet/af_decnet.c|7 ++
 net/econet/af_econet.c|7 ++
 net/ipv4/af_inet.c|5 +---
 net/ipv4/raw.c|8 ++-
 net/ipv4/tcp.c|7 ++
 net/ipv4/tcp_probe.c  |3 --
 net/ipv4/udp.c|9 +++-
 net/ipv4/udp_impl.h   |2 -
 net/ipv6/raw.c|6 +
 net/ipv6/udp.c|   10 +++--
 net/ipv6/udp_impl.h   |6 +
 net/ipx/af_ipx.c  |7 ++
 net/irda/af_irda.c|   29 +---
 net/key/af_key.c  |6 +
 net/llc/af_llc.c  |7 ++
 net/netlink/af_netlink.c  |6 +
 net/netrom/af_netrom.c|7 ++
 net/packet/af_packet.c|   11 --
 net/rose/af_rose.c|7 ++
 net/sctp/socket.c |9 +++-
 net/socket.c  |   32 ++-
 net/tipc/socket.c |   28 +--
 net/unix/af_unix.c|   39 +++---
 net/wanrouter/af_wanpipe.c|7 ++
 net/x25/af_x25.c  |6 +
 45 files changed, 166 insertions(+), 243 deletions(-)

---

diff -urpN -X dontdiff a/drivers/net/pppoe.c b/drivers/net/pppoe.c
--- a/drivers/net/pppoe.c   2007-01-12 11:18:47.244855016 -0800
+++ b/drivers/net/pppoe.c   2007-01-12 11:29:21.179177108 -0800
@@ -746,8 +746,8 @@ static int pppoe_ioctl(struct socket *so
 }
 
 
-static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len)
+static int pppoe_sendmsg(struct socket *sock, struct msghdr *m,
+size_t total_len)
 {
struct sk_buff *skb = NULL;
struct sock *sk = sock-sk;
@@ -912,8 +912,8 @@ static struct ppp_channel_ops pppoe_chan
.start_xmit = pppoe_xmit,
 };
 
-static int pppoe_recvmsg(struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len, int flags)
+static int pppoe_recvmsg(struct socket *sock, struct msghdr *m,
+size_t total_len, int flags)
 {
struct sock *sk = sock-sk;
struct sk_buff *skb = NULL;
diff -urpN -X dontdiff a/include/linux/net.h b/include/linux/net.h
--- a/include/linux/net.h   2007-01-12 11:18:56.683629587 -0800
+++ b/include/linux/net.h   2007-01-12 11:29:21.185175058 -0800
@@ -118,7 +118,6 @@ struct socket {
 
 struct vm_area_struct;
 struct page;
-struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
@@ -156,11 +155,10 @@ struct proto_ops {
  int optname, char __user *optval, int 
optlen);
int (*compat_getsockopt)(struct socket *sock, int level,
  int optname, char __user *optval, int 
__user *optlen);
-   int (*sendmsg)   (struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len);
-   int (*recvmsg)   (struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len,
- int flags);
+   int (*sendmsg)   (struct socket *sock, struct msghdr *m,
+ size_t total_len);
+   int (*recvmsg)   (struct socket *sock, struct msghdr *m,
+ size_t total_len, int flags);
int

Re: [take22 0/4] kevent: Generic event handling mechanism.

2006-11-02 Thread Nate Diller

On 11/1/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller ([EMAIL PROTECTED]) wrote:
 Indesiciveness has certainly been an issue here, but I remember akpm
 and Ulrich both giving concrete suggestions.  I was particularly
 interested in Andrew's request to explain and justify the differences
 between kevent and BSD's kqueue interface.  Was there a discussion
 that I missed?  I am very interested to see your work on this
 mechanism merged, because you've clearly emphasized performance and
 shown impressive results.  But it seems like we lose out on a lot by
 throwing out all the applications that already use kqueue.

It looks you missed that discussion - freebsd kqueue has fields in the
kevent structure which have diffent sizes in 32 and 64 bit environments.


Are you saying that the *only* reason we choose not to be
source-compatible with BSD is the 32 bit userland on 64 bit arch
problem?  I've followed every thread that gmail 'kqueue' search
returns, which thread are you referring to?  Nicholas Miell, in The
Proposed Linux kevent API thread, seems to think that there are no
advantages over kqueue to justify the incompatibility, an argument you
made no effort to refute.  I've also read the Kevent wiki at
linux-net.osdl.org, but it too is lacking in any direct comparisons
(even theoretical, let alone benchmarks) of the flexibility,
performance, etc. between the two.

I'm not arguing that you've done a bad design, I'm asking you to brag
about the things you improved on vs. kqueue.  Your emphasis on
unifying all the different event types into one interface is really
cool, fill me in on why that can't be effectively done with the kqueue
compatability and I also will advocate for kevent inclusion.

NATE
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take22 0/4] kevent: Generic event handling mechanism.

2006-11-01 Thread Nate Diller

On 11/1/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

On Wed, Nov 01, 2006 at 06:20:43PM +, Oleg Verych ([EMAIL PROTECTED]) wrote:

 Hallo, Evgeniy Polyakov.

Hello, Oleg.

 On 2006-11-01, you wrote:
 []
  Quantifying how much more scalable would be nice, as would be some
  example where it is useful. (It makes my webserver twice as fast on
  monster 64-cpu box).
 
  Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
 [...]

 Seriously. I'm seeing that patches also. New, shiny, always ready for
 inclusion. But considering kernel (linux in this case) as not thing
 for itself, i want to ask following question.

 Where's real-life application to do configure  make  make install?

Your real life or mine as developer?
I fortunately do not know anything about your real life, but my real life
applications can be found on project's homepage.
There is a link to archive there, where you can find plenty of sources.
You likely do not know, but it is a bit risky business to patch all
existing applications to show that approach is correct, if
implementation is not completed.
You likely do not know, but after I first time announced kevents in
February I changed interfaces 4 times - and it is just interfaces, not
including numerous features added/removed by developer's requests.

 There were some comments about laking much of such programs, answers were
 was in prev. e-mail, need to update them, something like that.
 Trivial web server sources url, mentioned in benchmark isn't pointed
 in patch advertisement. If it was, should i actually try that new
 *trivial* wheel?

Answer is trivial - there is archive where one can find a source code
(filenames are posted regulary). Should I create a rpm? For what glibc
version?

 Saying that, i want to give you some short examples, i know.
 *Linux kernel - userspace*:
 o Alexey Kuznetsov  networking - (excellent) iproute set of utilities;

iproute documentation was way too bad when Alexey presented it first
time :)

 o Maxim Krasnyansky tun net driver - vtun daemon application;

 *Glibc with mister Drepper* has huge set of tests, please search for
 `tst*' files in the sources.

Btw, show me splice() 'shiny' application? Does lighttpd use it?
Or move_pages().

 To make a little hint to you, Evgeniy, why don't you find a little
 animal in the open source zoo to implement little interface to
 proposed kernel subsystem and then show it to The Big Jury (not me),
 we have here? And i can not see, how you've managed to implement
 something like that having almost nothing on the test basket.
 Very *suspicious* ch.

There are always people who do not like something, what can I do with
it? I present the code, we discuss it, I ask for inclusion (since it is
the only way to get feedback), something requires changes, it is changed
and so on - it is development process.
I created 'little animal in the open source zoo' by myself to show how
simple kevents are.

 One, that comes in mind is lighthttpd http://www.lighttpd.net/.
 It had sub-interface for event systems like select,poll,epoll, when i
 checked its sources last time. And it is mature, btw.

As I already told several times, I changed only interfaces 4 times
already, since no one seems to know what we really want and how
interface should look like.


Indesiciveness has certainly been an issue here, but I remember akpm
and Ulrich both giving concrete suggestions.  I was particularly
interested in Andrew's request to explain and justify the differences
between kevent and BSD's kqueue interface.  Was there a discussion
that I missed?  I am very interested to see your work on this
mechanism merged, because you've clearly emphasized performance and
shown impressive results.  But it seems like we lose out on a lot by
throwing out all the applications that already use kqueue.

NATE
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/20] vm deadlock avoidance for NFS, NBD and iSCSI (take 7)

2006-09-12 Thread Nate Diller

On 9/12/06, Linus Torvalds [EMAIL PROTECTED] wrote:



On Tue, 12 Sep 2006, Peter Zijlstra wrote:

 Linus, when I mentioned swap over network to you in Ottawa, you said it was
 a valid use case, that people actually do and want this. Can you agree with
 the approach taken in these patches?

Well, in all honesty, I don't think I really said valid, but that I said
that some crazy people want to do it, and that we should try to allow them
their foibles.

So I'd be nervous to do any _guarantees_. I think that good VM policies
should make it be something that works in general (the dirty mapping
limits in particular), but I'd be a bit nervous about anybody taking it
_too_ seriously. Crazy people are still crazy, they just might be right
under certain reasonably-well-controlled circumstances.


(oops, forgot to cc: the list)

Personally, I'm a little unhappy with the added complexity here, I'm
not convinced that this extra feature is worth it.  In particular,
adding to the address_space_operations, the block_device_operations,
and creating a new swap index/offset interface just for this seems
questionable.  I feel like interface bloat should be reserved for
features that have widespread use and benefit.

Not that I'm opposed to this feature, just that I think this patch is
too invasive interface-wise.

NATE
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html