Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Kevin Wolf
Am 28.11.2014 um 13:57 hat Kevin Wolf geschrieben:
> Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben:
> > On 26.11.2014 15:46, Kevin Wolf wrote:
> > >This improves the performance of requests because an ACB doesn't need to
> > >be allocated on the heap any more. It also makes the code nicer and
> > >smaller.
> > >
> > >As a side effect, the codepath taken by aio=threads is changed to use
> > >paio_submit_co(). This doesn't change the performance at this point.
> > >
> > >Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
> > >
> > >   |  aio=native   | aio=threads
> > >   | before   | with patch | before   | with patch
> > >--+--++--+
> > >run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
> > >run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
> > >run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
> > >run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
> > >run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
> > >
> > >TODO: Do some more serious benchmarking in VMs with less variance.
> > >Results of a quick fio run are vaguely positive.
> > 
> > I still see the main-loop spun warnings with this patches applied to master.
> > It wasn't there with the original patch from August.
> > 
> > ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1
> > Sending 1000 requests, 4096 bytes each, 64 in parallel
> > main-loop: WARNING: I/O thread spun for 1000 iterations
> > Run completed in 31.947 seconds.
> 
> Yes, I still need to bisect that. The 'qemu-img bench' numbers above are
> actually also from August, we have regressed meanwhile by about a
> second, and I also haven't found the reason for that yet.

Did the first part of this now. The commit that introduced the "spun"
message is 2cdff7f6 ('linux-aio: avoid deadlock in nested aio_poll()
calls').

The following patch doesn't make it go away completely, but I only see
it sometime during like every other run now, instead of immediately
after starting qemu-img bench. It's probably a (very) minor performance
optimisation, too.

Kevin


diff --git a/block/linux-aio.c b/block/linux-aio.c
index fd8f0e4..1a0ec62 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -136,6 +136,8 @@ static void qemu_laio_completion_bh(void *opaque)
 
 qemu_laio_process_completion(s, laiocb);
 }
+
+qemu_bh_cancel(s->completion_bh);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -143,7 +145,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
 struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
 
 if (event_notifier_test_and_clear(&s->e)) {
-qemu_bh_schedule(s->completion_bh);
+qemu_laio_completion_bh(s);
 }
 }



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Kevin Wolf
Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben:
> On 26.11.2014 15:46, Kevin Wolf wrote:
> >This improves the performance of requests because an ACB doesn't need to
> >be allocated on the heap any more. It also makes the code nicer and
> >smaller.
> >
> >As a side effect, the codepath taken by aio=threads is changed to use
> >paio_submit_co(). This doesn't change the performance at this point.
> >
> >Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
> >
> >   |  aio=native   | aio=threads
> >   | before   | with patch | before   | with patch
> >--+--++--+
> >run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
> >run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
> >run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
> >run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
> >run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
> >
> >TODO: Do some more serious benchmarking in VMs with less variance.
> >Results of a quick fio run are vaguely positive.
> 
> I still see the main-loop spun warnings with this patches applied to master.
> It wasn't there with the original patch from August.
> 
> ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1
> Sending 1000 requests, 4096 bytes each, 64 in parallel
> main-loop: WARNING: I/O thread spun for 1000 iterations
> Run completed in 31.947 seconds.

Yes, I still need to bisect that. The 'qemu-img bench' numbers above are
actually also from August, we have regressed meanwhile by about a
second, and I also haven't found the reason for that yet.

Kevin



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Kevin Wolf
Am 28.11.2014 um 03:59 hat Ming Lei geschrieben:
> Hi Kevin,
> 
> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
> > This improves the performance of requests because an ACB doesn't need to
> > be allocated on the heap any more. It also makes the code nicer and
> > smaller.
> 
> I am not sure it is good way for linux aio optimization:
> 
> - for raw image with some constraint, coroutine can be avoided since
> io_submit() won't sleep most of times
> 
> - handling one time coroutine takes much time than handling malloc,
> memset and free on small buffer, following the test data:
> 
>  --   241ns per coroutine
>  --   61ns per (malloc, memset, free for 128bytes)

Please finally stop making comparisons between completely unrelated
things and trying to make a case against coroutines out of it. It simply
doesn't make any sense.

The truth is that in the 'qemu-img bench' case as well as in the highest
performing VM setup for Peter and me, the practically existing coroutine
based git branches perform better then the practically existing bypass
branches. If you think that theoretically the bypass branches must be
better, show us the patches and benchmarks.

If you can't, let's merge the coroutine improvements (which improve
more than just the case of raw images using no block layer features,
including cases that benefit the average user) and be done.

> I still think we should figure out a fast path to avoid cocourinte
> for linux-aio with raw image, otherwise it can't scale well for high
> IOPS device.
> 
> Also we can use simple buf pool to avoid the dynamic allocation
> easily, can't we?

Yes, the change to g_slice_alloc() was a bad move performance-wise.

> > As a side effect, the codepath taken by aio=threads is changed to use
> > paio_submit_co(). This doesn't change the performance at this point.
> >
> > Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
> >
> >   |  aio=native   | aio=threads
> >   | before   | with patch | before   | with patch
> > --+--++--+
> > run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
> > run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
> > run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
> > run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
> > run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
> >
> > TODO: Do some more serious benchmarking in VMs with less variance.
> > Results of a quick fio run are vaguely positive.
> 
> I will do the test with Paolo's fast path approach under
> VM I/O situation.

Currently, the best thing to compare it against is probably Peter's git
branch at https://github.com/plieven/qemu.git perf_master2. This patch
is only a first step in a whole series of possible optimisations.

Kevin



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Paolo Bonzini


On 28/11/2014 03:59, Ming Lei wrote:
> Hi Kevin,
> 
> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
>> This improves the performance of requests because an ACB doesn't need to
>> be allocated on the heap any more. It also makes the code nicer and
>> smaller.
> 
> I am not sure it is good way for linux aio optimization:
> 
> - for raw image with some constraint, coroutine can be avoided since
> io_submit() won't sleep most of times
> 
> - handling one time coroutine takes much time than handling malloc,
> memset and free on small buffer, following the test data:
> 
>  --   241ns per coroutine
>  --   61ns per (malloc, memset, free for 128bytes)
> 
> I still think we should figure out a fast path to avoid cocourinte
> for linux-aio with raw image, otherwise it can't scale well for high
> IOPS device.

sigsetjmp/siglongjmp are just ~60 instructions, it cannot account for
180ns (600 clock cycles).  The cost of creating and destroying the
coroutine must come from somewhere else.

Let's just try something else.  Let's remove the pool mutex, as suggested
by Peter but in a way that works even with non-ioeventfd backends.

I still believe we will end with some kind of coroutine bypass scheme
(even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
directly can help), but hey it cannot hurt to optimize hot code.

The patch below has a single pool where coroutines are placed on
destruction, and a per-thread allocation pool.  Whenever the destruction
pool is big enough, the next thread that runs out of coroutines will
steal from it instead of making a new coroutine.  If this works, it
would be beautiful in two ways:

1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.

2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also removes qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous).

The patch below is very raw, and untested beyond tests/test-coroutine.
There may be some premature optimization (not using GPrivate, even though
I need it to run the per-thread destructor) but it was easy enough.  Ming,
Kevin, can you benchmark it?

Related to this, we can see if __thread beats GPrivate in coroutine-ucontext.c.
Every clock cycle counts (600 clock cycles are a 2% improvement at 3 GHz
and 100 kiops) so we can see what we get.

Paolo

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index d433b90..6a01e2f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -191,6 +191,17 @@ struct {   
 \
 #define QSLIST_INSERT_HEAD(head, elm, field) do {\
 (elm)->field.sle_next = (head)->slh_first;  \
 (head)->slh_first = (elm);  \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do {   \
+   do {   \
+  (elm)->field.sle_next = (head)->slh_first;  \
+   } while (atomic_cmpxchg(&(head)->slh_first, (elm)->field.sle_next, \
+  (elm)) != (elm)->field.sle_next);\
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_MOVE_ATOMIC(dest, src) do {   \
+   (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);\
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_HEAD(head, field) do { \
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index bd574aa..60d761f 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -15,35 +15,52 @@
 #include "trace.h"
 #include "qemu-common.h"
 #include "qemu/thread.h"
+#include "qemu/atomic.h"
 #include "block/coroutine.h"
 #include "block/coroutine_int.h"
 
 enum {
-POOL_DEFAULT_SIZE = 64,
+POOL_BATCH_SIZE = 64,
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
+static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 static unsigned int pool_size;
-static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);
+
+/* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
+static void coroutine_pool_cleanup(void *value);
+static GPrivate dummy_key = G_PRIVATE_INIT(coroutine_pool_cleanup);
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
 Coroutine *co = NULL;
 
 if (CONFIG_COROUTINE_POOL) {
-qemu_mut

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Ming Lei
On 11/28/14, Markus Armbruster  wrote:
> Ming Lei  writes:
>
>> On 11/28/14, Markus Armbruster  wrote:
>>> Ming Lei  writes:
>>>
 Hi Kevin,

 On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
> This improves the performance of requests because an ACB doesn't need
> to
> be allocated on the heap any more. It also makes the code nicer and
> smaller.

 I am not sure it is good way for linux aio optimization:

 - for raw image with some constraint, coroutine can be avoided since
 io_submit() won't sleep most of times

 - handling one time coroutine takes much time than handling malloc,
 memset and free on small buffer, following the test data:

  --   241ns per coroutine
>>>
>>> What do you mean by "coroutine" here?  Create + destroy?  Yield?
>>
>> Please see perf_cost() in tests/test-coroutine.c
>
> static __attribute__((noinline)) void perf_cost_func(void *opaque)
> {
> qemu_coroutine_yield();
> }
>
> static void perf_cost(void)
> {
> const unsigned long maxcycles = 4000;
> unsigned long i = 0;
> double duration;
> unsigned long ops;
> Coroutine *co;
>
> g_test_timer_start();
> while (i++ < maxcycles) {
> co = qemu_coroutine_create(perf_cost_func);
> qemu_coroutine_enter(co, &i);
> qemu_coroutine_enter(co, NULL);
> }
> duration = g_test_timer_elapsed();
> ops = (long)(maxcycles / (duration * 1000));
>
> g_test_message("Run operation %lu iterations %f s, %luK
> operations/s, "
>"%luns per coroutine",
>maxcycles,
>duration, ops,
>(unsigned long)(10 * duration) / maxcycles);
> }
>
> This tests create, enter, yield, reenter, terminate, destroy.  The cost
> of create + destroy may well dominate.

Actually there shouldn't have been much cost from create and destroy
attributed to coroutine pool.

>
> If we create and destroy coroutines for each AIO request, we're doing it
> wrong.  I doubt Kevin's doing it *that* wrong ;)
>
> Anyway, let's benchmark the real code instead of putting undue trust in
> tests/test-coroutine.c micro-benchmarks.

I don't think there isn't trust from the micro-benchmark.

That is the direct cost from coroutine, and the cost won't be avoided at all,
not mention cost from switching stack.

If you google some test data posted by me previously, that would show
bypassing coroutine can increase throughput with ~50% for raw image
in case of linux aio, that is the real test case, not micro-benchmark.


Thanks,
Ming Lei



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Markus Armbruster
Ming Lei  writes:

> On 11/28/14, Markus Armbruster  wrote:
>> Ming Lei  writes:
>>
>>> Hi Kevin,
>>>
>>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
 This improves the performance of requests because an ACB doesn't need to
 be allocated on the heap any more. It also makes the code nicer and
 smaller.
>>>
>>> I am not sure it is good way for linux aio optimization:
>>>
>>> - for raw image with some constraint, coroutine can be avoided since
>>> io_submit() won't sleep most of times
>>>
>>> - handling one time coroutine takes much time than handling malloc,
>>> memset and free on small buffer, following the test data:
>>>
>>>  --   241ns per coroutine
>>
>> What do you mean by "coroutine" here?  Create + destroy?  Yield?
>
> Please see perf_cost() in tests/test-coroutine.c

static __attribute__((noinline)) void perf_cost_func(void *opaque)
{
qemu_coroutine_yield();
}

static void perf_cost(void)
{
const unsigned long maxcycles = 4000;
unsigned long i = 0;
double duration;
unsigned long ops;
Coroutine *co;

g_test_timer_start();
while (i++ < maxcycles) {
co = qemu_coroutine_create(perf_cost_func);
qemu_coroutine_enter(co, &i);
qemu_coroutine_enter(co, NULL);
}
duration = g_test_timer_elapsed();
ops = (long)(maxcycles / (duration * 1000));

g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
   "%luns per coroutine",
   maxcycles,
   duration, ops,
   (unsigned long)(10 * duration) / maxcycles);
}

This tests create, enter, yield, reenter, terminate, destroy.  The cost
of create + destroy may well dominate.

If we create and destroy coroutines for each AIO request, we're doing it
wrong.  I doubt Kevin's doing it *that* wrong ;)

Anyway, let's benchmark the real code instead of putting undue trust in
tests/test-coroutine.c micro-benchmarks.



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Ming Lei
On 11/28/14, Markus Armbruster  wrote:
> Ming Lei  writes:
>
>> Hi Kevin,
>>
>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
>>> This improves the performance of requests because an ACB doesn't need to
>>> be allocated on the heap any more. It also makes the code nicer and
>>> smaller.
>>
>> I am not sure it is good way for linux aio optimization:
>>
>> - for raw image with some constraint, coroutine can be avoided since
>> io_submit() won't sleep most of times
>>
>> - handling one time coroutine takes much time than handling malloc,
>> memset and free on small buffer, following the test data:
>>
>>  --   241ns per coroutine
>
> What do you mean by "coroutine" here?  Create + destroy?  Yield?

Please see perf_cost() in tests/test-coroutine.c

Thanks,
Ming Lei



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-27 Thread Markus Armbruster
Ming Lei  writes:

> Hi Kevin,
>
> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
>> This improves the performance of requests because an ACB doesn't need to
>> be allocated on the heap any more. It also makes the code nicer and
>> smaller.
>
> I am not sure it is good way for linux aio optimization:
>
> - for raw image with some constraint, coroutine can be avoided since
> io_submit() won't sleep most of times
>
> - handling one time coroutine takes much time than handling malloc,
> memset and free on small buffer, following the test data:
>
>  --   241ns per coroutine

What do you mean by "coroutine" here?  Create + destroy?  Yield?

>  --   61ns per (malloc, memset, free for 128bytes)
>
> I still think we should figure out a fast path to avoid cocourinte
> for linux-aio with raw image, otherwise it can't scale well for high
> IOPS device.
>
> Also we can use simple buf pool to avoid the dynamic allocation
> easily, can't we?
[...]



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-27 Thread Ming Lei
Hi Kevin,

On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
> This improves the performance of requests because an ACB doesn't need to
> be allocated on the heap any more. It also makes the code nicer and
> smaller.

I am not sure it is good way for linux aio optimization:

- for raw image with some constraint, coroutine can be avoided since
io_submit() won't sleep most of times

- handling one time coroutine takes much time than handling malloc,
memset and free on small buffer, following the test data:

 --   241ns per coroutine
 --   61ns per (malloc, memset, free for 128bytes)

I still think we should figure out a fast path to avoid cocourinte
for linux-aio with raw image, otherwise it can't scale well for high
IOPS device.

Also we can use simple buf pool to avoid the dynamic allocation
easily, can't we?

>
> As a side effect, the codepath taken by aio=threads is changed to use
> paio_submit_co(). This doesn't change the performance at this point.
>
> Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
>
>   |  aio=native   | aio=threads
>   | before   | with patch | before   | with patch
> --+--++--+
> run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
> run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
> run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
> run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
> run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
>
> TODO: Do some more serious benchmarking in VMs with less variance.
> Results of a quick fio run are vaguely positive.

I will do the test with Paolo's fast path approach under
VM I/O situation.

Thanks,
Ming Lei



Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-27 Thread Peter Lieven

On 26.11.2014 15:46, Kevin Wolf wrote:

This improves the performance of requests because an ACB doesn't need to
be allocated on the heap any more. It also makes the code nicer and
smaller.

As a side effect, the codepath taken by aio=threads is changed to use
paio_submit_co(). This doesn't change the performance at this point.

Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:

   |  aio=native   | aio=threads
   | before   | with patch | before   | with patch
--+--++--+
run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s

TODO: Do some more serious benchmarking in VMs with less variance.
Results of a quick fio run are vaguely positive.


I still see the main-loop spun warnings with this patches applied to master.
It wasn't there with the original patch from August.

~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1
Sending 1000 requests, 4096 bytes each, 64 in parallel
main-loop: WARNING: I/O thread spun for 1000 iterations
Run completed in 31.947 seconds.

Peter



Signed-off-by: Kevin Wolf 
---
  block/linux-aio.c | 70 +--
  block/raw-aio.h   |  5 ++--
  block/raw-posix.c | 62 ++--
  3 files changed, 52 insertions(+), 85 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d92513b..99b259d 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -12,6 +12,7 @@
  #include "qemu/queue.h"
  #include "block/raw-aio.h"
  #include "qemu/event_notifier.h"
+#include "block/coroutine.h"
  
  #include 
  
@@ -28,7 +29,7 @@

  #define MAX_QUEUED_IO  128
  
  struct qemu_laiocb {

-BlockAIOCB common;
+Coroutine *co;
  struct qemu_laio_state *ctx;
  struct iocb iocb;
  ssize_t ret;
@@ -86,9 +87,9 @@ static void qemu_laio_process_completion(struct 
qemu_laio_state *s,
  }
  }
  }
-laiocb->common.cb(laiocb->common.opaque, ret);
  
-qemu_aio_unref(laiocb);

+laiocb->ret = ret;
+qemu_coroutine_enter(laiocb->co, NULL);
  }
  
  /* The completion BH fetches completed I/O requests and invokes their

@@ -146,30 +147,6 @@ static void qemu_laio_completion_cb(EventNotifier *e)
  }
  }
  
-static void laio_cancel(BlockAIOCB *blockacb)

-{
-struct qemu_laiocb *laiocb = (struct qemu_laiocb *)blockacb;
-struct io_event event;
-int ret;
-
-if (laiocb->ret != -EINPROGRESS) {
-return;
-}
-ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event);
-laiocb->ret = -ECANCELED;
-if (ret != 0) {
-/* iocb is not cancelled, cb will be called by the event loop later */
-return;
-}
-
-laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
-}
-
-static const AIOCBInfo laio_aiocb_info = {
-.aiocb_size = sizeof(struct qemu_laiocb),
-.cancel_async   = laio_cancel,
-};
-
  static void ioq_init(LaioQueue *io_q)
  {
  io_q->size = MAX_QUEUED_IO;
@@ -243,23 +220,21 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, 
bool unplug)
  return ret;
  }
  
-BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,

-int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-BlockCompletionFunc *cb, void *opaque, int type)
+int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
+int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type)
  {
  struct qemu_laio_state *s = aio_ctx;
-struct qemu_laiocb *laiocb;
-struct iocb *iocbs;
  off_t offset = sector_num * 512;
  
-laiocb = qemu_aio_get(&laio_aiocb_info, bs, cb, opaque);

-laiocb->nbytes = nb_sectors * 512;
-laiocb->ctx = s;
-laiocb->ret = -EINPROGRESS;
-laiocb->is_read = (type == QEMU_AIO_READ);
-laiocb->qiov = qiov;
-
-iocbs = &laiocb->iocb;
+struct qemu_laiocb laiocb = {
+.co = qemu_coroutine_self(),
+.nbytes = nb_sectors * 512,
+.ctx = s,
+.is_read = (type == QEMU_AIO_READ),
+.qiov = qiov,
+};
+struct iocb *iocbs = &laiocb.iocb;
+int ret;
  
  switch (type) {

  case QEMU_AIO_WRITE:
@@ -272,22 +247,21 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
  default:
  fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
  __func__, type);
-goto out_free_aiocb;
+return -EIO;
  }
-io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
+io_set_eventfd(&laiocb.iocb, event_notifier_get_fd(&s->e));
  
  if (!s->io_q.plugged) {

-if (io_submit(s->ctx, 1, &iocbs) < 0) {
-goto out_free_aiocb;
+ret = io_submit(s->ctx, 1, &iocbs);
+if (ret < 0) {
+return r