Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Am 28.11.2014 um 13:57 hat Kevin Wolf geschrieben: > Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben: > > On 26.11.2014 15:46, Kevin Wolf wrote: > > >This improves the performance of requests because an ACB doesn't need to > > >be allocated on the heap any more. It also makes the code nicer and > > >smaller. > > > > > >As a side effect, the codepath taken by aio=threads is changed to use > > >paio_submit_co(). This doesn't change the performance at this point. > > > > > >Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: > > > > > > | aio=native | aio=threads > > > | before | with patch | before | with patch > > >--+--++--+ > > >run 1 | 29.921s | 26.932s| 35.286s | 35.447s > > >run 2 | 29.793s | 26.252s| 35.276s | 35.111s > > >run 3 | 30.186s | 27.114s| 35.042s | 34.921s > > >run 4 | 30.425s | 26.600s| 35.169s | 34.968s > > >run 5 | 30.041s | 26.263s| 35.224s | 35.000s > > > > > >TODO: Do some more serious benchmarking in VMs with less variance. > > >Results of a quick fio run are vaguely positive. > > > > I still see the main-loop spun warnings with this patches applied to master. > > It wasn't there with the original patch from August. > > > > ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1 > > Sending 1000 requests, 4096 bytes each, 64 in parallel > > main-loop: WARNING: I/O thread spun for 1000 iterations > > Run completed in 31.947 seconds. > > Yes, I still need to bisect that. The 'qemu-img bench' numbers above are > actually also from August, we have regressed meanwhile by about a > second, and I also haven't found the reason for that yet. Did the first part of this now. The commit that introduced the "spun" message is 2cdff7f6 ('linux-aio: avoid deadlock in nested aio_poll() calls'). The following patch doesn't make it go away completely, but I only see it sometime during like every other run now, instead of immediately after starting qemu-img bench. It's probably a (very) minor performance optimisation, too. Kevin diff --git a/block/linux-aio.c b/block/linux-aio.c index fd8f0e4..1a0ec62 100644 --- a/block/linux-aio.c +++ b/block/linux-aio.c @@ -136,6 +136,8 @@ static void qemu_laio_completion_bh(void *opaque) qemu_laio_process_completion(s, laiocb); } + +qemu_bh_cancel(s->completion_bh); } static void qemu_laio_completion_cb(EventNotifier *e) @@ -143,7 +145,7 @@ static void qemu_laio_completion_cb(EventNotifier *e) struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e); if (event_notifier_test_and_clear(&s->e)) { -qemu_bh_schedule(s->completion_bh); +qemu_laio_completion_bh(s); } }
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben: > On 26.11.2014 15:46, Kevin Wolf wrote: > >This improves the performance of requests because an ACB doesn't need to > >be allocated on the heap any more. It also makes the code nicer and > >smaller. > > > >As a side effect, the codepath taken by aio=threads is changed to use > >paio_submit_co(). This doesn't change the performance at this point. > > > >Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: > > > > | aio=native | aio=threads > > | before | with patch | before | with patch > >--+--++--+ > >run 1 | 29.921s | 26.932s| 35.286s | 35.447s > >run 2 | 29.793s | 26.252s| 35.276s | 35.111s > >run 3 | 30.186s | 27.114s| 35.042s | 34.921s > >run 4 | 30.425s | 26.600s| 35.169s | 34.968s > >run 5 | 30.041s | 26.263s| 35.224s | 35.000s > > > >TODO: Do some more serious benchmarking in VMs with less variance. > >Results of a quick fio run are vaguely positive. > > I still see the main-loop spun warnings with this patches applied to master. > It wasn't there with the original patch from August. > > ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1 > Sending 1000 requests, 4096 bytes each, 64 in parallel > main-loop: WARNING: I/O thread spun for 1000 iterations > Run completed in 31.947 seconds. Yes, I still need to bisect that. The 'qemu-img bench' numbers above are actually also from August, we have regressed meanwhile by about a second, and I also haven't found the reason for that yet. Kevin
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Am 28.11.2014 um 03:59 hat Ming Lei geschrieben: > Hi Kevin, > > On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf wrote: > > This improves the performance of requests because an ACB doesn't need to > > be allocated on the heap any more. It also makes the code nicer and > > smaller. > > I am not sure it is good way for linux aio optimization: > > - for raw image with some constraint, coroutine can be avoided since > io_submit() won't sleep most of times > > - handling one time coroutine takes much time than handling malloc, > memset and free on small buffer, following the test data: > > -- 241ns per coroutine > -- 61ns per (malloc, memset, free for 128bytes) Please finally stop making comparisons between completely unrelated things and trying to make a case against coroutines out of it. It simply doesn't make any sense. The truth is that in the 'qemu-img bench' case as well as in the highest performing VM setup for Peter and me, the practically existing coroutine based git branches perform better then the practically existing bypass branches. If you think that theoretically the bypass branches must be better, show us the patches and benchmarks. If you can't, let's merge the coroutine improvements (which improve more than just the case of raw images using no block layer features, including cases that benefit the average user) and be done. > I still think we should figure out a fast path to avoid cocourinte > for linux-aio with raw image, otherwise it can't scale well for high > IOPS device. > > Also we can use simple buf pool to avoid the dynamic allocation > easily, can't we? Yes, the change to g_slice_alloc() was a bad move performance-wise. > > As a side effect, the codepath taken by aio=threads is changed to use > > paio_submit_co(). This doesn't change the performance at this point. > > > > Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: > > > > | aio=native | aio=threads > > | before | with patch | before | with patch > > --+--++--+ > > run 1 | 29.921s | 26.932s| 35.286s | 35.447s > > run 2 | 29.793s | 26.252s| 35.276s | 35.111s > > run 3 | 30.186s | 27.114s| 35.042s | 34.921s > > run 4 | 30.425s | 26.600s| 35.169s | 34.968s > > run 5 | 30.041s | 26.263s| 35.224s | 35.000s > > > > TODO: Do some more serious benchmarking in VMs with less variance. > > Results of a quick fio run are vaguely positive. > > I will do the test with Paolo's fast path approach under > VM I/O situation. Currently, the best thing to compare it against is probably Peter's git branch at https://github.com/plieven/qemu.git perf_master2. This patch is only a first step in a whole series of possible optimisations. Kevin
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
On 28/11/2014 03:59, Ming Lei wrote: > Hi Kevin, > > On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf wrote: >> This improves the performance of requests because an ACB doesn't need to >> be allocated on the heap any more. It also makes the code nicer and >> smaller. > > I am not sure it is good way for linux aio optimization: > > - for raw image with some constraint, coroutine can be avoided since > io_submit() won't sleep most of times > > - handling one time coroutine takes much time than handling malloc, > memset and free on small buffer, following the test data: > > -- 241ns per coroutine > -- 61ns per (malloc, memset, free for 128bytes) > > I still think we should figure out a fast path to avoid cocourinte > for linux-aio with raw image, otherwise it can't scale well for high > IOPS device. sigsetjmp/siglongjmp are just ~60 instructions, it cannot account for 180ns (600 clock cycles). The cost of creating and destroying the coroutine must come from somewhere else. Let's just try something else. Let's remove the pool mutex, as suggested by Peter but in a way that works even with non-ioeventfd backends. I still believe we will end with some kind of coroutine bypass scheme (even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv directly can help), but hey it cannot hurt to optimize hot code. The patch below has a single pool where coroutines are placed on destruction, and a per-thread allocation pool. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal from it instead of making a new coroutine. If this works, it would be beautiful in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also removes qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous). The patch below is very raw, and untested beyond tests/test-coroutine. There may be some premature optimization (not using GPrivate, even though I need it to run the per-thread destructor) but it was easy enough. Ming, Kevin, can you benchmark it? Related to this, we can see if __thread beats GPrivate in coroutine-ucontext.c. Every clock cycle counts (600 clock cycles are a 2% improvement at 3 GHz and 100 kiops) so we can see what we get. Paolo diff --git a/include/qemu/queue.h b/include/qemu/queue.h index d433b90..6a01e2f 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -191,6 +191,17 @@ struct { \ #define QSLIST_INSERT_HEAD(head, elm, field) do {\ (elm)->field.sle_next = (head)->slh_first; \ (head)->slh_first = (elm); \ +} while (/*CONSTCOND*/0) + +#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do { \ + do { \ + (elm)->field.sle_next = (head)->slh_first; \ + } while (atomic_cmpxchg(&(head)->slh_first, (elm)->field.sle_next, \ + (elm)) != (elm)->field.sle_next);\ +} while (/*CONSTCOND*/0) + +#define QSLIST_MOVE_ATOMIC(dest, src) do { \ + (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);\ } while (/*CONSTCOND*/0) #define QSLIST_REMOVE_HEAD(head, field) do { \ diff --git a/qemu-coroutine.c b/qemu-coroutine.c index bd574aa..60d761f 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -15,35 +15,52 @@ #include "trace.h" #include "qemu-common.h" #include "qemu/thread.h" +#include "qemu/atomic.h" #include "block/coroutine.h" #include "block/coroutine_int.h" enum { -POOL_DEFAULT_SIZE = 64, +POOL_BATCH_SIZE = 64, }; /** Free list to speed up creation */ -static QemuMutex pool_lock; -static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool); +static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); static unsigned int pool_size; -static unsigned int pool_max_size = POOL_DEFAULT_SIZE; +static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); + +/* The GPrivate is only used to invoke coroutine_pool_cleanup. */ +static void coroutine_pool_cleanup(void *value); +static GPrivate dummy_key = G_PRIVATE_INIT(coroutine_pool_cleanup); Coroutine *qemu_coroutine_create(CoroutineEntry *entry) { Coroutine *co = NULL; if (CONFIG_COROUTINE_POOL) { -qemu_mut
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
On 11/28/14, Markus Armbruster wrote: > Ming Lei writes: > >> On 11/28/14, Markus Armbruster wrote: >>> Ming Lei writes: >>> Hi Kevin, On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf wrote: > This improves the performance of requests because an ACB doesn't need > to > be allocated on the heap any more. It also makes the code nicer and > smaller. I am not sure it is good way for linux aio optimization: - for raw image with some constraint, coroutine can be avoided since io_submit() won't sleep most of times - handling one time coroutine takes much time than handling malloc, memset and free on small buffer, following the test data: -- 241ns per coroutine >>> >>> What do you mean by "coroutine" here? Create + destroy? Yield? >> >> Please see perf_cost() in tests/test-coroutine.c > > static __attribute__((noinline)) void perf_cost_func(void *opaque) > { > qemu_coroutine_yield(); > } > > static void perf_cost(void) > { > const unsigned long maxcycles = 4000; > unsigned long i = 0; > double duration; > unsigned long ops; > Coroutine *co; > > g_test_timer_start(); > while (i++ < maxcycles) { > co = qemu_coroutine_create(perf_cost_func); > qemu_coroutine_enter(co, &i); > qemu_coroutine_enter(co, NULL); > } > duration = g_test_timer_elapsed(); > ops = (long)(maxcycles / (duration * 1000)); > > g_test_message("Run operation %lu iterations %f s, %luK > operations/s, " >"%luns per coroutine", >maxcycles, >duration, ops, >(unsigned long)(10 * duration) / maxcycles); > } > > This tests create, enter, yield, reenter, terminate, destroy. The cost > of create + destroy may well dominate. Actually there shouldn't have been much cost from create and destroy attributed to coroutine pool. > > If we create and destroy coroutines for each AIO request, we're doing it > wrong. I doubt Kevin's doing it *that* wrong ;) > > Anyway, let's benchmark the real code instead of putting undue trust in > tests/test-coroutine.c micro-benchmarks. I don't think there isn't trust from the micro-benchmark. That is the direct cost from coroutine, and the cost won't be avoided at all, not mention cost from switching stack. If you google some test data posted by me previously, that would show bypassing coroutine can increase throughput with ~50% for raw image in case of linux aio, that is the real test case, not micro-benchmark. Thanks, Ming Lei
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Ming Lei writes: > On 11/28/14, Markus Armbruster wrote: >> Ming Lei writes: >> >>> Hi Kevin, >>> >>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. >>> >>> I am not sure it is good way for linux aio optimization: >>> >>> - for raw image with some constraint, coroutine can be avoided since >>> io_submit() won't sleep most of times >>> >>> - handling one time coroutine takes much time than handling malloc, >>> memset and free on small buffer, following the test data: >>> >>> -- 241ns per coroutine >> >> What do you mean by "coroutine" here? Create + destroy? Yield? > > Please see perf_cost() in tests/test-coroutine.c static __attribute__((noinline)) void perf_cost_func(void *opaque) { qemu_coroutine_yield(); } static void perf_cost(void) { const unsigned long maxcycles = 4000; unsigned long i = 0; double duration; unsigned long ops; Coroutine *co; g_test_timer_start(); while (i++ < maxcycles) { co = qemu_coroutine_create(perf_cost_func); qemu_coroutine_enter(co, &i); qemu_coroutine_enter(co, NULL); } duration = g_test_timer_elapsed(); ops = (long)(maxcycles / (duration * 1000)); g_test_message("Run operation %lu iterations %f s, %luK operations/s, " "%luns per coroutine", maxcycles, duration, ops, (unsigned long)(10 * duration) / maxcycles); } This tests create, enter, yield, reenter, terminate, destroy. The cost of create + destroy may well dominate. If we create and destroy coroutines for each AIO request, we're doing it wrong. I doubt Kevin's doing it *that* wrong ;) Anyway, let's benchmark the real code instead of putting undue trust in tests/test-coroutine.c micro-benchmarks.
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
On 11/28/14, Markus Armbruster wrote: > Ming Lei writes: > >> Hi Kevin, >> >> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf wrote: >>> This improves the performance of requests because an ACB doesn't need to >>> be allocated on the heap any more. It also makes the code nicer and >>> smaller. >> >> I am not sure it is good way for linux aio optimization: >> >> - for raw image with some constraint, coroutine can be avoided since >> io_submit() won't sleep most of times >> >> - handling one time coroutine takes much time than handling malloc, >> memset and free on small buffer, following the test data: >> >> -- 241ns per coroutine > > What do you mean by "coroutine" here? Create + destroy? Yield? Please see perf_cost() in tests/test-coroutine.c Thanks, Ming Lei
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Ming Lei writes: > Hi Kevin, > > On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf wrote: >> This improves the performance of requests because an ACB doesn't need to >> be allocated on the heap any more. It also makes the code nicer and >> smaller. > > I am not sure it is good way for linux aio optimization: > > - for raw image with some constraint, coroutine can be avoided since > io_submit() won't sleep most of times > > - handling one time coroutine takes much time than handling malloc, > memset and free on small buffer, following the test data: > > -- 241ns per coroutine What do you mean by "coroutine" here? Create + destroy? Yield? > -- 61ns per (malloc, memset, free for 128bytes) > > I still think we should figure out a fast path to avoid cocourinte > for linux-aio with raw image, otherwise it can't scale well for high > IOPS device. > > Also we can use simple buf pool to avoid the dynamic allocation > easily, can't we? [...]
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Hi Kevin, On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf wrote: > This improves the performance of requests because an ACB doesn't need to > be allocated on the heap any more. It also makes the code nicer and > smaller. I am not sure it is good way for linux aio optimization: - for raw image with some constraint, coroutine can be avoided since io_submit() won't sleep most of times - handling one time coroutine takes much time than handling malloc, memset and free on small buffer, following the test data: -- 241ns per coroutine -- 61ns per (malloc, memset, free for 128bytes) I still think we should figure out a fast path to avoid cocourinte for linux-aio with raw image, otherwise it can't scale well for high IOPS device. Also we can use simple buf pool to avoid the dynamic allocation easily, can't we? > > As a side effect, the codepath taken by aio=threads is changed to use > paio_submit_co(). This doesn't change the performance at this point. > > Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: > > | aio=native | aio=threads > | before | with patch | before | with patch > --+--++--+ > run 1 | 29.921s | 26.932s| 35.286s | 35.447s > run 2 | 29.793s | 26.252s| 35.276s | 35.111s > run 3 | 30.186s | 27.114s| 35.042s | 34.921s > run 4 | 30.425s | 26.600s| 35.169s | 34.968s > run 5 | 30.041s | 26.263s| 35.224s | 35.000s > > TODO: Do some more serious benchmarking in VMs with less variance. > Results of a quick fio run are vaguely positive. I will do the test with Paolo's fast path approach under VM I/O situation. Thanks, Ming Lei
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
On 26.11.2014 15:46, Kevin Wolf wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. As a side effect, the codepath taken by aio=threads is changed to use paio_submit_co(). This doesn't change the performance at this point. Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: | aio=native | aio=threads | before | with patch | before | with patch --+--++--+ run 1 | 29.921s | 26.932s| 35.286s | 35.447s run 2 | 29.793s | 26.252s| 35.276s | 35.111s run 3 | 30.186s | 27.114s| 35.042s | 34.921s run 4 | 30.425s | 26.600s| 35.169s | 34.968s run 5 | 30.041s | 26.263s| 35.224s | 35.000s TODO: Do some more serious benchmarking in VMs with less variance. Results of a quick fio run are vaguely positive. I still see the main-loop spun warnings with this patches applied to master. It wasn't there with the original patch from August. ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1 Sending 1000 requests, 4096 bytes each, 64 in parallel main-loop: WARNING: I/O thread spun for 1000 iterations Run completed in 31.947 seconds. Peter Signed-off-by: Kevin Wolf --- block/linux-aio.c | 70 +-- block/raw-aio.h | 5 ++-- block/raw-posix.c | 62 ++-- 3 files changed, 52 insertions(+), 85 deletions(-) diff --git a/block/linux-aio.c b/block/linux-aio.c index d92513b..99b259d 100644 --- a/block/linux-aio.c +++ b/block/linux-aio.c @@ -12,6 +12,7 @@ #include "qemu/queue.h" #include "block/raw-aio.h" #include "qemu/event_notifier.h" +#include "block/coroutine.h" #include @@ -28,7 +29,7 @@ #define MAX_QUEUED_IO 128 struct qemu_laiocb { -BlockAIOCB common; +Coroutine *co; struct qemu_laio_state *ctx; struct iocb iocb; ssize_t ret; @@ -86,9 +87,9 @@ static void qemu_laio_process_completion(struct qemu_laio_state *s, } } } -laiocb->common.cb(laiocb->common.opaque, ret); -qemu_aio_unref(laiocb); +laiocb->ret = ret; +qemu_coroutine_enter(laiocb->co, NULL); } /* The completion BH fetches completed I/O requests and invokes their @@ -146,30 +147,6 @@ static void qemu_laio_completion_cb(EventNotifier *e) } } -static void laio_cancel(BlockAIOCB *blockacb) -{ -struct qemu_laiocb *laiocb = (struct qemu_laiocb *)blockacb; -struct io_event event; -int ret; - -if (laiocb->ret != -EINPROGRESS) { -return; -} -ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event); -laiocb->ret = -ECANCELED; -if (ret != 0) { -/* iocb is not cancelled, cb will be called by the event loop later */ -return; -} - -laiocb->common.cb(laiocb->common.opaque, laiocb->ret); -} - -static const AIOCBInfo laio_aiocb_info = { -.aiocb_size = sizeof(struct qemu_laiocb), -.cancel_async = laio_cancel, -}; - static void ioq_init(LaioQueue *io_q) { io_q->size = MAX_QUEUED_IO; @@ -243,23 +220,21 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug) return ret; } -BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd, -int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, -BlockCompletionFunc *cb, void *opaque, int type) +int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd, +int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type) { struct qemu_laio_state *s = aio_ctx; -struct qemu_laiocb *laiocb; -struct iocb *iocbs; off_t offset = sector_num * 512; -laiocb = qemu_aio_get(&laio_aiocb_info, bs, cb, opaque); -laiocb->nbytes = nb_sectors * 512; -laiocb->ctx = s; -laiocb->ret = -EINPROGRESS; -laiocb->is_read = (type == QEMU_AIO_READ); -laiocb->qiov = qiov; - -iocbs = &laiocb->iocb; +struct qemu_laiocb laiocb = { +.co = qemu_coroutine_self(), +.nbytes = nb_sectors * 512, +.ctx = s, +.is_read = (type == QEMU_AIO_READ), +.qiov = qiov, +}; +struct iocb *iocbs = &laiocb.iocb; +int ret; switch (type) { case QEMU_AIO_WRITE: @@ -272,22 +247,21 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd, default: fprintf(stderr, "%s: invalid AIO request type 0x%x.\n", __func__, type); -goto out_free_aiocb; +return -EIO; } -io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e)); +io_set_eventfd(&laiocb.iocb, event_notifier_get_fd(&s->e)); if (!s->io_q.plugged) { -if (io_submit(s->ctx, 1, &iocbs) < 0) { -goto out_free_aiocb; +ret = io_submit(s->ctx, 1, &iocbs); +if (ret < 0) { +return r