[Bug 1668829] Re: Performance regression from qemu 2.3 to 2.5 for vhost-user with ovs + dpdk

Rafael David Tinoco Fri, 19 Jan 2018 09:26:06 -0800

Hello Wangzhike,

I failed to update case with latest analysis here. I could isolate the
commit with a bisection. If you could provide feedback about this would
be also good.


#### ANALYSIS

# git bisection log
#
# bad: [a8c40fa2d667e585382080db36ac44e216b37a1c] Update version for v2.5.0 
release
# good: [e5b3a24181ea0cebf1c5b20f44d016311b7048f0] Update version for v2.3.0 
release
# bad: [6b324b3e5906fd9a9ce7f4f24decd1f1c7afde97] Merge remote-tracking branch 
'remotes/stefanha/tags/net-pull-request' into staging
# good: [a3d586f704609a45b6037534cb2f34da5dfd8895] dma/rc4030: create custom 
DMA address space
# good: [54f3223730736fca1e6e89bb7f99c4f8432fdabb] ahci: factor ncq_finish out 
of ncq_cb
# good: [e46e1a74ef482f1ef773e750df9654ef4442ca29] target-arm: Fix broken 
SCTLR_EL3 reset
# bad: [776f87845137a9b300a4815ba6bf6879310795aa] Merge remote-tracking branch 
'remotes/mjt/tags/pull-trivial-patches-2015-07-27' into staging
# good: [21a03d17f2edb1e63f7137d97ba355cc6f19d79f] AioContext: fix broken 
placement of event_notifier_test_and_clear
# bad: [e40db4c6d391419c0039fe274c74df32a6ca1a28] Merge remote-tracking branch 
'remotes/jnsnow/tags/cve-2015-5154-pull-request' into staging
# bad: [30fdfae49d53cfc678859095e49ac60b79562d6f] Merge remote-tracking branch 
'remotes/rth/tags/pull-tcg-20150723' into staging
# bad: [12e21eb088a51161c78ee39ed54ac56ebcff4243] Merge remote-tracking branch 
'remotes/ehabkost/tags/numa-pull-request' into staging
# bad: [dc94bd9166af5236a56bd5bb06845911915a925c] Merge remote-tracking branch 
'remotes/stefanha/tags/block-pull-request' into staging
# good: [b9c46307996856d03ddc1527468ff5401ac03a79] Merge remote-tracking branch 
'remotes/mdroth/tags/qga-pull-2015-07-21-tag' into staging
# bad: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: optimize clearing 
the EventNotifier
# first bad commit: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: 
optimize clearing the EventNotifier
#

commit 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3 (HEAD, refs/bisect/bad)
Author: Paolo Bonzini <[email protected]>
Date:   Tue Jul 21 16:07:53 2015 +0200

    It is pretty rare for aio_notify to actually set the EventNotifier.  It
    can happen with worker threads such as thread-pool.c's, but otherwise it
    should never be set thanks to the ctx->notify_me optimization.  The
    previous patch, unfortunately, added an unconditional call to
    event_notifier_test_and_clear; now add a userspace fast path that
    avoids the call.

    Note that it is not possible to do the same with event_notifier_set;
    it would break, as proved (again) by the included formal model.

    This patch survived over 3000 reboots on aarch64 KVM.

    Signed-off-by: Paolo Bonzini <[email protected]>
    Reviewed-by: Fam Zheng <[email protected]>
    Tested-by: Richard W.M. Jones <[email protected]>
    Message-id: [email protected]
    Signed-off-by: Stefan Hajnoczi <[email protected]>

## UNDERSTANDING

Basic logic for AioContext before this change was:

QEMU has its own asynchronous IO implementation, which has a master
structure referred as "AIO Context". This asynchronous IO subsystem is
used by different parts of QEMU (the main loop, or, the new IO threads,
also based in AIO context). AIO context is a g_source implementation
(from glib) and handles different types of file descriptor structures
(for polling them) and calls its callbacks based on a need for re-
scheduling (bottom half), presence of data in the file descriptors
(pollfd), or after timeouts. There are callbacks for rfifolocks (similar
to mutexes, with recursion), aiohandlers (using poll to monitor file
descriptors), qemu's deferred work implementation called "bottom halves"
(just like kernel's bottom halves idea) and notification events.

The mechanism AioContext used, in a simplified version, is this:

- aio is basically a scheduler where you can register callbacks.
- aio first tries to dispatch bottom halves and then the aio handlers.
- aio structure has fields for attention notification: notify_me, notified, 
notifier.
- aio subsystem knows about the need to block "aio_poll", for example, through 
notifier variables.

Example:

- aio user creates a bottom half with a specific callback and schedules it.
- qemu_bh_schedule() function already calls "aio_notify()":
  - to set "notified" variable to true.
  - to init an event notifier (notifier) containing "1" as string.
- aio subsystem "acks" this event when:
  - doing a aio_ctx_check before the next run (g_source model).
  - aio_poll() runs, the core loop function.
- the "ack" implies in:
  - setting "notified" back to false.
  - cleaning event notifier (notifier).

With that said, when observing the change more carefully:

- event_notifier_test_and_clear() is not being called directly.
- now, to cleanup the event notifier, we have another "logic".
- the logic doesn't seem to be the problem:
  - the event notifier is an struct with 2 file descriptors (to be read) on it.
  - cleaning this right when called or in the next poll() is no different.

## ANALYSIS (based on assumptions since I'm not reproducing this
locally)

# Probable cause

Unfortunately the new variable synchronization asked for some barriers:

+ atomic_mb_set(&ctx->notified, true);

#ifndef atomic_mb_set
#define atomic_mb_set(ptr, i)  do {         \
    smp_wmb();                              \
    atomic_set(ptr, i);                     \
    smp_mb();                               \
} while (0)
#endif

The "smp_wmb()" macro is either 1 or 2 compiler barrier (depending on
the architecture). I believe that x86_64 is using __sync_synchronize()
directive, causing a full HW barrier to happen (not only an out-of-order
correction, but, a real barrier). The "smp_mb()" is the same thing, 1
full HW barrier. So, for this particular change, we added at least two
full HW barriers (bigger cost for sensitive workload).

and

+ if (atomic_xchg(&ctx->notified, false)) {

This is a macro translating to __atomic_exchange() directive which,
likely, makes GCC to use another mem barrier for the atomicity regarding
cache coherency in between different cache layers.

# VirtIO

This commit has probably made a difference for final user's tests
because virtio is using the QEMU Bottom Half implementation from AIO
instead of the "timer" implementation (for deferred works). I'm not sure
if timer would be better or even worse (txmode=timer for virtio device).

https://pastebin.ubuntu.com/26417646/

Calling qemu_bh_new it creates a bottom half for "virtio_net_tx_bh",
which is the function responsible for scheduling the bottom half with
"qemu_bh_schedule", so, virtio_net_flush_txt can be run again and again
(causing the barriers to happen again and again).

# Observations

If we confirm that this is the actual problem - likely - it is possible
that we can try making this change more efficient, or even removing it,
but that will, likely, depend on upstream accepting what we proposed.
The reason behind this is simple: they can choose to have the barriers
even if causing lower throughput for some workloads.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1668829

Title:
  Performance regression from qemu 2.3 to 2.5 for vhost-user with ovs +
  dpdk

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1668829/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1668829] Re: Performance regression from qemu 2.3 to 2.5 for vhost-user with ovs + dpdk

Reply via email to