[PATCH] util/aio: Defer disabling poll mode as long as possible

2022-07-10 Thread Chao Gao
When we measure FIO read performance (cache=writethrough, bs=4k,
iodepth=64) in VMs, ~80K/s notifications (e.g., EPT_MISCONFIG) are observed
from guest to qemu.

It turns out those frequent notificatons are caused by interference from
worker threads. Worker threads queue bottom halves after completing IO
requests.  Pending bottom halves may lead to either aio_compute_timeout()
zeros timeout and pass it to try_poll_mode() or run_poll_handlers() returns
no progress after noticing pending aio_notify() events. Both cause
run_poll_handlers() to call poll_set_started(false) to disable poll mode.
However, for both cases, as timeout is already zeroed, the event loop
(i.e., aio_poll()) just processes bottom halves and then starts the next
event loop iteration. So, disabling poll mode has no value but leads to
unnecessary notifications from guest.

To minimize unnecessary notifications from guest, defer disabling poll
mode to when the event loop is about to be blocked.

With this patch applied, FIO seq-read performance (bs=4k, iodepth=64,
cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS.

Suggested-by: Stefan Hajnoczi 
Signed-off-by: Chao Gao 
---
 util/aio-posix.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 731f3826c0..6cc6256d53 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -585,18 +585,16 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList 
*ready_list,
 
 max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
 if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
+/*
+ * Enable poll mode. It pairs with the poll_set_started() in
+ * aio_poll() which disables poll mode.
+ */
 poll_set_started(ctx, ready_list, true);
 
 if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) {
 return true;
 }
 }
-
-if (poll_set_started(ctx, ready_list, false)) {
-*timeout = 0;
-return true;
-}
-
 return false;
 }
 
@@ -657,6 +655,17 @@ bool aio_poll(AioContext *ctx, bool blocking)
  * system call---a single round of run_poll_handlers_once suffices.
  */
 if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
+/*
+ * Disable poll mode. poll mode should be disabled before the call
+ * of ctx->fdmon_ops->wait() so that guest's notification can wake
+ * up IO threads when some work becomes pending. It is essential to
+ * avoid hangs or unnecessary latency.
+ */
+if (poll_set_started(ctx, _list, false)) {
+timeout = 0;
+progress = true;
+}
+
 ctx->fdmon_ops->wait(ctx, _list, timeout);
 }
 
-- 
2.25.1




Re: [RFC v1] util/aio: Keep notification disabled as much as possible

2022-07-07 Thread Chao Gao
On Thu, Jul 07, 2022 at 10:04:23AM +0100, Stefan Hajnoczi wrote:
>
>Does this patch solve the issue? The idea is to defer
>poll_set_started(false) for as long as possible.

Good idea! It is straightforward.

>
>diff --git a/util/aio-posix.c b/util/aio-posix.c
>index 731f3826c0..536f8b2534 100644
>--- a/util/aio-posix.c
>+++ b/util/aio-posix.c
>@@ -591,12 +591,6 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList 
>*ready_list,
> return true;
> }
> }
>-
>-if (poll_set_started(ctx, ready_list, false)) {
>-*timeout = 0;
>-return true;
>-}
>-
> return false;
> }
>
>@@ -657,6 +651,11 @@ bool aio_poll(AioContext *ctx, bool blocking)
>  * system call---a single round of run_poll_handlers_once suffices.
>  */
> if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
>+if (poll_set_started(ctx, _list, false)) {
>+timeout = 0;
>+progress = true;

In this case, is it ok to skip the call of ->wait() below? If yes, maybe put
the call in the "else" path.

>+}
>+
> ctx->fdmon_ops->wait(ctx, _list, timeout);
> }
>

Anyway,
Reviewed-by: Chao Gao 

And my tests show your change works well. The number of notifications is
significantly reduced from ~80K/s to tens.

Tested-by: Chao Gao 



Re: [RFC v1] util/aio: Keep notification disabled as much as possible

2022-07-06 Thread Chao Gao
On Wed, Jul 06, 2022 at 12:59:29PM +0100, Stefan Hajnoczi wrote:
>On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote:
>> When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) 
>> in
>> VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to 
>> qemu.
>
>It's not clear to me what caused the frequent poll_set_started(ctx,
>false) calls and whether this patch is correct. I have posted some

Me either. That's why the patch was marked RFC.

>questions below about the nature of this issue.
>
>If ctx->fdmon_ops->wait() is called while polling is still enabled then
>hangs or unnecessary latency can occur. For example, consider an fd
>handler that temporarily suppresses fd activity between poll start/end.
>The thread would be blocked in ->wait() and the fd will never become
>readable. Even if a hang doesn't occur because there is a timeout, there
>would be extra latency because the fd doesn't become readable and we
>have to wait for the timeout to expire so we can poll again. So we must
>be sure it's safe to leave polling enabled across the event loop and I'm
>not sure if this patch is okay.

Thanks for the explanation.

in aio_poll(),

if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
ctx->fdmon_ops->wait(ctx, _list, timeout);
}

if @timeout is zero, then ctx->fdmon_ops->wait() won't be called.

In case #2 and #3 below, it is guaranteed that @timeout is zero after
try_poll_mode() return. So, I think it is safe to keep polling enabled
for these two cases.

>
>> 
>> Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable
>> virtqueue notification in below 4 cases:
>> 
>> 1. ctx->poll_ns is 0
>> 2. a zero timeout is passed to try_poll_mode()
>> 3. polling succeeded but reported as no progress
>> 4. polling failed and reported as no progress
>> 
>> To minimize unnecessary guest notifications, keep notification disabled when
>> it is possible, i.e., polling is enabled and last polling doesn't fail.
>
>What is the exact definition of polling success/failure?

Polling success: found some events pending.
Polling failure: timeout.

success/failure are used because I saw below comment in
run_poll_handlers_once(), then I thought they are well-known terms.

/*
 * Polling was successful, exit try_poll_mode immediately
 * to adjust the next polling time.
 */

>
>> 
>> Keep notification disabled for case #2 and #3; handle case #2 simply by a 
>> call
>
>Did you see case #2 happening often? What was the cause?

I think so. I can add some tracepoint and collect statistics.

IMO (of course, I can be totally wrong), the cause is:
when a worker thread in thread poll completes an IO request, a bottom
half is queued by work_thread()->qemu_bh_schedule(). Pending bottom
halves lead to aio_compute_timeout() setting timeout to zero and then
a zero timeout is passed to try_poll_mode().

>
>> of run_poll_handlers_once() and for case #3, differentiate successful/failed
>> polling and skip the call of poll_set_started(ctx,false) for successful ones.
>
>This is probably the most interesting case. When polling detects an
>event, that's considered "progress", except for aio_notify() events.
>aio_notify() is an internal event for restarting the event loop when
>something has changed (e.g. fd handlers have been added/removed). I
>wouldn't expect it to intefere polling frequently since that requires
>another QEMU thread doing something to the AioContext, which should be
>rare.
>
>Was aio_notify() intefering with polling in your case? Do you know why?

Yes. It was. The reason is the same: after finishing IO requests, worker
threads queue bottom halves during which aio_notify() is called.



[RFC v1] util/aio: Keep notification disabled as much as possible

2022-07-01 Thread Chao Gao
When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in
VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu.

Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable
virtqueue notification in below 4 cases:

1. ctx->poll_ns is 0
2. a zero timeout is passed to try_poll_mode()
3. polling succeeded but reported as no progress
4. polling failed and reported as no progress

To minimize unnecessary guest notifications, keep notification disabled when
it is possible, i.e., polling is enabled and last polling doesn't fail.

Keep notification disabled for case #2 and #3; handle case #2 simply by a call
of run_poll_handlers_once() and for case #3, differentiate successful/failed
polling and skip the call of poll_set_started(ctx,false) for successful ones.

With this patch applied, FIO seq-read performance (bs=4k, iodepth=64,
cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS.

Below are changes in VM-exit statistics:
w/o this patch (duration:10s):
   VM-EXITSamples  Samples% Time%Min TimeMax Time   
  Avg time

 EPT_MISCONFIG 79744099.34%98.58%  0.39us 57.92us  
0.60us ( +-   0.05% )
 MSR_WRITE   3672 0.46% 1.15%  0.88us  4.97us  
1.52us ( +-   0.53% )
EXTERNAL_INTERRUPT   1638 0.20% 0.27%  0.59us 11.04us  
0.81us ( +-   1.71% )

w/ this patch (duration:10s):
  VM-EXITSamples  Samples% Time%Min TimeMax Time
 Avg time

 MSR_WRITE   352484.11%87.17%  0.86us  4.43us  
1.74us ( +-   0.60% )
EXTERNAL_INTERRUPT51512.29%10.05%  0.64us  8.95us  
1.37us ( +-   1.54% )
 EPT_MISCONFIG151 3.60% 2.79%  0.40us 52.07us  
1.30us ( +-  31.44% )

Signed-off-by: Chao Gao 
---
 util/aio-posix.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 731f3826c0..bd2076145b 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -519,7 +519,7 @@ static bool remove_idle_poll_handlers(AioContext *ctx,
  * Returns: true if progress was made, false otherwise
  */
 static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
-  int64_t max_ns, int64_t *timeout)
+  int64_t max_ns, int64_t *timeout, bool *success)
 {
 bool progress;
 int64_t start_time, elapsed_time;
@@ -553,6 +553,8 @@ static bool run_poll_handlers(AioContext *ctx, 
AioHandlerList *ready_list,
 progress = true;
 }
 
+*success = !*timeout;
+
 /* If time has passed with no successful polling, adjust *timeout to
  * keep the same ending time.
  */
@@ -577,22 +579,28 @@ static bool run_poll_handlers(AioContext *ctx, 
AioHandlerList *ready_list,
 static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
   int64_t *timeout)
 {
-int64_t max_ns;
+int64_t max_ns, start_time;
+bool success = false;
 
 if (QLIST_EMPTY_RCU(>poll_aio_handlers)) {
 return false;
 }
 
+if (!*timeout) {
+start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+return run_poll_handlers_once(ctx, ready_list, start_time, timeout);
+}
+
 max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
 if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
 poll_set_started(ctx, ready_list, true);
 
-if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) {
+if (run_poll_handlers(ctx, ready_list, max_ns, timeout, )) {
 return true;
 }
 }
 
-if (poll_set_started(ctx, ready_list, false)) {
+if (!success && poll_set_started(ctx, ready_list, false)) {
 *timeout = 0;
 return true;
 }
-- 
2.25.1




Re: [QEMU PATCH] x86: Set maximum APIC ID to KVM prior to vCPU creation

2022-05-20 Thread Chao Gao
On Fri, May 20, 2022 at 02:39:28PM +0800, Zeng Guang wrote:
>Specify maximum possible APIC ID assigned for current VM session prior to
>the creation of vCPUs. KVM need set up VM-scoped data structure indexed by
>the APIC ID, e.g. Posted-Interrupt Descriptor table to support Intel IPI
>virtualization.
>
>It can be achieved by calling KVM_ENABLE_CAP for KVM_CAP_MAX_VCPU_ID
>capability once KVM has already enabled it. Otherwise, simply prompts
>that KVM doesn't support this capability yet.
>
>Signed-off-by: Zeng Guang 
>---
> hw/i386/x86.c | 9 -
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
>diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>index 4cf107baea..ff74492325 100644
>--- a/hw/i386/x86.c
>+++ b/hw/i386/x86.c
>@@ -106,7 +106,7 @@ out:
> 
> void x86_cpus_init(X86MachineState *x86ms, int default_cpu_version)
> {
>-int i;
>+int i, ret;
> const CPUArchIdList *possible_cpus;
> MachineState *ms = MACHINE(x86ms);
> MachineClass *mc = MACHINE_GET_CLASS(x86ms);
>@@ -123,6 +123,13 @@ void x86_cpus_init(X86MachineState *x86ms, int 
>default_cpu_version)
>  */
> x86ms->apic_id_limit = x86_cpu_apic_id_from_index(x86ms,
>   ms->smp.max_cpus - 1) + 
> 1;
>+
>+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPU_ID,
>+0, x86ms->apic_id_limit);
>+if (ret < 0) {
>+error_report("kvm: Set max vcpu id not supported: %s", 
>strerror(-ret));
>+}

This piece of code is specific to KVM. Please move it to kvm-all.c and
invoke a wrapper function here. As kvm accelerator isn't necessarily
enabled, the function call should be guarded by kvm_enabled().

And I think the error message can be omitted because the failure doesn't
impact functionality; just a few more pages will be allocated by KVM.



Re: [RFC] KVM / QEMU: Introduce Interface for Querying APICv Info

2022-05-19 Thread Chao Gao
On Fri, May 20, 2022 at 10:30:40AM +0700, Suthikulpanit, Suravee wrote:
>Hi All,
>
>Currently, we don't have a good way to check whether APICV is active on a VM.
>Normally, For AMD SVM AVIC, users either have to check for trace point, or 
>using
>"perf kvm stat live" to catch AVIC-related #VMEXIT.
>
>For KVM, I would like to propose introducing a new IOCTL interface (i.e. 
>KVM_GET_APICV_INFO),
>where user-space tools (e.g. QEMU monitor) can query run-time information of 
>APICv for VM and vCPUs
>such as APICv inhibit reason flags.
>
>For QEMU, we can leverage the "info lapic" command, and append the APICV 
>information after
>all LAPIC register information:
>
>For example:
>
>- Begin Snippet -
>(qemu) info lapic 0
>dumping local APIC state for CPU 0
>
>LVT0 0x00010700 active-hi edge  masked  ExtINT (vec 0)
>LVT1 0x0400 active-hi edge  NMI
>LVTPC0x0001 active-hi edge  masked  Fixed  (vec 0)
>LVTERR   0x00fe active-hi edge  Fixed  (vec 
>254)
>LVTTHMR  0x0001 active-hi edge  masked  Fixed  (vec 0)
>LVTT 0x000400ee active-hi edge tsc-deadline Fixed  (vec 
>238)
>TimerDCR=0x0 (divide by 2) initial_count = 0 current_count = 0
>SPIV 0x01ff APIC enabled, focus=off, spurious vec 255
>ICR  0x00fd physical edge de-assert no-shorthand
>ICR2 0x0005 cpu 5 (X2APIC ID)
>ESR  0x
>ISR  (none)
>IRR  (none)
>
>APR 0x00 TPR 0x00 DFR 0x0f LDR 0x00PPR 0x00
>
>APICV   vm inhibit: 0x10 <-- HERE
>APICV vcpu inhibit: 0 <-- HERE
>
>-- End Snippet --
>
>Otherwise, we can have APICv-specific info command (e.g. info apicv).

I think this information can be added to kvm per-vm/vcpu debugfs. Then no
qemu change is needed.



Re: [PATCH 0/3] Disable vhost device IOTLB is IOMMU is not enabled

2021-08-03 Thread Chao Gao
On Wed, Aug 04, 2021 at 11:48:00AM +0800, Jason Wang wrote:
>Hi:
>
>We currently try to enable device IOTLB when iommu_platform is
>set. This may lead unnecessary trasnsactions between qemu and vhost
>when vIOMMU is not used (which is the typical case for the encrypted
>VM).
>
>So patch tries to use transport specific method to detect the enalbing
>of vIOMMU and enable the device IOTLB only if vIOMMU is enalbed.
>
>Please review.

Tested-by: Chao Gao 

Tested with TDX; this series fixes the performance issue we saw in a TD
when vhost was enabled.

Thanks
Chao

>
>Thanks
>
>Jason Wang (3):
>  virtio-bus: introduce iommu_enabled()
>  virtio-pci: implement iommu_enabled()
>  vhost: correctly detect the enabling IOMMU
>
> hw/virtio/vhost.c  |  2 +-
> hw/virtio/virtio-bus.c | 14 ++
> hw/virtio/virtio-pci.c | 14 ++
> include/hw/virtio/virtio-bus.h |  4 +++-
> 4 files changed, 32 insertions(+), 2 deletions(-)
>
>-- 
>2.25.1
>



Re: [PATCH] vhost: use large iotlb entry if no IOMMU translation is needed

2021-08-02 Thread Chao Gao
On Tue, Aug 03, 2021 at 12:43:58PM +0800, Jason Wang wrote:
>
>在 2021/8/3 下午12:29, Chao Gao 写道:
>> Ping. Could someone help to review this patch?
>> 
>> Thanks
>> Chao
>> 
>> On Wed, Jul 21, 2021 at 03:54:02PM +0800, Chao Gao wrote:
>> > If guest enables IOMMU_PLATFORM for virtio-net, severe network
>> > performance drop is observed even if there is no IOMMU.
>
>
>We see such reports internally and we're testing a patch series to disable
>vhost IOTLB in this case.
>
>Will post a patch soon.

OK. put me in the CC list. I would like to test with TDX to ensure your patch
fix the performance issue I am facing.

>
>
>
>> >   And disabling
>> > vhost can mitigate the perf issue. Finally, we found the culprit is
>> > frequent iotlb misses: kernel vhost-net has 2048 entries and each
>> > entry is 4K (qemu uses 4K for i386 if no IOMMU); vhost-net can cache
>> > translations for up to 8M (i.e. 4K*2048) IOVAs. If guest uses >8M
>> > memory for DMA, there are some iotlb misses.
>> > 
>> > If there is no IOMMU or IOMMU is disabled or IOMMU works in pass-thru
>> > mode, we can optimistically use large, unaligned iotlb entries instead
>> > of 4K-aligned entries to reduce iotlb pressure.
>
>
>Instead of introducing new general facilities like unaligned IOTLB entry. I
>wonder if we optimize the vtd_iommu_translate() to use e.g 1G instead?

using 1G iotlb entry looks feasible.

>
>    } else {
>    /* DMAR disabled, passthrough, use 4k-page*/
>    iotlb.iova = addr & VTD_PAGE_MASK_4K;
>    iotlb.translated_addr = addr & VTD_PAGE_MASK_4K;
>    iotlb.addr_mask = ~VTD_PAGE_MASK_4K;
>    iotlb.perm = IOMMU_RW;
>    success = true;
>    }
>
>
>> >   Actually, vhost-net
>> > in kernel supports unaligned iotlb entry. The alignment requirement is
>> > imposed by address_space_get_iotlb_entry() and flatview_do_translate().
>
>
>For the passthrough case, is there anyway to detect them and then disable
>device IOTLB in those case?

yes. I guess so; qemu knows the presence and status of iommu. Currently,
in flatview_do_translate(), memory_region_get_iommu() tells whether a memory
region is behind an iommu.

Thanks
Chao



Re: [PATCH] vhost: use large iotlb entry if no IOMMU translation is needed

2021-08-02 Thread Chao Gao
Ping. Could someone help to review this patch?

Thanks
Chao

On Wed, Jul 21, 2021 at 03:54:02PM +0800, Chao Gao wrote:
>If guest enables IOMMU_PLATFORM for virtio-net, severe network
>performance drop is observed even if there is no IOMMU. And disabling
>vhost can mitigate the perf issue. Finally, we found the culprit is
>frequent iotlb misses: kernel vhost-net has 2048 entries and each
>entry is 4K (qemu uses 4K for i386 if no IOMMU); vhost-net can cache
>translations for up to 8M (i.e. 4K*2048) IOVAs. If guest uses >8M
>memory for DMA, there are some iotlb misses.
>
>If there is no IOMMU or IOMMU is disabled or IOMMU works in pass-thru
>mode, we can optimistically use large, unaligned iotlb entries instead
>of 4K-aligned entries to reduce iotlb pressure. Actually, vhost-net
>in kernel supports unaligned iotlb entry. The alignment requirement is
>imposed by address_space_get_iotlb_entry() and flatview_do_translate().
>
>Introduce IOMMUTLBEntryUnaligned which has a @len field to specify the
>iotlb size to abstract a generic iotlb entry: aligned (original
>IOMMUTLBEntry) and unaligned entry. flatview_do_translate() now
>returns a magic value in @page_mask_out if no IOMMU translation is
>needed. Then, address_space_get_iotbl_entry() can accordingly return a
>page-aligned iotlb entry or the whole memory region section where the
>iova resides as a large iotlb entry.
>
>Signed-off-by: Chao Gao 
>---
> hw/virtio/vhost.c |  6 +++---
> include/exec/memory.h | 16 ++--
> softmmu/physmem.c | 37 +
> 3 files changed, 46 insertions(+), 13 deletions(-)
>
>diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>index e8f85a5d2d..6745caa129 100644
>--- a/hw/virtio/vhost.c
>+++ b/hw/virtio/vhost.c
>@@ -1010,7 +1010,7 @@ static int vhost_memory_region_lookup(struct vhost_dev 
>*hdev,
> 
> int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
> {
>-IOMMUTLBEntry iotlb;
>+IOMMUTLBEntryUnaligned iotlb;
> uint64_t uaddr, len;
> int ret = -EFAULT;
> 
>@@ -1031,8 +1031,8 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, 
>uint64_t iova, int write)
> goto out;
> }
> 
>-len = MIN(iotlb.addr_mask + 1, len);
>-iova = iova & ~iotlb.addr_mask;
>+len = MIN(iotlb.len, len);
>+iova = iotlb.iova;
> 
> ret = vhost_backend_update_device_iotlb(dev, iova, uaddr,
> len, iotlb.perm);
>diff --git a/include/exec/memory.h b/include/exec/memory.h
>index c3d417d317..3f04e8fe88 100644
>--- a/include/exec/memory.h
>+++ b/include/exec/memory.h
>@@ -94,6 +94,7 @@ struct MemoryRegionSection {
> };
> 
> typedef struct IOMMUTLBEntry IOMMUTLBEntry;
>+typedef struct IOMMUTLBEntryUnaligned IOMMUTLBEntryUnaligned;
> 
> /* See address_space_translate: bit 0 is read, bit 1 is write.  */
> typedef enum {
>@@ -113,6 +114,15 @@ struct IOMMUTLBEntry {
> IOMMUAccessFlags perm;
> };
> 
>+/* IOMMUTLBEntryUnaligned may be not page-aligned */
>+struct IOMMUTLBEntryUnaligned {
>+AddressSpace*target_as;
>+hwaddr   iova;
>+hwaddr   translated_addr;
>+hwaddr   len;
>+IOMMUAccessFlags perm;
>+};
>+
> /*
>  * Bitmap for different IOMMUNotifier capabilities. Each notifier can
>  * register with one or multiple IOMMU Notifier capability bit(s).
>@@ -2653,8 +2663,10 @@ void address_space_cache_destroy(MemoryRegionCache 
>*cache);
> /* address_space_get_iotlb_entry: translate an address into an IOTLB
>  * entry. Should be called from an RCU critical section.
>  */
>-IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
>-bool is_write, MemTxAttrs attrs);
>+IOMMUTLBEntryUnaligned address_space_get_iotlb_entry(AddressSpace *as,
>+ hwaddr addr,
>+ bool is_write,
>+ MemTxAttrs attrs);
> 
> /* address_space_translate: translate an address range into an address space
>  * into a MemoryRegion and an address range into that section.  Should be
>diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>index 3c1912a1a0..469963f754 100644
>--- a/softmmu/physmem.c
>+++ b/softmmu/physmem.c
>@@ -143,6 +143,8 @@ typedef struct subpage_t {
> 
> #define PHYS_SECTION_UNASSIGNED 0
> 
>+#define PAGE_MASK_NOT_BEHIND_IOMMU ((hwaddr)-1)
>+
> static void io_mem_init(void);
> static void memory_map_init(void);
> static void tcg_log_global_after_sync(MemoryListener *listener);
>@@ -470,7 +472,9 @@ unassigned:

[PATCH] vhost: use large iotlb entry if no IOMMU translation is needed

2021-07-21 Thread Chao Gao
If guest enables IOMMU_PLATFORM for virtio-net, severe network
performance drop is observed even if there is no IOMMU. And disabling
vhost can mitigate the perf issue. Finally, we found the culprit is
frequent iotlb misses: kernel vhost-net has 2048 entries and each
entry is 4K (qemu uses 4K for i386 if no IOMMU); vhost-net can cache
translations for up to 8M (i.e. 4K*2048) IOVAs. If guest uses >8M
memory for DMA, there are some iotlb misses.

If there is no IOMMU or IOMMU is disabled or IOMMU works in pass-thru
mode, we can optimistically use large, unaligned iotlb entries instead
of 4K-aligned entries to reduce iotlb pressure. Actually, vhost-net
in kernel supports unaligned iotlb entry. The alignment requirement is
imposed by address_space_get_iotlb_entry() and flatview_do_translate().

Introduce IOMMUTLBEntryUnaligned which has a @len field to specify the
iotlb size to abstract a generic iotlb entry: aligned (original
IOMMUTLBEntry) and unaligned entry. flatview_do_translate() now
returns a magic value in @page_mask_out if no IOMMU translation is
needed. Then, address_space_get_iotbl_entry() can accordingly return a
page-aligned iotlb entry or the whole memory region section where the
iova resides as a large iotlb entry.

Signed-off-by: Chao Gao 
---
 hw/virtio/vhost.c |  6 +++---
 include/exec/memory.h | 16 ++--
 softmmu/physmem.c | 37 +
 3 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e8f85a5d2d..6745caa129 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1010,7 +1010,7 @@ static int vhost_memory_region_lookup(struct vhost_dev 
*hdev,
 
 int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
 {
-IOMMUTLBEntry iotlb;
+IOMMUTLBEntryUnaligned iotlb;
 uint64_t uaddr, len;
 int ret = -EFAULT;
 
@@ -1031,8 +1031,8 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, 
uint64_t iova, int write)
 goto out;
 }
 
-len = MIN(iotlb.addr_mask + 1, len);
-iova = iova & ~iotlb.addr_mask;
+len = MIN(iotlb.len, len);
+iova = iotlb.iova;
 
 ret = vhost_backend_update_device_iotlb(dev, iova, uaddr,
 len, iotlb.perm);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c3d417d317..3f04e8fe88 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -94,6 +94,7 @@ struct MemoryRegionSection {
 };
 
 typedef struct IOMMUTLBEntry IOMMUTLBEntry;
+typedef struct IOMMUTLBEntryUnaligned IOMMUTLBEntryUnaligned;
 
 /* See address_space_translate: bit 0 is read, bit 1 is write.  */
 typedef enum {
@@ -113,6 +114,15 @@ struct IOMMUTLBEntry {
 IOMMUAccessFlags perm;
 };
 
+/* IOMMUTLBEntryUnaligned may be not page-aligned */
+struct IOMMUTLBEntryUnaligned {
+AddressSpace*target_as;
+hwaddr   iova;
+hwaddr   translated_addr;
+hwaddr   len;
+IOMMUAccessFlags perm;
+};
+
 /*
  * Bitmap for different IOMMUNotifier capabilities. Each notifier can
  * register with one or multiple IOMMU Notifier capability bit(s).
@@ -2653,8 +2663,10 @@ void address_space_cache_destroy(MemoryRegionCache 
*cache);
 /* address_space_get_iotlb_entry: translate an address into an IOTLB
  * entry. Should be called from an RCU critical section.
  */
-IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
-bool is_write, MemTxAttrs attrs);
+IOMMUTLBEntryUnaligned address_space_get_iotlb_entry(AddressSpace *as,
+ hwaddr addr,
+ bool is_write,
+ MemTxAttrs attrs);
 
 /* address_space_translate: translate an address range into an address space
  * into a MemoryRegion and an address range into that section.  Should be
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 3c1912a1a0..469963f754 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -143,6 +143,8 @@ typedef struct subpage_t {
 
 #define PHYS_SECTION_UNASSIGNED 0
 
+#define PAGE_MASK_NOT_BEHIND_IOMMU ((hwaddr)-1)
+
 static void io_mem_init(void);
 static void memory_map_init(void);
 static void tcg_log_global_after_sync(MemoryListener *listener);
@@ -470,7 +472,9 @@ unassigned:
  * @page_mask_out: page mask for the translated address. This
  *should only be meaningful for IOMMU translated
  *addresses, since there may be huge pages that this bit
- *would tell. It can be @NULL if we don't care about it.
+ *would tell. If the returned memory region section isn't
+ *behind an IOMMU, PAGE_MASK_NOT_BEHIND_IOMMU is return.
+ *It can be @NULL if we don't care about it.
  * @is_write: whether the translation operation is for write
  * @is_mmio: whether this can be MM

Re: [Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability

2019-09-06 Thread Chao Gao
On Thu, Aug 29, 2019 at 12:21:11PM +0200, Roger Pau Monné wrote:
>On Thu, Aug 29, 2019 at 05:02:27PM +0800, Chao Gao wrote:
>> Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's
>> perspective, assigned devices cannot be reset. But some devices rely on PCI
>> reset to recover from hardware hangs. When being assigned to a VM, those
>> devices cannot be reset and won't work any longer if a hardware hang occurs.
>> We have to reboot VM to trigger PCI reset on host to recover the device.
>>
>> This patch exposes FLR capability to VMs if the assigned device can be reset 
>> on
>> host. When VM initiates an FLR to a device, qemu cleans up the device state,
>> (including disabling of intx and/or MSI and unmapping BARs from guest, 
>> deleting
>> emulated registers), then initiate PCI reset through 'reset' knob under the
>> device's sysfs, finally initialize the device again.
>
>I think you likely need to deassign the device from the VM, perform
>the reset, and then assign the device again, so that there's no Xen
>internal state carried over prior to the reset?

Yes. It is the safest way. But here I want to present the feature as FLR
(such that the device driver in guest can issue PCI reset whenever
needed and no change is needed to device driver).  Current device
deassignment notifies guest that the device is going to be removed
It is not the standard PCI reset. Is it possible to make guest unaware
of the device deassignment to emulate a standard PCI reset? In my mind,
we can expose do_pci_remove/add to qemu or rewrite them in qemu (but
don't remove the device from guest's PCI hierarchy). Do you think it is
the right direction?

Thanks
Chao



Re: [Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability

2019-08-29 Thread Chao Gao
On Thu, Aug 29, 2019 at 12:03:44PM +0200, Jan Beulich wrote:
>On 29.08.2019 11:02, Chao Gao wrote:
>> Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's
>> perspective, assigned devices cannot be reset. But some devices rely on PCI
>> reset to recover from hardware hangs. When being assigned to a VM, those
>> devices cannot be reset and won't work any longer if a hardware hang occurs.
>> We have to reboot VM to trigger PCI reset on host to recover the device.
>
>Did you consider a hot-unplug, reset (by host), hot-plug cycle instead?

Yes. I considered this means. But it needs host to initiate this action.
However, when a device needs reset is determined by the device driver
in VM. So in practice, VM still needs a way to notify host to do
unplug/reset/plug. As the standard FLR capability can meet the
requirement, I don't try to invent one.

>
>> +static int xen_pt_devctl_reg_write(XenPCIPassthroughState *s,
>> +   XenPTReg *cfg_entry, uint16_t *val,
>> +   uint16_t dev_value, uint16_t valid_mask)
>> +{
>> +if (s->real_device.is_resetable && (*val & PCI_EXP_DEVCTL_BCR_FLR)) {
>> +xen_pt_reset(s);
>> +}
>> +return xen_pt_word_reg_write(s, cfg_entry, val, dev_value, valid_mask);
>
>I think you also need to clear the bit before handing on the request,
>such that reads will always observe it clear.

Will do.

Thanks
Chao



[Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability

2019-08-29 Thread Chao Gao
Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's
perspective, assigned devices cannot be reset. But some devices rely on PCI
reset to recover from hardware hangs. When being assigned to a VM, those
devices cannot be reset and won't work any longer if a hardware hang occurs.
We have to reboot VM to trigger PCI reset on host to recover the device.

This patch exposes FLR capability to VMs if the assigned device can be reset on
host. When VM initiates an FLR to a device, qemu cleans up the device state,
(including disabling of intx and/or MSI and unmapping BARs from guest, deleting
emulated registers), then initiate PCI reset through 'reset' knob under the
device's sysfs, finally initialize the device again.

Signed-off-by: Chao Gao 
---
Do we need to introduce an attribute, like "permissive" to explicitly
enable FLR capability emulation? During PCI reset, interrupts and BARs are
unmapped from the guest. It seems that guest cannot interact with the device
directly except access to device's configuration space which is emulated by
qemu. If proper method can be used to prevent qemu accessing the physical
device there is no new security hole caused by the FLR emulation.

VM's FLR may be backed by any reset function on host to the physical device,
for example: FLR, D3softreset, secondary bus reset. Not sure it is fine to mix
them. Given Linux kernel just uses an unified API to reset device and caller
cannot choose a specific one, it might be OK.
---
 hw/xen/xen-host-pci-device.c | 30 ++
 hw/xen/xen-host-pci-device.h |  3 +++
 hw/xen/xen_pt.c  |  9 +
 hw/xen/xen_pt.h  |  1 +
 hw/xen/xen_pt_config_init.c  | 30 +++---
 5 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
index 1b44dcafaf..d549656f42 100644
--- a/hw/xen/xen-host-pci-device.c
+++ b/hw/xen/xen-host-pci-device.c
@@ -198,6 +198,35 @@ static bool xen_host_pci_dev_is_virtfn(XenHostPCIDevice *d)
 return !stat(path, );
 }
 
+static bool xen_host_pci_resetable(XenHostPCIDevice *d)
+{
+char path[PATH_MAX];
+
+xen_host_pci_sysfs_path(d, "reset", path, sizeof(path));
+
+return !access(path, W_OK);
+}
+
+void xen_host_pci_reset(XenHostPCIDevice *d)
+{
+char path[PATH_MAX];
+int fd;
+
+xen_host_pci_sysfs_path(d, "reset", path, sizeof(path));
+
+fd = open(path, O_WRONLY);
+if (fd == -1) {
+XEN_HOST_PCI_LOG("Xen host pci reset: open error\n");
+return;
+}
+
+if (write(fd, "1", 1) != 1) {
+XEN_HOST_PCI_LOG("Xen host pci reset: write error\n");
+}
+
+return;
+}
+
 static void xen_host_pci_config_open(XenHostPCIDevice *d, Error **errp)
 {
 char path[PATH_MAX];
@@ -377,6 +406,7 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t 
domain,
 d->class_code = v;
 
 d->is_virtfn = xen_host_pci_dev_is_virtfn(d);
+d->is_resetable = xen_host_pci_resetable(d);
 
 return;
 
diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h
index 4d8d34ecb0..cacf9b3df8 100644
--- a/hw/xen/xen-host-pci-device.h
+++ b/hw/xen/xen-host-pci-device.h
@@ -32,6 +32,7 @@ typedef struct XenHostPCIDevice {
 XenHostPCIIORegion rom;
 
 bool is_virtfn;
+bool is_resetable;
 
 int config_fd;
 } XenHostPCIDevice;
@@ -55,4 +56,6 @@ int xen_host_pci_set_block(XenHostPCIDevice *d, int pos, 
uint8_t *buf,
 
 int xen_host_pci_find_ext_cap_offset(XenHostPCIDevice *s, uint32_t cap);
 
+void xen_host_pci_reset(XenHostPCIDevice *d);
+
 #endif /* XEN_HOST_PCI_DEVICE_H */
diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index 8fbaf2eae9..d750367c0a 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -938,6 +938,15 @@ static void xen_pt_unregister_device(PCIDevice *d)
 xen_pt_destroy(d);
 }
 
+void xen_pt_reset(XenPCIPassthroughState *s)
+{
+PCIDevice *d = PCI_DEVICE(s);
+
+xen_pt_unregister_device(d);
+xen_host_pci_reset(>real_device);
+xen_pt_realize(d, NULL);
+}
+
 static Property xen_pci_passthrough_properties[] = {
 DEFINE_PROP_PCI_HOST_DEVADDR("hostaddr", XenPCIPassthroughState, hostaddr),
 DEFINE_PROP_BOOL("permissive", XenPCIPassthroughState, permissive, false),
diff --git a/hw/xen/xen_pt.h b/hw/xen/xen_pt.h
index 9167bbaf6d..ed05bc0d39 100644
--- a/hw/xen/xen_pt.h
+++ b/hw/xen/xen_pt.h
@@ -332,4 +332,5 @@ int xen_pt_register_vga_regions(XenHostPCIDevice *dev);
 int xen_pt_unregister_vga_regions(XenHostPCIDevice *dev);
 void xen_pt_setup_vga(XenPCIPassthroughState *s, XenHostPCIDevice *dev,
  Error **errp);
+void xen_pt_reset(XenPCIPassthroughState *s);
 #endif /* XEN_PT_H */
diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index 31ec5add1d..435abd7286 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -852,6 +8

Re: [Qemu-devel] [PATCH v3 3/3] msi: Handle remappable format interrupt request

2017-12-11 Thread Chao Gao
On Mon, Dec 11, 2017 at 06:07:48PM +, Anthony PERARD wrote:
>On Fri, Nov 17, 2017 at 02:24:25PM +0800, Chao Gao wrote:
>> According to VT-d spec Interrupt Remapping and Interrupt Posting ->
>> Interrupt Remapping -> Interrupt Request Formats On Intel 64
>> Platforms, fields of MSI data register have changed. This patch
>> avoids wrongly regarding a remappable format interrupt request as
>> an interrupt binded with a pirq.
>> 
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>> v3:
>>  - clarify the interrupt format bit is Intel-specific, then it is
>>  improper to define MSI_ADDR_IF_MASK in a common header.
>> ---
>>  hw/i386/xen/xen-hvm.c | 10 +-
>>  hw/pci/msi.c  |  5 +++--
>>  hw/pci/msix.c |  4 +++-
>>  hw/xen/xen_pt_msi.c   |  2 +-
>>  include/hw/xen/xen.h  |  2 +-
>>  stubs/xen-hvm.c   |  2 +-
>>  6 files changed, 18 insertions(+), 7 deletions(-)
>> 
>> diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
>> index 8028bed..52dc8af 100644
>> --- a/hw/i386/xen/xen-hvm.c
>> +++ b/hw/i386/xen/xen-hvm.c
>> @@ -145,8 +145,16 @@ void xen_piix_pci_write_config_client(uint32_t address, 
>> uint32_t val, int len)
>>  }
>>  }
>>  
>> -int xen_is_pirq_msi(uint32_t msi_data)
>> +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data)
>>  {
>> +/* If the MSI address is configured in remapping format, the MSI will 
>> not
>> + * be remapped into a pirq. This 'if' test excludes Intel-specific
>> + * remappable msi.
>> + */
>> +#define MSI_ADDR_IF_MASK 0x0010
>
>I don't think that is the right place for a define, they also exist
>outside of the context of the function.

yes.

>That define would be better at the top of this file, I think.(There is

will do.

Thanks
Chao

>probably a better place in the common headers, but I'm not sure were.)




Re: [Qemu-devel] [PATCH v3 2/3] xen/pt: Pass the whole msi addr/data to Xen

2017-12-11 Thread Chao Gao
On Mon, Dec 11, 2017 at 05:59:08PM +, Anthony PERARD wrote:
>On Fri, Nov 17, 2017 at 02:24:24PM +0800, Chao Gao wrote:
>> Previously, some fields (reserved or unalterable) are filtered by
>> Qemu. This fields are useless for the legacy interrupt format.
>> However, these fields are may meaningful (for intel platform)
>> for the interrupt of remapping format. It is better to pass the whole
>> msi addr/data to Xen without any filtering.
>> 
>> The main reason why we want this is QEMU doesn't have the knowledge
>> to decide the interrupt format after we introduce vIOMMU inside Xen.
>> Passing the whole msi message down and let arch-specific vIOMMU to
>> decide the interrupt format.
>> 
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>> v3:
>>  - new
>> ---
>>  hw/xen/xen_pt_msi.c | 47 ---
>>  1 file changed, 12 insertions(+), 35 deletions(-)
>> 
>> diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
>> index 6d1e3bd..f7d6e76 100644
>> --- a/hw/xen/xen_pt_msi.c
>> +++ b/hw/xen/xen_pt_msi.c
>> @@ -47,25 +47,6 @@ static inline uint32_t msi_ext_dest_id(uint32_t addr_hi)
>>  return addr_hi & 0xff00;
>>  }
>>  
>> -static uint32_t msi_gflags(uint32_t data, uint64_t addr)
>> -{
>> -uint32_t result = 0;
>> -int rh, dm, dest_id, deliv_mode, trig_mode;
>> -
>> -rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
>> -dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
>> -dest_id = msi_dest_id(addr);
>> -deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
>> -trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
>> -
>> -result = dest_id | (rh << XEN_PT_GFLAGS_SHIFT_RH)
>> -| (dm << XEN_PT_GFLAGS_SHIFT_DM)
>> -| (deliv_mode << XEN_PT_GFLAGSSHIFT_DELIV_MODE)
>> -| (trig_mode << XEN_PT_GFLAGSSHIFT_TRG_MODE);
>> -
>> -return result;
>> -}
>> -
>>  static inline uint64_t msi_addr64(XenPTMSI *msi)
>>  {
>>  return (uint64_t)msi->addr_hi << 32 | msi->addr_lo;
>> @@ -160,23 +141,20 @@ static int msi_msix_update(XenPCIPassthroughState *s,
>> bool masked)
>>  {
>>  PCIDevice *d = >dev;
>> -uint8_t gvec = msi_vector(data);
>> -uint32_t gflags = msi_gflags(data, addr);
>> +uint32_t gflags = masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED);
>>  int rc = 0;
>>  uint64_t table_addr = 0;
>>  
>> -XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
>> -   " (entry: %#x)\n",
>> -   is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry);
>> +XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x addr %"PRIx64
>> +   " data %#x gflags %#x (entry: %#x)\n",
>> +   is_msix ? "-X" : "", pirq, addr, data, gflags, msix_entry);
>>  
>>  if (is_msix) {
>>  table_addr = s->msix->mmio_base_addr;
>>  }
>>  
>> -gflags |= masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED);
>> -
>> -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
>> -  pirq, gflags, table_addr);
>> +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, pirq, addr,
>> +  data, gflags, table_addr);
>
>Are you trying to modifie an existing API? That is not going to work. We
>want to be able to build QEMU against older version of Xen, and it
>should work as well.

Yes. I thought it didn't matter. And definitely, I was wrong. I will keep
compatibility by introducing a new API. A wapper function, which calls
the old or new API according to the Xen version, would be used here.

Thanks
Chao



Re: [Qemu-devel] [PATCH v3 1/3] i386/msi: Correct mask of destination ID in MSI address

2017-11-20 Thread Chao Gao
On Fri, Nov 17, 2017 at 03:24:21PM +0200, Michael S. Tsirkin wrote:
>On Fri, Nov 17, 2017 at 02:24:23PM +0800, Chao Gao wrote:
>> According to SDM 10.11.1, only [19:12] bits of MSI address are
>> Destination ID, change the mask to avoid ambiguity for VT-d spec
>> has used the bit 4 to indicate a remappable interrupt request.
>> 
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> Reviewed-by: Anthony PERARD <anthony.per...@citrix.com>
>> Reviewed-by: Peter Xu <pet...@redhat.com>
>
>Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
>
>Do you want me to merge this?

Of course. It is a simple patch and has no dependency with the other two in
this series.

Thanks
Chao



[Qemu-devel] [PATCH v3 3/3] msi: Handle remappable format interrupt request

2017-11-16 Thread Chao Gao
According to VT-d spec Interrupt Remapping and Interrupt Posting ->
Interrupt Remapping -> Interrupt Request Formats On Intel 64
Platforms, fields of MSI data register have changed. This patch
avoids wrongly regarding a remappable format interrupt request as
an interrupt binded with a pirq.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
v3:
 - clarify the interrupt format bit is Intel-specific, then it is
 improper to define MSI_ADDR_IF_MASK in a common header.
---
 hw/i386/xen/xen-hvm.c | 10 +-
 hw/pci/msi.c  |  5 +++--
 hw/pci/msix.c |  4 +++-
 hw/xen/xen_pt_msi.c   |  2 +-
 include/hw/xen/xen.h  |  2 +-
 stubs/xen-hvm.c   |  2 +-
 6 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index 8028bed..52dc8af 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -145,8 +145,16 @@ void xen_piix_pci_write_config_client(uint32_t address, 
uint32_t val, int len)
 }
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data)
 {
+/* If the MSI address is configured in remapping format, the MSI will not
+ * be remapped into a pirq. This 'if' test excludes Intel-specific
+ * remappable msi.
+ */
+#define MSI_ADDR_IF_MASK 0x0010
+if (msi_addr_lo & MSI_ADDR_IF_MASK) {
+return 0;
+}
 /* If vector is 0, the msi is remapped into a pirq, passed as
  * dest_id.
  */
diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index 5e05ce5..d05c876 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev)
 static bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
 {
 uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
-uint32_t mask, data;
+uint32_t mask, data, addr_lo;
 bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
 assert(vector < PCI_MSI_VECTORS_MAX);
 
@@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned 
int vector)
 }
 
 data = pci_get_word(dev->config + msi_data_off(dev, msi64bit));
-if (xen_is_pirq_msi(data)) {
+addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev));
+if (xen_is_pirq_msi(addr_lo, data)) {
 return false;
 }
 
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index c944c02..4cb01db 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -83,9 +83,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int 
vector, bool fmask)
 {
 unsigned offset = vector * PCI_MSIX_ENTRY_SIZE;
 uint8_t *data = >msix_table[offset + PCI_MSIX_ENTRY_DATA];
+uint8_t *addr_lo = >msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR];
 /* MSIs on Xen can be remapped into pirqs. In those cases, masking
  * and unmasking go through the PV evtchn path. */
-if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) {
+if (xen_enabled() && xen_is_pirq_msi(pci_get_long(addr_lo),
+ pci_get_long(data))) {
 return false;
 }
 return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] &
diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index f7d6e76..0e5bf83 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -96,7 +96,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s,
 
 assert((!is_msix && msix_entry == 0) || is_msix);
 
-if (xen_is_pirq_msi(data)) {
+if (xen_is_pirq_msi(addr, data)) {
 *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr);
 if (!*ppirq) {
 /* this probably identifies an misconfiguration of the guest,
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index 7efcdaa..0d6c83e 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -34,7 +34,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 void xen_piix3_set_irq(void *opaque, int irq_num, int level);
 void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len);
 void xen_hvm_inject_msi(uint64_t addr, uint32_t data);
-int xen_is_pirq_msi(uint32_t msi_data);
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data);
 
 qemu_irq *xen_interrupt_controller_init(void);
 
diff --git a/stubs/xen-hvm.c b/stubs/xen-hvm.c
index 3ca6c51..aeb1592 100644
--- a/stubs/xen-hvm.c
+++ b/stubs/xen-hvm.c
@@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data)
 {
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data)
 {
 return 0;
 }
-- 
1.8.3.1




[Qemu-devel] [PATCH v3 0/3] Qemu: add Xen vIOMMU interrupt remapping function support

2017-11-16 Thread Chao Gao
This patchset is to deal with MSI interrupt remapping request when guest
updates MSI registers.

In case of conflicts, this series also can be found in my personal github:
Xen: https://github.com/gc1008/viommu_xen.git vIOMMU4
Qemu: https://github.com/gc1008/viommu_qemu.git vIOMMU3

Any comments would be highly appreciated. And below is the change histroy

Changes from v2:
In last version, a new interface is used for binding a guest remappable msi
with a physical interrupt, while the old interface is used for binding
non-remappable msi. But for AMD, only from the MSI message itself, the
interrupt format cannot be infered. To address this, we decide to pass the
whole guest msi message to Xen and let vIOMMUs in Xen detemine whether
an given interrupt is remappable or not.
So the following changes are made:
- Instead of introducing a new interface for binding remapping format msi,
the exist interface is modified to support msi of both format.
- In patch 3, define MSI_ADDR_IF_MASK inside a function because
it is intel-specific. It is improper to define it in a common header.

Chao Gao (3):
  i386/msi: Correct mask of destination ID in MSI address
  xen/pt: Pass the whole msi addr/data to Xen
  msi: Handle remappable format interrupt request

 hw/i386/xen/xen-hvm.c | 10 -
 hw/pci/msi.c  |  5 +++--
 hw/pci/msix.c |  4 +++-
 hw/xen/xen_pt_msi.c   | 49 ---
 include/hw/i386/apic-msidef.h |  2 +-
 include/hw/xen/xen.h  |  2 +-
 stubs/xen-hvm.c   |  2 +-
 7 files changed, 31 insertions(+), 43 deletions(-)

-- 
1.8.3.1




[Qemu-devel] [PATCH v3 2/3] xen/pt: Pass the whole msi addr/data to Xen

2017-11-16 Thread Chao Gao
Previously, some fields (reserved or unalterable) are filtered by
Qemu. This fields are useless for the legacy interrupt format.
However, these fields are may meaningful (for intel platform)
for the interrupt of remapping format. It is better to pass the whole
msi addr/data to Xen without any filtering.

The main reason why we want this is QEMU doesn't have the knowledge
to decide the interrupt format after we introduce vIOMMU inside Xen.
Passing the whole msi message down and let arch-specific vIOMMU to
decide the interrupt format.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
v3:
 - new
---
 hw/xen/xen_pt_msi.c | 47 ---
 1 file changed, 12 insertions(+), 35 deletions(-)

diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index 6d1e3bd..f7d6e76 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -47,25 +47,6 @@ static inline uint32_t msi_ext_dest_id(uint32_t addr_hi)
 return addr_hi & 0xff00;
 }
 
-static uint32_t msi_gflags(uint32_t data, uint64_t addr)
-{
-uint32_t result = 0;
-int rh, dm, dest_id, deliv_mode, trig_mode;
-
-rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
-dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
-dest_id = msi_dest_id(addr);
-deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
-trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
-
-result = dest_id | (rh << XEN_PT_GFLAGS_SHIFT_RH)
-| (dm << XEN_PT_GFLAGS_SHIFT_DM)
-| (deliv_mode << XEN_PT_GFLAGSSHIFT_DELIV_MODE)
-| (trig_mode << XEN_PT_GFLAGSSHIFT_TRG_MODE);
-
-return result;
-}
-
 static inline uint64_t msi_addr64(XenPTMSI *msi)
 {
 return (uint64_t)msi->addr_hi << 32 | msi->addr_lo;
@@ -160,23 +141,20 @@ static int msi_msix_update(XenPCIPassthroughState *s,
bool masked)
 {
 PCIDevice *d = >dev;
-uint8_t gvec = msi_vector(data);
-uint32_t gflags = msi_gflags(data, addr);
+uint32_t gflags = masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED);
 int rc = 0;
 uint64_t table_addr = 0;
 
-XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
-   " (entry: %#x)\n",
-   is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry);
+XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x addr %"PRIx64
+   " data %#x gflags %#x (entry: %#x)\n",
+   is_msix ? "-X" : "", pirq, addr, data, gflags, msix_entry);
 
 if (is_msix) {
 table_addr = s->msix->mmio_base_addr;
 }
 
-gflags |= masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED);
-
-rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
-  pirq, gflags, table_addr);
+rc = xc_domain_update_msi_irq(xen_xc, xen_domid, pirq, addr,
+  data, gflags, table_addr);
 
 if (rc) {
 XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n",
@@ -199,8 +177,6 @@ static int msi_msix_disable(XenPCIPassthroughState *s,
 bool is_binded)
 {
 PCIDevice *d = >dev;
-uint8_t gvec = msi_vector(data);
-uint32_t gflags = msi_gflags(data, addr);
 int rc = 0;
 
 if (pirq == XEN_PT_UNASSIGNED_PIRQ) {
@@ -208,12 +184,13 @@ static int msi_msix_disable(XenPCIPassthroughState *s,
 }
 
 if (is_binded) {
-XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
-   is_msix ? "-X" : "", pirq, gvec);
-rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags);
+XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, addr %"PRIx64", data %#x\n",
+   is_msix ? "-X" : "", pirq, addr, data);
+rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, pirq, addr, data);
 if (rc) {
-XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, 
gvec: %#x)\n",
-   is_msix ? "-X" : "", errno, pirq, gvec);
+XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, "
+   "addr: %"PRIx64", data: %#x)\n",
+   is_msix ? "-X" : "", errno, pirq, addr, data);
 return rc;
 }
 }
-- 
1.8.3.1




[Qemu-devel] [PATCH v3 1/3] i386/msi: Correct mask of destination ID in MSI address

2017-11-16 Thread Chao Gao
According to SDM 10.11.1, only [19:12] bits of MSI address are
Destination ID, change the mask to avoid ambiguity for VT-d spec
has used the bit 4 to indicate a remappable interrupt request.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
Reviewed-by: Anthony PERARD <anthony.per...@citrix.com>
Reviewed-by: Peter Xu <pet...@redhat.com>
---
 include/hw/i386/apic-msidef.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
index 8b4d4cc..420b411 100644
--- a/include/hw/i386/apic-msidef.h
+++ b/include/hw/i386/apic-msidef.h
@@ -26,6 +26,6 @@
 
 #define MSI_ADDR_DEST_ID_SHIFT  12
 #define MSI_ADDR_DEST_IDX_SHIFT 4
-#define  MSI_ADDR_DEST_ID_MASK  0x000
+#define  MSI_ADDR_DEST_ID_MASK  0x000ff000
 
 #endif /* HW_APIC_MSIDEF_H */
-- 
1.8.3.1




Re: [Qemu-devel] [RFC PATCH 4/4] msi: taking interrupt format into consideration during judging a pirq is binded with a event channel

2017-03-30 Thread Chao Gao
On Thu, Mar 30, 2017 at 06:29:29PM +0100, Anthony PERARD wrote:
>On Fri, Mar 17, 2017 at 07:29:17PM +0800, Lan Tianyu wrote:
>> From: Chao Gao <chao@intel.com>
>> Subject: msi: taking interrupt format into consideration during
>> judging a pirq is binded with a event channel
>
>This is quite a long title, I think we can make it shorter. Maybe "msi:
>Handle MSI remapping format.

OK.

>
>>  data = pci_get_word(dev->config + msi_data_off(dev, msi64bit));
>> -if (xen_is_pirq_msi(data)) {
>> +addr_lo = pci_get_word(dev->config + msi_address_lo_off(dev));
>
>This could be get_long, so addr_lo will actually have the all low bits
>of the addr.

yes.

>
>> +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo)
>>  {
>> +/* If msi address is configurate to remapping format, the msi will not
>> + * remapped into a pirq.
>> + */
>> +if ( msi_addr_lo & 0x10 )
>
>That's a magic number, is 0x10 MSI_ADDR_IM_MASK?

Yes.

Really thanks for your comments.

>
>Thanks,
>
>-- 
>Anthony PERARD



Re: [Qemu-devel] [RFC PATCH 3/4] xen-pt: bind/unbind interrupt remapping format MSI

2017-03-30 Thread Chao Gao
On Thu, Mar 30, 2017 at 05:51:45PM +0100, Anthony PERARD wrote:
>On Fri, Mar 17, 2017 at 07:29:16PM +0800, Lan Tianyu wrote:
>> From: Chao Gao <chao@intel.com>
>> 
>> If a vIOMMU is exposed to guest, guest will configure the msi to remapping
>> format. The original code isn't suitable to the new format. A new pair
>> bind/unbind interfaces are added for this usage. This patch recognizes
>> this case and use new interfaces to bind/unbind msi.
>> 
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>>  hw/xen/xen_pt_msi.c   | 36 
>>  include/hw/i386/apic-msidef.h |  1 +
>>  2 files changed, 29 insertions(+), 8 deletions(-)
>> 
>> diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
>> index 62add06..8b0d7fc 100644
>> --- a/hw/xen/xen_pt_msi.c
>> +++ b/hw/xen/xen_pt_msi.c
>> @@ -161,6 +161,7 @@ static int msi_msix_update(XenPCIPassthroughState *s,
>>  uint8_t gvec = msi_vector(data);
>>  uint32_t gflags = msi_gflags(data, addr);
>>  int rc = 0;
>> +bool ir = !!(addr & MSI_ADDR_IM_MASK);
>>  uint64_t table_addr = 0;
>>  
>>  XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
>> @@ -171,8 +172,14 @@ static int msi_msix_update(XenPCIPassthroughState *s,
>>  table_addr = s->msix->mmio_base_addr;
>>  }
>>  
>> -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
>> +if (ir) {
>
>You could maybe use add_ADDR_IM_MASK instead of going through a
>variable.
>
>> +rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq,
>> +d->devfn, data, addr, table_addr);
>
>Do you also want to update the XEN_PT_LOG above? Since it does not
>always reflect the update_msi call anymore.

Yes. I adjust the output.

>
>> +}
>> +else {
>> +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
>>pirq, gflags, table_addr);
>> +}
>>  
>>  if (rc) {
>>  XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n",
>> @@ -204,13 +211,26 @@ static int msi_msix_disable(XenPCIPassthroughState *s,
>>  }
>>  
>>  if (is_binded) {
>> -XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
>> -   is_msix ? "-X" : "", pirq, gvec);
>> -rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, 
>> gflags);
>> -if (rc) {
>> -XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, 
>> gvec: %#x)\n",
>> -   is_msix ? "-X" : "", errno, pirq, gvec);
>> -return rc;
>> +if ( addr & MSI_ADDR_IM_MASK ) {
>> +XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, addr: 
>> %lx)\n",
>
>For addr, it should be PRIx64 instead of %lx.
>
>> +   is_msix ? "-X" : "", pirq, data, addr);
>> +rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq,
>> +d->devfn, data, addr);
>> +if (rc) {
>> +XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, 
>> data: %x, addr: %lx)\n",
>> +   is_msix ? "-X" : "", rc, pirq, data, addr);
>> +return rc;
>> +}
>> +
>> +} else {
>> +XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
>> +   is_msix ? "-X" : "", pirq, gvec);
>> +rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, 
>> gflags);
>> +if (rc) {
>> +XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: 
>> %d, gvec: %#x)\n",
>> +   is_msix ? "-X" : "", errno, pirq, gvec);
>> +return rc;
>> +}
>>  }
>>  }
>>  
>> diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
>> index 8b4d4cc..08b584f 100644
>> --- a/include/hw/i386/apic-msidef.h
>> +++ b/include/hw/i386/apic-msidef.h
>> @@ -27,5 +27,6 @@
>>  #define MSI_ADDR_DEST_ID_SHIFT  12
>>  #define MSI_ADDR_DEST_IDX_SHIFT 4
>>  #define  MSI_ADDR_DEST_ID_MASK  0x000
>
>Could you add a 0 to dest_id here? So their will be 8 digit and it those
>not look weird when compared to the next define.
>

Will do.

>> +#define  MSI_ADDR_IM_MASK   0x0010
>
>Is the definition of MSI_ADDR_IM_MASK available somewhere? In the Intel
>SDM I've only found this bit to be reserved.

Yes, it is defined in VT-d spec 5.1.5.2 MSI and MSI-X Register Programming.
I made a mistake here. I should use MSI_ADDR_IF_MASK. 

Thanks
Chao

>
>Thanks,
>
>-- 
>Anthony PERARD



Re: [Qemu-devel] [RFC PATCH 2/4] Xen: add a dummy vIOMMU to create/destroy vIOMMU in Xen

2017-03-30 Thread Chao Gao
On Thu, Mar 30, 2017 at 05:24:52PM +0100, Anthony PERARD wrote:
>Hi,
>
>On Fri, Mar 17, 2017 at 07:29:15PM +0800, Lan Tianyu wrote:
>> From: Chao Gao <chao@intel.com>
>> 
>> Since adding a dynamic sysbus device is forbidden, so choose TYPE_DEVICE
>> as parent class.
>> 
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>>  hw/xen/Makefile.objs |   1 +
>>  hw/xen/xen_viommu.c  | 116 
>> +++
>>  2 files changed, 117 insertions(+)
>>  create mode 100644 hw/xen/xen_viommu.c
>> 
>> +static void xen_viommu_realize(DeviceState *dev, Error **errp)
>> +{
>> +int rc;
>> +uint64_t cap;
>> +char *dom;
>> +char viommu_path[1024];
>> +XenVIOMMUState *s = XEN_VIOMMU_DEVICE(dev);
>> +
>> +s->id = -1;
>> +
>> +/* Read vIOMMU attributes from Xenstore. */
>> +dom = xs_get_domain_path(xenstore, xen_domid);
>> +snprintf(viommu_path, sizeof(viommu_path), "%s/viommu", dom);
>> +rc = xenstore_read_uint64(viommu_path, "base_addr", >base_addr);  
>
>Could these informations (base_addr and cap) be read from the command
>line instead of via xenstore?
>Any reason for these informations to be on xenstore?

Actually, we passed both via command line at first. We just concerned
whether it was ok to pass a address through command line since
we find no device does the similar thing.

>
>> +if (rc) {
>> +error_report("Can't get base address of vIOMMU");
>
>I think error_setg should be used instead of error_report.
>
>> +exit(1);
>
>Also, exit should be remove, and return instead. error_setg would be
>enough to signal that the device can not work.
>
>> +}
>> +
>> +rc = xenstore_read_uint64(viommu_path, "cap", >cap);
>> +if (rc) {
>> +error_report("Can't get capabilities of vIOMMU");
>> +exit(1);
>> +}
>> +
>> +rc = xc_viommu_query_cap(xen_xc, xen_domid, );
>
>Since xc_viommu_* seems to be new, you should use
>xendevicemodel_viommu_* instead, also you will need to define a stub for
>them to be able to compile QEMU against older version of Xen.

Will follow your suggestions above.

Thanks
Chao

>
>
>The patch looks good otherwise.
>
>Thanks,
>
>-- 
>Anthony PERARD



Re: [Qemu-devel] [PATCH v5 0/3] IOMMU: intel_iommu support map and unmap notifications

2016-10-26 Thread Chao Gao
On Fri, Oct 21, 2016 at 12:07:18AM +0300, Aviv B.D wrote:
>From: "Aviv Ben-David" 
>
>* Advertize Cache Mode capability in iommu cap register. 
>  This capability is controlled by "cache-mode" property of intel-iommu device.
>  To enable this option call QEMU with "-device intel-iommu,cache-mode=true".
>
>* On page cache invalidation in intel vIOMMU, check if the domain belong to
>  registered notifier, and notify accordingly.
>
>Currently this patch still doesn't enabling VFIO devices support with vIOMMU 
>present. Current problems:
>* vfio_iommu_map_notify is not aware about memory range belong to specific 
>  VFIOGuestIOMMU.
>* memory_region_iommu_replay hangs QEMU on start up while it itterate over 
>  64bit address space. Commenting out the call to this function enables 
>  workable VFIO device while vIOMMU present.
>
After applying the patch series based on commit ede0cbeb, I run the following 
cmd:
modprobe vfio-pci
echo 8086 1528 > /sys/bus/pci/drivers/vfio-pci/new_id
echo ":03:00.1" > /sys/bus/pci/devices/\:03\:00.1/driver/unbind
echo 8086 1528 > /sys/bus/pci/drivers/vfio-pci/new_id
to prepare for assigning nic device.
After that, I try to boot a guest by
qemu-system-x86_64 -boot c -m 4096 -monitor pty -serial stdio --enable-kvm \
-M kernel-irqchip=split -bios bios.bin -smp cpus=288 -hda vdisk.img \
-machine q35 -device intel-iommu,intremap=on,eim=on,cache-mode=true \
-net none -device vfio-pci,host=03:00.1
and however encounter this error:
qemu-system-x86_64: -device vfio-pci,host=03:00.1: iommu map to non memory area 
e258e000
qemu-system-x86_64: -device vfio-pci,host=03:00.1: iommu map to non memory area 
e258f000
qemu-system-x86_64: -device vfio-pci,host=03:00.1: iommu map to non memory area 
e259

Do i make some mistakes in this test? How to correct it?

by the way, there is a build error:
qemu/hw/pci-host/apb.c:326:5: error: initialization from incompatible pointer 
type [-Werror]
 .translate = pbm_translate_iommu,
 ^
qemu/hw/pci-host/apb.c:326:5: error: (near initialization for 
\u2018pbm_iommu_ops.translate\u2019) [-Werror]
cc1: all warnings being treated as errors
make: *** [hw/pci-host/apb.o] Error 1 
>
>-- 
>1.9.1
>
>



Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode

2016-09-21 Thread Chao Gao
Hi, we had 3 problems left here.
1. IRQremapping can't work with x2apic_cluster mode.
2. apic_id > 255 can't receive devices interrupts.
3. windows crash when present IRQremapping capability to it.

I test with latest kernel v4.8-rc7 and find all of them are unsolved.
Also in the qemu and kernel code bases, I couldn't find some patches to 
solve them. Will you fix these problems or leave them to communities? 

Thanks,
-Chao

On Tue, Aug 09, 2016 at 02:51:16PM +0200, Radim Krčmář wrote:
>2016-08-09 16:19+0800, Chao Gao:
>> On Tue, Aug 09, 2016 at 02:18:15PM +0800, Peter Xu wrote:
>>>I think the problem is with x2apic. Currently, x2apic is enabled in
>>>vIOMMU when kernel irqchip is used. This is problematic, since
>>>actually we are throughing away dest_id[31:8] without Radim's patches,
>>>meanwhile I see that by default x2apic is using cluster mode.
>>>
>>>In cluster mode, 8 bits will possibly not suffice (I think the reason
>>>why >17 vcpus will bring trouble is that each cluster has 16 vcpus,
>>>we'll have trouble if we have more than one cluster).
>>>
>>>To temporarily solve your issue, you should not only need "-global
>>>ioapic.version=0x20" in QEMU command line, but also add "x2apic_phys"
>>>to you guest kernel boot parameter, to force guest boot with x2apic
>>>physical mode (not cluster mode). Though this can only work for <255
>>>vcpus. IMHO we may still need to wait for Radim's patches to test >255
>>>case.
>> 
>> ok, after adding "x2apic_phys" to guest kernel boot parameter, I 
>> boot up a 288(yes, 288) vcpus guest successfully with command
>> qemu-system-x86_64 -boot c -m 4096 -sdl --enable-kvm \
>> -M kernel-irqchip=split -bios bios.bin -smp cpus=288 -hda vdisk.img \
>> -device intel-iommu,intremap=on -machine q35
>> Also, I can see interrupts on those cpu with inital apicid>255 from 
>> /proc/cpuinfo and /proc/interrupts. 
>
>Great, thanks for testing!
>Only IPIs will be correctly delivered to apic_id > 255 without few more
>patches on the QEMU side, though.
>



Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode

2016-08-09 Thread Chao Gao
On Tue, Aug 09, 2016 at 02:18:15PM +0800, Peter Xu wrote:
>On Tue, Aug 09, 2016 at 12:33:17PM +0800, Chao Gao wrote:
>> On Mon, Aug 08, 2016 at 04:57:14PM +0800, Peter Xu wrote:
>> >On Mon, Aug 08, 2016 at 03:41:23PM +0800, Chao Gao wrote:
>> >> HI, everyone.
>> >> 
>> >> We have done some tests after merging this patch set into the lastest qemu
>> >> master. In kvm aspect, we use the lastest kvm linux-next branch. Here are
>> >> some problems we have met.
>> >> 
>> >> 1. We can't boot up a 288 vcpus linux guest with CLI:
>> >> qemu-system-x86_64 -boot c -m 4096 -sdl -monitor pty --enable-kvm \
>> >> -M kernel-irqchip=split -serial stdio -bios bios.bin -smp cpus=288 \
>> >> -hda vdisk.img -device intel-iommu,intremap=on -machine q35.
>> >> The problem exists, even after we only assign 32 vcpus to the linux guest.
>> >> Maybe the output "do_IRQ: 146.113 No irq handler for vector (irq -1)" is 
>> >> a clue.
>> >> The output of qemu and kernel is in attachments. Do you have any idea
>> >> about the problem and how to solve it?
>> >
>> >IIUC, we need to wait for Radim's QEMU patches to finally enable 288
>> >vcpus?
>> >
>> >Btw, could you please try adding this to the QEMU cmdline when testing
>> >with 32 vcpus:
>> >
>> >  -global ioapic.version=0x20
>> >
>> >I see that you were running RHEL 7.2 guest with a default e1000. In
>> >that case, we may need to boost ioapic version to 0x20.
>> 
>> It doesn't work. My host machine has 16 cpus. When I assign 4 or 8 vcpus to 
>> the guest
>> or 255 vcpus but set "kernel-irqchip=off", the guest work well. Maybe when 
>> irqchip
>> is in kernel, intremap can only handle situations that vcpus number is less 
>> than 
>> physical cpus'. Do you think it's right? 
>
>I don't think so. Vcpu number should not be related to host cpu
>numbers.
>
>I think the problem is with x2apic. Currently, x2apic is enabled in
>vIOMMU when kernel irqchip is used. This is problematic, since
>actually we are throughing away dest_id[31:8] without Radim's patches,
>meanwhile I see that by default x2apic is using cluster mode.
>
>In cluster mode, 8 bits will possibly not suffice (I think the reason
>why >17 vcpus will bring trouble is that each cluster has 16 vcpus,
>we'll have trouble if we have more than one cluster).
>
>To temporarily solve your issue, you should not only need "-global
>ioapic.version=0x20" in QEMU command line, but also add "x2apic_phys"
>to you guest kernel boot parameter, to force guest boot with x2apic
>physical mode (not cluster mode). Though this can only work for <255
>vcpus. IMHO we may still need to wait for Radim's patches to test >255
>case.

ok, after adding "x2apic_phys" to guest kernel boot parameter, I 
boot up a 288(yes, 288) vcpus guest successfully with command
qemu-system-x86_64 -boot c -m 4096 -sdl --enable-kvm \
-M kernel-irqchip=split -bios bios.bin -smp cpus=288 -hda vdisk.img \
-device intel-iommu,intremap=on -machine q35
Also, I can see interrupts on those cpu with inital apicid>255 from 
/proc/cpuinfo and /proc/interrupts. 

Thanks.
-Chao




Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode

2016-08-08 Thread Chao Gao
On Mon, Aug 08, 2016 at 04:57:14PM +0800, Peter Xu wrote:
>On Mon, Aug 08, 2016 at 03:41:23PM +0800, Chao Gao wrote:
>> HI, everyone.
>> 
>> We have done some tests after merging this patch set into the lastest qemu
>> master. In kvm aspect, we use the lastest kvm linux-next branch. Here are
>> some problems we have met.
>> 
>> 1. We can't boot up a 288 vcpus linux guest with CLI:
>> qemu-system-x86_64 -boot c -m 4096 -sdl -monitor pty --enable-kvm \
>> -M kernel-irqchip=split -serial stdio -bios bios.bin -smp cpus=288 \
>> -hda vdisk.img -device intel-iommu,intremap=on -machine q35.
>> The problem exists, even after we only assign 32 vcpus to the linux guest.
>> Maybe the output "do_IRQ: 146.113 No irq handler for vector (irq -1)" is a 
>> clue.
>> The output of qemu and kernel is in attachments. Do you have any idea
>> about the problem and how to solve it?
>
>IIUC, we need to wait for Radim's QEMU patches to finally enable 288
>vcpus?
>
>Btw, could you please try adding this to the QEMU cmdline when testing
>with 32 vcpus:
>
>  -global ioapic.version=0x20
>
>I see that you were running RHEL 7.2 guest with a default e1000. In
>that case, we may need to boost ioapic version to 0x20.

It doesn't work. My host machine has 16 cpus. When I assign 4 or 8 vcpus to the 
guest
or 255 vcpus but set "kernel-irqchip=off", the guest work well. Maybe when 
irqchip
is in kernel, intremap can only handle situations that vcpus number is less 
than 
physical cpus'. Do you think it's right? 

Thanks,
-Chao



Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode

2016-08-08 Thread Chao Gao
On Mon, Aug 08, 2016 at 11:18:20AM +0200, Igor Mammedov wrote:
>On Mon, 8 Aug 2016 15:41:23 +0800
>Chao Gao <chao@intel.com> wrote:
>
>> HI, everyone.
>> 
>> We have done some tests after merging this patch set into the lastest qemu
>> master. In kvm aspect, we use the lastest kvm linux-next branch. Here are
>> some problems we have met.
>> 
>> 1. We can't boot up a 288 vcpus linux guest with CLI:
>> qemu-system-x86_64 -boot c -m 4096 -sdl -monitor pty --enable-kvm \
>> -M kernel-irqchip=split -serial stdio -bios bios.bin -smp cpus=288 \
>> -hda vdisk.img -device intel-iommu,intremap=on -machine q35.
>> The problem exists, even after we only assign 32 vcpus to the linux guest.
>> Maybe the output "do_IRQ: 146.113 No irq handler for vector (irq -1)" is a 
>> clue.
>> The output of qemu and kernel is in attachments. Do you have any idea
>> about the problem and how to solve it?
>I don't think we ever looked at "kernel-irqchip=split" only in kernel variant's
>been targeted so far.
>Radim probably knows better whether it should work or not.
>
>Have you tried with smaller amount of CPUs but with APIC IDs above 254,
>like in test below?
>
>[...]
>
>> >Tested with following CLI:
>> > QEMU -M q35 -enable-kvm -smp 1,sockets=9,cores=32,threads=1,maxcpus=288 \
>> >  -device qemu64-x86_64-cpu,socket-id=8,core-id=30,thread-id=0   \
>> >  -bios x2apic_bios.bin

I test with CLI:
qemu-system-x86_64 -M q35 \
-enable-kvm -smp 1,sockets=9,cores=32,threads=1,maxcpus=288 \
-device qemu64-x86_64-cpu,socket-id=8,core-id=30,thread-id=0 \
-bios bios.bin -hda vdisk.img -serial stdio -m 4096 2>>qemu_and_guest.log >&2

But, I think there should have a cpu with initial apicid >255 
in /proc/cpuinfo. The log(in attachments) shows that the guest kernel 
treats the other cpu as a bad one. What do you think cause the problem?

# cat /proc/interrupts
localhost login: CPU0   
   0:125   IO-APIC-edge  timer
   1:117   IO-APIC-edge  i8042
   4:382   IO-APIC-edge  serial
   7:  0   IO-APIC-edge  parport0
   8:  1   IO-APIC-edge  rtc0
   9:  0   IO-APIC-fasteoi   acpi
  12:   1661   IO-APIC-edge  i8042
  16:  0   IO-APIC-fasteoi   i801_smbus
  22: 27   IO-APIC-fasteoi   enp0s2
  24:   7310   PCI-MSI-edge  :00:1f.2
 NMI:  0   Non-maskable interrupts
 LOC:   6401   Local timer interrupts
 SPU:  0   Spurious interrupts
 PMI:  0   Performance monitoring interrupts
 IWI:   3870   IRQ work interrupts
 RTR:  0   APIC ICR read retries
 RES:  0   Rescheduling interrupts
 CAL:  0   Function call interrupts
 TLB:  0   TLB shootdowns
 TRM:  0   Thermal event interrupts
 THR:  0   Threshold APIC interrupts
 MCE:  0   Machine check exceptions
 MCP:  1   Machine check polls
 ERR:  0
 MIS:  0

# cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 6
model name  : QEMU Virtual CPU version 2.5+
stepping: 3
microcode   : 0x1
cpu MHz : 3591.682
cache size  : 4096 KB
physical id : 0
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl xtopology pni 
cx16 x2apic hypervisor lahf_lm
bogomips: 7183.36
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
Warning: Number of hotpluggable cpus requested (288) exceeds the recommended 
cpus supported by KVM (240)
Changing serial settings was 0/0 now 3/0
SeaBIOS (version ?-20160808_104017-g.c)
BUILD: gcc: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) binutils: version 
2.23.52.0.1-55.el7 20130226
No Xen hypervisor found.
vendor 8086 device 29c0
Running on QEMU (q35)
Running on KVM
RamSize: 0x8000 [cmos]
Relocating init from 0x000da4e0 to 0x7ffac9d0 (size 79264)
Found QEMU fw_cfg
QEMU fw_cfg DMA interface supported
RamBlock: addr 0x len 0x8000 [e820]
RamBlock: addr 0x0001 len 0x8000 [e820]
Moving pm_base to 0x600
=== PCI bus & bridge init ===
PCI: pci_bios_init_bus_rec bus = 0x0
=== PCI device probing ===
Found 6 PCI devices (max PCI bus is 00)
=== PCI new allocation pass #1 ===
PCI: check devices
=== PCI new allocation pass #2 ===
PCI: IO: c000 - c09f
PCI: 32: c000 - fec0
PCI: map device bdf=00:02.0  bar 1, addr c000, size 0040 [io]
PCI: map device bdf=00:1f.3  bar 4, addr 00