Re: Using virtual IOMMU in guest hypervisors other than KVM and Xen?

2019-10-19 Thread Jintack Lim
On Fri, Oct 18, 2019 at 8:37 PM Peter Xu  wrote:
>
> On Wed, Oct 16, 2019 at 03:01:22PM -0700, Jintack Lim wrote:
> > On Mon, Oct 14, 2019 at 7:50 PM Peter Xu  wrote:
> > >
> > > On Mon, Oct 14, 2019 at 01:28:49PM -0700, Jintack Lim wrote:
> > > > Hi,
> > >
> > > Hello, Jintack,
> > >
> > Hi Peter,
> >
> > > >
> > > > I'm trying to pass through a physical network device to a nested VM
> > > > using virtual IOMMU. While I was able to do it successfully using KVM
> > > > and Xen guest hypervisors running in a VM respectively, I couldn't do
> > > > it with Hyper-V as I described below. I wonder if anyone have
> > > > successfully used virtual IOMMU in other hypervisors other than KVM
> > > > and Xen? (like Hyper-V or VMware)
> > > >
> > > > The issue I have with Hyper-V is that Hyper-V gives an error that the
> > > > underlying hardware is not capable of doing passthrough. The exact
> > > > error message is as follows.
> > > >
> > > > Windows Power-shell > (Get-VMHost).IovSupportReasons
> > > > The chipset on the system does not do DMA remapping, without which
> > > > SR-IOV cannot be supported.
> > > >
> > > > I'm pretty sure that Hyper-V recognizes virtual IOMMU, though; I have
> > > > enabled iommu in windows boot loader[1], and I see differences when
> > > > booing a Windows VM with and without virtual IOMMU. I also checked
> > > > that virtual IOMMU traces are printed.
> > >
> > > What traces have you checked?  More explicitly, have you seen DMAR
> > > enabled and page table setup for that specific device to be
> > > pass-throughed?
> >
> > Thanks for the pointers. I checked that DMAR is NOT enabled. The only
> > registers that Windows guest accessed were Version Register,
> > Capability Register, and Extended Capability Register. On the other
> > hand, a Linux guest accessed other registers and enabled DMAR.
> > Here's a link to the trace I got using QEMU 4.1.0. Do you see anything
> > interesting there?
> > http://paste.ubuntu.com/p/YcSyxG9Z3x/
>
> Then I feel like Windows is reluctant to enable DMAR due to lacking of
> some caps.
>
> >
> > >
> > > >
> > > > I have tried multiple KVM/QEMU versions including the latest ones
> > > > (kernel v5.3, QEMU 4.1.0) as well as two different Windows servers
> > > > (2016 and 2019), but I see the same result. [4]
> > > >
> > > > I'd love to hear if somebody is using virtual IOMMU in Hyper-V or
> > > > VMware successfully, especially for passthrough. I also appreciate if
> > > > somebody can point out any configuration errors I have.
> > > >
> > > > Here's the qemu command line I use, basically from the QEMU vt-d
> > > > page[2] and Hyper-v on KVM from kvmforum [3].
> > > >
> > > > ./qemu/x86_64-softmmu/qemu-system-x86_64 -device
> > > > intel-iommu,intremap=on,caching-mode=on -smp 6 -m 24G -M
> > >
> > > Have you tried to use 4-level IOMMU page table (aw-bits=48 on latest
> > > QEMU, or x-aw-bits=48 on some old ones)?  IIRC we've encountered
> > > issues when trying to pass the SVVP Windows test with this, in which
> > > 4-level is required.  I'm not sure whether whether that is required in
> > > general usages of vIOMMU in Windows.
> >
> > I just tried the option you mentioned, but it didn't change anything.
> > BTW, what version of Windows was it?
>
> Sorry I don't remember that. I didn't do the test but I was just
> acknowledged that with it the test passed.  I assume you're using the
> latest QEMU here because I know Windows could require another
> capability (DMA draining) and it should be on by default in latest
> qemu master.

Thanks. Yes, I plan to use v2.11.0 eventually, but I'm trying to make
things work with the latest version first.

>
> At that time the complete cmdline to pass the test should be:
>
>   -device intel-iommu,intremap=on,aw-bits=48,caching-mode=off,eim=on
>
> I also don't remember on why caching-mode needs to be off at that
> time (otherwise SVVP fails too).

Thanks for providing the cmdline. However, turning off the
caching-mode with an assigned device resulted in the following error
on VM boot.
"We need to set caching-mode=on for intel-iommu to enable device assignment."
Does this mean that we can't assign a physical device all the way to a
nested VM with a Windows L1 hypervisor as of now?

Without assigning a device, I was able to boot a Windows VM with the
cmdline above and I see that DMAR in vIOMMU is enabled. Windows still
complains about DMA remapping, though. I'll investigate further.

>
> --
> Peter Xu
>



Re: Using virtual IOMMU in guest hypervisors other than KVM and Xen?

2019-10-16 Thread Jintack Lim
On Mon, Oct 14, 2019 at 7:50 PM Peter Xu  wrote:
>
> On Mon, Oct 14, 2019 at 01:28:49PM -0700, Jintack Lim wrote:
> > Hi,
>
> Hello, Jintack,
>
Hi Peter,

> >
> > I'm trying to pass through a physical network device to a nested VM
> > using virtual IOMMU. While I was able to do it successfully using KVM
> > and Xen guest hypervisors running in a VM respectively, I couldn't do
> > it with Hyper-V as I described below. I wonder if anyone have
> > successfully used virtual IOMMU in other hypervisors other than KVM
> > and Xen? (like Hyper-V or VMware)
> >
> > The issue I have with Hyper-V is that Hyper-V gives an error that the
> > underlying hardware is not capable of doing passthrough. The exact
> > error message is as follows.
> >
> > Windows Power-shell > (Get-VMHost).IovSupportReasons
> > The chipset on the system does not do DMA remapping, without which
> > SR-IOV cannot be supported.
> >
> > I'm pretty sure that Hyper-V recognizes virtual IOMMU, though; I have
> > enabled iommu in windows boot loader[1], and I see differences when
> > booing a Windows VM with and without virtual IOMMU. I also checked
> > that virtual IOMMU traces are printed.
>
> What traces have you checked?  More explicitly, have you seen DMAR
> enabled and page table setup for that specific device to be
> pass-throughed?

Thanks for the pointers. I checked that DMAR is NOT enabled. The only
registers that Windows guest accessed were Version Register,
Capability Register, and Extended Capability Register. On the other
hand, a Linux guest accessed other registers and enabled DMAR.
Here's a link to the trace I got using QEMU 4.1.0. Do you see anything
interesting there?
http://paste.ubuntu.com/p/YcSyxG9Z3x/

>
> >
> > I have tried multiple KVM/QEMU versions including the latest ones
> > (kernel v5.3, QEMU 4.1.0) as well as two different Windows servers
> > (2016 and 2019), but I see the same result. [4]
> >
> > I'd love to hear if somebody is using virtual IOMMU in Hyper-V or
> > VMware successfully, especially for passthrough. I also appreciate if
> > somebody can point out any configuration errors I have.
> >
> > Here's the qemu command line I use, basically from the QEMU vt-d
> > page[2] and Hyper-v on KVM from kvmforum [3].
> >
> > ./qemu/x86_64-softmmu/qemu-system-x86_64 -device
> > intel-iommu,intremap=on,caching-mode=on -smp 6 -m 24G -M
>
> Have you tried to use 4-level IOMMU page table (aw-bits=48 on latest
> QEMU, or x-aw-bits=48 on some old ones)?  IIRC we've encountered
> issues when trying to pass the SVVP Windows test with this, in which
> 4-level is required.  I'm not sure whether whether that is required in
> general usages of vIOMMU in Windows.

I just tried the option you mentioned, but it didn't change anything.
BTW, what version of Windows was it?

>
> > q35,accel=kvm,kernel-irqchip=split -cpu
> > host,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time -drive
> > if=none,file=/vm/guest0.img,id=vda,cache=none,format=raw -device
> > virtio-blk-pci,drive=vda --nographic -qmp
> > unix:/var/run/qmp,server,nowait -serial
> > telnet:127.0.0.1:,server,nowait -netdev
> > user,id=net0,hostfwd=tcp::-:22 -device
> > virtio-net-pci,netdev=net0,mac=de:ad:be:ef:f2:12 -netdev
> > tap,id=net1,vhost=on,helper=/srv/vm/qemu/qemu-bridge-helper -device
> > virtio-net-pci,netdev=net1,disable-modern=off,disable-legacy=on,mac=de:ad:be:ef:f2:11
> > -device vfio-pci,host=:06:10.0,id=net2 -monitor stdio -usb -device
> > usb-tablet -rtc base=localtime,clock=host -vnc 127.0.0.1:4 --cdrom
> > win19.iso --drive file=virtio-win.iso,index=3,media=cdrom
>
> --
> Peter Xu
>



Using virtual IOMMU in guest hypervisors other than KVM and Xen?

2019-10-14 Thread Jintack Lim
Hi,

I'm trying to pass through a physical network device to a nested VM
using virtual IOMMU. While I was able to do it successfully using KVM
and Xen guest hypervisors running in a VM respectively, I couldn't do
it with Hyper-V as I described below. I wonder if anyone have
successfully used virtual IOMMU in other hypervisors other than KVM
and Xen? (like Hyper-V or VMware)

The issue I have with Hyper-V is that Hyper-V gives an error that the
underlying hardware is not capable of doing passthrough. The exact
error message is as follows.

Windows Power-shell > (Get-VMHost).IovSupportReasons
The chipset on the system does not do DMA remapping, without which
SR-IOV cannot be supported.

I'm pretty sure that Hyper-V recognizes virtual IOMMU, though; I have
enabled iommu in windows boot loader[1], and I see differences when
booing a Windows VM with and without virtual IOMMU. I also checked
that virtual IOMMU traces are printed.

I have tried multiple KVM/QEMU versions including the latest ones
(kernel v5.3, QEMU 4.1.0) as well as two different Windows servers
(2016 and 2019), but I see the same result. [4]

I'd love to hear if somebody is using virtual IOMMU in Hyper-V or
VMware successfully, especially for passthrough. I also appreciate if
somebody can point out any configuration errors I have.

Here's the qemu command line I use, basically from the QEMU vt-d
page[2] and Hyper-v on KVM from kvmforum [3].

./qemu/x86_64-softmmu/qemu-system-x86_64 -device
intel-iommu,intremap=on,caching-mode=on -smp 6 -m 24G -M
q35,accel=kvm,kernel-irqchip=split -cpu
host,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time -drive
if=none,file=/vm/guest0.img,id=vda,cache=none,format=raw -device
virtio-blk-pci,drive=vda --nographic -qmp
unix:/var/run/qmp,server,nowait -serial
telnet:127.0.0.1:,server,nowait -netdev
user,id=net0,hostfwd=tcp::-:22 -device
virtio-net-pci,netdev=net0,mac=de:ad:be:ef:f2:12 -netdev
tap,id=net1,vhost=on,helper=/srv/vm/qemu/qemu-bridge-helper -device
virtio-net-pci,netdev=net1,disable-modern=off,disable-legacy=on,mac=de:ad:be:ef:f2:11
-device vfio-pci,host=:06:10.0,id=net2 -monitor stdio -usb -device
usb-tablet -rtc base=localtime,clock=host -vnc 127.0.0.1:4 --cdrom
win19.iso --drive file=virtio-win.iso,index=3,media=cdrom

Thanks,
Jintack

[1] 
https://social.technet.microsoft.com/Forums/en-US/a7c2940a-af32-4dab-8b31-7a605e8cf075/a-hypervisor-feature-is-not-available-to-the-user?forum=WinServerPreview
[2] https://wiki.qemu.org/Features/VT-d
[3] https://www.linux-kvm.org/images/6/6a/HyperV-KVM.pdf
[4] https://www.mail-archive.com/qemu-devel@nongnu.org/msg568963.html



Re: Migration failure when running nested VMs

2019-09-23 Thread Jintack Lim
On Mon, Sep 23, 2019 at 3:42 AM Dr. David Alan Gilbert
 wrote:
>
> * Jintack Lim (incredible.t...@gmail.com) wrote:
> > Hi,
>
> Copying in Paolo, since he recently did work to fix nested migration -
> it was expected to be broken until pretty recently; but 4.1.0 qemu on
> 5.3 kernel is pretty new, so I think I'd expected it to work.
>

Thank you, Dave. What Paolo proposed make migration work!

> > I'm seeing VM live migration failure when a VM is running a nested VM.
> > I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried
> > v5.2, but the result was the same. Kernel versions in L1 and L2 VM are
> > v4.18, but I don't think that matters.
> >
> > The symptom is that L2 VM kernel crashes in different places after
> > migration but the call stack is mostly related to memory management
> > like [1] and [2]. The kernel crash happens almost all the time. While
> > L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1
> > and L2 VM were doing nothing during migration.
> >
> > I found a few clues about this issue.
> > 1) It happens with a relatively large memory for L1 (24G), but it does
> > not with a smaller size (3G).
> >
> > 2) Dead migration worked; when I ran "stop" command in the qemu
> > monitor for L1 first and did migration, migration worked always. It
> > also worked when I only stopped L2 VM and kept L1 live during the
> > migration.
> >
> > With those two clues, I guess maybe some dirty pages made by L2 are
> > not transferred to the destination correctly, but I'm not really sure.
> >
> > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
> > Intel(R) Xeon(R) CPU E5-2630 v3 CPU.
> >
> > This makes me confused because I thought migrating nested state
> > doesn't depend on the underlying hardware.. Anyways, L1-only migration
> > with the large memory size (24G) works on both CPUs without any
> > problem.
> >
> > I would appreciate any comments/suggestions to fix this problem.
>
> Can you share the qemu command lines you're using for both L1 and L2
> please ?

Sure. I use the same QEMU command line for L1 and L2 except for cpu
and memory allocation.

This is the one for running L1, and I use smaller cpu and memory size for L2.
./qemu/x86_64-softmmu/qemu-system-x86_64 -smp 6 -m 24G -M
q35,accel=kvm -cpu host -drive
if=none,file=/vm_nfs/guest0.img,id=vda,cache=none,format=raw -device
virtio-blk-pci,drive=vda --nographic -qmp
unix:/var/run/qmp,server,wait -serial mon:stdio -netdev
user,id=net0,hostfwd=tcp::-:22 -device
virtio-net-pci,netdev=net0,mac=de:ad:be:ef:f2:12 -netdev
tap,id=net1,vhost=on,helper=/srv/vm/qemu/qemu-bridge-helper -device
virtio-net-pci,netdev=net1,disable-modern=off,disable-legacy=on,mac=de:ad:be:ef:f2:11
-monitor telnet:127.0.0.1:,server,nowait

> Are there any dmesg entries around the time of the migration on either
> the hosts or the L1 VMs?

No, I didn't see anything special in L0 or L1 kernel log.

> What guest OS are you running in L1 and L2?
>

I'm using Linux v4.18 both in L1 and L2.

Thanks,
Jintack

> Dave
>
> > Thanks,
> > Jintack
> >
> >
> > [1]https://paste.ubuntu.com/p/XGDKH45yt4/
> > [2]https://paste.ubuntu.com/p/CpbVTXJCyc/
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: Migration failure when running nested VMs

2019-09-23 Thread Jintack Lim
On Mon, Sep 23, 2019 at 4:48 AM Paolo Bonzini  wrote:
>
> On 23/09/19 12:42, Dr. David Alan Gilbert wrote:
> >
> > With those two clues, I guess maybe some dirty pages made by L2 are
> > not transferred to the destination correctly, but I'm not really sure.
> >
> > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
> > Intel(R) Xeon(R) CPU E5-2630 v3 CPU.
>
> Hmm, try disabling pml (kvm_intel.pml=0).  This would be the main
> difference, memory-management wise, between those two machines.
>

Thank you, Paolo.

This makes migration work successfully over 20 times in a row on
Intel(R) Xeon(R) Silver 4114 CPU where migration failed almost always
without disabling pml.

I guess there's a problem in KVM pml code? I'm fine with disabling
pml. But if you have patches to fix the issue, I'm willing to test it
on the CPU.

Thanks,
Jintack

> Paolo



Migration failure when running nested VMs

2019-09-20 Thread Jintack Lim
Hi,

I'm seeing VM live migration failure when a VM is running a nested VM.
I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried
v5.2, but the result was the same. Kernel versions in L1 and L2 VM are
v4.18, but I don't think that matters.

The symptom is that L2 VM kernel crashes in different places after
migration but the call stack is mostly related to memory management
like [1] and [2]. The kernel crash happens almost all the time. While
L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1
and L2 VM were doing nothing during migration.

I found a few clues about this issue.
1) It happens with a relatively large memory for L1 (24G), but it does
not with a smaller size (3G).

2) Dead migration worked; when I ran "stop" command in the qemu
monitor for L1 first and did migration, migration worked always. It
also worked when I only stopped L2 VM and kept L1 live during the
migration.

With those two clues, I guess maybe some dirty pages made by L2 are
not transferred to the destination correctly, but I'm not really sure.

3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on
Intel(R) Xeon(R) CPU E5-2630 v3 CPU.

This makes me confused because I thought migrating nested state
doesn't depend on the underlying hardware.. Anyways, L1-only migration
with the large memory size (24G) works on both CPUs without any
problem.

I would appreciate any comments/suggestions to fix this problem.

Thanks,
Jintack


[1]https://paste.ubuntu.com/p/XGDKH45yt4/
[2]https://paste.ubuntu.com/p/CpbVTXJCyc/



Re: [Qemu-devel] Why one virtio-pci device has two different DeviceState?

2019-01-06 Thread Jintack Lim
On Sat, Jan 5, 2019 at 10:42 AM Peter Maydell  wrote:
>
> On Fri, 4 Jan 2019 at 20:23, Jintack Lim  wrote:
> > I was wondering why one virtio-pci device has two different
> > DeviceState? - one directly from VirtIOPCIProxy and the other from
> > VirtIO such as VirtIONet. As an example, they are denoted as
> > qdev and vdev respectively in virtio_net_pci_realize().
>
> It's been a while since I looked at this, but there are two
> basic issues underlying the weird way virtio devices are
> set up:
>  (1) PCI is not the only "transport" -- the VirtIONet etc
>  are shared with other transports like MMIO or the S390 ones
>  (2) retaining back-compatibility matters a lot here: we need
>  command lines to still work, and also the migration data
>  stream needs to stay compatible
> Some of the way the devices are reflects the way we started
> with a design where there was only a single device (eg the
> pci virtio-net device) and then refactored it to support
> multiple transports while retaining back compatibility.

Thanks for the insight, Peter. That make sense!!

Thanks,
Jintack

>
> > I thought that just one DeviceState is enough for any device in QEMU.
> > Maybe I'm missing something fundamental here.
>
> This isn't generally true, it's just that a lot of
> our devices are of the simple straightforward kind
> where that's true. It's also possible for an
> implementation of a device to be as a combination
> of other devices, which is what we have here.
> virtio-pci-net is-a PCIDevice (which in turn is-a Device),
> but it has-a VirtIONet device (which is-a Device) as
> part of its implementation.
> (It's also possible to manually create the pci
> transport and the virtio-net backend separately
> and connect them together without the virtio-pci-net
> device at all. That's more often used with non-pci
> transports but it works for pci too.)
>
> You can also see a similar thing with a lot of the
> "container" SoC objects like TYPE_ASPEED_SOC, which
> is a subclass of DeviceState, but is implemented
> using a dozen different objects all of which are
> themselves DeviceState subclasses.
>
> thanks
> -- PMM
>




[Qemu-devel] Why one virtio-pci device has two different DeviceState?

2019-01-04 Thread Jintack Lim
Hi,

I was wondering why one virtio-pci device has two different
DeviceState? - one directly from VirtIOPCIProxy and the other from
VirtIO such as VirtIONet. As an example, they are denoted as
qdev and vdev respectively in virtio_net_pci_realize().

I thought that just one DeviceState is enough for any device in QEMU.
Maybe I'm missing something fundamental here.

*Just* for people who wonder why I'm asking this question, I'd like to
find a device in the list of SaveStateEntry on a MMIO operation to a
PCI device. For virtio devices, I only have qdev information in the
MMIO handler while I need to have vdev information to find the virtio
device in the SaveStateEntry list. I can possibly do this by
converting qdev to vdev knowing this is a virtio device as in
virtio_net_pci_realize(), but I'd like to find a way to do it without
knowing the device is a virtio device.

Thanks,
Jintack




Re: [Qemu-devel] Logging dirty pages from vhost-net in-kernel with vIOMMU

2018-12-09 Thread Jintack Lim
On Fri, Dec 7, 2018 at 7:37 AM Jason Wang  wrote:
>
>
> On 2018/12/6 下午8:44, Jason Wang wrote:
> >
> > On 2018/12/6 下午8:11, Jintack Lim wrote:
> >> On Thu, Dec 6, 2018 at 2:33 AM Jason Wang  wrote:
> >>>
> >>> On 2018/12/5 下午10:47, Jintack Lim wrote:
> >>>> On Tue, Dec 4, 2018 at 8:30 PM Jason Wang  wrote:
> >>>>> On 2018/12/5 上午2:37, Jintack Lim wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm wondering how the current implementation works when logging
> >>>>>> dirty
> >>>>>> pages during migration from vhost-net (in kernel) when used vIOMMU.
> >>>>>>
> >>>>>> I understand how vhost-net logs GPAs when not using vIOMMU. But when
> >>>>>> we use vhost with vIOMMU, then shouldn't vhost-net need to log the
> >>>>>> translated address (GPA) instead of the address written in the
> >>>>>> descriptor (IOVA) ? The current implementation looks like vhost-net
> >>>>>> just logs IOVA without translation in vhost_get_vq_desc() in
> >>>>>> drivers/vhost/net.c. It seems like QEMU doesn't do any further
> >>>>>> translation of the dirty log when syncing.
> >>>>>>
> >>>>>> I might be missing something. Could somebody shed some light on
> >>>>>> this?
> >>>>> Good catch. It looks like a bug to me. Want to post a patch for this?
> >>>> Thanks for the confirmation.
> >>>>
> >>>> What would be a good setup to catch this kind of migration bug? I
> >>>> tried to observe it in the VM expecting to see network applications
> >>>> not getting data correctly on the destination, but it was not
> >>>> successful (i.e. the VM on the destination just worked fine.) I didn't
> >>>> even see anything going wrong when I disabled the vhost logging
> >>>> completely without using vIOMMU.
> >>>>
> >>>> What I did is I ran multiple network benchmarks (e.g. netperf tcp
> >>>> stream and my own one to check correctness of received data) in a VM
> >>>> without vhost dirty page logging, and the benchmarks just ran fine in
> >>>> the destination. I checked the used ring at the time the VM is stopped
> >>>> in the source for migration, and it had multiple descriptors that is
> >>>> (probably) not processed in the VM yet. Do you have any insight how it
> >>>> could just work and what would be a good setup to catch this?
> >>>
> >>> According to past experience, it could be reproduced by doing scp from
> >>> host to guest during migration.
> >>>
> >> Thanks. I actually tried that, but didn't see any problem either - I
> >> copied a large file during migration from host to guest (the copy
> >> continued on the destination), and checked md5 hashes using md5sum,
> >> but the copied file had the same checksum as the one in the host.
> >>
> >> Do you recall what kind of symptom you observed when the dirty pages
> >> were not migrated correctly with scp?
> >
> >
> > Yes,  the point is to make the migration converge before the end of
> > scp (e.g set migration speed to a very big value). If scp end before
> > migration, we won't catch the bug. And it's better to do several
> > rounds of migration during scp.
> >
> > Anyway, let me try to reproduce it tomorrow.
> >
>
> Looks like I can reproduce this, scp give the following error to me:
>
> scp /home/file root@192.168.100.4:/home
> file   63% 1301MB 58.1MB/s
> 00:12 ETAReceived disconnect from 192.168.100.4: 2: Packet corrupt
> lost connection

Thanks for sharing this.

I was able to reproduce the bug. I observed different md5sum in the
host and the guest after several tries. I didn't observe the
disconnect you saw, but the different md5sum is enough to show the
bug, I guess.

Thanks,
Jintack

>
> FYI, I use the following cli:
>
> numactl --cpunodebind 0 --membind 0 $qemu_path $img_path \
> -netdev tap,id=hn0,vhost=on \
> -device ioh3420,id=root.1,chassis=1 \
> -device
> virtio-net-pci,bus=root.1,netdev=hn0,ats=on,disable-legacy=on,disable-modern=off,iommu_platform=on
> \
> -device intel-iommu,device-iotlb=on \
> -M q35 -m 4G -enable-kvm -cpu host -smp 2 $@
>
> Thanks
>
>
> > Thanks
>




Re: [Qemu-devel] Logging dirty pages from vhost-net in-kernel with vIOMMU

2018-12-06 Thread Jintack Lim
On Thu, Dec 6, 2018 at 2:33 AM Jason Wang  wrote:
>
>
> On 2018/12/5 下午10:47, Jintack Lim wrote:
> > On Tue, Dec 4, 2018 at 8:30 PM Jason Wang  wrote:
> >>
> >> On 2018/12/5 上午2:37, Jintack Lim wrote:
> >>> Hi,
> >>>
> >>> I'm wondering how the current implementation works when logging dirty
> >>> pages during migration from vhost-net (in kernel) when used vIOMMU.
> >>>
> >>> I understand how vhost-net logs GPAs when not using vIOMMU. But when
> >>> we use vhost with vIOMMU, then shouldn't vhost-net need to log the
> >>> translated address (GPA) instead of the address written in the
> >>> descriptor (IOVA) ? The current implementation looks like vhost-net
> >>> just logs IOVA without translation in vhost_get_vq_desc() in
> >>> drivers/vhost/net.c. It seems like QEMU doesn't do any further
> >>> translation of the dirty log when syncing.
> >>>
> >>> I might be missing something. Could somebody shed some light on this?
> >>
> >> Good catch. It looks like a bug to me. Want to post a patch for this?
> > Thanks for the confirmation.
> >
> > What would be a good setup to catch this kind of migration bug? I
> > tried to observe it in the VM expecting to see network applications
> > not getting data correctly on the destination, but it was not
> > successful (i.e. the VM on the destination just worked fine.) I didn't
> > even see anything going wrong when I disabled the vhost logging
> > completely without using vIOMMU.
> >
> > What I did is I ran multiple network benchmarks (e.g. netperf tcp
> > stream and my own one to check correctness of received data) in a VM
> > without vhost dirty page logging, and the benchmarks just ran fine in
> > the destination. I checked the used ring at the time the VM is stopped
> > in the source for migration, and it had multiple descriptors that is
> > (probably) not processed in the VM yet. Do you have any insight how it
> > could just work and what would be a good setup to catch this?
>
>
> According to past experience, it could be reproduced by doing scp from
> host to guest during migration.
>

Thanks. I actually tried that, but didn't see any problem either - I
copied a large file during migration from host to guest (the copy
continued on the destination), and checked md5 hashes using md5sum,
but the copied file had the same checksum as the one in the host.

Do you recall what kind of symptom you observed when the dirty pages
were not migrated correctly with scp?

>
> >
> > About sending a patch, as Michael suggested, I think it's better for
> > you to handle this case - this is not my area of expertise, yet :-)
>
>
> No problem, I will fix this.
>
> Thanks for spotting this issue.
>
>
> >> Thanks
> >>
> >>
> >>> Thanks,
> >>> Jintack
> >>>
> >>>
>




Re: [Qemu-devel] Logging dirty pages from vhost-net in-kernel with vIOMMU

2018-12-05 Thread Jintack Lim
On Tue, Dec 4, 2018 at 8:30 PM Jason Wang  wrote:
>
>
> On 2018/12/5 上午2:37, Jintack Lim wrote:
> > Hi,
> >
> > I'm wondering how the current implementation works when logging dirty
> > pages during migration from vhost-net (in kernel) when used vIOMMU.
> >
> > I understand how vhost-net logs GPAs when not using vIOMMU. But when
> > we use vhost with vIOMMU, then shouldn't vhost-net need to log the
> > translated address (GPA) instead of the address written in the
> > descriptor (IOVA) ? The current implementation looks like vhost-net
> > just logs IOVA without translation in vhost_get_vq_desc() in
> > drivers/vhost/net.c. It seems like QEMU doesn't do any further
> > translation of the dirty log when syncing.
> >
> > I might be missing something. Could somebody shed some light on this?
>
>
> Good catch. It looks like a bug to me. Want to post a patch for this?

Thanks for the confirmation.

What would be a good setup to catch this kind of migration bug? I
tried to observe it in the VM expecting to see network applications
not getting data correctly on the destination, but it was not
successful (i.e. the VM on the destination just worked fine.) I didn't
even see anything going wrong when I disabled the vhost logging
completely without using vIOMMU.

What I did is I ran multiple network benchmarks (e.g. netperf tcp
stream and my own one to check correctness of received data) in a VM
without vhost dirty page logging, and the benchmarks just ran fine in
the destination. I checked the used ring at the time the VM is stopped
in the source for migration, and it had multiple descriptors that is
(probably) not processed in the VM yet. Do you have any insight how it
could just work and what would be a good setup to catch this?

About sending a patch, as Michael suggested, I think it's better for
you to handle this case - this is not my area of expertise, yet :-)

>
> Thanks
>
>
> >
> > Thanks,
> > Jintack
> >
> >
>




[Qemu-devel] Logging dirty pages from vhost-net in-kernel with vIOMMU

2018-12-04 Thread Jintack Lim
Hi,

I'm wondering how the current implementation works when logging dirty
pages during migration from vhost-net (in kernel) when used vIOMMU.

I understand how vhost-net logs GPAs when not using vIOMMU. But when
we use vhost with vIOMMU, then shouldn't vhost-net need to log the
translated address (GPA) instead of the address written in the
descriptor (IOVA) ? The current implementation looks like vhost-net
just logs IOVA without translation in vhost_get_vq_desc() in
drivers/vhost/net.c. It seems like QEMU doesn't do any further
translation of the dirty log when syncing.

I might be missing something. Could somebody shed some light on this?

Thanks,
Jintack




Re: [Qemu-devel] Virtual IOMMU is working for Windows VM?

2018-10-23 Thread Jintack Lim
On Tue, Oct 23, 2018 at 9:25 AM Peter Xu  wrote:
>
> On Mon, Oct 22, 2018 at 10:43:32AM -0400, Jintack Lim wrote:
> > On Mon, Oct 22, 2018 at 5:27 AM Peter Xu  wrote:
> > >
> > > On Mon, Oct 22, 2018 at 12:22:02AM -0400, Jintack Lim wrote:
> > > > Hi,
> > > >
> > > > I wonder if vIOMMU is working for Windows VM?
> > > >
> > > > I tried it with v2.11.0, but it didn't seem to work. I assume that 
> > > > seaBIOS
> > > > sets IOMMU on by default as is the case when I launched a Linux VM. But 
> > > > I
> > > > might be missing something. Can somebody shed some light on it?
> > >
> > > Hi, Jintack,
> > >
> >
> > Thanks Peter,
> >
> > > I think at least the latest QEMU should work for Windows, but I don't
> > > really run Windows that frequently.
> > >
> > > What is the error you've encountered?  Have you tried the latest QEMU,
> > > or switching Windows versions to try?
> >
> > I ran Windows commands in Windows Powershell like below. Well, I guess
> > this is not the best way to check IOMMU presence, but couldn't find a
> > better way to do it.
> >
> > $ (Get-VMHost).IovSupport
> > false
> > $ (Get-VMHost).IovSupportReasons
> > The chipset on the system does not do DMA remapping, ...
> >
> > I just tried QEMU v3.0.0, but I see the same symptom. I'm using
> > Windows server 2016.  Unfortunately, trying another Windows version
> > would be hard for me at this point.
> >
> > I just wonder if there's way to check if Vt-d is on in SeaBIOS?
>
> I'm not sure whether SeaBIOS would enable VT-d or has any kind of
> support of it at all even if the translation unit is provided.
>

All right. I then assume the BIOS doesn't disable it as is the case
for Linux guest.

> >
> > >
> > > What I can remember about Windows is that Ladi had fixed a bug for
> > > windows-only (8991c460be, "intel_iommu: relax iq tail check on
> > > VTD_GCMD_QIE enable", 2017-07-03) but it should be even in 2.10 so I
> > > guess it's not the problem you've encountered.
> >
> > I'm CCing Ladi, just in case he has some idea :)
>
> Good idea, though I'm afraid the RH email could be stall though. :)

Ah.. :)

>
> I can try to install one Windows Server 2016 some day but I cannot
> really guarantee.  Feel free to try to debug it on your own :).
> Basically I would consider to enable the IOMMU traces in intel_iommu.c
> just like what you have done before when with the vfio-pci devices and
> check out the log.  Normally we should see plenty of MMIOs to setup
> the device and hopefully that would provide hint on what's wrong there.

Thanks. I will. Now that Linux guest can recognize vIOMMU well, hope I
can spot where things go wrong with Window guest hopefully!

Best,
Jintack

>
> Regards,
>
> --
> Peter Xu
>




Re: [Qemu-devel] Virtual IOMMU is working for Windows VM?

2018-10-22 Thread Jintack Lim
On Mon, Oct 22, 2018 at 5:27 AM Peter Xu  wrote:
>
> On Mon, Oct 22, 2018 at 12:22:02AM -0400, Jintack Lim wrote:
> > Hi,
> >
> > I wonder if vIOMMU is working for Windows VM?
> >
> > I tried it with v2.11.0, but it didn't seem to work. I assume that seaBIOS
> > sets IOMMU on by default as is the case when I launched a Linux VM. But I
> > might be missing something. Can somebody shed some light on it?
>
> Hi, Jintack,
>

Thanks Peter,

> I think at least the latest QEMU should work for Windows, but I don't
> really run Windows that frequently.
>
> What is the error you've encountered?  Have you tried the latest QEMU,
> or switching Windows versions to try?

I ran Windows commands in Windows Powershell like below. Well, I guess
this is not the best way to check IOMMU presence, but couldn't find a
better way to do it.

$ (Get-VMHost).IovSupport
false
$ (Get-VMHost).IovSupportReasons
The chipset on the system does not do DMA remapping, ...

I just tried QEMU v3.0.0, but I see the same symptom. I'm using
Windows server 2016.  Unfortunately, trying another Windows version
would be hard for me at this point.

I just wonder if there's way to check if Vt-d is on in SeaBIOS?

>
> What I can remember about Windows is that Ladi had fixed a bug for
> windows-only (8991c460be, "intel_iommu: relax iq tail check on
> VTD_GCMD_QIE enable", 2017-07-03) but it should be even in 2.10 so I
> guess it's not the problem you've encountered.

I'm CCing Ladi, just in case he has some idea :)

Thanks,
Jintack

>
> Regards,
>
> --
> Peter Xu
>




[Qemu-devel] Virtual IOMMU is working for Windows VM?

2018-10-21 Thread Jintack Lim
Hi,

I wonder if vIOMMU is working for Windows VM?

I tried it with v2.11.0, but it didn't seem to work. I assume that seaBIOS
sets IOMMU on by default as is the case when I launched a Linux VM. But I
might be missing something. Can somebody shed some light on it?

Thanks,
Jintack


[Qemu-devel] Have multiple virtio-net devices, but only one of them receives all traffic

2018-10-01 Thread Jintack Lim
Hi,

I'm using QEMU 3.0.0 and Linux kernel 4.15.0 on x86 machines. I'm
observing pretty weird behavior when I have multiple virtio-net
devices. My KVM VM has two virtio-net devices (vhost=off) and I'm
using a Linux bridge in the host. The two devices have different
MAC/IP addresses.

When I tried to access the VM using two different IP addresses (e.g
ping or ssh), I found that only one device in VM gets all incoming
network traffic while I expected that two devices get traffic for
their own IP addresses.

I checked it in the several ways.

1) I did ping with two IP addresses from the host/other physical
machines in the same subnet, and only one device's interrupt count is
increased.
2) I checked the ARP table from the ping sources, and two different IP
addresses have the same MAC address. In fact, I dumped ARP messages
using tcpdump, and the VM (or the bridge?) replied with the same MAC
address for two different IP addresses as attached below.
3) I monitored the host bridge (# bridge monitor) and found that only
one device's MAC address is registered.

It looks like one device's IP/MAC address is not advertised properly,
but I'm not really sure. When I turned off the device getting all the
traffic, then the other device starts getting incoming packets; the
device's MAC address is registered in the host bridge. The active
device only gets traffic for its own IP address, of course.

Here's the tcpdump result. IP 10.10.1.100 and 10.10.1.221 are VM's IP
addresses. IP 10.10.1.221 is assigned to a device having
52:54:00:12:34:58, but the log shows it is advertised as having
...:57.

23:24:10.983700 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
10.10.1.100 tell kvm-dest-link-1, length 46
23:24:10.983771 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.10.1.100
is-at 52:54:00:12:34:57 (oui Unknown), length 28
23:24:17.794811 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
10.10.1.211 tell kvm-dest-link-1, length 46
23:24:17.794869 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.10.1.211
is-at 52:54:00:12:34:57 (oui Unknown), length 28

I would appreciate any help!

Thanks,
Jintack




Re: [Qemu-devel] Virtual IOMMU + Virtio-net devices in a Windows VM doesn't work

2018-07-29 Thread Jintack Lim
Thanks, Yan.

On Sun, Jul 29, 2018 at 7:44 AM Yan Vugenfirer  wrote:
>
>
>
> > On 26 Jul 2018, at 05:53, Jintack Lim  wrote:
> >
> > Hi Peter,
> >
> > On Tue, Jul 24, 2018 at 1:55 AM Peter Xu  wrote:
> >>
> >> On Mon, Jul 23, 2018 at 04:13:18PM -0400, Jintack Lim wrote:
> >>> Hi,
> >>>
> >>> I'm running a Windows VM on top of KVM on x86, and one of virtio-net
> Hi,
>
> What Windows OS are you using? Keep in mind that IOMMU support in Windows 
> drivers will work in Windows 10 and Windows Server 2016.
> What’s the version of the virtio-win drivers are you using (should be build 
> 150 and up, or to include the following commit: 
> https://github.com/virtio-win/kvm-guest-drivers-windows/commit/eac3270d10924903ff38a08fcdaa252604d2e4a9)?
>

I'm using Windows Server 2016, and the virtio-win driver version is
0.1.149, which is the latest binary from here [1]. I'll try to build
150, and test it again. Thanks for the confirmation!

[1] 
https://docs.fedoraproject.org/quick-docs/en-US/creating-windows-virtual-machines-using-virtio-drivers.html

> Best regards,
> Yan.
>
> >>> device in the Windows VM doesn't seem to work. I provided virtual
> >>> IOMMU and two virtio-net devices to the VM: one bypassing the virtual
> >>> IOMMU and the other one behind the virtual IOMMU[1]. It turned out
> >>> that the virtio-net device behind virtual IOMMU didn't work while the
> >>> one bypassing the virtual IOMMU worked well. In a linux VM with the
> >>> same configuration, both of virtio-net device worked well.
> >>>
> >>> I found that there is a subtle difference between virtio-net devices
> >>> bypassing and behind virtual IOMMU in a Linux VM. The lscpu command in
> >>> the Linux VM shows different device names for them; the first line is
> >>> for the bypassing one, and the second line is for the one behind the
> >>> virtual IOMMU
> >>>
> >>> 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
> >>> 01:00.0 Ethernet controller: Red Hat, Inc Device 1041 (rev 01)
> >>>
> >>> I wonder if this difference somehow caused the problem in the Windows
> >>> VM. I've installed the latest virtio drivers (0.1.149) from the fedora
> >>> project [2]
> >>>
> >>> Any thoughts?
> >>>
> >>> I'm using v4.15 Linux kernel as a host, and QEMU 2.11.0.
> >>
> >> Have you tried the latest QEMU?
> >>
> >
> > I just tried the latest QEMU, but observed the same symptom.
> >
> >> Also CC Jason and Michael.
> >
> > Thanks!
> >
> >>
> >>>
> >>> Thanks,
> >>> Jintack
> >>>
> >>> [1] https://wiki.qemu.org/Features/VT-d
> >>> [2] 
> >>> https://docs.fedoraproject.org/quick-docs/en-US/creating-windows-virtual-machines-using-virtio-drivers.html
> >>>
> >>>
> >>
> >> Regards,
> >>
> >> --
> >> Peter Xu
> >>
> >
> >
>
>




Re: [Qemu-devel] Virtual IOMMU + Virtio-net devices in a Windows VM doesn't work

2018-07-25 Thread Jintack Lim
Hi Peter,

On Tue, Jul 24, 2018 at 1:55 AM Peter Xu  wrote:
>
> On Mon, Jul 23, 2018 at 04:13:18PM -0400, Jintack Lim wrote:
> > Hi,
> >
> > I'm running a Windows VM on top of KVM on x86, and one of virtio-net
> > device in the Windows VM doesn't seem to work. I provided virtual
> > IOMMU and two virtio-net devices to the VM: one bypassing the virtual
> > IOMMU and the other one behind the virtual IOMMU[1]. It turned out
> > that the virtio-net device behind virtual IOMMU didn't work while the
> > one bypassing the virtual IOMMU worked well. In a linux VM with the
> > same configuration, both of virtio-net device worked well.
> >
> > I found that there is a subtle difference between virtio-net devices
> > bypassing and behind virtual IOMMU in a Linux VM. The lscpu command in
> > the Linux VM shows different device names for them; the first line is
> > for the bypassing one, and the second line is for the one behind the
> > virtual IOMMU
> >
> > 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
> > 01:00.0 Ethernet controller: Red Hat, Inc Device 1041 (rev 01)
> >
> > I wonder if this difference somehow caused the problem in the Windows
> > VM. I've installed the latest virtio drivers (0.1.149) from the fedora
> > project [2]
> >
> > Any thoughts?
> >
> > I'm using v4.15 Linux kernel as a host, and QEMU 2.11.0.
>
> Have you tried the latest QEMU?
>

I just tried the latest QEMU, but observed the same symptom.

> Also CC Jason and Michael.

Thanks!

>
> >
> > Thanks,
> > Jintack
> >
> > [1] https://wiki.qemu.org/Features/VT-d
> > [2] 
> > https://docs.fedoraproject.org/quick-docs/en-US/creating-windows-virtual-machines-using-virtio-drivers.html
> >
> >
>
> Regards,
>
> --
> Peter Xu
>




[Qemu-devel] Virtual IOMMU + Virtio-net devices in a Windows VM doesn't work

2018-07-23 Thread Jintack Lim
Hi,

I'm running a Windows VM on top of KVM on x86, and one of virtio-net
device in the Windows VM doesn't seem to work. I provided virtual
IOMMU and two virtio-net devices to the VM: one bypassing the virtual
IOMMU and the other one behind the virtual IOMMU[1]. It turned out
that the virtio-net device behind virtual IOMMU didn't work while the
one bypassing the virtual IOMMU worked well. In a linux VM with the
same configuration, both of virtio-net device worked well.

I found that there is a subtle difference between virtio-net devices
bypassing and behind virtual IOMMU in a Linux VM. The lscpu command in
the Linux VM shows different device names for them; the first line is
for the bypassing one, and the second line is for the one behind the
virtual IOMMU

00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
01:00.0 Ethernet controller: Red Hat, Inc Device 1041 (rev 01)

I wonder if this difference somehow caused the problem in the Windows
VM. I've installed the latest virtio drivers (0.1.149) from the fedora
project [2]

Any thoughts?

I'm using v4.15 Linux kernel as a host, and QEMU 2.11.0.

Thanks,
Jintack

[1] https://wiki.qemu.org/Features/VT-d
[2] 
https://docs.fedoraproject.org/quick-docs/en-US/creating-windows-virtual-machines-using-virtio-drivers.html




Re: [Qemu-devel] vIOMMU Posted-interrupt implementation - atomic operation?

2018-06-06 Thread Jintack Lim
On Wed, Jun 6, 2018 at 2:56 AM, Tian, Kevin  wrote:
>> From: Jintack Lim [mailto:jint...@cs.columbia.edu]
>> Sent: Tuesday, June 5, 2018 8:57 PM
>>
>> Thanks, Kevin.
>>
>> On Tue, Jun 5, 2018 at 2:54 AM, Tian, Kevin  wrote:
>> >> From: Jintack Lim
>> >> Sent: Friday, June 1, 2018 11:47 AM
>> >>
>> >> Hi,
>> >>
>> >> I'm implementing Posted-interrupt functionality in vIOMMU. According
>> >> to Vt-d spec 5.2.3, IOMMU performs a coherent atomic read-modify-
>> write
>> >> operation of the posted-interrupt descriptor. I wonder how can we
>> >> achieve this considering the guest can modify the same
>> >> posted-interrupt descriptor anytime. Is there any existing mechanism
>> >> that I can use in QEMU?
>> >>
>> >
>> > I don't think it's possible to emulate such operation in software, unless
>> > you want to change guest to be cooperative. Actually it is not necessary.
>> > VT-d does so due to some hardware implementation consideration.
>>
>> Would you mind expanding this? I'm curious what it would be. Is it
>> because IOMMU can't do something like cmpxchg instructions?
>
> I don't have further information. Above is what I was told by hardware
> team.

Ah, I see. Thanks!

>
>>
>> > Since you are emulating on CPU, could just follow how CPU posted
>> > interrupt is conducted. If you look at SDM (29.6 Posted-Interrupt
>> > Processing):
>> >
>> > "There is a requirement, however, that such modifications be
>> > done using locked read-modify-write instructions."
>> >
>> > [instructions] means you can do update multiple times when posting an
>> > interrupt, as long as each update is atomic.
>>
>> Ah, that's a good point. So the unit of atomic operation doesn't need
>> to be the whole PI descriptor, but it can be any subset (e.g. just one
>> bit) of the descriptor? By looking at Linux kernel code, that seems to
>> be the case.
>>
>
> Exactly. :-)

Cool. Thanks for the confirmation.

Thanks,
Jintack

>
> Thanks
> Kevin




Re: [Qemu-devel] vIOMMU Posted-interrupt implementation - atomic operation?

2018-06-05 Thread Jintack Lim
Thanks, Kevin.

On Tue, Jun 5, 2018 at 2:54 AM, Tian, Kevin  wrote:
>> From: Jintack Lim
>> Sent: Friday, June 1, 2018 11:47 AM
>>
>> Hi,
>>
>> I'm implementing Posted-interrupt functionality in vIOMMU. According
>> to Vt-d spec 5.2.3, IOMMU performs a coherent atomic read-modify-write
>> operation of the posted-interrupt descriptor. I wonder how can we
>> achieve this considering the guest can modify the same
>> posted-interrupt descriptor anytime. Is there any existing mechanism
>> that I can use in QEMU?
>>
>
> I don't think it's possible to emulate such operation in software, unless
> you want to change guest to be cooperative. Actually it is not necessary.
> VT-d does so due to some hardware implementation consideration.

Would you mind expanding this? I'm curious what it would be. Is it
because IOMMU can't do something like cmpxchg instructions?

> Since you are emulating on CPU, could just follow how CPU posted
> interrupt is conducted. If you look at SDM (29.6 Posted-Interrupt
> Processing):
>
> "There is a requirement, however, that such modifications be
> done using locked read-modify-write instructions."
>
> [instructions] means you can do update multiple times when posting an
> interrupt, as long as each update is atomic.

Ah, that's a good point. So the unit of atomic operation doesn't need
to be the whole PI descriptor, but it can be any subset (e.g. just one
bit) of the descriptor? By looking at Linux kernel code, that seems to
be the case.

Best,
Jintack

>
> Thanks
> Kevin
>




[Qemu-devel] vIOMMU Posted-interrupt implementation - atomic operation?

2018-05-31 Thread Jintack Lim
Hi,

I'm implementing Posted-interrupt functionality in vIOMMU. According
to Vt-d spec 5.2.3, IOMMU performs a coherent atomic read-modify-write
operation of the posted-interrupt descriptor. I wonder how can we
achieve this considering the guest can modify the same
posted-interrupt descriptor anytime. Is there any existing mechanism
that I can use in QEMU?

Thanks,
Jintack




Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes

2018-05-17 Thread Jintack Lim
On Thu, May 17, 2018 at 4:59 AM, Peter Xu <pet...@redhat.com> wrote:
> (Hello, Jintack, Feel free to test this branch again against your scp
>  error case when you got free time)

Hi Peter,

>
> I rewrote some of the patches in V3.  Major changes:
>
> - Dropped mergable interval tree, instead introduced IOVA tree, which
>   is even simpler.
>
> - Fix the scp error issue that Jintack reported.  Please see patches
>   for detailed information.  That's the major reason to rewrite a few
>   of the patches.  We use replay for domain flushes are possibly
>   incorrect in the past.  The thing is that IOMMU replay has an
>   "definition" that "we should only send MAP when new page detected",
>   while for shadow page syncing we actually need something else than
>   that.  So in this version I started to use a new
>   vtd_sync_shadow_page_table() helper to do the page sync.

I checked that the scp problem I had (i.e. scp from the host to the
guest having virtual IOMMU and an assigned network device) was gone
with this patch series. Cool!

Please feel free to move this tag if this is not the right place!
Tested-by: Jintack Lim <jint...@cs.columbia.edu>

Thanks,
Jintack

>
> - Some other refines after the refactoring.
>
> I'll add unit test for the IOVA tree after this series merged to make
> sure we won't switch to another new tree implementaion...
>
> The element size in the new IOVA tree should be around
> sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> worst case usage ratio would be 72/4K=2%, which still seems acceptable
> (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> mapping in QEMU).
>
> I did explicit test with scp this time, copying 1G sized file for >10
> times on each of the following case:
>
> - L1 guest, with vIOMMU and with assigned device
> - L2 guest, without vIOMMU and with assigned device
> - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
>
> Please review.  Thanks,
>
> (Below are old content from previous cover letter)
>
> ==
>
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]
>
> Online repo:
>
>   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
>
> This series fixes several major problems that current code has:
>
> - Issue 1: when getting very big PSI UNMAP invalidations, the current
>   code is buggy in that we might skip the notification while actually
>   we should always send that notification.
>
> - Issue 2: IOTLB is not thread safe, while block dataplane can be
>   accessing and updating it in parallel.
>
> - Issue 3: For devices that only registered with UNMAP-only notifiers,
>   we don't really need to do page walking for PSIs, we can directly
>   deliver the notification down.  For example, vhost.
>
> - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
>   in the future, vDPA as well).  The problem is that, now for domain
>   invalidations we do this to make sure the shadow page tables are
>   correctly synced:
>
>   1. unmap the whole address space
>   2. replay the whole address space, map existing pages
>
>   However during step 1 and 2 there will be a very tiny window (it can
>   be as big as 3ms) that the shadow page table is either invalid or
>   incomplete (since we're rebuilding it up).  That's fatal error since
>   devices never know that happending and it's still possible to DMA to
>   memories.
>
> Patch 1 fixes issue 1.  I put it at the first since it's picked from
> an old post.
>
> Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
>
> Patch 3 fixes issue 2.
>
> Patch 4 fixes issue 3.
>
> Patch 5-9 fix issue 4.  Here a very simple interval tree is
> implemented based on Gtree.  It's different with general interval tree
> in that it does not allow user to pass in private data (e.g.,
> translated addresses).  However that benefits us that then we can
> merge adjacent interval leaves so that hopefully we won't consume much
> memory even if the mappings are a lot (that happens for nested virt -
> when mapping the whole L2 guest RAM range, it can be at least in GBs).
>
> Patch 10 is another big cleanup only can work after patch 9.
>
> Tests:
>
> - device assignments to L1, even L2 guests.  With this series applied
>   (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
>   we

Re: [Qemu-devel] How to check if Vt-d is capable of posted-interrupt?

2018-04-30 Thread Jintack Lim
On Mon, Apr 30, 2018 at 2:09 PM, Alex Williamson
<alex.william...@redhat.com> wrote:
> On Mon, 30 Apr 2018 13:44:23 -0400
> Jintack Lim <jint...@cs.columbia.edu> wrote:
>
>> Add iommu mailing list since this question might be more related to iommu.
>>
>> On Mon, Apr 30, 2018 at 10:11 AM, Jintack Lim <jint...@cs.columbia.edu> 
>> wrote:
>> > Hi,
>> >
>> > I wonder how to check if Vt-d is capable of posted-interrupt? I'm
>> > using Intel E5-2630 v3.
>> >
>> > I was once told that APICv and posted-interrupt capability always come
>> > together. But it seems like my cpu support APICv
>> > (/sys/module/kvm_intel/parameters/enable_apicv is Y), but
>> > posted-interrupt capability is only shipped with the next generation
>> > of the cpu (E5-2600 v4, which is Broadwell).
>> >
>> > What would be an easy way to check this?
>
> PI support is bit 59 in the capability register which is exposed
> through sysfs at /sys/class/iommu/dmar*/intel-iommu/cap so you could do
> something like:
>
> # for i in $(find /sys/class/iommu/dmar* -type l); do echo -n "$i: "; echo 
> $(( ( 0x$(cat $i/intel-iommu/cap) >> 59 ) & 1 )); done
>

Thanks for a nice solution, Alex.
It turns out my cpu doesn't have the PI capability.
/sys/class/iommu/dmar0: 0
/sys/class/iommu/dmar1: 0

> I think the relationship between APICv and PI goes the other direction,
> if you have PI, you probably have APICv.  Having APICv implies nothing
> about having PI.  Thanks,

Yeah, I think that makes sense.

Thanks,
Jintack

>
> Alex
>




Re: [Qemu-devel] How to check if Vt-d is capable of posted-interrupt?

2018-04-30 Thread Jintack Lim
Add iommu mailing list since this question might be more related to iommu.

On Mon, Apr 30, 2018 at 10:11 AM, Jintack Lim <jint...@cs.columbia.edu> wrote:
> Hi,
>
> I wonder how to check if Vt-d is capable of posted-interrupt? I'm
> using Intel E5-2630 v3.
>
> I was once told that APICv and posted-interrupt capability always come
> together. But it seems like my cpu support APICv
> (/sys/module/kvm_intel/parameters/enable_apicv is Y), but
> posted-interrupt capability is only shipped with the next generation
> of the cpu (E5-2600 v4, which is Broadwell).
>
> What would be an easy way to check this?
>
> Thanks,
> Jintack




[Qemu-devel] How to check if Vt-d is capable of posted-interrupt?

2018-04-30 Thread Jintack Lim
Hi,

I wonder how to check if Vt-d is capable of posted-interrupt? I'm
using Intel E5-2630 v3.

I was once told that APICv and posted-interrupt capability always come
together. But it seems like my cpu support APICv
(/sys/module/kvm_intel/parameters/enable_apicv is Y), but
posted-interrupt capability is only shipped with the next generation
of the cpu (E5-2600 v4, which is Broadwell).

What would be an easy way to check this?

Thanks,
Jintack




Re: [Qemu-devel] intel-iommu and vhost: Do we need 'device-iotlb' and 'ats'?

2018-02-26 Thread Jintack Lim
Hi Eric,

On Mon, Feb 26, 2018 at 5:14 AM, Auger Eric <eric.au...@redhat.com> wrote:
> Hi Jintack,
>
> On 21/02/18 05:03, Jintack Lim wrote:
>> Hi,
>>
>> I'm using vhost with the virtual intel-iommu, and this page[1] shows
>> the QEMU command line example.
>>
>> qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \
>>-device intel-iommu,intremap=on,device-iotlb=on \
>>-device ioh3420,id=pcie.1,chassis=1 \
>>-device
>> virtio-net-pci,bus=pcie.1,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on
>> \
>>-netdev tap,id=net0,vhostforce \
>>$IMAGE_PATH
>>
>> I wonder what's the impact of using device-iotlb and ats options as
>> they are described necessary.
>>
>> In my understanding, vhost in the kernel only looks at
>> VIRTIO_F_IOMMU_PLATFORM, and when it is set, vhost uses a
>> device-iotlb. In addition, vhost and QEMU communicate using vhost_msg
>> basically to cache mappings correctly in the vhost, so I wonder what's
>> the role of ats in this case.
>>
>> A related question is that if we use SMMU emulation[2] on ARM without
>> those options, does vhost cache mappings as if it has a device-iotlb?
>> (I guess this is the case.)
> vsmmuv3 emulation code does not support ATS at the moment. vhost support
> is something different. As Peter explained it comes with the capability
> of the virtio device to register unmap notifiers. Those notifiers get
> called each time there are TLB invalidation commands. That way the
> in-kernel vhost cache can be invalidated. vhost support was there until
> vsmmuv3 v7. With latest versions, I removed it to help reviewers
> concentrate on the root functionality. However I will send it to you
> based on v9.

Thanks, Eric. I'm happy to take a look at those patches!

Thanks,
Jintack

>
> Thanks
>
> Eric
>>
>> I'm pretty new to QEMU code, so I might be missing something. Can
>> somebody shed some light on it?
>>
>> [1] https://wiki.qemu.org/Features/VT-d
>> [2] http://lists.nongnu.org/archive/html/qemu-devel/2018-02/msg04736.html
>>
>> Thanks,
>> Jintack
>>
>>
>




Re: [Qemu-devel] intel-iommu and vhost: Do we need 'device-iotlb' and 'ats'?

2018-02-23 Thread Jintack Lim
Hi Kevin,

On Fri, Feb 23, 2018 at 2:34 AM, Tian, Kevin  wrote:
>> From: Peter Xu
>> Sent: Friday, February 23, 2018 3:09 PM
>>
>> >
>> > Right. I think my question was not clear. My question was that why don’t
>> > IOMMU invalidate device-iotlb along with its mappings in one go. Then
>> IOMMU
>> > device driver doesn’t need to flush device-iotlb explicitly. Maybe the
>> > reason is that ATS and IOMMU are not always coupled.. but I guess it’s
>> time
>> > for me to get some more background :)
>>
>> Ah, I see your point.
>>
>> I don't know the answer.  My wild guess is that IOMMU is just trying
>> to be simple and only provide most basic functionalities, leaving
>> complex stuff to CPU.  For example, if IOMMU takes over the ownership
>> to deliever device-iotlb invalidations when receiving iotlb
>> invalidations, it possibly needs to traverse the device tree sometimes
>> (e.g., for domain invalidations) to know what device is under what
>> domain, which is really compliated.  While it'll be simpler for CPU to
>> do this since it's very possible that the OS keeps a list of devices
>> for a domain already.
>>
>> IMHO that follows the *nix philosophy too - Do One Thing And Do It
>> Well.  Though again, it's wild guess and I may be wrong. :)
>>
>> CCing Alex, in case he has quick answers.
>>
>
> IOMMU and devices are de-coupled. You need a protocol so IOMMU
> knows which device enables translation caches and thus requires
> explicit invalidation, which is how ATS comes to play. ATS is not
> mandatory for vhost, but doing so provides more flexibility e.g.
> to enable I/O page fault if further emulating PCI PRS cap.

Thanks for the explanation!

Thanks,
Jintack

>
> Thanks
> Kevin




Re: [Qemu-devel] intel-iommu and vhost: Do we need 'device-iotlb' and 'ats'?

2018-02-23 Thread Jintack Lim
On Fri, Feb 23, 2018 at 2:09 AM, Peter Xu <pet...@redhat.com> wrote:
> On Fri, Feb 23, 2018 at 06:34:04AM +, Jintack Lim wrote:
>> On Fri, Feb 23, 2018 at 1:10 AM Peter Xu <pet...@redhat.com> wrote:
>>
>> > On Fri, Feb 23, 2018 at 12:32:13AM -0500, Jintack Lim wrote:
>> > > Hi Peter,
>> > >
>> > > Hope you had great holidays!
>> > >
>> > > On Thu, Feb 22, 2018 at 10:55 PM, Peter Xu <pet...@redhat.com> wrote:
>> > > > On Tue, Feb 20, 2018 at 11:03:46PM -0500, Jintack Lim wrote:
>> > > >> Hi,
>> > > >>
>> > > >> I'm using vhost with the virtual intel-iommu, and this page[1] shows
>> > > >> the QEMU command line example.
>> > > >>
>> > > >> qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \
>> > > >>-device intel-iommu,intremap=on,device-iotlb=on \
>> > > >>-device ioh3420,id=pcie.1,chassis=1 \
>> > > >>-device
>> > > >>
>> > virtio-net-pci,bus=pcie.1,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on
>> > > >> \
>> > > >>-netdev tap,id=net0,vhostforce \
>> > > >>$IMAGE_PATH
>> > > >>
>> > > >> I wonder what's the impact of using device-iotlb and ats options as
>> > > >> they are described necessary.
>> > > >>
>> > > >> In my understanding, vhost in the kernel only looks at
>> > > >> VIRTIO_F_IOMMU_PLATFORM, and when it is set, vhost uses a
>> > > >> device-iotlb. In addition, vhost and QEMU communicate using vhost_msg
>> > > >> basically to cache mappings correctly in the vhost, so I wonder what's
>> > > >> the role of ats in this case.
>> > > >
>> > > > The "ats" as virtio device parameter will add ATS capability to the
>> > > > PCI device.
>> > > >
>> > > > The "device-iotlb" as intel-iommu parameter will enable ATS in the
>> > > > IOMMU device (and also report that in ACPI field).
>> > > >
>> > > > If both parameters are provided IIUC it means guest will know virtio
>> > > > device has device-iotlb and it'll treat the device specially (e.g.,
>> > > > guest will need to send device-iotlb invalidations).
>> > >
>> > > Oh, I see. I was focusing on how QEMU and vhost work in the host, but
>> > > I think I missed the guest part! Thanks. I see that the Intel IOMMU
>> > > driver has has_iotlb_device flag for that purpose.
>> > >
>> > > >
>> > > > We'd better keep these parameters when running virtio devices with
>> > > > vIOMMU.  For the rest of vhost/arm specific questions, I'll leave to
>> > > > others.
>> > >
>> > > It seems like SMMU is not checking ATS capability - at least
>> > > ats_enabled flag - but I may miss something here as well :)
>> > >
>> > > >
>> > > > PS: Though IIUC the whole ATS thing may not really be necessary for
>> > > > current VT-d emulation, since even with ATS vhost is registering UNMAP
>> > > > IOMMU notifiers (see vhost_iommu_region_add()), and IIUC that means
>> > > > vhost will receive IOTLB invalidations even without ATS support, and
>> > > > it _might_ still work.
>> > >
>> > > Right. That's what I thought.
>> > >
>> > > Come to think of it, I'm not sure why we need to flush mappings in
>> > > IOMMU and devices separately in the first place... Any thoughts?
>> >
>> > I don't know ATS much, neither.
>> >
>> > You can have a look at chap 4 of vt-d spec:
>> >
>> > One approach to scaling IOTLBs is to enable I/O devices to
>> > participate in the DMA remapping with IOTLBs implemented at
>> > the devices. The Device-IOTLBs alleviate pressure for IOTLB
>> > resources in the core logic, and provide opportunities for
>> > devices to improve performance by pre-fetching address
>> > translations before issuing DMA requests. This may be useful
>> > for devices with strict DMA latency requirements (such as
>> > isochronous devices), and for devices that have large DMA
>

Re: [Qemu-devel] intel-iommu and vhost: Do we need 'device-iotlb' and 'ats'?

2018-02-22 Thread Jintack Lim
On Fri, Feb 23, 2018 at 1:10 AM Peter Xu <pet...@redhat.com> wrote:

> On Fri, Feb 23, 2018 at 12:32:13AM -0500, Jintack Lim wrote:
> > Hi Peter,
> >
> > Hope you had great holidays!
> >
> > On Thu, Feb 22, 2018 at 10:55 PM, Peter Xu <pet...@redhat.com> wrote:
> > > On Tue, Feb 20, 2018 at 11:03:46PM -0500, Jintack Lim wrote:
> > >> Hi,
> > >>
> > >> I'm using vhost with the virtual intel-iommu, and this page[1] shows
> > >> the QEMU command line example.
> > >>
> > >> qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \
> > >>-device intel-iommu,intremap=on,device-iotlb=on \
> > >>-device ioh3420,id=pcie.1,chassis=1 \
> > >>-device
> > >>
> virtio-net-pci,bus=pcie.1,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on
> > >> \
> > >>-netdev tap,id=net0,vhostforce \
> > >>$IMAGE_PATH
> > >>
> > >> I wonder what's the impact of using device-iotlb and ats options as
> > >> they are described necessary.
> > >>
> > >> In my understanding, vhost in the kernel only looks at
> > >> VIRTIO_F_IOMMU_PLATFORM, and when it is set, vhost uses a
> > >> device-iotlb. In addition, vhost and QEMU communicate using vhost_msg
> > >> basically to cache mappings correctly in the vhost, so I wonder what's
> > >> the role of ats in this case.
> > >
> > > The "ats" as virtio device parameter will add ATS capability to the
> > > PCI device.
> > >
> > > The "device-iotlb" as intel-iommu parameter will enable ATS in the
> > > IOMMU device (and also report that in ACPI field).
> > >
> > > If both parameters are provided IIUC it means guest will know virtio
> > > device has device-iotlb and it'll treat the device specially (e.g.,
> > > guest will need to send device-iotlb invalidations).
> >
> > Oh, I see. I was focusing on how QEMU and vhost work in the host, but
> > I think I missed the guest part! Thanks. I see that the Intel IOMMU
> > driver has has_iotlb_device flag for that purpose.
> >
> > >
> > > We'd better keep these parameters when running virtio devices with
> > > vIOMMU.  For the rest of vhost/arm specific questions, I'll leave to
> > > others.
> >
> > It seems like SMMU is not checking ATS capability - at least
> > ats_enabled flag - but I may miss something here as well :)
> >
> > >
> > > PS: Though IIUC the whole ATS thing may not really be necessary for
> > > current VT-d emulation, since even with ATS vhost is registering UNMAP
> > > IOMMU notifiers (see vhost_iommu_region_add()), and IIUC that means
> > > vhost will receive IOTLB invalidations even without ATS support, and
> > > it _might_ still work.
> >
> > Right. That's what I thought.
> >
> > Come to think of it, I'm not sure why we need to flush mappings in
> > IOMMU and devices separately in the first place... Any thoughts?
>
> I don't know ATS much, neither.
>
> You can have a look at chap 4 of vt-d spec:
>
> One approach to scaling IOTLBs is to enable I/O devices to
> participate in the DMA remapping with IOTLBs implemented at
> the devices. The Device-IOTLBs alleviate pressure for IOTLB
> resources in the core logic, and provide opportunities for
> devices to improve performance by pre-fetching address
> translations before issuing DMA requests. This may be useful
> for devices with strict DMA latency requirements (such as
> isochronous devices), and for devices that have large DMA
> working set or multiple active DMA streams.
>
> So I think it's for performance's sake. For example, the DMA operation
> won't need to be translated at all if it's pre-translated, so it can
> have less latency.  And also, that'll offload some of the translation
> process so that workload can be more distributed.
>
> When with that (caches located both on IOMMU's and device's side), we
> need to invalidate all the cache when needed.
>

Right. I think my question was not clear. My question was that why don’t
IOMMU invalidate device-iotlb along with its mappings in one go. Then IOMMU
device driver doesn’t need to flush device-iotlb explicitly. Maybe the
reason is that ATS and IOMMU are not always coupled.. but I guess it’s time
for me to get some more background :)


> >
> > Your reply was really helpful 

Re: [Qemu-devel] intel-iommu and vhost: Do we need 'device-iotlb' and 'ats'?

2018-02-22 Thread Jintack Lim
Hi Peter,

Hope you had great holidays!

On Thu, Feb 22, 2018 at 10:55 PM, Peter Xu <pet...@redhat.com> wrote:
> On Tue, Feb 20, 2018 at 11:03:46PM -0500, Jintack Lim wrote:
>> Hi,
>>
>> I'm using vhost with the virtual intel-iommu, and this page[1] shows
>> the QEMU command line example.
>>
>> qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \
>>-device intel-iommu,intremap=on,device-iotlb=on \
>>-device ioh3420,id=pcie.1,chassis=1 \
>>-device
>> virtio-net-pci,bus=pcie.1,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on
>> \
>>-netdev tap,id=net0,vhostforce \
>>$IMAGE_PATH
>>
>> I wonder what's the impact of using device-iotlb and ats options as
>> they are described necessary.
>>
>> In my understanding, vhost in the kernel only looks at
>> VIRTIO_F_IOMMU_PLATFORM, and when it is set, vhost uses a
>> device-iotlb. In addition, vhost and QEMU communicate using vhost_msg
>> basically to cache mappings correctly in the vhost, so I wonder what's
>> the role of ats in this case.
>
> The "ats" as virtio device parameter will add ATS capability to the
> PCI device.
>
> The "device-iotlb" as intel-iommu parameter will enable ATS in the
> IOMMU device (and also report that in ACPI field).
>
> If both parameters are provided IIUC it means guest will know virtio
> device has device-iotlb and it'll treat the device specially (e.g.,
> guest will need to send device-iotlb invalidations).

Oh, I see. I was focusing on how QEMU and vhost work in the host, but
I think I missed the guest part! Thanks. I see that the Intel IOMMU
driver has has_iotlb_device flag for that purpose.

>
> We'd better keep these parameters when running virtio devices with
> vIOMMU.  For the rest of vhost/arm specific questions, I'll leave to
> others.

It seems like SMMU is not checking ATS capability - at least
ats_enabled flag - but I may miss something here as well :)

>
> PS: Though IIUC the whole ATS thing may not really be necessary for
> current VT-d emulation, since even with ATS vhost is registering UNMAP
> IOMMU notifiers (see vhost_iommu_region_add()), and IIUC that means
> vhost will receive IOTLB invalidations even without ATS support, and
> it _might_ still work.

Right. That's what I thought.

Come to think of it, I'm not sure why we need to flush mappings in
IOMMU and devices separately in the first place... Any thoughts?

Your reply was really helpful to me. I appreciate it.

Thanks,
Jintack

> But there can be other differences, like
> performance, etc.
>
>>
>> A related question is that if we use SMMU emulation[2] on ARM without
>> those options, does vhost cache mappings as if it has a device-iotlb?
>> (I guess this is the case.)
>>
>> I'm pretty new to QEMU code, so I might be missing something. Can
>> somebody shed some light on it?
>>
>> [1] https://wiki.qemu.org/Features/VT-d
>> [2] http://lists.nongnu.org/archive/html/qemu-devel/2018-02/msg04736.html
>>
>> Thanks,
>> Jintack
>>
>
> --
> Peter Xu
>




[Qemu-devel] intel-iommu and vhost: Do we need 'device-iotlb' and 'ats'?

2018-02-20 Thread Jintack Lim
Hi,

I'm using vhost with the virtual intel-iommu, and this page[1] shows
the QEMU command line example.

qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \
   -device intel-iommu,intremap=on,device-iotlb=on \
   -device ioh3420,id=pcie.1,chassis=1 \
   -device
virtio-net-pci,bus=pcie.1,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on
\
   -netdev tap,id=net0,vhostforce \
   $IMAGE_PATH

I wonder what's the impact of using device-iotlb and ats options as
they are described necessary.

In my understanding, vhost in the kernel only looks at
VIRTIO_F_IOMMU_PLATFORM, and when it is set, vhost uses a
device-iotlb. In addition, vhost and QEMU communicate using vhost_msg
basically to cache mappings correctly in the vhost, so I wonder what's
the role of ats in this case.

A related question is that if we use SMMU emulation[2] on ARM without
those options, does vhost cache mappings as if it has a device-iotlb?
(I guess this is the case.)

I'm pretty new to QEMU code, so I might be missing something. Can
somebody shed some light on it?

[1] https://wiki.qemu.org/Features/VT-d
[2] http://lists.nongnu.org/archive/html/qemu-devel/2018-02/msg04736.html

Thanks,
Jintack




Re: [Qemu-devel] Assigning network devices to nested VMs results in driver errors in nested VMs

2018-02-14 Thread Jintack Lim
On Tue, Feb 13, 2018 at 11:44 PM, Jintack Lim <jint...@cs.columbia.edu> wrote:
> Hi,
>
> I'm trying to assign network devices to nested VMs on x86 using KVM,
> but I got network device driver errors in the nested VMs. (I've tried
> this about an year ago when vIOMMU patches were not upstreamed, and I
> got similar errors at that time.)
>
> This could be network driver issues, but I'd like to get some help if
> somebody encountered similar issues.
>
> I'm using v4.15.0 kernel and v2.11.0 QEMU, and I followed this [1]
> guide. I had no problem with assigning devices to the first level VMs
> (L1 VMs). And I also checked that the devices were assigned to nested
> VMs with the lspci command in the nested VMs. But network device
> drivers failed to initialize the device. I tried two network cards -
> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection and
> Mellanox Technologies MT27500 Family.
>
> Intel driver error in the nested VM looks like this.
> [1.939552] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
> version 5.1.0-k
> [1.949796] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
> [2.210024] ixgbe :00:04.0: HW Init failed: -12
> [2.218144] ixgbe: probe of :00:04.0 failed with error -12
>

I was assigning PF to the L1 VMs and L2 VMs so far; I guess this is
not the right way. So I tried to assign VF to the L1 VM and assigned
the same VF to the L2 VM in turn. Then the device driver in L2 VM
didn't show any error, and I was able to configure the network
interface. But the network still didn't work.

I only tried Intel network device so far.

> and I saw lots of these messages in the host (L0) kernel log when
> booting the nested VM.
>
> [ 1557.404173] DMAR: DRHD: handling fault status reg 102
> [ 1557.409813] DMAR: [DMA Read] Request device [06:00.0] fault addr
> 9 [fault reason 06] PTE Read access is not set
> [ 1561.383957] DMAR: DRHD: handling fault status reg 202
> [ 1561.389598] DMAR: [DMA Read] Request device [06:00.0] fault addr
> 9 [fault reason 06] PTE Read access is not set
>

I still see similar error logs in the host kernel. The fault address
looks different, though.

[ 3228.636485] ixgbe :06:00.0 eth2: VF Reset msg received from vf 0
[ 3236.023683] DMAR: DRHD: handling fault status reg 2
[ 3236.029129] DMAR: [DMA Read] Request device [06:10.0] fault addr
354748000 [fault reason 06] PTE Read access is not set
[ 3236.371711] DMAR: DRHD: handling fault status reg 102
[ 3236.377353] DMAR: [DMA Read] Request device [06:10.0] fault addr
354748000 [fault reason 06] PTE Read access is not set
[ 3236.595667] DMAR: DRHD: handling fault status reg 202
[ 3236.601307] DMAR: [DMA Read] Request device [06:10.0] fault addr
354748000 [fault reason 06] PTE Read access is not set
[ 3236.831863] DMAR: DRHD: handling fault status reg 302
[ 3236.837503] DMAR: [DMA Read] Request device [06:10.0] fault addr
370b7c000 [fault reason 06] PTE Read access is not set
[ 3237.647806] vfio-pci :06:10.0: timed out waiting for pending
transaction; performing function level reset anyway

> This is Mellanox driver error in another nested VM.
> [2.481694] mlx4_core: Initializing :00:04.0
> [3.519422] mlx4_core :00:04.0: Installed FW has unsupported
> command interface revision 0
> [3.537769] mlx4_core :00:04.0: (Installed FW version is 0.0.000)
> [3.551733] mlx4_core :00:04.0: This driver version supports
> only revisions 2 to 3
> [3.568758] mlx4_core :00:04.0: QUERY_FW command failed, aborting
> [3.582789] mlx4_core :00:04.0: Failed to init fw, aborting.
>
> The host showed similar messages as above.
>
> I wonder what could be the cause of these errors. Please let me know
> if further information is needed.
>
> [1] https://wiki.qemu.org/Features/VT-d
>
> Thanks,
> Jintack




Re: [Qemu-devel] Assigning network devices to nested VMs results in driver errors in nested VMs

2018-02-14 Thread Jintack Lim
On Wed, Feb 14, 2018 at 12:36 AM, Peter Xu <pet...@redhat.com> wrote:
> On Tue, Feb 13, 2018 at 11:44:09PM -0500, Jintack Lim wrote:
>> Hi,
>>
>> I'm trying to assign network devices to nested VMs on x86 using KVM,
>> but I got network device driver errors in the nested VMs. (I've tried
>> this about an year ago when vIOMMU patches were not upstreamed, and I
>> got similar errors at that time.)
>>
>> This could be network driver issues, but I'd like to get some help if
>> somebody encountered similar issues.
>>
>> I'm using v4.15.0 kernel and v2.11.0 QEMU, and I followed this [1]
>> guide. I had no problem with assigning devices to the first level VMs
>> (L1 VMs). And I also checked that the devices were assigned to nested
>> VMs with the lspci command in the nested VMs. But network device
>> drivers failed to initialize the device. I tried two network cards -
>> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection and
>> Mellanox Technologies MT27500 Family.
>>
>> Intel driver error in the nested VM looks like this.
>> [1.939552] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
>> version 5.1.0-k
>> [1.949796] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
>> [2.210024] ixgbe :00:04.0: HW Init failed: -12
>> [2.218144] ixgbe: probe of :00:04.0 failed with error -12
>>
>> and I saw lots of these messages in the host (L0) kernel log when
>> booting the nested VM.
>>
>> [ 1557.404173] DMAR: DRHD: handling fault status reg 102
>> [ 1557.409813] DMAR: [DMA Read] Request device [06:00.0] fault addr
>> 9 [fault reason 06] PTE Read access is not set
>> [ 1561.383957] DMAR: DRHD: handling fault status reg 202
>> [ 1561.389598] DMAR: [DMA Read] Request device [06:00.0] fault addr
>> 9 [fault reason 06] PTE Read access is not set
>>
>> This is Mellanox driver error in another nested VM.
>> [2.481694] mlx4_core: Initializing :00:04.0
>> [3.519422] mlx4_core :00:04.0: Installed FW has unsupported
>> command interface revision 0
>> [3.537769] mlx4_core :00:04.0: (Installed FW version is 0.0.000)
>> [3.551733] mlx4_core :00:04.0: This driver version supports
>> only revisions 2 to 3
>> [3.568758] mlx4_core :00:04.0: QUERY_FW command failed, aborting
>> [3.582789] mlx4_core :00:04.0: Failed to init fw, aborting.
>>
>> The host showed similar messages as above.
>>
>> I wonder what could be the cause of these errors. Please let me know
>> if further information is needed.
>>
>> [1] https://wiki.qemu.org/Features/VT-d
>
> Hi, Jintack,

Hi Peter,

>
> Thanks for reporting the problem.
>
> I haven't been playing with nested assignment much recently (and even
> before), but I think I encountered similar problem too in the past.

Oh, that's good to hear that :)

>
> Will let you know if I had any progress, but it's possibly not gonna
> happen in a few days since there'll be a whole week holiday starting
> from tomorrow (which is Chinese Spring Festival).

Thanks a lot. Enjoy Lunar New Year!

>
> --
> Peter Xu
>




[Qemu-devel] Assigning network devices to nested VMs results in driver errors in nested VMs

2018-02-13 Thread Jintack Lim
Hi,

I'm trying to assign network devices to nested VMs on x86 using KVM,
but I got network device driver errors in the nested VMs. (I've tried
this about an year ago when vIOMMU patches were not upstreamed, and I
got similar errors at that time.)

This could be network driver issues, but I'd like to get some help if
somebody encountered similar issues.

I'm using v4.15.0 kernel and v2.11.0 QEMU, and I followed this [1]
guide. I had no problem with assigning devices to the first level VMs
(L1 VMs). And I also checked that the devices were assigned to nested
VMs with the lspci command in the nested VMs. But network device
drivers failed to initialize the device. I tried two network cards -
Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection and
Mellanox Technologies MT27500 Family.

Intel driver error in the nested VM looks like this.
[1.939552] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
version 5.1.0-k
[1.949796] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
[2.210024] ixgbe :00:04.0: HW Init failed: -12
[2.218144] ixgbe: probe of :00:04.0 failed with error -12

and I saw lots of these messages in the host (L0) kernel log when
booting the nested VM.

[ 1557.404173] DMAR: DRHD: handling fault status reg 102
[ 1557.409813] DMAR: [DMA Read] Request device [06:00.0] fault addr
9 [fault reason 06] PTE Read access is not set
[ 1561.383957] DMAR: DRHD: handling fault status reg 202
[ 1561.389598] DMAR: [DMA Read] Request device [06:00.0] fault addr
9 [fault reason 06] PTE Read access is not set

This is Mellanox driver error in another nested VM.
[2.481694] mlx4_core: Initializing :00:04.0
[3.519422] mlx4_core :00:04.0: Installed FW has unsupported
command interface revision 0
[3.537769] mlx4_core :00:04.0: (Installed FW version is 0.0.000)
[3.551733] mlx4_core :00:04.0: This driver version supports
only revisions 2 to 3
[3.568758] mlx4_core :00:04.0: QUERY_FW command failed, aborting
[3.582789] mlx4_core :00:04.0: Failed to init fw, aborting.

The host showed similar messages as above.

I wonder what could be the cause of these errors. Please let me know
if further information is needed.

[1] https://wiki.qemu.org/Features/VT-d

Thanks,
Jintack




Re: [Qemu-devel] iommu emulation

2017-03-02 Thread Jintack Lim
On Thu, Mar 2, 2017 at 5:20 PM, Bandan Das <b...@redhat.com> wrote:
> Jintack Lim <jint...@cs.columbia.edu> writes:
>
>> [cc Bandan]
>>
>> On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jint...@cs.columbia.edu>
>> wrote:
>>
>>>
>>>
>>> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
>>> alex.william...@redhat.com> wrote:
> ...
>>>
>>
>> I've tried another network device on a different machine. It has "Intel
>> Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
>> controller. I got the same problem of getting the network device
>> initialization failure in L2. I think I'm missing something since I heard
>> from Bandan that he had no problem to assign a device to L2 with ixgbe.
>>
>> This is the error message from dmesg in L2.
>>
>> [3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
>> version 4.2.1-k
>> [3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
>> [3.964875] ixgbe :00:02.0: HW Init failed: -12
>> [3.972362] ixgbe: probe of :00:02.0 failed with error -12
>>
>> I checked that L2 indeed had that device.
>> root@guest0:~# lspci
>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
>> Controller
>> 00:01.0 VGA compatible controller: Device 1234: (rev 02)
>> 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
>> Network Connection (rev 01)
>
> Jintack, any progress with this ?

Not much, unfortunately.

>
> I am testing on a X540-AT2 and I see a different behavior. It appears
> config succeeds but the driver keeps resetting the device due to a Tx
> hang:

Thanks for your effort!

>
> [ 568.612391 ] ixgbe :00:03.0 enp0s3: tx hang 38 detected on queue 0,
> resetting adapter
> [ 568.612393 ]  ixgbe :00:03.0 enp0s3: initiating reset due to tx
> timeout
> [ 568.612397 ]  ixgbe :00:03.0 enp0s3: Reset adapter
>
> This may be device specific but I think the actual behavior you see is
> also dependent on the ixgbe driver in the guest. Are you on a recent
> kernel ? Also, can you point me to the hack (by Peter) that you have
> mentioned above ?

I was using 4.6.0-rc on the machine with Mellanox device, and
4.10.0-rc on the machine with Intel device. L0, L1 and L2 had the same
version.

This is the initial hack from Peter,
--8<---
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..bacd302 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)

 }

-/* Cleanup chain head ID if necessary */
-if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0x) {
-pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
-}
-
 g_free(config);
 return;
 }
-->8---

and I believe this is the commit merged into QEMU repo.

commit d0d1cd70d10639273e2a23870e7e7d80b2bc4e21
Author: Alex Williamson <alex.william...@redhat.com>
Date:   Wed Feb 22 13:19:58 2017 -0700

vfio/pci: Improve extended capability comments, skip masked caps


Thanks,
Jintack

>
> Thanks,
> Bandan
>
>> I'm describing steps I took, so if you notice something wrong, PLEASE let
>> me know.
>>
>> 1. [L0] Check the device with lspci. Result is [1]
>> 2. [L0] Unbind from the original driver and bind to vfio-pci driver
>> following [2][3]
>> 3. [L0] Start L1 with this script. [4]
>> 4. [L1] L1 is able to use the network device.
>> 5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
>> the step 2.
>> 6. [L1] Start L2 with this script. [5]
>> 7. [L2] Got the init failure error message above.
>>
>> [1] https://paste.ubuntu.com/24055745/
>> [2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
>> [3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
>> [4] https://paste.ubuntu.com/24055715/
>> [5] https://paste.ubuntu.com/24055720/
>>
>> Thanks,
>> Jintack
>>
>>
>>>
>>>
>>>> Alex
>>>>
>>>>
>>>
>




Re: [Qemu-devel] iommu emulation

2017-02-23 Thread Jintack Lim
[cc Bandan]

On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jint...@cs.columbia.edu>
wrote:

>
>
> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
> alex.william...@redhat.com> wrote:
>
>> On Thu, 16 Feb 2017 10:28:39 +0800
>> Peter Xu <pet...@redhat.com> wrote:
>>
>> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
>> >
>> > [...]
>> >
>> > > > Alex, do you like something like below to fix above issue that
>> Jintack
>> > > > has encountered?
>> > > >
>> > > > (note: this code is not for compile, only trying show what I
>> mean...)
>> > > >
>> > > > --8<---
>> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> > > > index 332f41d..4dca631 100644
>> > > > --- a/hw/vfio/pci.c
>> > > > +++ b/hw/vfio/pci.c
>> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >   */
>> > > >  config = g_memdup(pdev->config, vdev->config_size);
>> > > >
>> > > > -/*
>> > > > - * Extended capabilities are chained with each pointing to the
>> next, so we
>> > > > - * can drop anything other than the head of the chain simply
>> by modifying
>> > > > - * the previous next pointer.  For the head of the chain, we
>> can modify the
>> > > > - * capability ID to something that cannot match a valid
>> capability.  ID
>> > > > - * 0 is reserved for this since absence of capabilities is
>> indicated by
>> > > > - * 0 for the ID, version, AND next pointer.  However,
>> pcie_add_capability()
>> > > > - * uses ID 0 as reserved for list management and will
>> incorrectly match and
>> > > > - * assert if we attempt to pre-load the head of the chain with
>> this ID.
>> > > > - * Use ID 0x temporarily since it is also seems to be
>> reserved in
>> > > > - * part for identifying absence of capabilities in a root
>> complex register
>> > > > - * block.  If the ID still exists after adding capabilities,
>> switch back to
>> > > > - * zero.  We'll mark this entire first dword as emulated for
>> this purpose.
>> > > > - */
>> > > > -pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
>> > > > - PCI_EXT_CAP(0x, 0, 0));
>> > > > -pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -pci_set_long(vdev->emulated_config_bits +
>> PCI_CONFIG_SPACE_SIZE, ~0);
>> > > > -
>> > > >  for (next = PCI_CONFIG_SPACE_SIZE; next;
>> > > >   next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
>> > > >  header = pci_get_long(config + next);
>> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >  switch (cap_id) {
>> > > >  case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
>> OVMF */
>> > > >  case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
>> virtualization */
>> > > > +/* keep this ecap header (4 bytes), but mask cap_id to
>> 0x */
>> > > > +...
>> > > >  trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
>> cap_id, next);
>> > > >  break;
>> > > >  default:
>> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >
>> > > >  }
>> > > >
>> > > > -/* Cleanup chain head ID if necessary */
>> > > > -if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
>> 0x) {
>> > > > -pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -}
>> > > > -
>> > > >  g_free(config);
>> > > >  return;
>> > > >  }
>> > > > ->8-
>> > > >
>> > > > Since after all we need the assumption that 0x is reserved for
>> > > > cap_id. Then, we can just remove the "first 0x then 0x0" hack,
>> > > > which is imho error-prone and hacky.
>> > >
>> > > This doesn't fix the bug, whic

Re: [Qemu-devel] [PATCH] intel_iommu: make sure its init before PCI dev

2017-02-22 Thread Jintack Lim
On Wed, Feb 22, 2017 at 6:42 AM, Jintack Lim <jint...@cs.columbia.edu>
wrote:

>
>
> On Wed, Feb 22, 2017 at 12:49 AM, Peter Xu <pet...@redhat.com> wrote:
>
>> Intel vIOMMU devices are created with "-device" parameter, while here
>> actually we need to make sure this device will be created before some
>> other PCI devices (like vfio-pci devices) so that we know iommu_fn will
>> be setup correctly before realizations of those PCI devices.
>>
>> Here we do explicit check to make sure intel-iommu device will be inited
>> before all the rest of the PCI devices. This is done by checking against
>> the devices dangled under current root PCIe bus and we should see
>> nothing there besides integrated ICH9 ones.
>>
>> If the user violated this rule, we abort the program.
>>
>
> Hi Peter,
>
> After applying this patch, qemu gave the following error and was
> terminated, but I believe I passed parameters in a right order?
>

FYI, I've applied this patch to your vtd-vfio-enablement-v7 branch.


>
> [kvm-node ~]$sudo qemu-system-x86_64 \
> > -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > -M q35,accel=kvm,kernel-irqchip=split \
> > -m 12G \
> > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > -smp 4,sockets=4,cores=1,threads=1 \
> > -device vfio-pci,host=08:00.0
> qemu-system-x86_64: -device intel-iommu,intremap=on,eim=off,caching-mode=on:
> Please init intel-iommu before other PCI devices
>
>  Thanks,
> Jintack
>
>
>> Maybe one day we will be able to manage the ordering of device
>> initialization, and then we can grant VT-d devices a higher init
>> priority. But before that, let's have this explicit check to make sure
>> of it.
>>
>> Signed-off-by: Peter Xu <pet...@redhat.com>
>> ---
>>  hw/i386/intel_iommu.c | 40 
>>  1 file changed, 40 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 22d8226..db74124 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -31,6 +31,7 @@
>>  #include "hw/i386/apic-msidef.h"
>>  #include "hw/boards.h"
>>  #include "hw/i386/x86-iommu.h"
>> +#include "hw/i386/ich9.h"
>>  #include "hw/pci-host/q35.h"
>>  #include "sysemu/kvm.h"
>>  #include "hw/i386/apic_internal.h"
>> @@ -2560,6 +2561,41 @@ static bool vtd_decide_config(IntelIOMMUState *s,
>> Error **errp)
>>  return true;
>>  }
>>
>> +static bool vtd_has_inited_pci_devices(PCIBus *bus, Error **errp)
>> +{
>> +int i;
>> +uint8_t func;
>> +
>> +/* We check against root bus */
>> +assert(bus && pci_bus_is_root(bus));
>> +
>> +/*
>> + * We need to make sure vIOMMU device is created before other PCI
>> + * devices other than the integrated ICH9 ones, so that they can
>> + * get correct iommu_fn setup even during its realize(). Some
>> + * devices (e.g., vfio-pci) will need a correct iommu_fn to work.
>> + */
>> +for (i = 1; i < PCI_FUNC_MAX * PCI_SLOT_MAX; i++) {
>> +/* Skip the checking against ICH9 integrated devices */
>> +if (PCI_SLOT(i) == ICH9_LPC_DEV) {
>> +func = PCI_FUNC(i);
>> +if (func == ICH9_LPC_FUNC ||
>> +func == ICH9_SATA1_FUNC ||
>> +func == ICH9_SMB_FUNC) {
>> +continue;
>> +}
>> +}
>> +
>> +if (bus->devices[i]) {
>> +error_setg(errp, "Please init intel-iommu before "
>> +   "other PCI devices");
>> +return true;
>> +}
>> +}
>> +
>> +return false;
>> +}
>> +
>>  static void vtd_realize(DeviceState *dev, Error **errp)
>>  {
>>  PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
>> @@ -2567,6 +2603,10 @@ static void vtd_realize(DeviceState *dev, Error
>> **errp)
>>  IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>>  X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
>>
>> +if (vtd_has_inited_pci_devices(bus, errp)) {
>> +return;
>> +}
>> +
>>  VTD_DPRINTF(GENERAL, "");
>>  x86_iommu->type = TYPE_INTEL;
>>
>> --
>> 2.7.4
>>
>>
>>
>


Re: [Qemu-devel] [PATCH] vfio/pci: Improve extended capability comments, skip masked caps

2017-02-22 Thread Jintack Lim
On Tue, Feb 21, 2017 at 10:08 PM, Peter Xu <pet...@redhat.com> wrote:

> [cc Jintack]
>
> On Tue, Feb 21, 2017 at 02:43:03PM -0700, Alex Williamson wrote:
> > Since commit 4bb571d857d9 ("pci/pcie: don't assume cap id 0 is
> > reserved") removes the internal use of extended capability ID 0, the
> > comment here becomes invalid.  However, peeling back the onion, the
> > code is still correct and we still can't seed the capability chain
> > with ID 0, unless we want to muck with using the version number to
> > force the header to be non-zero, which is much uglier to deal with.
> > The comment also now covers some of the subtleties of using cap ID 0,
> > such as transparently indicating absence of capabilities if none are
> > added.  This doesn't detract from the correctness of the referenced
> > commit as vfio in the kernel also uses capability ID zero to mask
> > capabilties.  In fact, we should skip zero capabilities precisely
> > because the kernel might also expose such a capability at the head
> > position and re-introduce the problem.
> >
> > Signed-off-by: Alex Williamson <alex.william...@redhat.com>
> > Cc: Peter Xu <pet...@redhat.com>
> > Cc: Michael S. Tsirkin <m...@redhat.com>
> > ---
> >  hw/vfio/pci.c |   31 +--
> >  1 file changed, 21 insertions(+), 10 deletions(-)
> >
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index f2ba9b6cfafc..03a3d0154976 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1880,16 +1880,26 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >  /*
> >   * Extended capabilities are chained with each pointing to the
> next, so we
> >   * can drop anything other than the head of the chain simply by
> modifying
> > - * the previous next pointer.  For the head of the chain, we can
> modify the
> > - * capability ID to something that cannot match a valid
> capability.  ID
> > - * 0 is reserved for this since absence of capabilities is
> indicated by
> > - * 0 for the ID, version, AND next pointer.  However,
> pcie_add_capability()
> > - * uses ID 0 as reserved for list management and will incorrectly
> match and
> > - * assert if we attempt to pre-load the head of the chain with this
> ID.
> > - * Use ID 0x temporarily since it is also seems to be reserved
> in
> > - * part for identifying absence of capabilities in a root complex
> register
> > - * block.  If the ID still exists after adding capabilities, switch
> back to
> > - * zero.  We'll mark this entire first dword as emulated for this
> purpose.
> > + * the previous next pointer.  Seed the head of the chain here such
> that
> > + * we can simply skip any capabilities we want to drop below,
> regardless
> > + * of their position in the chain.  If this stub capability still
> exists
> > + * after we add the capabilities we want to expose, update the
> capability
> > + * ID to zero.  Note that we cannot seed with the capability header
> being
> > + * zero as this conflicts with definition of an absent capability
> chain
> > + * and prevents capabilities beyond the head of the list from being
> added.
> > + * By replacing the dummy capability ID with zero after walking the
> device
> > + * chain, we also transparently mark extended capabilities as
> absent if
> > + * no capabilities were added.  Note that the PCIe spec defines an
> absence
> > + * of extended capabilities to be determined by a value of zero for
> the
> > + * capability ID, version, AND next pointer.  A non-zero next
> pointer
> > + * should be sufficient to indicate additional capabilities are
> present,
> > + * which will occur if we call pcie_add_capability() below.  The
> entire
> > + * first dword is emulated to support this.
> > + *
> > + * NB. The kernel side does similar masking, so be prepared that our
> > + * view of the device may also contain a capability ID zero in the
> head
> > + * of the chain.  Skip it for the same reason that we cannot seed
> the
> > + * chain with a zero capability.
> >   */
> >  pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> >   PCI_EXT_CAP(0x, 0, 0));
> > @@ -1915,6 +1925,7 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >     PCI_EXT_CAP_NEXT_MASK);
> >
> >  switch (cap_id) {
> > +case 0: /* kernel masked capability */
> &g

Re: [Qemu-devel] [PATCH] intel_iommu: make sure its init before PCI dev

2017-02-22 Thread Jintack Lim
On Wed, Feb 22, 2017 at 12:49 AM, Peter Xu  wrote:

> Intel vIOMMU devices are created with "-device" parameter, while here
> actually we need to make sure this device will be created before some
> other PCI devices (like vfio-pci devices) so that we know iommu_fn will
> be setup correctly before realizations of those PCI devices.
>
> Here we do explicit check to make sure intel-iommu device will be inited
> before all the rest of the PCI devices. This is done by checking against
> the devices dangled under current root PCIe bus and we should see
> nothing there besides integrated ICH9 ones.
>
> If the user violated this rule, we abort the program.
>

Hi Peter,

After applying this patch, qemu gave the following error and was
terminated, but I believe I passed parameters in a right order?

[kvm-node ~]$sudo qemu-system-x86_64 \
> -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> -M q35,accel=kvm,kernel-irqchip=split \
> -m 12G \
> -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> -smp 4,sockets=4,cores=1,threads=1 \
> -device vfio-pci,host=08:00.0
qemu-system-x86_64: -device
intel-iommu,intremap=on,eim=off,caching-mode=on: Please init intel-iommu
before other PCI devices

 Thanks,
Jintack


> Maybe one day we will be able to manage the ordering of device
> initialization, and then we can grant VT-d devices a higher init
> priority. But before that, let's have this explicit check to make sure
> of it.
>
> Signed-off-by: Peter Xu 
> ---
>  hw/i386/intel_iommu.c | 40 
>  1 file changed, 40 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 22d8226..db74124 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -31,6 +31,7 @@
>  #include "hw/i386/apic-msidef.h"
>  #include "hw/boards.h"
>  #include "hw/i386/x86-iommu.h"
> +#include "hw/i386/ich9.h"
>  #include "hw/pci-host/q35.h"
>  #include "sysemu/kvm.h"
>  #include "hw/i386/apic_internal.h"
> @@ -2560,6 +2561,41 @@ static bool vtd_decide_config(IntelIOMMUState *s,
> Error **errp)
>  return true;
>  }
>
> +static bool vtd_has_inited_pci_devices(PCIBus *bus, Error **errp)
> +{
> +int i;
> +uint8_t func;
> +
> +/* We check against root bus */
> +assert(bus && pci_bus_is_root(bus));
> +
> +/*
> + * We need to make sure vIOMMU device is created before other PCI
> + * devices other than the integrated ICH9 ones, so that they can
> + * get correct iommu_fn setup even during its realize(). Some
> + * devices (e.g., vfio-pci) will need a correct iommu_fn to work.
> + */
> +for (i = 1; i < PCI_FUNC_MAX * PCI_SLOT_MAX; i++) {
> +/* Skip the checking against ICH9 integrated devices */
> +if (PCI_SLOT(i) == ICH9_LPC_DEV) {
> +func = PCI_FUNC(i);
> +if (func == ICH9_LPC_FUNC ||
> +func == ICH9_SATA1_FUNC ||
> +func == ICH9_SMB_FUNC) {
> +continue;
> +}
> +}
> +
> +if (bus->devices[i]) {
> +error_setg(errp, "Please init intel-iommu before "
> +   "other PCI devices");
> +return true;
> +}
> +}
> +
> +return false;
> +}
> +
>  static void vtd_realize(DeviceState *dev, Error **errp)
>  {
>  PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
> @@ -2567,6 +2603,10 @@ static void vtd_realize(DeviceState *dev, Error
> **errp)
>  IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>  X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
>
> +if (vtd_has_inited_pci_devices(bus, errp)) {
> +return;
> +}
> +
>  VTD_DPRINTF(GENERAL, "");
>  x86_iommu->type = TYPE_INTEL;
>
> --
> 2.7.4
>
>
>


Re: [Qemu-devel] iommu emulation

2017-02-21 Thread Jintack Lim
On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson  wrote:

> On Thu, 16 Feb 2017 10:28:39 +0800
> Peter Xu  wrote:
>
> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
> >
> > [...]
> >
> > > > Alex, do you like something like below to fix above issue that
> Jintack
> > > > has encountered?
> > > >
> > > > (note: this code is not for compile, only trying show what I mean...)
> > > >
> > > > --8<---
> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > index 332f41d..4dca631 100644
> > > > --- a/hw/vfio/pci.c
> > > > +++ b/hw/vfio/pci.c
> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >   */
> > > >  config = g_memdup(pdev->config, vdev->config_size);
> > > >
> > > > -/*
> > > > - * Extended capabilities are chained with each pointing to the
> next, so we
> > > > - * can drop anything other than the head of the chain simply by
> modifying
> > > > - * the previous next pointer.  For the head of the chain, we
> can modify the
> > > > - * capability ID to something that cannot match a valid
> capability.  ID
> > > > - * 0 is reserved for this since absence of capabilities is
> indicated by
> > > > - * 0 for the ID, version, AND next pointer.  However,
> pcie_add_capability()
> > > > - * uses ID 0 as reserved for list management and will
> incorrectly match and
> > > > - * assert if we attempt to pre-load the head of the chain with
> this ID.
> > > > - * Use ID 0x temporarily since it is also seems to be
> reserved in
> > > > - * part for identifying absence of capabilities in a root
> complex register
> > > > - * block.  If the ID still exists after adding capabilities,
> switch back to
> > > > - * zero.  We'll mark this entire first dword as emulated for
> this purpose.
> > > > - */
> > > > -pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > > > - PCI_EXT_CAP(0x, 0, 0));
> > > > -pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -pci_set_long(vdev->emulated_config_bits +
> PCI_CONFIG_SPACE_SIZE, ~0);
> > > > -
> > > >  for (next = PCI_CONFIG_SPACE_SIZE; next;
> > > >   next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> > > >  header = pci_get_long(config + next);
> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >  switch (cap_id) {
> > > >  case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
> OVMF */
> > > >  case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
> virtualization */
> > > > +/* keep this ecap header (4 bytes), but mask cap_id to
> 0x */
> > > > +...
> > > >  trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
> cap_id, next);
> > > >  break;
> > > >  default:
> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >
> > > >  }
> > > >
> > > > -/* Cleanup chain head ID if necessary */
> > > > -if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
> 0x) {
> > > > -pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -}
> > > > -
> > > >  g_free(config);
> > > >  return;
> > > >  }
> > > > ->8-
> > > >
> > > > Since after all we need the assumption that 0x is reserved for
> > > > cap_id. Then, we can just remove the "first 0x then 0x0" hack,
> > > > which is imho error-prone and hacky.
> > >
> > > This doesn't fix the bug, which is that pcie_add_capability() uses a
> > > valid capability ID for it's own internal tracking.  It's only doing
> > > this to find the end of the capability chain, which we could do in a
> > > spec complaint way by looking for a zero next pointer.  Fix that and
> > > then vfio doesn't need to do this set to 0x then back to zero
> > > nonsense at all.  Capability ID zero is valid.  Thanks,
> >
> > Yeah I see Michael's fix on the capability list stuff. However, imho
> > these are two different issues? Or say, even if with that patch, we
> > should still need this hack (first 0x0, then 0x) right? Since
> > looks like that patch didn't solve the problem if the first pcie ecap
> > is masked at 0x100.
>
> I thought the problem was that QEMU in the host exposes a device with a
> capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
> capability ID of 0 because that's how it finds the end of the chain.
> Therefore if we make QEMU not use capability ID 0 for internal
> purposes, things work.  vfio using 0x and swapping back to 0x0
> becomes unnecessary, but doesn't hurt anything.  Thanks,
>

I've applied Peter's hack and Michael's patch below, but still can't use
the assigned device in L2.
 commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4
Author: Michael S. Tsirkin 
Date:   Wed Feb 15 22:37:45 2017 +0200

pci/pcie: don't assume cap id 0 is reserved


Re: [Qemu-devel] iommu emulation

2017-02-15 Thread Jintack Lim
On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson <alex.william...@redhat.com
> wrote:

> On Wed, 15 Feb 2017 17:05:35 -0500
> Jintack Lim <jint...@cs.columbia.edu> wrote:
>
> > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <pet...@redhat.com> wrote:
> >
> > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> > >
> > > [...]
> > >
> > > > > > >> > I misunderstood what you said?
> > > > > > >
> > > > > > > I failed to understand why an vIOMMU could help boost
> performance.
> > > :(
> > > > > > > Could you provide your command line here so that I can try to
> > > > > > > reproduce?
> > > > > >
> > > > > > Sure. This is the command line to launch L1 VM
> > > > > >
> > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host
> \
> > > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > > >
> > > > > > And this is for L2 VM.
> > > > > >
> > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > > -m 8G \
> > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > > -device vfio-pci,host=00:03.0,id=net0
> > > > >
> > > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > > than L1 guest with/without vIOMMU?
> > > > >
> > > >
> > > > That's right. I thought you were asking about command lines for L1/L2
> > > guest
> > > > :(.
> > > > I think I made the confusion, and as I said above, I didn't mean to
> talk
> > > > about the performance of L1 guest with/without vIOMMO.
> > > > We can move on!
> > >
> > > I see. Sure! :-)
> > >
> > > [...]
> > >
> > > > >
> > > > > Then, I *think* above assertion you encountered would fail only if
> > > > > prev == 0 here, but I still don't quite sure why was that
> happening.
> > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in
> your L1
> > > > > guest?
> > > > >
> > > >
> > > > Sure. This is from my L1 guest.
> > >
> > > Hmm... I think I found the problem...
> > >
> > > >
> > > > root@guest0:~# lspci -vvv -s 00:03.0
> > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > > [ConnectX-3]
> > > > Subsystem: Mellanox Technologies Device 0050
> > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > Stepping- SERR+ FastB2B- DisINTx+
> > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>  > > > SERR-  > > > Latency: 0, Cache Line Size: 64 bytes
> > > > Interrupt: pin A routed to IRQ 23
> > > > Region 0: Memory at fe90 (64-bit, non-prefetchable) [size=1M]
> > > > Region 2: Memory at fe00 (64-bit, prefetchable) [size=8M]
> > > > Expansion ROM at fea0 [disabled] [size=1M]
> > > > Capabilities: [40] Power Management version 3
> > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-
> > > )
> > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > > Capabilities: [48] Vital Product Data
> > > > Product Name: CX354A - ConnectX-3 QSFP
> > > > Read-only fields:
> > > > [PN] Part number: MCX354A-FCBT
> > > > [EC] Engineering changes: A4
> > > > [SN] Serial number: MT1346X00791
> > > > [V0] Vendor specific: PCIe Gen3 x8
> > > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > > Read/write fields:
> > > > [V1] Vendor specific: N/A
> > > > [YA] Asset tag: N/A
> > > > [RW] Read-write area: 105 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-writ

Re: [Qemu-devel] iommu emulation

2017-02-09 Thread Jintack Lim
On Wed, Feb 8, 2017 at 10:52 PM, Peter Xu <pet...@redhat.com> wrote:
> (cc qemu-devel and Alex)
>
> On Wed, Feb 08, 2017 at 09:14:03PM -0500, Jintack Lim wrote:
>> On Wed, Feb 8, 2017 at 10:49 AM, Jintack Lim <jint...@cs.columbia.edu> wrote:
>> > Hi Peter,
>> >
>> > On Tue, Feb 7, 2017 at 10:12 PM, Peter Xu <pet...@redhat.com> wrote:
>> >> On Tue, Feb 07, 2017 at 02:16:29PM -0500, Jintack Lim wrote:
>> >>> Hi Peter and Michael,
>> >>
>> >> Hi, Jintack,
>> >>
>> >>>
>> >>> I would like to get some help to run a VM with the emulated iommu. I
>> >>> have tried for a few days to make it work, but I couldn't.
>> >>>
>> >>> What I want to do eventually is to assign a network device to the
>> >>> nested VM so that I can measure the performance of applications
>> >>> running in the nested VM.
>> >>
>> >> Good to know that you are going to use [4] to do something useful. :-)
>> >>
>> >> However, could I ask why you want to measure the performance of
>> >> application inside nested VM rather than host? That's something I am
>> >> just curious about, considering that virtualization stack will
>> >> definitely introduce overhead along the way, and I don't know whether
>> >> that'll affect your measurement to the application.
>> >
>> > I have added nested virtualization support to KVM/ARM, which is under
>> > review now. I found that application performance running inside the
>> > nested VM is really bad both on ARM and x86, and I'm trying to figure
>> > out what's the real overhead. I think one way to figure that out is to
>> > see if the direct device assignment to L2 helps to reduce the overhead
>> > or not.
>
> I see. IIUC you are trying to use an assigned device to replace your
> old emulated device in L2 guest to see whether performance will drop
> as well, right? Then at least I can know that you won't need a nested
> VT-d here (so we should not need a vIOMMU in L2 guest).

That's right.

>
> In that case, I think we can give it a shot, considering that L1 guest
> will use vfio-pci for that assigned device as well, and when L2 guest
> QEMU uses this assigned device, it'll use a static mapping (just to
> map the whole GPA for L2 guest) there, so even if you are using a
> kernel driver in L2 guest with your to-be-tested application, we
> should still be having a static mapping in vIOMMU in L1 guest, which
> is IMHO fine from performance POV.
>
> I cced Alex in case I missed anything here.
>
>> >
>> >>
>> >> Another thing to mention is that (in case you don't know that), device
>> >> assignment with VT-d protection would be even slower than generic VMs
>> >> (without Intel IOMMU protection) if you are using generic kernel
>> >> drivers in the guest, since we may need real-time DMA translation on
>> >> data path.
>> >>
>> >
>> > So, this is the comparison between using virtio and using the device
>> > assignment for L1? I have tested application performance running
>> > inside L1 with and without iommu, and I found that the performance is
>> > better with iommu. I thought whether the device is assigned to L1 or
>> > L2, the DMA translation is done by iommu, which is pretty fast? Maybe
>> > I misunderstood what you said?
>
> I failed to understand why an vIOMMU could help boost performance. :(
> Could you provide your command line here so that I can try to
> reproduce?

Sure. This is the command line to launch L1 VM

qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
-m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
-drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
-smp 4,sockets=4,cores=1,threads=1 \
-device vfio-pci,host=08:00.0,id=net0

And this is for L2 VM.

./qemu-system-x86_64 -M q35,accel=kvm \
-m 8G \
-drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
-device vfio-pci,host=00:03.0,id=net0

>
> Besides, what I mentioned above is just in case you don't know that
> vIOMMU will drag down the performance in most cases.
>
> I think here to be more explicit, the overhead of vIOMMU is different
> for assigned devices and emulated ones.
>
>   (1) For emulated devices, the overhead is when we do the
>   translation, or say when we do the DMA operation. We need
>   real-time translation which should drag down the performance.
>
>   (2) For assigned devices (our case), the overhead is when we setup
>   the pages (si