Re: [Qemu-devel] QEMU crashed when reconnecting over iscsi protocol
BTW, the iscsi server I used is scsi-target-utils ( https://github.com/fujita/tgt). Bob Chen 于2018年12月19日周三 下午7:34写道: > I looked into the source code, and found some reconnect method from > libiscsi. Are they able to work? > > QEMU: 2.12.1 > libiscsi: 1.18.0 (https://github.com/sahlberg/libiscsi) > > > (gdb) f > #0 0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at > connect.c:461 > 461 memcpy(tmp_iscsi->old_iscsi, iscsi, sizeof(struct iscsi_context)); > (gdb) bt > #0 0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at > connect.c:461 > #1 0x7fcd956a4ccd in iscsi_service_reconnect_if_loggedin > (iscsi=0x7fcd97f206d0) at socket.c:879 > #2 0x7fcd956a50d2 in iscsi_tcp_service (iscsi=0x7fcd97f206d0, > revents=) at socket.c:989 > #3 0x7fcd9674b737 in iscsi_process_read (arg=0x7fcd97f1e680) at > block/iscsi.c:371 > #4 0x7fcd967e5b47 in aio_dispatch_handlers (ctx=0x7fcd97ee48d0) at > util/aio-posix.c:406 > #5 0x7fcd967e5ce9 in aio_dispatch (ctx=0x7fcd97ee48d0) at > util/aio-posix.c:437 > #6 0x7fcd967e038c in aio_ctx_dispatch (source=0x7fcd97ee48d0, > callback=0, user_data=0x0) at util/async.c:261 > #7 0x7fcd93b1135e in g_main_context_dispatch () from > /usr/lib64/libglib-2.0.so.0 > #8 0x7fcd967e424e in glib_pollfds_poll () at util/main-loop.c:215 > #9 0x7fcd967e4368 in os_host_main_loop_wait (timeout=98700) at > util/main-loop.c:263 > #10 0x7fcd967e4438 in main_loop_wait (nonblocking=0) at > util/main-loop.c:522 > #11 0x7fcd963c4b29 in main_loop () at vl.c:1943 > #12 0x7fcd963cca8d in main (argc=62, argv=0x789f4208, > envp=0x789f4400) at vl.c:4734 >
[Qemu-devel] QEMU crashed when reconnecting over iscsi protocol
I looked into the source code, and found some reconnect method from libiscsi. Are they able to work? QEMU: 2.12.1 libiscsi: 1.18.0 (https://github.com/sahlberg/libiscsi) (gdb) f #0 0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at connect.c:461 461 memcpy(tmp_iscsi->old_iscsi, iscsi, sizeof(struct iscsi_context)); (gdb) bt #0 0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at connect.c:461 #1 0x7fcd956a4ccd in iscsi_service_reconnect_if_loggedin (iscsi=0x7fcd97f206d0) at socket.c:879 #2 0x7fcd956a50d2 in iscsi_tcp_service (iscsi=0x7fcd97f206d0, revents=) at socket.c:989 #3 0x7fcd9674b737 in iscsi_process_read (arg=0x7fcd97f1e680) at block/iscsi.c:371 #4 0x7fcd967e5b47 in aio_dispatch_handlers (ctx=0x7fcd97ee48d0) at util/aio-posix.c:406 #5 0x7fcd967e5ce9 in aio_dispatch (ctx=0x7fcd97ee48d0) at util/aio-posix.c:437 #6 0x7fcd967e038c in aio_ctx_dispatch (source=0x7fcd97ee48d0, callback=0, user_data=0x0) at util/async.c:261 #7 0x7fcd93b1135e in g_main_context_dispatch () from /usr/lib64/libglib-2.0.so.0 #8 0x7fcd967e424e in glib_pollfds_poll () at util/main-loop.c:215 #9 0x7fcd967e4368 in os_host_main_loop_wait (timeout=98700) at util/main-loop.c:263 #10 0x7fcd967e4438 in main_loop_wait (nonblocking=0) at util/main-loop.c:522 #11 0x7fcd963c4b29 in main_loop () at vl.c:1943 #12 0x7fcd963cca8d in main (argc=62, argv=0x789f4208, envp=0x789f4400) at vl.c:4734
Re: [Qemu-devel] [QEMU + SPDK] The demo in the official document is not working
Problem solved. Got a reply from Intel just now. -- Forwarded message -- From: Liu, Changpeng <changpeng@intel.com> Date: 2018-04-23 18:06 GMT+08:00 Subject: [Qemu-devel] [QEMU + SPDK] The demo in the official document is not working To: "a175818...@gmail.com" <a175818...@gmail.com> Hi Bob, The issues was introduced by the following commit: commit fb20fbb764 "vhost: avoid to start/stop virtqueue which is not ready". When you starting with seabios, the virtio-scsi driver in seabios will only enumerate 1 I/O queue, but didn't include task management and event queue, while SPDK vhost target must get all the 3 queues when starting, so this process will be blocked at seabios. One workaround for now, you can start with the chardev, after booting to OS, you can use -device_add to add the virtio-scsi controller. We are developing the right fix, ethier in seabios or SPDK, this will be fixed very soon. Best Regards, Changpeng Liu 2018-04-23 16:19 GMT+08:00 Bob Chen <a175818...@gmail.com>: > Hi, > > I was trying to run qemu with spdk, referring to > http://www.spdk.io/doc/vhost.html#vhost_qemu_config . Steps were strictly > followed. > > # Environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x, >> dpdk 17.11.1, qemu 2.11.1 > > > > cd spdk >> sudo su >> ulimit -l unlimited >> HUGEMEM=2048 ./scripts/setup.sh >> ./app/vhost/vhost -S /var/tmp -s 1024 -m 0x1 & >> ./scripts/rpc.py construct_nvme_bdev -b Nvme0 -t pcie -a :03:00.0 >> ./scripts/rpc.py construct_malloc_bdev 128 4096 -b Malloc0 >> ./scripts/rpc.py construct_vhost_scsi_controller --cpumask 0x1 vhost.0 >> ./scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Nvme0n1 >> ./scripts/rpc.py add_vhost_scsi_lun vhost.0 1 Malloc0 >> >> qemu-system-x86_64 -enable-kvm -cpu host -machine pc,accel=kvm -daemonize >> -vnc :1 \ >> -smp 1 -m 1G -object >> memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on >> -numa node,memdev=mem0 \ >> -drive file=,if=none,id=disk -device >> ide-hd,drive=disk,bootindex=0 \ >> -chardev socket,id=spdk_vhost_scsi0,path=/var/tmp/vhost.0 \ >> -device vhost-user-scsi-pci,id=scsi0,chardev=spdk_vhost_scsi0 > > > > Unfortunately the demo was not able to work, stopped at the boot-loader > screen saying that not bootable device. But if I removed the last two lines( > vhost-user-scsi-pci), the guest could start successfully. So there must > be something wrong with the spdk device. > > Is there anyone in this community who happened to be familiar with spdk? > Or should I seek for help from Intel? Don't know who is responsible for > maintaining this document. > > > Thanks, > Bob >
Re: [Qemu-devel] [SPDK] qemu process hung at boot-up, no explicit errors or warnings
2018-04-21 1:34 GMT+08:00 John Snow <js...@redhat.com>: > > > On 04/20/2018 07:13 AM, Bob Chen wrote: > > 2.11.1 could work, qemu is no longer occupying 100% CPU. That's > > interesting... > > > > Does 2.12 use 100% even at the firmware menu? Maybe we're not giving > this VM long enough to hit the spot that causes it to use 100%. > > > Now I can see the starting screen via vnc, says that not a bootable disk. > > > > If it's stopping at the firmware I think the guest is not being taken > into account, and the firmware is either trying to boot the wrong disk > or your disk is corrupted. (Or it can't see your disk at all -- I'm > personally not very familiar with using SPDK.) > Can you share your command line with us? > I opened another thread to discuss this issue, looks like there's something wrong with the spdk official doc. However, 2.12 is still very likely to be buggy. I'm sure I waited long enough but the CPU never dropped below 100%. The vnc screen stopped at "Loading SeaBIOS ..." Again, 2.10 and 2.11 are fine. Though "No bootable device" but CPU dropped anyway. > > > > > Is that because the guest OS's virtio vhost driver is not up to date > enough? > > > > > > Thanks, > > Bob > > > > On the QEMU-devel list and many other technical lists, can you please > reply in-line instead of replying above? ("top posting") > > Thank you, > --John > > > 2018-04-20 1:17 GMT+08:00 John Snow <js...@redhat.com > > <mailto:js...@redhat.com>>: > > > > Forwarding to qemu-block. > > > > On 04/19/2018 06:13 AM, Bob Chen wrote: > > > Hi, > > > > > > I was trying to run qemu with spdk, referring to > > > http://www.spdk.io/doc/vhost.html#vhost_qemu_config > > <http://www.spdk.io/doc/vhost.html#vhost_qemu_config> > > > > > > Everything went well since I had already set up hugepages, vfio, > vhost > > > targets, vhost-scsi device(vhost-block was also tested), etc, > without > > > errors or warnings reported. > > > > > > But at the last step to run qemu, the process would somehow hang. > And its > > > cpu busy remained 100%. > > > > > > I used `perf top -p` to monitor the process activity, > > > 18.28% [kernel] [k] vmx_vcpu_run > > >3.06% [kernel] [k] vcpu_enter_guest > > >3.05% [kernel] [k] > system_call_after_swapgs > > > > > > Do you have any ideas about what happened here? > > > > > > Test environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x, > > > dpdk 17.11.1, qemu 2.12.0-rc2. Detail log is attached within this > mail. > > > > > > > Have you tried any other versions? (2.11.1, or 2.12-rc4?) > > > > > > > > > > > Thanks, > > > Bob > > > > > > > >
[Qemu-devel] [QEMU + SPDK] The demo in the official document is not working
Hi, I was trying to run qemu with spdk, referring to http://www.spdk.io/doc/ vhost.html#vhost_qemu_config . Steps were strictly followed. # Environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x, > dpdk 17.11.1, qemu 2.11.1 cd spdk > sudo su > ulimit -l unlimited > HUGEMEM=2048 ./scripts/setup.sh > ./app/vhost/vhost -S /var/tmp -s 1024 -m 0x1 & > ./scripts/rpc.py construct_nvme_bdev -b Nvme0 -t pcie -a :03:00.0 > ./scripts/rpc.py construct_malloc_bdev 128 4096 -b Malloc0 > ./scripts/rpc.py construct_vhost_scsi_controller --cpumask 0x1 vhost.0 > ./scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Nvme0n1 > ./scripts/rpc.py add_vhost_scsi_lun vhost.0 1 Malloc0 > > qemu-system-x86_64 -enable-kvm -cpu host -machine pc,accel=kvm -daemonize > -vnc :1 \ > -smp 1 -m 1G -object > memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on -numa > node,memdev=mem0 \ > -drive file=,if=none,id=disk -device > ide-hd,drive=disk,bootindex=0 \ > -chardev socket,id=spdk_vhost_scsi0,path=/var/tmp/vhost.0 \ > -device vhost-user-scsi-pci,id=scsi0,chardev=spdk_vhost_scsi0 Unfortunately the demo was not able to work, stopped at the boot-loader screen saying that not bootable device. But if I removed the last two lines( vhost-user-scsi-pci), the guest could start successfully. So there must be something wrong with the spdk device. Is there anyone in this community who happened to be familiar with spdk? Or should I seek for help from Intel? Don't know who is responsible for maintaining this document. Thanks, Bob
Re: [Qemu-devel] [SPDK] qemu process hung at boot-up, no explicit errors or warnings
2.11.1 could work, qemu is no longer occupying 100% CPU. That's interesting... Now I can see the starting screen via vnc, says that not a bootable disk. Is that because the guest OS's virtio vhost driver is not up to date enough? Thanks, Bob 2018-04-20 1:17 GMT+08:00 John Snow <js...@redhat.com>: > Forwarding to qemu-block. > > On 04/19/2018 06:13 AM, Bob Chen wrote: > > Hi, > > > > I was trying to run qemu with spdk, referring to > > http://www.spdk.io/doc/vhost.html#vhost_qemu_config > > > > Everything went well since I had already set up hugepages, vfio, vhost > > targets, vhost-scsi device(vhost-block was also tested), etc, without > > errors or warnings reported. > > > > But at the last step to run qemu, the process would somehow hang. And its > > cpu busy remained 100%. > > > > I used `perf top -p` to monitor the process activity, > > 18.28% [kernel] [k] vmx_vcpu_run > >3.06% [kernel] [k] vcpu_enter_guest > >3.05% [kernel] [k] system_call_after_swapgs > > > > Do you have any ideas about what happened here? > > > > Test environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x, > > dpdk 17.11.1, qemu 2.12.0-rc2. Detail log is attached within this mail. > > > > Have you tried any other versions? (2.11.1, or 2.12-rc4?) > > > > > > > Thanks, > > Bob > > > >
[Qemu-devel] [SPDK] qemu process hung at boot-up, no explicit errors or warnings
Hi, I was trying to run qemu with spdk, referring to http://www.spdk.io/doc/vhost.html#vhost_qemu_config Everything went well since I had already set up hugepages, vfio, vhost targets, vhost-scsi device(vhost-block was also tested), etc, without errors or warnings reported. But at the last step to run qemu, the process would somehow hang. And its cpu busy remained 100%. I used `perf top -p` to monitor the process activity, 18.28% [kernel] [k] vmx_vcpu_run 3.06% [kernel] [k] vcpu_enter_guest 3.05% [kernel] [k] system_call_after_swapgs Do you have any ideas about what happened here? Test environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x, dpdk 17.11.1, qemu 2.12.0-rc2. Detail log is attached within this mail. Thanks, Bob qemu-spdk.log Description: Binary data
Re: [Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK
I think you can edit the github's repo description to tell people not to download release from this site. 2018-04-18 17:29 GMT+08:00 Peter Maydell <peter.mayd...@linaro.org>: > On 18 April 2018 at 09:09, Bob Chen <a175818...@gmail.com> wrote: > > I found that it has nothing to do with the release version, but the > github > > one is just not able to work... > > > > So what github and qemu.org provide are totally different things? > > The "tarballs" from github don't work and can't work, but > github provides no mechanism for the project to say "don't > create this page". You can complain to github about this > if you like. > > You're best off starting with the project's own web > page: https://www.qemu.org/ -- we use github only as > a sort of backup git server in case our own is down. > > thanks > -- PMM >
Re: [Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK
I found that it has nothing to do with the release version, but the github one is just not able to work... So what github and qemu.org provide are totally different things? 2018-04-18 14:52 GMT+08:00 Bob Chen <a175818...@gmail.com>: > No, I downloaded the release tarball from github. > > 2018-04-18 14:25 GMT+08:00 Stefan Weil <s...@weilnetz.de>: > >> Am 18.04.2018 um 08:19 schrieb Bob Chen: >> > fatal error: ui/input-keymap-atset1-to-qcode.c: No such file or >> directory >> > >> > Build on my CentOS 7. >> > >> >> That file is generated during the build. rc4 compiles in my test on >> Debian. Did you start your build from a fresh git clone? >> >> Stefan >> > >
Re: [Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK
No, I downloaded the release tarball from github. 2018-04-18 14:25 GMT+08:00 Stefan Weil <s...@weilnetz.de>: > Am 18.04.2018 um 08:19 schrieb Bob Chen: > > fatal error: ui/input-keymap-atset1-to-qcode.c: No such file or > directory > > > > Build on my CentOS 7. > > > > That file is generated during the build. rc4 compiles in my test on > Debian. Did you start your build from a fresh git clone? > > Stefan >
[Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK
fatal error: ui/input-keymap-atset1-to-qcode.c: No such file or directory Build on my CentOS 7.
Re: [Qemu-devel] [GPU and VFIO] qemu hang at startup, VFIO_IOMMU_MAP_DMA is extremely slow
Ping... Was it because VFIO_IOMMU_MAP_DMA needs contiguous memory and my host was not able to provide them immediately? 2017-12-26 19:37 GMT+08:00 Bob Chen <a175818...@gmail.com>: > > > 2017-12-26 18:51 GMT+08:00 Liu, Yi L <yi.l@intel.com>: > >> > -Original Message- >> > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu= >> intel@nongnu.org] >> > On Behalf Of Bob Chen >> > Sent: Tuesday, December 26, 2017 6:30 PM >> > To: qemu-devel@nongnu.org >> > Subject: [Qemu-devel] [GPU and VFIO] qemu hang at startup, >> > VFIO_IOMMU_MAP_DMA is extremely slow >> > >> > Hi, >> > >> > I have a host server with multiple GPU cards, and was assigning them to >> qemu >> > with VFIO. >> > >> > I found that when setting up the last free GPU, the qemu process would >> hang >> >> Are all the GPUs in the same iommu group? >> > > Each of them is in a single group. > > >> >> > there and took almost 10 minutes before finishing startup. I made some >> dig by >> > gdb, and found the slowest part occurred at the >> > hw/vfio/common.c:vfio_dma_map function call. >> >> This is to setup mapping and it takes time. This function would be called >> multiple >> times and it will take some time. The slowest part, do you mean it takes >> a long time for a single vfio_dma_map() calling or the whole passthru >> spends a lot >> of time on creating mapping. If a single calling takes a lot of time, >> then it may be >> a problem. >> > > Each vfio_dma_map() takes 3 to 10 mins accordingly. > > >> >> You may paste your Qemu command which might help. And the dmesg in host >> would also help. >> > > cmd line: > After adding -device vfio-pci,host=09:00.0,multifunction=on,addr=0x15, > qemu would hang. > Otherwise, could start immediately without this option. > > dmesg: > [Tue Dec 26 18:39:50 2017] vfio-pci :09:00.0: enabling device (0400 -> > 0402) > [Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap > 0x1e@0x258 > [Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap > 0x19@0x900 > [Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation > wraparound > [Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation > wraparound > [Tue Dec 26 18:40:03 2017] kvm [74663]: vcpu0 ignored rdmsr: 0x345 > > Kernel: > 3.10.0-514.16.1 CentOS 7.3 > > >> >> > >> > >> > static int vfio_dma_map(VFIOContainer *container, hwaddr iova, >> ram_addr_t >> > size, void *vaddr, bool readonly) { ... >> > if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 || >> > (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && >> > ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) { >> > return 0; >> > } >> > ... >> > } >> > >> > >> > The hang was enable to reproduce on one of my hosts, I was setting up a >> 4GB >> > memory VM, while the host still had 16GB free. GPU physical mem is 8G. >> >> Does it happen when you only assign a single GPU? >> > > Not sure. Didn't try multiple GPUs. > > >> >> > Also, this phenomenon was observed on other hosts occasionally, and the >> > similarity is that they always happened on the last free GPU. >> > >> > >> > Full stack trace file is attached. Looking forward for you help, thanks >> > >> > >> > - Bob >> >> Regards, >> Yi L >> > >
Re: [Qemu-devel] [GPU and VFIO] qemu hang at startup, VFIO_IOMMU_MAP_DMA is extremely slow
2017-12-26 18:51 GMT+08:00 Liu, Yi L <yi.l@intel.com>: > > -Original Message- > > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu= > intel@nongnu.org] > > On Behalf Of Bob Chen > > Sent: Tuesday, December 26, 2017 6:30 PM > > To: qemu-devel@nongnu.org > > Subject: [Qemu-devel] [GPU and VFIO] qemu hang at startup, > > VFIO_IOMMU_MAP_DMA is extremely slow > > > > Hi, > > > > I have a host server with multiple GPU cards, and was assigning them to > qemu > > with VFIO. > > > > I found that when setting up the last free GPU, the qemu process would > hang > > Are all the GPUs in the same iommu group? > Each of them is in a single group. > > > there and took almost 10 minutes before finishing startup. I made some > dig by > > gdb, and found the slowest part occurred at the > > hw/vfio/common.c:vfio_dma_map function call. > > This is to setup mapping and it takes time. This function would be called > multiple > times and it will take some time. The slowest part, do you mean it takes > a long time for a single vfio_dma_map() calling or the whole passthru > spends a lot > of time on creating mapping. If a single calling takes a lot of time, then > it may be > a problem. > Each vfio_dma_map() takes 3 to 10 mins accordingly. > > You may paste your Qemu command which might help. And the dmesg in host > would also help. > cmd line: After adding -device vfio-pci,host=09:00.0,multifunction=on,addr=0x15, qemu would hang. Otherwise, could start immediately without this option. dmesg: [Tue Dec 26 18:39:50 2017] vfio-pci :09:00.0: enabling device (0400 -> 0402) [Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap 0x1e@0x258 [Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap 0x19@0x900 [Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation wraparound [Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation wraparound [Tue Dec 26 18:40:03 2017] kvm [74663]: vcpu0 ignored rdmsr: 0x345 Kernel: 3.10.0-514.16.1 CentOS 7.3 > > > > > > > static int vfio_dma_map(VFIOContainer *container, hwaddr iova, ram_addr_t > > size, void *vaddr, bool readonly) { ... > > if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 || > > (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && > > ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) { > > return 0; > > } > > ... > > } > > > > > > The hang was enable to reproduce on one of my hosts, I was setting up a > 4GB > > memory VM, while the host still had 16GB free. GPU physical mem is 8G. > > Does it happen when you only assign a single GPU? > Not sure. Didn't try multiple GPUs. > > > Also, this phenomenon was observed on other hosts occasionally, and the > > similarity is that they always happened on the last free GPU. > > > > > > Full stack trace file is attached. Looking forward for you help, thanks > > > > > > - Bob > > Regards, > Yi L >
[Qemu-devel] [GPU and VFIO] qemu hang at startup, VFIO_IOMMU_MAP_DMA is extremely slow
Hi, I have a host server with multiple GPU cards, and was assigning them to qemu with VFIO. I found that when setting up the last free GPU, the qemu process would hang there and took almost 10 minutes before finishing startup. I made some dig by gdb, and found the slowest part occurred at the hw/vfio/common.c:vfio_dma_map function call. static int vfio_dma_map(VFIOContainer *container, hwaddr iova, ram_addr_t size, void *vaddr, bool readonly) { ... if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 || (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) { return 0; } ... } The hang was enable to reproduce on one of my hosts, I was setting up a 4GB memory VM, while the host still had 16GB free. GPU physical mem is 8G. Also, this phenomenon was observed on other hosts occasionally, and the similarity is that they always happened on the last free GPU. Full stack trace file is attached. Looking forward for you help, thanks - Bob #0 vfio_dma_map (container=0x56b1f880, iova=0, size=655360, vaddr=0x7ffecfe0, readonly=false) at /usr/src/debug/qemu-2.6.2/hw/vfio/common.c:227 map = {argsz = 655359, flags = 0, vaddr = 0, iova = 140737488339248, size = 93824994141691} #1 0x557712dc in vfio_listener_region_add (listener=0x56b1f890, section=0x7fffc1f0) at /usr/src/debug/qemu-2.6.2/hw/vfio/common.c:419 container = 0x56b1f880 iova = 0 end = 655359 llend = {lo = 655360, hi = 0} llsize = {lo = 655360, hi = 0} vaddr = 0x7ffecfe0 ret = 7 __func__ = "vfio_listener_region_add" #2 0x55728465 in listener_add_address_space (listener=0x56b1f890, as=0x560823e0 ) at /usr/src/debug/qemu-2.6.2/memory.c:2179 section = {mr = 0x566ec570, address_space = 0x560823e0 , offset_within_region = 0, size = {lo = 655360, hi = 0}, offset_within_address_space = 0, readonly = false} view = 0x57ae3bd0 fr = 0x566f0c00 #3 0x5572860d in memory_listener_register (listener=0x56b1f890, filter=0x560823e0 ) at /usr/src/debug/qemu-2.6.2/memory.c:2208 other = 0x565b4910 as = 0x560823e0 #4 0x55772811 in vfio_connect_container (group=0x5784bce0, as=0x560823e0 ) at /usr/src/debug/qemu-2.6.2/hw/vfio/common.c:900 container = 0x56b1f880 ret = 0 fd = 35 space = 0x5784bd20 #5 0x55772cbc in vfio_get_group (groupid=25, as=0x560823e0 ) at /usr/src/debug/qemu-2.6.2/hw/vfio/common.c:1008 group = 0x5784bce0 path = "/dev/vfio/25\000U\000\000P\303\377\377\377\177\000\000\332Q\224UUU\000" status = {argsz = 8, flags = 1} #6 0x5577af5c in vfio_initfn (pdev=0x581672b0) at /usr/src/debug/qemu-2.6.2/hw/vfio/pci.c:2447 vdev = 0x581672b0 vbasedev_iter = 0x40b group = 0x55bbc65d tmp = 0x57640b60 "" group_path = "../../../../../../kernel/iommu_groups/25\000\000\000\000\343\003\000\000\031ĻUUU\000\000\000\000\000\000\000\000\000\000\220\304\377\377\377\177\000\000]ƻU\a\000\000\000\320ɻUUU\000\000\360\304\377\377\v\004\000\000\300\305\377\377\377\177\000\000I\252\260UUU\000\000\360\304\377\377\377\1- 77\000\000\000\000\000\000\000\000\000\000\320\304\377\377\377\177\000\000]ƻUUU\000\000\260ɻUUU\000\000f˲U\343\003\000\000\241:\000\000\000\200\377\377\002", '\000' , "\060\000\000\000[\000\000\000`\305\377\377\377\177"... group_name = 0x7fffc466 "25" len = 40 st = {st_dev = 17, st_ino = 39127, st_nlink = 3, st_mode = 16877, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1513939417, tv_nsec = 943657386}, st_mtim = {tv_sec = 1510113186, tv_nsec = 59601}, st_ctim = { tv_sec = 1510113186, tv_nsec = 59601}, __unused = {0, 0, 0}} groupid = 25 ret = 21845 #7 0x55943b65 in pci_default_realize (dev=0x581672b0, errp=0x7fffd4b8) at hw/pci/pci.c:1895 pc = 0x56568e70 __func__ = "pci_default_realize" #8 0x55943a08 in pci_qdev_realize (qdev=0x581672b0, errp=0x7fffd520) at hw/pci/pci.c:1867 pci_dev = 0x581672b0 pc = 0x56568e70 __func__ = "pci_qdev_realize" local_err = 0x0 bus = 0x569baea0 is_default_rom = false #9 0x558af8da in device_set_realized (obj=0x581672b0, value=true, errp=0x7fffd6e0) at hw/core/qdev.c:1066 dev = 0x581672b0 __func__ = "device_set_realized" dc = 0x56568e70 hotplug_ctrl = 0x55af83cfbus = 0x7fffd5c7 local_err = 0x0 #10 0x55a3754d in property_set_bool (obj=0x581672b0, v=0x565a9140, name=0x55b494e9
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
Hi, After 3 months of work and investigation, and tedious mail discussions with Nvidia, I think some progress have been made, in terms of the GPUDirect(p2p) in virtual environment. The only remaining issue then, is the low bidirectional bandwidth between two sibling GPUs under the same PCIe switch. We expanded the tests to run on even more GPU cards, so the results seemed to be explicit now. P40 is OK, and its hardware topology on host is: \-[:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[03]00.0 LSI Logic / Symbios Logic MegaRAID SAS-3 3008 [Fury] +-02.0-[04]00.0 NVIDIA Corporation GP102GL [Tesla P40] +-03.0-[02]00.0 NVIDIA Corporation GP102GL [Tesla P40] M60, not OK, low bandwidth: \-[:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[06]00.0 LSI Logic / Symbios Logic MegaRAID SAS-3 3008 [Fury] +-02.0-[07-0a]00.0-[08-0a]--+-08.0-[09]00.0 NVIDIA Corporation GM204GL [Tesla M60] | \-10.0-[0a]00.0 NVIDIA Corporation GM204GL [Tesla M60] V100, not OK, low bandwidth: \-[:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[01]--+-00.0 Mellanox Technologies MT27710 Family [ConnectX-4 Lx] |\-00.1 Mellanox Technologies MT27710 Family [ConnectX-4 Lx] +-02.0-[02-05]00.0-[03-05]--+-08.0-[04]00.0 NVIDIA Corporation GV100 [Tesla V100 PCIe] | \-10.0-[05]00.0 NVIDIA Corporation GV100 [Tesla V100 PCIe] So what might be the actual effect of the PLX switch hardware for GPU data flow? Although it is not visible in guest OS. Nvidia tech-support guys are not familiar with virtualization. They asked us to consult the community first.
Re: [Qemu-devel] [PATCH 0/3] vfio/pci: Add NVIDIA GPUDirect P2P clique support
It's a mistake, please ignore. This patch is able to work. 2017-10-26 18:45 GMT+08:00 Bob Chen <a175818...@gmail.com>: > There seem to be some bugs in these patches, causing my VM failed to boot. > > Test case: > > 0. Merge these 3 patches in to release 2.10.1 > > 1. qemu-system-x86_64_2.10.1 ... \ > -device vfio-pci,host=04:00.0 \ > -device vfio-pci,host=05:00.0 \ > -device vfio-pci,host=08:00.0 \ > -device vfio-pci,host=09:00.0 \ > -device vfio-pci,host=85:00.0 \ > -device vfio-pci,host=86:00.0 \ > -device vfio-pci,host=89:00.0 \ > -device vfio-pci,host=8a:00.0 ... > > The guest was able to boot up. > > 2. qemu-system-x86_64_2.10.1 ... \ > -device vfio-pci,host=04:00.0,x-nv-gpudirect-clique=0 \ > -device vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0 \ > -device vfio-pci,host=08:00.0,x-nv-gpudirect-clique=0 \ > -device vfio-pci,host=09:00.0,x-nv-gpudirect-clique=0 \ > -device vfio-pci,host=85:00.0,x-nv-gpudirect-clique=8 \ > -device vfio-pci,host=86:00.0,x-nv-gpudirect-clique=8 \ > -device vfio-pci,host=89:00.0,x-nv-gpudirect-clique=8 \ > -device vfio-pci,host=8a:00.0,x-nv-gpudirect-clique=8 \ > > Hang. VNC couldn't connect. > > > My personal patch used to work, although it was done by straightforward > hacking and not that friendly to read. > > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > > @@static int vfio_initfn(PCIDevice *pdev) > > vfio_add_emulated_long(vdev, 0xc8, 0x50080009, ~0); > if (count < 4) > { > vfio_add_emulated_long(vdev, 0xcc, 0x5032, ~0); > } > else > { > vfio_add_emulated_long(vdev, 0xcc, 0x00085032, ~0); > } > vfio_add_emulated_word(vdev, 0x78, 0xc810, ~0); > > 2017-08-30 6:05 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > >> NVIDIA has a specification for exposing a virtual vendor capability >> which provides a hint to guest drivers as to which sets of GPUs can >> support direct peer-to-peer DMA. Devices with the same clique ID are >> expected to support this. The user can specify a clique ID for an >> NVIDIA graphics device using the new vfio-pci x-nv-gpudirect-clique= >> option, where valid clique IDs are a 4-bit integer. It's entirely the >> user's responsibility to specify sets of devices for which P2P works >> correctly and provides some benefit. This is only useful for DMA >> between NVIDIA GPUs, therefore it's only useful to specify cliques >> comprised of more than one GPU. Furthermore, this does not enable DMA >> between VMs, there is no change to VM DMA mapping, this only exposes >> hints about existing DMA paths to the guest driver. Thanks, >> >> Alex >> >> --- >> >> Alex Williamson (3): >> vfio/pci: Do not unwind on error >> vfio/pci: Add virtual capabilities quirk infrastructure >> vfio/pci: Add NVIDIA GPUDirect Cliques support >> >> >> hw/vfio/pci-quirks.c | 114 ++ >> >> hw/vfio/pci.c| 17 +++ >> hw/vfio/pci.h|4 ++ >> 3 files changed, 133 insertions(+), 2 deletions(-) >> > >
Re: [Qemu-devel] [PATCH 0/3] vfio/pci: Add NVIDIA GPUDirect P2P clique support
There seem to be some bugs in these patches, causing my VM failed to boot. Test case: 0. Merge these 3 patches in to release 2.10.1 1. qemu-system-x86_64_2.10.1 ... \ -device vfio-pci,host=04:00.0 \ -device vfio-pci,host=05:00.0 \ -device vfio-pci,host=08:00.0 \ -device vfio-pci,host=09:00.0 \ -device vfio-pci,host=85:00.0 \ -device vfio-pci,host=86:00.0 \ -device vfio-pci,host=89:00.0 \ -device vfio-pci,host=8a:00.0 ... The guest was able to boot up. 2. qemu-system-x86_64_2.10.1 ... \ -device vfio-pci,host=04:00.0,x-nv-gpudirect-clique=0 \ -device vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0 \ -device vfio-pci,host=08:00.0,x-nv-gpudirect-clique=0 \ -device vfio-pci,host=09:00.0,x-nv-gpudirect-clique=0 \ -device vfio-pci,host=85:00.0,x-nv-gpudirect-clique=8 \ -device vfio-pci,host=86:00.0,x-nv-gpudirect-clique=8 \ -device vfio-pci,host=89:00.0,x-nv-gpudirect-clique=8 \ -device vfio-pci,host=8a:00.0,x-nv-gpudirect-clique=8 \ Hang. VNC couldn't connect. My personal patch used to work, although it was done by straightforward hacking and not that friendly to read. --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@static int vfio_initfn(PCIDevice *pdev) vfio_add_emulated_long(vdev, 0xc8, 0x50080009, ~0); if (count < 4) { vfio_add_emulated_long(vdev, 0xcc, 0x5032, ~0); } else { vfio_add_emulated_long(vdev, 0xcc, 0x00085032, ~0); } vfio_add_emulated_word(vdev, 0x78, 0xc810, ~0); 2017-08-30 6:05 GMT+08:00 Alex Williamson: > NVIDIA has a specification for exposing a virtual vendor capability > which provides a hint to guest drivers as to which sets of GPUs can > support direct peer-to-peer DMA. Devices with the same clique ID are > expected to support this. The user can specify a clique ID for an > NVIDIA graphics device using the new vfio-pci x-nv-gpudirect-clique= > option, where valid clique IDs are a 4-bit integer. It's entirely the > user's responsibility to specify sets of devices for which P2P works > correctly and provides some benefit. This is only useful for DMA > between NVIDIA GPUs, therefore it's only useful to specify cliques > comprised of more than one GPU. Furthermore, this does not enable DMA > between VMs, there is no change to VM DMA mapping, this only exposes > hints about existing DMA paths to the guest driver. Thanks, > > Alex > > --- > > Alex Williamson (3): > vfio/pci: Do not unwind on error > vfio/pci: Add virtual capabilities quirk infrastructure > vfio/pci: Add NVIDIA GPUDirect Cliques support > > > hw/vfio/pci-quirks.c | 114 ++ > > hw/vfio/pci.c| 17 +++ > hw/vfio/pci.h|4 ++ > 3 files changed, 133 insertions(+), 2 deletions(-) >
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
More updates: 1. This behavior was found not only on M60, but also on TITAN 1080Ti or Xp. 2. When not setting up the p2p compatibility, i.e. run the original qemu with GPUs attached to the root pcie bus, the LnkSta on host always remains at 8 GT/s. Don't know why the new p2p change would cause the GPU driver in guest to re-negotiate its speed. I think it has gone beyond the community's responsibility to debug this tricky issue. So I have contacted nvidia for technical support, and they are expected to send me a reply in next few weeks. Will keep you guys updated. Bob 2017-08-31 0:43 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > On Wed, 30 Aug 2017 17:41:20 +0800 > Bob Chen <a175818...@gmail.com> wrote: > > > I think I have observed what you said... > > > > The link speed on host remained 8GT/s until I finished running > > p2pBandwidthLatencyTest > > for the first time. Then it became 2.5GT/s... > > > > > > # lspci -s 09:00.0 -vvv > ... > > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- > > So long as the device renegotiates to 8GT/s under load rather than > getting stuck at 2.5GT/s, I think this is the expected behavior. This > is a power saving measure by the driver. Thanks, > > Alex >
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
I think I have observed what you said... The link speed on host remained 8GT/s until I finished running p2pBandwidthLatencyTest for the first time. Then it became 2.5GT/s... # lspci -s 09:00.0 -vvv 09:00.0 3D controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1) Subsystem: NVIDIA Corporation Device 115e Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 Capabilities: [900 v1] #19 Kernel driver in use: vfio-pci Kernel modules: nouveau
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
The topology is already having all GPUs directly attached to root bus 0. In this situation you can't see the LnkSta attribute in any capabilities. The other way of using emulated switch would somehow show this attribute, at 8 GT/s, although the real bandwidth is low as usual. 2017-08-23 2:06 GMT+08:00 Michael S. Tsirkin <m...@redhat.com>: > On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote: > > On Tue, 22 Aug 2017 15:04:55 +0800 > > Bob Chen <a175818...@gmail.com> wrote: > > > > > Hi, > > > > > > I got a spec from Nvidia which illustrates how to enable GPU p2p in > > > virtualization environment. (See attached) > > > > Neat, looks like we should implement a new QEMU vfio-pci option, > > something like nvidia-gpudirect-p2p-id=. I don't think I'd want to > > code the policy of where to enable it into QEMU or the kernel, so we'd > > push it up to management layers or users to decide. > > > > > The key is to append the legacy pci capabilities list when setting up > the > > > hypervisor, with a Nvidia customized capability config. > > > > > > I added some hack in hw/vfio/pci.c and managed to implement that. > > > > > > Then I found the GPU was able to recognize its peer, and the latency > has > > > dropped. ✅ > > > > > > However the bandwidth didn't improve, but decreased instead. ❌ > > > > > > Any suggestions? > > > > What's the VM topology? I've found that in a Q35 configuration with > > GPUs downstream of an emulated root port, the NVIDIA driver in the > > guest will downshift the physical link rate to 2.5GT/s and never > > increase it back to 8GT/s. I believe this is because the virtual > > downstream port only advertises Gen1 link speeds. > > > Fixing that would be nice, and it's great that you now actually have a > reproducer that can be used to test it properly. > > Exposing higher link speeds is a bit of work since there are now all > kind of corner cases to cover as guests may play with link speeds and we > must pretend we change it accordingly. An especially interesting > question is what to do with the assigned device when guest tries to play > with port link speed. It's kind of similar to AER in that respect. > > I guess we can just ignore it for starters. > > > If the GPUs are on > > the root complex (ie. pcie.0) the physical link will run at 2.5GT/s > > when the GPU is idle and upshift to 8GT/s under load. This also > > happens if the GPU is exposed in a conventional PCI topology to the > > VM. Another interesting data point is that an older Kepler GRID card > > does not have this issue, dynamically shifting the link speed under > > load regardless of the VM PCI/e topology, while a new M60 using the > > same driver experiences this problem. I've filed a bug with NVIDIA as > > this seems to be a regression, but it appears (untested) that the > > hypervisor should take the approach of exposing full, up-to-date PCIe > > link capabilities and report a link status matching the downstream > > devices. > > > > I'd suggest during your testing, watch lspci info for the GPU from the > > host, noting the behavior of LnkSta (Link Status) to check if the > > devices gets stuck at 2.5GT/s in your VM configuration and adjust the > > topology until it works, likely placing the GPUs on pcie.0 for a Q35 > > based machine. Thanks, > > > > Alex >
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
Plus: 1 GB hugepages neither improved bandwidth nor latency. Results remained the same. 2017-08-08 9:44 GMT+08:00 Bob Chen <a175818...@gmail.com>: > 1. How to test the KVM exit rate? > > 2. The switches are separate devices of PLX Technology > > # lspci -s 07:08.0 -nn > 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port > PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca) > > # This is one of the Root Ports in the system. > [:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon > D DMI2 > +-01.0-[01]00.0 LSI Logic / Symbios Logic MegaRAID SAS > 2208 [Thunderbolt] > +-02.0-[02-05]-- > +-03.0-[06-09]00.0-[07-09]--+-08.0-[08]--+-00.0 NVIDIA > Corporation GP102 [TITAN Xp] > | |\-00.1 NVIDIA > Corporation GP102 HDMI Audio Controller > | \-10.0-[09]--+-00.0 NVIDIA > Corporation GP102 [TITAN Xp] > |\-00.1 NVIDIA > Corporation GP102 HDMI Audio Controller > > > > > 3. ACS > > It seemed that I had misunderstood your point? I finally found ACS > information on switches, not on GPUs. > > Capabilities: [f24 v1] Access Control Services > ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ > EgressCtrl+ DirectTrans+ > ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ > EgressCtrl- DirectTrans- > > > > 2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > >> On Mon, 7 Aug 2017 21:00:04 +0800 >> Bob Chen <a175818...@gmail.com> wrote: >> >> > Bad news... The performance had dropped dramatically when using emulated >> > switches. >> > >> > I was referring to the PCIe doc at >> > https://github.com/qemu/qemu/blob/master/docs/pcie.txt >> > >> > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine >> > q35,accel=kvm -nodefaults -nodefconfig \ >> > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ >> > -device x3130-upstream,id=upstream_port1,bus=root_port1 \ >> > -device >> > xio3130-downstream,id=downstream_port1,bus=upstream_port1, >> chassis=11,slot=11 >> > \ >> > -device >> > xio3130-downstream,id=downstream_port2,bus=upstream_port1, >> chassis=12,slot=12 >> > \ >> > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ >> > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ >> > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ >> > -device x3130-upstream,id=upstream_port2,bus=root_port2 \ >> > -device >> > xio3130-downstream,id=downstream_port3,bus=upstream_port2, >> chassis=21,slot=21 >> > \ >> > -device >> > xio3130-downstream,id=downstream_port4,bus=upstream_port2, >> chassis=22,slot=22 >> > \ >> > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ >> > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ >> > ... >> > >> > >> > Not 8 GPUs this time, only 4. >> > >> > *1. Attached to pcie bus directly (former situation):* >> > >> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) >> >D\D 0 1 2 3 >> > 0 420.93 10.03 11.07 11.09 >> > 1 10.04 425.05 11.08 10.97 >> > 2 11.17 11.17 425.07 10.07 >> > 3 11.25 11.25 10.07 423.64 >> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) >> >D\D 0 1 2 3 >> > 0 425.98 10.03 11.07 11.09 >> > 1 9.99 426.43 11.07 11.07 >> > 2 11.04 11.20 425.98 9.89 >> > 3 11.21 11.21 10.06 425.97 >> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) >> >D\D 0 1 2 3 >> > 0 430.67 10.45 19.59 19.58 >> > 1 10.44 428.81 19.49 19.53 >> > 2 19.62 19.62 429.52 10.57 >> > 3 19.60 19.66 10.43 427.38 >> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) >> >D\D 0 1 2 3 >> > 0 429.47 10.47 19.52 19.39 >> > 1 10.48 427.15 19.64 19.52 >> > 2 19.64 19.59 429.02 10.42 >> > 3 19.60 19.64 10.47 427.81 >> > P2P=Disabled Latency Matrix (us) >> >D\D 0 1 2 3 >> > 0 4.50 13.72 14.49 14.44 >> > 1 13.65 4.53 14.52 14.3
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
1. How to test the KVM exit rate? 2. The switches are separate devices of PLX Technology # lspci -s 07:08.0 -nn 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca) # This is one of the Root Ports in the system. [:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[01]00.0 LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] +-02.0-[02-05]-- +-03.0-[06-09]00.0-[07-09]--+-08.0-[08]--+-00.0 NVIDIA Corporation GP102 [TITAN Xp] | |\-00.1 NVIDIA Corporation GP102 HDMI Audio Controller | \-10.0-[09]--+-00.0 NVIDIA Corporation GP102 [TITAN Xp] |\-00.1 NVIDIA Corporation GP102 HDMI Audio Controller 3. ACS It seemed that I had misunderstood your point? I finally found ACS information on switches, not on GPUs. Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- 2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > On Mon, 7 Aug 2017 21:00:04 +0800 > Bob Chen <a175818...@gmail.com> wrote: > > > Bad news... The performance had dropped dramatically when using emulated > > switches. > > > > I was referring to the PCIe doc at > > https://github.com/qemu/qemu/blob/master/docs/pcie.txt > > > > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine > > q35,accel=kvm -nodefaults -nodefconfig \ > > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ > > -device x3130-upstream,id=upstream_port1,bus=root_port1 \ > > -device > > xio3130-downstream,id=downstream_port1,bus=upstream_ > port1,chassis=11,slot=11 > > \ > > -device > > xio3130-downstream,id=downstream_port2,bus=upstream_ > port1,chassis=12,slot=12 > > \ > > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ > > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ > > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ > > -device x3130-upstream,id=upstream_port2,bus=root_port2 \ > > -device > > xio3130-downstream,id=downstream_port3,bus=upstream_ > port2,chassis=21,slot=21 > > \ > > -device > > xio3130-downstream,id=downstream_port4,bus=upstream_ > port2,chassis=22,slot=22 > > \ > > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ > > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ > > ... > > > > > > Not 8 GPUs this time, only 4. > > > > *1. Attached to pcie bus directly (former situation):* > > > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > >D\D 0 1 2 3 > > 0 420.93 10.03 11.07 11.09 > > 1 10.04 425.05 11.08 10.97 > > 2 11.17 11.17 425.07 10.07 > > 3 11.25 11.25 10.07 423.64 > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > >D\D 0 1 2 3 > > 0 425.98 10.03 11.07 11.09 > > 1 9.99 426.43 11.07 11.07 > > 2 11.04 11.20 425.98 9.89 > > 3 11.21 11.21 10.06 425.97 > > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) > >D\D 0 1 2 3 > > 0 430.67 10.45 19.59 19.58 > > 1 10.44 428.81 19.49 19.53 > > 2 19.62 19.62 429.52 10.57 > > 3 19.60 19.66 10.43 427.38 > > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) > >D\D 0 1 2 3 > > 0 429.47 10.47 19.52 19.39 > > 1 10.48 427.15 19.64 19.52 > > 2 19.64 19.59 429.02 10.42 > > 3 19.60 19.64 10.47 427.81 > > P2P=Disabled Latency Matrix (us) > >D\D 0 1 2 3 > > 0 4.50 13.72 14.49 14.44 > > 1 13.65 4.53 14.52 14.33 > > 2 14.22 13.82 4.52 14.50 > > 3 13.87 13.75 14.53 4.55 > > P2P=Enabled Latency Matrix (us) > >D\D 0 1 2 3 > > 0 4.44 13.56 14.58 14.45 > > 1 13.56 4.48 14.39 14.45 > > 2 13.85 13.93 4.86 14.80 > > 3 14.51 14.23 14.70 4.72 > > > > > > *2. Attached to emulated Root Port and Switches:* > > > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > >D\D 0 1 2 3 > > 0 420.48 3.15 3.12 3.12 > > 1 3.13 422.
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
Besides, I checked the lspci -vvv output, no capabilities of Access Control are seen. 2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > On Tue, 1 Aug 2017 17:35:40 +0800 > Bob Chen <a175818...@gmail.com> wrote: > > > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > > > > > On Tue, 1 Aug 2017 13:04:46 +0800 > > > Bob Chen <a175818...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > This is a sketch of my hardware topology. > > > > > > > > CPU0 <- QPI ->CPU1 > > > >| | > > > > Root Port(at PCIe.0)Root Port(at PCIe.1) > > > >/\ / \ > > > > > > Are each of these lines above separate root ports? ie. each root > > > complex hosts two root ports, each with a two-port switch downstream of > > > it? > > > > > > > Not quite sure if root complex is a concept or a real physical device ... > > > > But according to my observation by `lspci -vt`, there are indeed 4 Root > > Ports in the system. So the sketch might need a tiny update. > > > > > > CPU0 <- QPI ->CPU1 > > > >| | > > > > Root Complex(device?) Root Complex(device?) > > > > /\ /\ > > > > Root Port Root Port Root Port Root Port > > > >/\ /\ > > > > SwitchSwitch SwitchSwitch > > > > / \ / \ / \ / \ > > > >GPU GPU GPU GPU GPU GPU GPU GPU > > > Yes, that's what I expected. So the numbers make sense, the immediate > sibling GPU would share bandwidth between the root port and upstream > switch port, any other GPU should not double-up on any single link. > > > > > SwitchSwitch SwitchSwitch > > > > / \ / \ / \/\ > > > >GPU GPU GPU GPU GPU GPU GPU GPU > > > > > > > > > > > > And below are the p2p bandwidth test results. > > > > > > > > Host: > > > >D\D 0 1 2 3 4 5 6 7 > > > > 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 > > > > 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 > > > > 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 > > > > 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 > > > > 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 > > > > 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 > > > > 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 > > > > 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 > > > > > > > > VM: > > > >D\D 0 1 2 3 4 5 6 7 > > > > 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 > > > > 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 > > > > 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 > > > > 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 > > > > 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 > > > > 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 > > > > 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 > > > > 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 > > > > > > Interesting test, how do you get these numbers? What are the units, > > > GB/s? > > > > > > > > > > > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are > > GB/s. Asynchronous read and write. Bidirectional. > > > > However, the Unidirectional test had shown a different result. Didn't > fall > > down to a half. > > > > VM: > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > >D\D 0 1 2 3 4 5 6 7 > > 0 424.07 10.02 11.33 11.30 11.09 11.05 11.06 11.10 > > 1 10.05 425.98 11.40 11.33 11.08 11.10 11.13 11.09 > > 2 11.31 11.28 423.67 10.10 11.14 11.13 11.13 11.11 > > 3 11.30 11.31 10.08 425.05 11.10 11.07
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
Bad news... The performance had dropped dramatically when using emulated switches. I was referring to the PCIe doc at https://github.com/qemu/qemu/blob/master/docs/pcie.txt # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine q35,accel=kvm -nodefaults -nodefconfig \ -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ -device x3130-upstream,id=upstream_port1,bus=root_port1 \ -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11 \ -device xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12 \ -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ -device x3130-upstream,id=upstream_port2,bus=root_port2 \ -device xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21 \ -device xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22 \ -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ ... Not 8 GPUs this time, only 4. *1. Attached to pcie bus directly (former situation):* Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 420.93 10.03 11.07 11.09 1 10.04 425.05 11.08 10.97 2 11.17 11.17 425.07 10.07 3 11.25 11.25 10.07 423.64 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 425.98 10.03 11.07 11.09 1 9.99 426.43 11.07 11.07 2 11.04 11.20 425.98 9.89 3 11.21 11.21 10.06 425.97 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 430.67 10.45 19.59 19.58 1 10.44 428.81 19.49 19.53 2 19.62 19.62 429.52 10.57 3 19.60 19.66 10.43 427.38 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 429.47 10.47 19.52 19.39 1 10.48 427.15 19.64 19.52 2 19.64 19.59 429.02 10.42 3 19.60 19.64 10.47 427.81 P2P=Disabled Latency Matrix (us) D\D 0 1 2 3 0 4.50 13.72 14.49 14.44 1 13.65 4.53 14.52 14.33 2 14.22 13.82 4.52 14.50 3 13.87 13.75 14.53 4.55 P2P=Enabled Latency Matrix (us) D\D 0 1 2 3 0 4.44 13.56 14.58 14.45 1 13.56 4.48 14.39 14.45 2 13.85 13.93 4.86 14.80 3 14.51 14.23 14.70 4.72 *2. Attached to emulated Root Port and Switches:* Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 420.48 3.15 3.12 3.12 1 3.13 422.31 3.12 3.12 2 3.08 3.09 421.40 3.13 3 3.10 3.10 3.13 418.68 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 418.68 3.14 3.12 3.12 1 3.15 420.03 3.12 3.12 2 3.11 3.10 421.39 3.14 3 3.11 3.08 3.13 419.13 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 424.36 5.36 5.35 5.34 1 5.36 424.36 5.34 5.34 2 5.35 5.36 425.52 5.35 3 5.36 5.36 5.34 425.29 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 422.98 5.35 5.35 5.35 1 5.35 423.44 5.34 5.33 2 5.35 5.35 425.29 5.35 3 5.35 5.34 5.34 423.21 P2P=Disabled Latency Matrix (us) D\D 0 1 2 3 0 4.79 16.59 16.38 16.22 1 16.62 4.77 16.35 16.69 2 16.77 16.66 4.03 16.68 3 16.54 16.56 16.78 4.08 P2P=Enabled Latency Matrix (us) D\D 0 1 2 3 0 4.51 16.56 16.58 16.66 1 15.65 3.87 16.74 16.61 2 16.59 16.81 3.96 16.70 3 16.47 16.28 16.68 4.03 Is it because the heavy load of CPU emulation had caused a bottleneck? 2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > On Tue, 1 Aug 2017 17:35:40 +0800 > Bob Chen <a175818...@gmail.com> wrote: > > > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > > > > > On Tue, 1 Aug 2017 13:04:46 +0800 > > > Bob Chen <a175818...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > This is a sketch of my hardware topology. > > > > > > > > CPU0 <- QPI ->CPU1 > > > >| | > > > > Root Port(at PCIe.0)Root Port(at PCIe.1) > > > >/\ / \ > > > > > > Are each of these lines above separate root ports? ie. each root > > > complex hosts two root ports, each with a two-port switch downstream of > > > it? &g
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.william...@redhat.com>: > On Tue, 1 Aug 2017 13:04:46 +0800 > Bob Chen <a175818...@gmail.com> wrote: > > > Hi, > > > > This is a sketch of my hardware topology. > > > > CPU0 <- QPI ->CPU1 > >| | > > Root Port(at PCIe.0)Root Port(at PCIe.1) > >/\ / \ > > Are each of these lines above separate root ports? ie. each root > complex hosts two root ports, each with a two-port switch downstream of > it? > Not quite sure if root complex is a concept or a real physical device ... But according to my observation by `lspci -vt`, there are indeed 4 Root Ports in the system. So the sketch might need a tiny update. CPU0 <- QPI ->CPU1 | | Root Complex(device?) Root Complex(device?) /\ /\ Root Port Root Port Root Port Root Port /\ /\ SwitchSwitch SwitchSwitch / \ / \ / \ / \ GPU GPU GPU GPU GPU GPU GPU GPU > > > SwitchSwitch SwitchSwitch > > / \ / \ / \/\ > >GPU GPU GPU GPU GPU GPU GPU GPU > > > > > > And below are the p2p bandwidth test results. > > > > Host: > >D\D 0 1 2 3 4 5 6 7 > > 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 > > 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 > > 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 > > 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 > > 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 > > 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 > > 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 > > 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 > > > > VM: > >D\D 0 1 2 3 4 5 6 7 > > 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 > > 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 > > 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 > > 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 > > 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 > > 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 > > 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 > > 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 > > Interesting test, how do you get these numbers? What are the units, > GB/s? > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are GB/s. Asynchronous read and write. Bidirectional. However, the Unidirectional test had shown a different result. Didn't fall down to a half. VM: Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 424.07 10.02 11.33 11.30 11.09 11.05 11.06 11.10 1 10.05 425.98 11.40 11.33 11.08 11.10 11.13 11.09 2 11.31 11.28 423.67 10.10 11.14 11.13 11.13 11.11 3 11.30 11.31 10.08 425.05 11.10 11.07 11.09 11.06 4 11.16 11.17 11.21 11.17 423.67 10.08 11.25 11.28 5 10.97 11.01 11.07 11.02 10.09 425.52 11.23 11.27 6 11.09 11.13 11.16 11.10 11.28 11.33 422.71 10.10 7 11.13 11.09 11.15 11.11 11.36 11.33 10.02 422.75 Host: Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 424.13 13.38 10.17 10.17 11.23 11.21 10.94 11.22 1 13.38 424.06 10.18 10.19 11.20 11.19 11.19 11.14 2 10.18 10.18 422.75 13.38 11.19 11.19 11.17 11.17 3 10.18 10.18 13.38 425.05 11.05 11.08 11.08 11.06 4 11.01 11.06 11.06 11.03 423.21 13.38 10.17 10.17 5 10.91 10.91 10.89 10.92 13.38 425.52 10.18 10.18 6 11.28 11.30 11.32 11.31 10.19 10.18 424.59 13.37 7 11.18 11.20 11.16 11.21 10.17 10.19 13.38 424.13 > > > In the VM, the bandwidth between two GPUs under the same physical switch > is > > obviously lower, as per the reasons you said in former threads. > > Hmm, I'm not sure I can explain why the number is lower than to more > remote GPUs though. Is the test simultaneously reading and writing and > therefore we overload the link to the upstream switch port? Otherwise > I'd expect the bidirectional support in PCIe to be able to handle the >
Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
Hi, This is a sketch of my hardware topology. CPU0 <- QPI ->CPU1 | | Root Port(at PCIe.0)Root Port(at PCIe.1) /\ / \ SwitchSwitch SwitchSwitch / \ / \ / \/\ GPU GPU GPU GPU GPU GPU GPU GPU And below are the p2p bandwidth test results. Host: D\D 0 1 2 3 4 5 6 7 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 VM: D\D 0 1 2 3 4 5 6 7 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 In the VM, the bandwidth between two GPUs under the same physical switch is obviously lower, as per the reasons you said in former threads. But what confused me most is that GPUs under different switches could achieve the same speed, as well as in the Host. Does that mean after IOMMU address translation, data traversing has utilized QPI bus by default? Even these two devices do not belong to the same PCIe bus? In a word, I'm trying to build a massive deep-learning/HPC infrastructure for the cloud environment. Nvidia itself released a solution based on dockers, and I believe qemu/VMs could also do it. Hopefully I could get some help from the community. The emulated switch you suggested looks like a good option to me, I will have a try. Thanks, Bob 2017-07-27 1:32 GMT+08:00 Alex Williamson: > On Wed, 26 Jul 2017 19:06:58 +0300 > "Michael S. Tsirkin" wrote: > > > On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote: > > > On Wed, 26 Jul 2017 09:21:38 +0300 > > > Marcel Apfelbaum wrote: > > > > > > > On 25/07/2017 11:53, 陈博 wrote: > > > > > To accelerate data traversing between devices under the same PCIE > Root > > > > > Port or Switch. > > > > > > > > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017- > 07/msg07209.html > > > > > > > > > > > > > Hi, > > > > > > > > It may be possible, but maybe PCIe Switch assignment is not > > > > the only way to go. > > > > > > > > Adding Alex and Michael for their input on this matter. > > > > More info at: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2017- > 07/msg07209.html > > > > > > I think you need to look at where the IOMMU is in the topology and what > > > address space the devices are working in when assigned to a VM to > > > realize that it doesn't make any sense to assign switch ports to a VM. > > > GPUs cannot do switch level peer to peer when assigned because they are > > > operating in an I/O virtual address space. This is why we configure > > > ACS on downstream ports to prevent peer to peer. Peer to peer > > > transactions must be forwarded upstream by the switch ports in order to > > > reach the IOMMU for translation. Note however that we do populate peer > > > to peer mappings within the IOMMU, so if the hardware supports it, the > > > IOMMU can reflect the transaction back out to the I/O bus to reach the > > > other device without CPU involvement. > > > > > > Therefore I think the better solution, if it encourages the NVIDIA > > > driver to do the right thing, is to use emulated switches. Assigning > > > the physical switch would really do nothing more than make the PCIe > link > > > information more correct in the VM, everything else about the switch > > > would be emulated. Even still, unless you have an I/O topology which > > > integrates the IOMMU into the switch itself, the data flow still needs > > > to go all the way to the root complex to hit the IOMMU before being > > > reflected to the other device. Direct peer to peer between downstream > > > switch ports operates in the wrong address space. Thanks, > > > > > > Alex > > > > That's true of course. What would make sense would be for > > hardware vendors to add ATS support to their cards. > > > > Then peer to peer should be allowed by hypervisor for translated > transactions. > > > > Gives you the performance benefit without the
Re: [Qemu-devel] [Device passthrough] Is there a way to passthrough PCIE switch/bridge ?
Details attached... I was trying to passthrough multiple GPUs into a single VM, the topology looks like these below. Without the help of PCIE switches, data traversing between GPUs are extremely slow. (depends on memcopy by CPU) So my question is can I passthrough all the switches and PCIE Root, in order to make the Nvidia p2p feature work? Host: [root@localhost ~]# nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX PHB PHB SOC SOC SOC SOC 0-9,20-29 GPU1 PIX X PHB PHB SOC SOC SOC SOC 0-9,20-29 GPU2 PHB PHB X PIX SOC SOC SOC SOC 0-9,20-29 GPU3 PHB PHB PIX X SOC SOC SOC SOC 0-9,20-29 GPU4 SOC SOC SOC SOC X PIX PHB PHB 10-19,30-39 GPU5 SOC SOC SOC SOC PIX X PHB PHB 10-19,30-39 GPU6 SOC SOC SOC SOC PHB PHB X PIX 10-19,30-39 GPU7 SOC SOC SOC SOC PHB PHB PIX X 10-19,30-39 VM: [root@titan-xp-chenbo-2 ~]# nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PHB PHB PHB PHB PHB PHB PHB 0-15 GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-15 GPU2 PHB PHB X PHB PHB PHB PHB PHB 0-15 GPU3 PHB PHB PHB X PHB PHB PHB PHB 0-15 GPU4 PHB PHB PHB PHB X PHB PHB PHB 0-15 GPU5 PHB PHB PHB PHB PHB X PHB PHB 0-15 GPU6 PHB PHB PHB PHB PHB PHB X PHB 0-15 GPU7 PHB PHB PHB PHB PHB PHB PHB X 0-15 Legend: X = Self SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI) PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks 2017-07-24 14:03 GMT+08:00 Bob Chen <a175818...@gmail.com>: > > - Bob >
[Qemu-devel] [Device passthrough] Is there a way to passthrough PCIE switch/bridge ?
- Bob
[Qemu-devel] Questions about GPU passthrough + multiple PCIE switches on host
Hi folks, I have 8 GPU cards needed to passthrough to 1 vm. These cards are placed at 2 PCIE switches on host server, in case there might be bandwidth limit within a single bus. So what is the correct QEMU bus parameter if I want to achieve the best performance. The QEMU's pcie.0/1 parameter could really reflect to the actual physical device? Thanks, Bob
[Qemu-devel] How to upgrade QEMU?
Hi folks, I am about to upgrade my QEMU version from an ancient 1.1.2 to the latest. My plan is to override all the installing files with the new ones. Since I used to rename all the qemu-xxx binaries under /usr/local/bin with a specified version suffix, so they are not my concern. Just wondering if it is safe to replace those /usr/local/share files (ROM and keymaps?), while some some of my QEMU guests are still running in-flight. Regards, Bob
Re: [Qemu-devel] [Nbd] [Qemu-block] How to online resize qemu disk with nbd protocol?
Hi folks, My time schedule doesn't allow me to wait for the community's solution, so I started to work on quick fix, which is to add a 'bdrv_truncate' function to the current NBD's BlockDriver. Basically it's an 'active resize' implementation. I also realized that the 'bdrv_truncate' caller stack is not in a coroutine, seemed to be the main thread? Then I tried some synchronous code as below: int nbd_truncate(BlockDriverState *bs, int64_t offset) { //... nbd_client_detach_aio_context(bs); qio_channel_set_blocking(client->ioc, true, NULL); ret = nbd_send_request(client->ioc, );// step 1, send custom NBD_CMD_RESIZE request ret = nbd_receive_reply(client->ioc, ); read_sync(client->ioc, _size, sizeof(new_size)); // step 2, expected to receive the confirmed new_size as data new_size = be64_to_cpu(new_size); qio_channel_set_blocking(client->ioc, false, NULL); nbd_client_attach_aio_context(bs, aio_context); //... } However at step 2, the 'new_size' I read is not always correct. Sometimes the bytes are repeating, for instance 1073741824 (1GB) became 1073741824073741824 ... Could you help me figure out what went wrong? Regards, Bob 2017-01-18 16:01 GMT+08:00 Wouter Verhelst: > On Mon, Jan 16, 2017 at 01:36:21PM -0600, Eric Blake wrote: > > Maybe the structured reply proposal can be extended into this (reserve a > > "reply" header that can be issued as many times as desired by the server > > without the client ever having issued the request first, and where the > > reply never uses the end-of-reply marker), but I'm not sure I want to go > > that direction just yet. > > It's not necessarily a bad idea, which could also be used for: > - keepalive probes from server to client > - unsolicited ESHUTDOWN messages > > both of which are currently not possible and might be useful for the > protocol to have. > > -- > < ron> I mean, the main *practical* problem with C++, is there's like a > dozen >people in the world who think they really understand all of its > rules, >and pretty much all of them are just lying to themselves too. > -- #debian-devel, OFTC, 2016-02-12 >
Re: [Qemu-devel] [Qemu-block] [Nbd] How to online resize qemu disk with nbd protocol?
There might be a time window between the NBD server's resize and the client's `re-read size` request. Is it safe? What about an active `resize` request from the client? Considering some NBD servers might have the capability to do instant resizing, not applying to LVM or host block device, of course... Regards, Bob 2017-01-13 0:54 GMT+08:00 Stefan Hajnoczi: > On Thu, Jan 12, 2017 at 3:44 PM, Alex Bligh wrote: > >> On 12 Jan 2017, at 14:43, Eric Blake wrote: > >> That's because the NBD protocol lacks a resize command. You'd have to > >> first get that proposed as an NBD extension before qemu could support > it. > > > > Actually the NBD protocol lacks a 'make a disk with size X' command, > > let alone a resize command. The size of an NBD disk is (currently) > > entirely in the hands of the server. What I think we'd really need > > would be a 'reread size' command, and have the server capable of > > supporting resizing. That would then work for readonly images too. > > That would be fine for QEMU. Resizing LVM volumes or host block > devices works exactly like this. > > Stefan > >
[Qemu-devel] How to online resize qemu disk with nbd protocol?
Hi, My qemu runs on a 3rd party distributed block storage, and the disk backend protocol is nbd. I notices that there are differences between default qcow2 local disk and my nbd disk, in terms of resizing the disk on the fly. Local qcow2 disk could work no matter using qemu-img resize or qemu monitor 'block_resize', but the nbd disk seemed to fail to detect the backend size change(had resized the disk on EBS at first). It said "this feature or command is not currently supported". Is that possible to hack qemu nbd code, making it the same way as resizing local qcow2 disk? I have the interface to resize EBS disk at backend. Regards, Bob
Re: [Qemu-devel] Live migration + cpu/mem hotplug
Answer my own question: The corresponding cmd-line parameter for memory hot-add by QEMU monitor is, -object memory-backend-ram,id=mem0,size=1024M -device pc-dimm,id=dimm0,memdev=mem0 2017-01-05 18:12 GMT+08:00 Daniel P. Berrange <berra...@redhat.com>: > On Thu, Jan 05, 2017 at 04:27:26PM +0800, Bob Chen wrote: > > Hi, > > > > According to the docs, the destination Qemu must have the exactly same > > parameters as the source one. So if the source has just finished cpu or > > memory hotplug, what would the dest's parameters be like? > > > > Does DIMM device, or logically QOM object, have to be reflected on the > new > > command-line parameters? > > Yes, if you have hotplugged any type of device since the VM was started, > the QEMU command line args on the target host must include all the original > args from the source QEMU, and also any args reflect to reflect the > hotplugged devices too. > > A further complication is that on the target, you must also make sure you > fully specify *all* device address information (PCI slots, SCSI luns, etc > etc), because the addresses QEMU assigns to a device after hotplug may > not be the same as the addresses QEMU assigns to a device whne coldplug. > > eg if you boot a guest with 1 NIC + 1 disk, and then hotplug a 2nd NIC > you might get > >1st NIC == PCI slot 2 >1st disk == PCI slot 3 >2nd NIC == PCI slot 4 > > if however, you started QEMU with 2 NICs and 1 disk straight away QEMU > might assign addresses in the order > >1st NIC == PCI slot 2 >2nd NIC == PCI slot 3 >1st disk == PCI slot 4 > > this would totally kill a guest OS during live migration as the slots > for devices its using would change. > > So as a general rule when launching QEMU on a target host for migrating, > you must be explicit about all device addresses and not rely on QEMU to > auto-assign addresses. This is quite alot of work to get right, but if > you're using libvirt it'll do pretty much all this automatically for > you. > > Regards, > Daniel > -- > |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ > :| > |: http://libvirt.org -o- http://virt-manager.org > :| > |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ > :| >
[Qemu-devel] Live migration + cpu/mem hotplug
Hi, According to the docs, the destination Qemu must have the exactly same parameters as the source one. So if the source has just finished cpu or memory hotplug, what would the dest's parameters be like? Does DIMM device, or logically QOM object, have to be reflected on the new command-line parameters? Regards, Bob
[Qemu-devel] QEMU -smp paramater: Will multiplue threads and cores improve performance?
Hi, -smp 16 -smp cores=4,threads=4,sockets=1 Which one has better performance? The scenario is guest VMs running on cloud server. -Bob
[Qemu-devel] QEMU 1.1.2: block IO throttle might occasionally freeze running process's IO to zero
Test case: 1. QEMU 1.1.2 2. Run fio inside the vm, give it some pressure. Watch the realtime throughput 3. block_set_io_throttle drive_2 1 0 0 2000 0 0 # throttle bps and iops, any value 4. Observed that the IO is very likely to freeze to zero. The fio process stuck! 5. Kill the former fio process, start a new one. The IO turns back to normal Didn't reproduce it with QEMU 2.5. Actually I'm not wishfully thinking the community would help fix this bug on such an ancient version. Just hope someone can tell me what is the root cause. Then I have to evaluate whether I should move to higher version QEMU, or fix this bug on 1.1.2 in-place(if it is a small one).
[Qemu-devel] cgroup blkio weight has no effect on qemu
Sorry for disturbing by reply, don't know why I'm not able to send a new mail. Hi folks, Could you enlighten me how to achieve proportional IO sharing by using cgroup, instead of qemu's io-throttling? My qemu config is like: -drive file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device virtio-blk-pci... Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct Cgroup blkio weight of the qemu process is properly configured as well. But no matter how change the proportion, such as vm1=400 and vm2=100, I can only get the equal IO speed. Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu? PS. cache=writethrough aio=threads is also tested, the same results. - Bob 2016-01-20 18:04 GMT+08:00 Markus Armbruster: > > >
[Qemu-devel] cgroup blkio weight has no effect on qemu
Hi folks, Could you enlighten me how to achieve proportional IO sharing by using cgroup, instead of qemu's io-throttling? My qemu config is like: -drive file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device virtio-blk-pci... Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct Cgroup blkio weight of the qemu process is properly configured as well. But no matter how change the proportion, such as vm1=400 and vm2=100, I can only get the equal IO speed. Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu? PS. cache=writethrough aio=threads is also tested, the same results. - Bob
[Qemu-devel] cgroup blkio weight has no effect on qemu
Hi folks, Could you enlighten me how to achieve proportional IO sharing by using cgroup, instead of qemu's io-throttling? My qemu config is like: -drive file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device virtio-blk-pci... Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct Cgroup blkio weight of the qemu process is properly configured as well. But no matter how change the proportion, such as vm1=400 and vm2=100, I can only get the equal IO speed. Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu? PS. cache=writethrough aio=threads is also tested, the same results. - Bob
[Qemu-devel] cgroup blkio weight has no effect on qemu
Hi folks, Could you enlighten me how to achieve proportional IO sharing by using cgroup, instead of qemu's io-throttling? My qemu config is like: -drive file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device virtio-blk-pci... Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct Cgroup blkio weight of the qemu process is properly configured as well. But no matter how change the proportion, such as vm1=400 and vm2=100, I can only get the equal IO speed. Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu? PS. cache=writethrough aio=threads is also tested, the same results. - Bob
[Qemu-devel] cgroup blkio weight has no effect on qemu
Hi folks, Could you enlighten me how to achieve proportional IO sharing by using cgroup, instead of qemu's io-throttling? My qemu config is like: -drive file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device virtio-blk-pci... Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct Cgroup blkio weight of the qemu process is properly configured as well. But no matter how change the proportion, such as vm1=400 and vm2=100, I can only get the equal IO speed. Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu? PS. cache=writethrough aio=threads is also tested, the same results. - Bob
[Qemu-devel] cgroup blkio.weight is not working for qemu
Hi folks, Could you enlighten me how to achieve proportional IO sharing by using cgroup, instead of qemu's io-throttling? My qemu config is like: -drive file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device virtio-blk-pci... Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct Cgroup blkio weight of the qemu process is properly configured as well. But no matter how change the proportion, such as vm1=400 and vm2=100, I can only get the equal IO speed. Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu? PS. cache=writethrough aio=threads is also tested, the same results. - Bob