Re: [Qemu-devel] QEMU crashed when reconnecting over iscsi protocol

2018-12-19 Thread Bob Chen
BTW, the iscsi server I used is scsi-target-utils (
https://github.com/fujita/tgt).


Bob Chen  于2018年12月19日周三 下午7:34写道:

> I looked into the source code, and found some reconnect method from
> libiscsi. Are they able to work?
>
> QEMU: 2.12.1
> libiscsi: 1.18.0  (https://github.com/sahlberg/libiscsi)
>
>
> (gdb) f
> #0  0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at
> connect.c:461
> 461 memcpy(tmp_iscsi->old_iscsi, iscsi, sizeof(struct iscsi_context));
> (gdb) bt
> #0  0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at
> connect.c:461
> #1  0x7fcd956a4ccd in iscsi_service_reconnect_if_loggedin
> (iscsi=0x7fcd97f206d0) at socket.c:879
> #2  0x7fcd956a50d2 in iscsi_tcp_service (iscsi=0x7fcd97f206d0,
> revents=) at socket.c:989
> #3  0x7fcd9674b737 in iscsi_process_read (arg=0x7fcd97f1e680) at
> block/iscsi.c:371
> #4  0x7fcd967e5b47 in aio_dispatch_handlers (ctx=0x7fcd97ee48d0) at
> util/aio-posix.c:406
> #5  0x7fcd967e5ce9 in aio_dispatch (ctx=0x7fcd97ee48d0) at
> util/aio-posix.c:437
> #6  0x7fcd967e038c in aio_ctx_dispatch (source=0x7fcd97ee48d0,
> callback=0, user_data=0x0) at util/async.c:261
> #7  0x7fcd93b1135e in g_main_context_dispatch () from
> /usr/lib64/libglib-2.0.so.0
> #8  0x7fcd967e424e in glib_pollfds_poll () at util/main-loop.c:215
> #9  0x7fcd967e4368 in os_host_main_loop_wait (timeout=98700) at
> util/main-loop.c:263
> #10 0x7fcd967e4438 in main_loop_wait (nonblocking=0) at
> util/main-loop.c:522
> #11 0x7fcd963c4b29 in main_loop () at vl.c:1943
> #12 0x7fcd963cca8d in main (argc=62, argv=0x789f4208,
> envp=0x789f4400) at vl.c:4734
>


[Qemu-devel] QEMU crashed when reconnecting over iscsi protocol

2018-12-19 Thread Bob Chen
I looked into the source code, and found some reconnect method from
libiscsi. Are they able to work?

QEMU: 2.12.1
libiscsi: 1.18.0  (https://github.com/sahlberg/libiscsi)


(gdb) f
#0  0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at
connect.c:461
461 memcpy(tmp_iscsi->old_iscsi, iscsi, sizeof(struct iscsi_context));
(gdb) bt
#0  0x7fcd956933bd in iscsi_reconnect (iscsi=0x7fcd97f206d0) at
connect.c:461
#1  0x7fcd956a4ccd in iscsi_service_reconnect_if_loggedin
(iscsi=0x7fcd97f206d0) at socket.c:879
#2  0x7fcd956a50d2 in iscsi_tcp_service (iscsi=0x7fcd97f206d0,
revents=) at socket.c:989
#3  0x7fcd9674b737 in iscsi_process_read (arg=0x7fcd97f1e680) at
block/iscsi.c:371
#4  0x7fcd967e5b47 in aio_dispatch_handlers (ctx=0x7fcd97ee48d0) at
util/aio-posix.c:406
#5  0x7fcd967e5ce9 in aio_dispatch (ctx=0x7fcd97ee48d0) at
util/aio-posix.c:437
#6  0x7fcd967e038c in aio_ctx_dispatch (source=0x7fcd97ee48d0,
callback=0, user_data=0x0) at util/async.c:261
#7  0x7fcd93b1135e in g_main_context_dispatch () from
/usr/lib64/libglib-2.0.so.0
#8  0x7fcd967e424e in glib_pollfds_poll () at util/main-loop.c:215
#9  0x7fcd967e4368 in os_host_main_loop_wait (timeout=98700) at
util/main-loop.c:263
#10 0x7fcd967e4438 in main_loop_wait (nonblocking=0) at
util/main-loop.c:522
#11 0x7fcd963c4b29 in main_loop () at vl.c:1943
#12 0x7fcd963cca8d in main (argc=62, argv=0x789f4208,
envp=0x789f4400) at vl.c:4734


Re: [Qemu-devel] [QEMU + SPDK] The demo in the official document is not working

2018-04-23 Thread Bob Chen
Problem solved. Got a reply from Intel just now.


-- Forwarded message --
From: Liu, Changpeng <changpeng@intel.com>
Date: 2018-04-23 18:06 GMT+08:00
Subject: [Qemu-devel] [QEMU + SPDK] The demo in the official document is
not working
To: "a175818...@gmail.com" <a175818...@gmail.com>


Hi Bob,

The issues was introduced by the following commit:
commit fb20fbb764 "vhost: avoid to start/stop virtqueue which is not ready".


When you starting with seabios, the virtio-scsi driver in seabios will only
enumerate 1 I/O queue,
but didn't include task management and event queue,  while SPDK vhost
target must get all the 3 queues
when starting, so this process will be blocked at seabios.

One workaround for now, you can start with the chardev, after booting to
OS, you can use -device_add to add the
virtio-scsi controller. We are developing the right fix, ethier in seabios
or SPDK, this will be fixed very soon.


Best Regards,
Changpeng Liu


2018-04-23 16:19 GMT+08:00 Bob Chen <a175818...@gmail.com>:

> Hi,
>
> I was trying to run qemu with spdk, referring to
> http://www.spdk.io/doc/vhost.html#vhost_qemu_config . Steps were strictly
> followed.
>
> # Environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x,
>> dpdk 17.11.1, qemu 2.11.1
>
>
>
> cd spdk
>> sudo su
>> ulimit -l unlimited
>> HUGEMEM=2048 ./scripts/setup.sh
>> ./app/vhost/vhost -S /var/tmp -s 1024 -m 0x1 &
>> ./scripts/rpc.py construct_nvme_bdev -b Nvme0 -t pcie -a :03:00.0
>> ./scripts/rpc.py construct_malloc_bdev 128 4096 -b Malloc0
>> ./scripts/rpc.py construct_vhost_scsi_controller --cpumask 0x1 vhost.0
>> ./scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Nvme0n1
>> ./scripts/rpc.py add_vhost_scsi_lun vhost.0 1 Malloc0
>>
>> qemu-system-x86_64 -enable-kvm -cpu host -machine pc,accel=kvm -daemonize
>> -vnc :1 \
>> -smp 1 -m 1G -object 
>> memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on
>> -numa node,memdev=mem0 \
>> -drive file=,if=none,id=disk -device
>> ide-hd,drive=disk,bootindex=0 \
>> -chardev socket,id=spdk_vhost_scsi0,path=/var/tmp/vhost.0 \
>> -device vhost-user-scsi-pci,id=scsi0,chardev=spdk_vhost_scsi0
>
>
>
> Unfortunately the demo was not able to work, stopped at the boot-loader
> screen saying that not bootable device. But if I removed the last two lines(
> vhost-user-scsi-pci), the guest could start successfully. So there must
> be something wrong with the spdk device.
>
> Is there anyone in this community who happened to be familiar with spdk?
> Or should I seek for help from Intel? Don't know who is responsible for
> maintaining this document.
>
>
> Thanks,
> Bob
>


Re: [Qemu-devel] [SPDK] qemu process hung at boot-up, no explicit errors or warnings

2018-04-23 Thread Bob Chen
2018-04-21 1:34 GMT+08:00 John Snow <js...@redhat.com>:

>
>
> On 04/20/2018 07:13 AM, Bob Chen wrote:
> > 2.11.1 could work, qemu is no longer occupying 100% CPU. That's
> > interesting...
> >
>
> Does 2.12 use 100% even at the firmware menu? Maybe we're not giving
> this VM long enough to hit the spot that causes it to use 100%.
>
> > Now I can see the starting screen via vnc, says that not a bootable disk.
> >
>
> If it's stopping at the firmware I think the guest is not being taken
> into account, and the firmware is either trying to boot the wrong disk
> or your disk is corrupted. (Or it can't see your disk at all -- I'm
> personally not very familiar with using SPDK.)


> Can you share your command line with us?
>

I opened another thread to discuss this issue, looks like there's something
wrong with the spdk official doc.

However, 2.12 is still very likely to be buggy. I'm sure I waited long
enough but the CPU never dropped below 100%. The vnc screen stopped at
"Loading SeaBIOS ..."

Again, 2.10 and 2.11 are fine. Though "No bootable device" but CPU dropped
anyway.


>
> >
> > Is that because the guest OS's virtio vhost driver is not up to date
> enough?
> > ​
> >
> > Thanks,
> > Bob
> >
>
> On the QEMU-devel list and many other technical lists, can you please
> reply in-line instead of replying above? ("top posting")
>
> Thank you,
> --John
>
> > 2018-04-20 1:17 GMT+08:00 John Snow <js...@redhat.com
> > <mailto:js...@redhat.com>>:
> >
> > Forwarding to qemu-block.
> >
> > On 04/19/2018 06:13 AM, Bob Chen wrote:
> > > Hi,
> > >
> > > I was trying to run qemu with spdk, referring to
> > > http://www.spdk.io/doc/vhost.html#vhost_qemu_config
> > <http://www.spdk.io/doc/vhost.html#vhost_qemu_config>
> > >
> > > Everything went well since I had already set up hugepages, vfio,
> vhost
> > > targets, vhost-scsi device(vhost-block was also tested), etc,
> without
> > > errors or warnings reported.
> > >
> > > But at the last step to run qemu, the process would somehow hang.
> And its
> > > cpu busy remained 100%.
> > >
> > > I used `perf top -p` to monitor the process activity,
> > >   18.28%  [kernel]  [k] vmx_vcpu_run
> > >3.06%  [kernel]  [k] vcpu_enter_guest
> > >3.05%  [kernel]  [k]
> system_call_after_swapgs
> > >
> > > Do you have any ideas about what happened here?
> > >
> > > Test environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x,
> > > dpdk 17.11.1, qemu 2.12.0-rc2. Detail log is attached within this
> mail.
> > >
> >
> > Have you tried any other versions? (2.11.1, or 2.12-rc4?)
> >
> > >
> > >
> > > Thanks,
> > > Bob
> > >
> >
> >
>


[Qemu-devel] [QEMU + SPDK] The demo in the official document is not working

2018-04-23 Thread Bob Chen
Hi,

I was trying to run qemu with spdk, referring to http://www.spdk.io/doc/
vhost.html#vhost_qemu_config . Steps were strictly followed.

# Environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x,
> dpdk 17.11.1, qemu 2.11.1



cd spdk
> sudo su
> ulimit -l unlimited
> HUGEMEM=2048 ./scripts/setup.sh
> ./app/vhost/vhost -S /var/tmp -s 1024 -m 0x1 &
> ./scripts/rpc.py construct_nvme_bdev -b Nvme0 -t pcie -a :03:00.0
> ./scripts/rpc.py construct_malloc_bdev 128 4096 -b Malloc0
> ./scripts/rpc.py construct_vhost_scsi_controller --cpumask 0x1 vhost.0
> ./scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Nvme0n1
> ./scripts/rpc.py add_vhost_scsi_lun vhost.0 1 Malloc0
>
> qemu-system-x86_64 -enable-kvm -cpu host -machine pc,accel=kvm -daemonize
> -vnc :1 \
> -smp 1 -m 1G -object
> memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on -numa
> node,memdev=mem0 \
> -drive file=,if=none,id=disk -device
> ide-hd,drive=disk,bootindex=0 \
> -chardev socket,id=spdk_vhost_scsi0,path=/var/tmp/vhost.0 \
> -device vhost-user-scsi-pci,id=scsi0,chardev=spdk_vhost_scsi0



Unfortunately the demo was not able to work, stopped at the boot-loader
screen saying that not bootable device. But if I removed the last two lines(
vhost-user-scsi-pci), the guest could start successfully. So there must be
something wrong with the spdk device.

Is there anyone in this community who happened to be familiar with spdk? Or
should I seek for help from Intel? Don't know who is responsible for
maintaining this document.


Thanks,
Bob


Re: [Qemu-devel] [SPDK] qemu process hung at boot-up, no explicit errors or warnings

2018-04-20 Thread Bob Chen
2.11.1 could work, qemu is no longer occupying 100% CPU. That's
interesting...

Now I can see the starting screen via vnc, says that not a bootable disk.


Is that because the guest OS's virtio vhost driver is not up to date enough?
​

Thanks,
Bob

2018-04-20 1:17 GMT+08:00 John Snow <js...@redhat.com>:

> Forwarding to qemu-block.
>
> On 04/19/2018 06:13 AM, Bob Chen wrote:
> > Hi,
> >
> > I was trying to run qemu with spdk, referring to
> > http://www.spdk.io/doc/vhost.html#vhost_qemu_config
> >
> > Everything went well since I had already set up hugepages, vfio, vhost
> > targets, vhost-scsi device(vhost-block was also tested), etc, without
> > errors or warnings reported.
> >
> > But at the last step to run qemu, the process would somehow hang. And its
> > cpu busy remained 100%.
> >
> > I used `perf top -p` to monitor the process activity,
> >   18.28%  [kernel]  [k] vmx_vcpu_run
> >3.06%  [kernel]  [k] vcpu_enter_guest
> >3.05%  [kernel]  [k] system_call_after_swapgs
> >
> > Do you have any ideas about what happened here?
> >
> > Test environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x,
> > dpdk 17.11.1, qemu 2.12.0-rc2. Detail log is attached within this mail.
> >
>
> Have you tried any other versions? (2.11.1, or 2.12-rc4?)
>
> >
> >
> > Thanks,
> > Bob
> >
>
>


[Qemu-devel] [SPDK] qemu process hung at boot-up, no explicit errors or warnings

2018-04-19 Thread Bob Chen
Hi,

I was trying to run qemu with spdk, referring to
http://www.spdk.io/doc/vhost.html#vhost_qemu_config

Everything went well since I had already set up hugepages, vfio, vhost
targets, vhost-scsi device(vhost-block was also tested), etc, without
errors or warnings reported.

But at the last step to run qemu, the process would somehow hang. And its
cpu busy remained 100%.

I used `perf top -p` to monitor the process activity,
  18.28%  [kernel]  [k] vmx_vcpu_run
   3.06%  [kernel]  [k] vcpu_enter_guest
   3.05%  [kernel]  [k] system_call_after_swapgs

Do you have any ideas about what happened here?

Test environment: latest CentOS 7 kernel, nvme ssd, spdk v18.01.x,
dpdk 17.11.1, qemu 2.12.0-rc2. Detail log is attached within this mail.



Thanks,
Bob


qemu-spdk.log
Description: Binary data


Re: [Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK

2018-04-18 Thread Bob Chen
I think you can edit the github's repo description to tell people not to
download release from this site.

2018-04-18 17:29 GMT+08:00 Peter Maydell <peter.mayd...@linaro.org>:

> On 18 April 2018 at 09:09, Bob Chen <a175818...@gmail.com> wrote:
> > I found that it has nothing to do with the release version, but the
> github
> > one is just not able to work...
> >
> > So what github and qemu.org provide are totally different things?
>
> The "tarballs" from github don't work and can't work, but
> github provides no mechanism for the project to say "don't
> create this page". You can complain to github about this
> if you like.
>
> You're best off starting with the project's own web
> page: https://www.qemu.org/ -- we use github only as
> a sort of backup git server in case our own is down.
>
> thanks
> -- PMM
>


Re: [Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK

2018-04-18 Thread Bob Chen
I found that it has nothing to do with the release version, but the github
one is just not able to work...

So what github and qemu.org provide are totally different things?




2018-04-18 14:52 GMT+08:00 Bob Chen <a175818...@gmail.com>:

> No, I downloaded the release tarball from github.
>
> 2018-04-18 14:25 GMT+08:00 Stefan Weil <s...@weilnetz.de>:
>
>> Am 18.04.2018 um 08:19 schrieb Bob Chen:
>> > fatal error: ui/input-keymap-atset1-to-qcode.c: No such file or
>> directory
>> >
>> > Build on my CentOS 7.
>> >
>>
>> That file is generated during the build. rc4 compiles in my test on
>> Debian. Did you start your build from a fresh git clone?
>>
>> Stefan
>>
>
>


Re: [Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK

2018-04-18 Thread Bob Chen
No, I downloaded the release tarball from github.

2018-04-18 14:25 GMT+08:00 Stefan Weil <s...@weilnetz.de>:

> Am 18.04.2018 um 08:19 schrieb Bob Chen:
> > fatal error: ui/input-keymap-atset1-to-qcode.c: No such file or
> directory
> >
> > Build on my CentOS 7.
> >
>
> That file is generated during the build. rc4 compiles in my test on
> Debian. Did you start your build from a fresh git clone?
>
> Stefan
>


[Qemu-devel] Latest v2.12.0-rc4 has compiling error, rc3 is OK

2018-04-18 Thread Bob Chen
fatal error: ui/input-keymap-atset1-to-qcode.c: No such file or directory

Build on my CentOS 7.


Re: [Qemu-devel] [GPU and VFIO] qemu hang at startup, VFIO_IOMMU_MAP_DMA is extremely slow

2018-01-01 Thread Bob Chen
Ping...

Was it because VFIO_IOMMU_MAP_DMA needs contiguous memory and my host was
not able to provide them immediately?

2017-12-26 19:37 GMT+08:00 Bob Chen <a175818...@gmail.com>:

>
>
> 2017-12-26 18:51 GMT+08:00 Liu, Yi L <yi.l@intel.com>:
>
>> > -Original Message-
>> > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=
>> intel@nongnu.org]
>> > On Behalf Of Bob Chen
>> > Sent: Tuesday, December 26, 2017 6:30 PM
>> > To: qemu-devel@nongnu.org
>> > Subject: [Qemu-devel] [GPU and VFIO] qemu hang at startup,
>> > VFIO_IOMMU_MAP_DMA is extremely slow
>> >
>> > Hi,
>> >
>> > I have a host server with multiple GPU cards, and was assigning them to
>> qemu
>> > with VFIO.
>> >
>> > I found that when setting up the last free GPU, the qemu process would
>> hang
>>
>> Are all the GPUs in the same iommu group?
>>
>
> Each of them is in a single group.
>
>
>>
>> > there and took almost 10 minutes before finishing startup. I made some
>> dig by
>> > gdb, and found the slowest part occurred at the
>> > hw/vfio/common.c:vfio_dma_map function call.
>>
>> This is to setup mapping and it takes time. This function would be called
>> multiple
>> times and it will take some time. The slowest part, do you mean it takes
>> a long time for a single vfio_dma_map() calling or the whole passthru
>> spends a lot
>> of time on creating mapping. If a single calling takes a lot of time,
>> then it may be
>> a problem.
>>
>
> Each vfio_dma_map() takes 3 to 10 mins accordingly.
>
>
>>
>> You may paste your Qemu command which might help. And the dmesg in host
>> would also help.
>>
>
> cmd line:
> After adding -device vfio-pci,host=09:00.0,multifunction=on,addr=0x15,
> qemu would hang.
> Otherwise, could start immediately without this option.
>
> dmesg:
> [Tue Dec 26 18:39:50 2017] vfio-pci :09:00.0: enabling device (0400 ->
> 0402)
> [Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap
> 0x1e@0x258
> [Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap
> 0x19@0x900
> [Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation
> wraparound
> [Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation
> wraparound
> [Tue Dec 26 18:40:03 2017] kvm [74663]: vcpu0 ignored rdmsr: 0x345
>
> Kernel:
> 3.10.0-514.16.1  CentOS 7.3
>
>
>>
>> >
>> >
>> > static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> ram_addr_t
>> > size, void *vaddr, bool readonly) { ...
>> > if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 ||
>> > (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
>> >  ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) {
>> > return 0;
>> > }
>> > ...
>> > }
>> >
>> >
>> > The hang was enable to reproduce on one of my hosts, I was setting up a
>> 4GB
>> > memory VM, while the host still had 16GB free. GPU physical mem is 8G.
>>
>> Does it happen when you only assign a single GPU?
>>
>
> Not sure. Didn't try multiple GPUs.
>
>
>>
>> > Also, this phenomenon was observed on other hosts occasionally, and the
>> > similarity is that they always happened on the last free GPU.
>> >
>> >
>> > Full stack trace file is attached. Looking forward for you help, thanks
>> >
>> >
>> > - Bob
>>
>> Regards,
>> Yi L
>>
>
>


Re: [Qemu-devel] [GPU and VFIO] qemu hang at startup, VFIO_IOMMU_MAP_DMA is extremely slow

2017-12-26 Thread Bob Chen
2017-12-26 18:51 GMT+08:00 Liu, Yi L <yi.l@intel.com>:

> > -Original Message-
> > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=
> intel@nongnu.org]
> > On Behalf Of Bob Chen
> > Sent: Tuesday, December 26, 2017 6:30 PM
> > To: qemu-devel@nongnu.org
> > Subject: [Qemu-devel] [GPU and VFIO] qemu hang at startup,
> > VFIO_IOMMU_MAP_DMA is extremely slow
> >
> > Hi,
> >
> > I have a host server with multiple GPU cards, and was assigning them to
> qemu
> > with VFIO.
> >
> > I found that when setting up the last free GPU, the qemu process would
> hang
>
> Are all the GPUs in the same iommu group?
>

Each of them is in a single group.


>
> > there and took almost 10 minutes before finishing startup. I made some
> dig by
> > gdb, and found the slowest part occurred at the
> > hw/vfio/common.c:vfio_dma_map function call.
>
> This is to setup mapping and it takes time. This function would be called
> multiple
> times and it will take some time. The slowest part, do you mean it takes
> a long time for a single vfio_dma_map() calling or the whole passthru
> spends a lot
> of time on creating mapping. If a single calling takes a lot of time, then
> it may be
> a problem.
>

Each vfio_dma_map() takes 3 to 10 mins accordingly.


>
> You may paste your Qemu command which might help. And the dmesg in host
> would also help.
>

cmd line:
After adding -device vfio-pci,host=09:00.0,multifunction=on,addr=0x15, qemu
would hang.
Otherwise, could start immediately without this option.

dmesg:
[Tue Dec 26 18:39:50 2017] vfio-pci :09:00.0: enabling device (0400 ->
0402)
[Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap
0x1e@0x258
[Tue Dec 26 18:39:51 2017] vfio_ecap_init: :09:00.0 hiding ecap
0x19@0x900
[Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation
wraparound
[Tue Dec 26 18:39:55 2017] kvm: zapping shadow pages for mmio generation
wraparound
[Tue Dec 26 18:40:03 2017] kvm [74663]: vcpu0 ignored rdmsr: 0x345

Kernel:
3.10.0-514.16.1  CentOS 7.3


>
> >
> >
> > static int vfio_dma_map(VFIOContainer *container, hwaddr iova, ram_addr_t
> > size, void *vaddr, bool readonly) { ...
> > if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 ||
> > (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
> >  ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) {
> > return 0;
> > }
> > ...
> > }
> >
> >
> > The hang was enable to reproduce on one of my hosts, I was setting up a
> 4GB
> > memory VM, while the host still had 16GB free. GPU physical mem is 8G.
>
> Does it happen when you only assign a single GPU?
>

Not sure. Didn't try multiple GPUs.


>
> > Also, this phenomenon was observed on other hosts occasionally, and the
> > similarity is that they always happened on the last free GPU.
> >
> >
> > Full stack trace file is attached. Looking forward for you help, thanks
> >
> >
> > - Bob
>
> Regards,
> Yi L
>


[Qemu-devel] [GPU and VFIO] qemu hang at startup, VFIO_IOMMU_MAP_DMA is extremely slow

2017-12-26 Thread Bob Chen
Hi,

I have a host server with multiple GPU cards, and was assigning them to
qemu with VFIO.

I found that when setting up the last free GPU, the qemu process would hang
there and took almost 10 minutes before finishing startup. I made some dig
by gdb, and found the slowest part occurred at the
hw/vfio/common.c:vfio_dma_map function call.


static int vfio_dma_map(VFIOContainer *container, hwaddr iova, ram_addr_t
size, void *vaddr, bool readonly)
{
...
if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 ||
(errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
 ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) {
return 0;
}
...
}


The hang was enable to reproduce on one of my hosts, I was setting up a 4GB
memory VM, while the host still had 16GB free. GPU physical mem is 8G.

Also, this phenomenon was observed on other hosts occasionally, and the
similarity is that they always happened on the last free GPU.


Full stack trace file is attached. Looking forward for you help, thanks


- Bob
#0  vfio_dma_map (container=0x56b1f880, iova=0, size=655360, 
vaddr=0x7ffecfe0, readonly=false) at 
/usr/src/debug/qemu-2.6.2/hw/vfio/common.c:227
map = {argsz = 655359, flags = 0, vaddr = 0, iova = 140737488339248, 
size = 93824994141691}
#1  0x557712dc in vfio_listener_region_add (listener=0x56b1f890, 
section=0x7fffc1f0) at /usr/src/debug/qemu-2.6.2/hw/vfio/common.c:419
container = 0x56b1f880
iova = 0
end = 655359
llend = {lo = 655360, hi = 0}
llsize = {lo = 655360, hi = 0}
vaddr = 0x7ffecfe0
ret = 7
__func__ = "vfio_listener_region_add"
#2  0x55728465 in listener_add_address_space (listener=0x56b1f890, 
as=0x560823e0 )
at /usr/src/debug/qemu-2.6.2/memory.c:2179
section = {mr = 0x566ec570, address_space = 0x560823e0 
, offset_within_region = 0, size = {lo = 655360, hi = 0},
  offset_within_address_space = 0, readonly = false}
view = 0x57ae3bd0
fr = 0x566f0c00
#3  0x5572860d in memory_listener_register (listener=0x56b1f890, 
filter=0x560823e0 )
at /usr/src/debug/qemu-2.6.2/memory.c:2208
other = 0x565b4910
as = 0x560823e0 
#4  0x55772811 in vfio_connect_container (group=0x5784bce0, 
as=0x560823e0 )
at /usr/src/debug/qemu-2.6.2/hw/vfio/common.c:900
container = 0x56b1f880
ret = 0
fd = 35
space = 0x5784bd20
#5  0x55772cbc in vfio_get_group (groupid=25, as=0x560823e0 
) at /usr/src/debug/qemu-2.6.2/hw/vfio/common.c:1008
group = 0x5784bce0
path = 
"/dev/vfio/25\000U\000\000P\303\377\377\377\177\000\000\332Q\224UUU\000"
status = {argsz = 8, flags = 1}
#6  0x5577af5c in vfio_initfn (pdev=0x581672b0) at 
/usr/src/debug/qemu-2.6.2/hw/vfio/pci.c:2447
vdev = 0x581672b0
vbasedev_iter = 0x40b
group = 0x55bbc65d
tmp = 0x57640b60 ""
group_path = 
"../../../../../../kernel/iommu_groups/25\000\000\000\000\343\003\000\000\031ĻUUU\000\000\000\000\000\000\000\000\000\000\220\304\377\377\377\177\000\000]ƻU\a\000\000\000\320ɻUUU\000\000\360\304\377\377\v\004\000\000\300\305\377\377\377\177\000\000I\252\260UUU\000\000\360\304\377\377\377\1-
77\000\000\000\000\000\000\000\000\000\000\320\304\377\377\377\177\000\000]ƻUUU\000\000\260ɻUUU\000\000f˲U\343\003\000\000\241:\000\000\000\200\377\377\002",
 '\000' , 
"\060\000\000\000[\000\000\000`\305\377\377\377\177"...
group_name = 0x7fffc466 "25"
len = 40
st = {st_dev = 17, st_ino = 39127, st_nlink = 3, st_mode = 16877, 
st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 4096,
  st_blocks = 0, st_atim = {tv_sec = 1513939417, tv_nsec = 943657386}, 
st_mtim = {tv_sec = 1510113186, tv_nsec = 59601}, st_ctim = {
tv_sec = 1510113186, tv_nsec = 59601}, __unused = {0, 0, 0}}
groupid = 25
ret = 21845
#7  0x55943b65 in pci_default_realize (dev=0x581672b0, 
errp=0x7fffd4b8) at hw/pci/pci.c:1895
pc = 0x56568e70
__func__ = "pci_default_realize"
#8  0x55943a08 in pci_qdev_realize (qdev=0x581672b0, 
errp=0x7fffd520) at hw/pci/pci.c:1867
pci_dev = 0x581672b0
pc = 0x56568e70
__func__ = "pci_qdev_realize"
local_err = 0x0
bus = 0x569baea0
is_default_rom = false
#9  0x558af8da in device_set_realized (obj=0x581672b0, value=true, 
errp=0x7fffd6e0) at hw/core/qdev.c:1066
dev = 0x581672b0
__func__ = "device_set_realized"
dc = 0x56568e70
hotplug_ctrl = 0x55af83cf 
bus = 0x7fffd5c7
local_err = 0x0
#10 0x55a3754d in property_set_bool (obj=0x581672b0, 
v=0x565a9140, name=0x55b494e9 

Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-11-30 Thread Bob Chen
Hi,

After 3 months of work and investigation, and tedious mail discussions with
Nvidia, I think some progress have been made, in terms of the
GPUDirect(p2p) in virtual environment.

The only remaining issue then, is the low bidirectional bandwidth between
two sibling GPUs under the same PCIe switch.

We expanded the tests to run on even more GPU cards, so the results seemed
to be explicit now.


P40 is OK, and its hardware topology on host is:
 \-[:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3
v4/Xeon D DMI2
 +-01.0-[03]00.0  LSI Logic / Symbios Logic MegaRAID SAS-3
3008 [Fury]
 +-02.0-[04]00.0  NVIDIA Corporation GP102GL [Tesla P40]
 +-03.0-[02]00.0  NVIDIA Corporation GP102GL [Tesla P40]


M60, not OK, low bandwidth:
 \-[:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3
v4/Xeon D DMI2
 +-01.0-[06]00.0  LSI Logic / Symbios Logic MegaRAID SAS-3
3008 [Fury]
 +-02.0-[07-0a]00.0-[08-0a]--+-08.0-[09]00.0  NVIDIA
Corporation GM204GL [Tesla M60]
 |   \-10.0-[0a]00.0  NVIDIA
Corporation GM204GL [Tesla M60]


V100, not OK, low bandwidth:
\-[:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon
D DMI2
 +-01.0-[01]--+-00.0  Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
 |\-00.1  Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
 +-02.0-[02-05]00.0-[03-05]--+-08.0-[04]00.0  NVIDIA
Corporation GV100 [Tesla V100 PCIe]
 |   \-10.0-[05]00.0  NVIDIA
Corporation GV100 [Tesla V100 PCIe]



So what might be the actual effect of the PLX switch hardware for GPU data
flow? Although it is not visible in guest OS.
Nvidia tech-support guys are not familiar with virtualization. They asked
us to consult the community first.


Re: [Qemu-devel] [PATCH 0/3] vfio/pci: Add NVIDIA GPUDirect P2P clique support

2017-11-20 Thread Bob Chen
It's a mistake, please ignore. This patch is able to work.

2017-10-26 18:45 GMT+08:00 Bob Chen <a175818...@gmail.com>:

> There seem to be some bugs in these patches, causing my VM failed to boot.
>
> Test case:
>
> 0. Merge these 3 patches in to release 2.10.1
>
> 1. qemu-system-x86_64_2.10.1  ... \
> -device vfio-pci,host=04:00.0 \
> -device vfio-pci,host=05:00.0 \
> -device vfio-pci,host=08:00.0 \
> -device vfio-pci,host=09:00.0 \
> -device vfio-pci,host=85:00.0 \
> -device vfio-pci,host=86:00.0 \
> -device vfio-pci,host=89:00.0 \
> -device vfio-pci,host=8a:00.0 ...
>
> The guest was able to boot up.
>
> 2. qemu-system-x86_64_2.10.1  ... \
> -device vfio-pci,host=04:00.0,x-nv-gpudirect-clique=0 \
> -device vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0 \
> -device vfio-pci,host=08:00.0,x-nv-gpudirect-clique=0 \
> -device vfio-pci,host=09:00.0,x-nv-gpudirect-clique=0 \
> -device vfio-pci,host=85:00.0,x-nv-gpudirect-clique=8 \
> -device vfio-pci,host=86:00.0,x-nv-gpudirect-clique=8 \
> -device vfio-pci,host=89:00.0,x-nv-gpudirect-clique=8 \
> -device vfio-pci,host=8a:00.0,x-nv-gpudirect-clique=8 \
>
> Hang. VNC couldn't connect.
>
>
> My personal patch used to work, although it was done by straightforward
> hacking and not that friendly to read.
>
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
>
>  @@static int vfio_initfn(PCIDevice *pdev)
>
> vfio_add_emulated_long(vdev, 0xc8, 0x50080009, ~0);
> if (count < 4)
> {
> vfio_add_emulated_long(vdev, 0xcc, 0x5032, ~0);
> }
> else
> {
> vfio_add_emulated_long(vdev, 0xcc, 0x00085032, ~0);
> }
> vfio_add_emulated_word(vdev, 0x78, 0xc810, ~0);
>
> 2017-08-30 6:05 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:
>
>> NVIDIA has a specification for exposing a virtual vendor capability
>> which provides a hint to guest drivers as to which sets of GPUs can
>> support direct peer-to-peer DMA.  Devices with the same clique ID are
>> expected to support this.  The user can specify a clique ID for an
>> NVIDIA graphics device using the new vfio-pci x-nv-gpudirect-clique=
>> option, where valid clique IDs are a 4-bit integer.  It's entirely the
>> user's responsibility to specify sets of devices for which P2P works
>> correctly and provides some benefit.  This is only useful for DMA
>> between NVIDIA GPUs, therefore it's only useful to specify cliques
>> comprised of more than one GPU.  Furthermore, this does not enable DMA
>> between VMs, there is no change to VM DMA mapping, this only exposes
>> hints about existing DMA paths to the guest driver.  Thanks,
>>
>> Alex
>>
>> ---
>>
>> Alex Williamson (3):
>>   vfio/pci: Do not unwind on error
>>   vfio/pci: Add virtual capabilities quirk infrastructure
>>   vfio/pci: Add NVIDIA GPUDirect Cliques support
>>
>>
>>  hw/vfio/pci-quirks.c |  114 ++
>> 
>>  hw/vfio/pci.c|   17 +++
>>  hw/vfio/pci.h|4 ++
>>  3 files changed, 133 insertions(+), 2 deletions(-)
>>
>
>


Re: [Qemu-devel] [PATCH 0/3] vfio/pci: Add NVIDIA GPUDirect P2P clique support

2017-10-26 Thread Bob Chen
There seem to be some bugs in these patches, causing my VM failed to boot.

Test case:

0. Merge these 3 patches in to release 2.10.1

1. qemu-system-x86_64_2.10.1  ... \
-device vfio-pci,host=04:00.0 \
-device vfio-pci,host=05:00.0 \
-device vfio-pci,host=08:00.0 \
-device vfio-pci,host=09:00.0 \
-device vfio-pci,host=85:00.0 \
-device vfio-pci,host=86:00.0 \
-device vfio-pci,host=89:00.0 \
-device vfio-pci,host=8a:00.0 ...

The guest was able to boot up.

2. qemu-system-x86_64_2.10.1  ... \
-device vfio-pci,host=04:00.0,x-nv-gpudirect-clique=0 \
-device vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0 \
-device vfio-pci,host=08:00.0,x-nv-gpudirect-clique=0 \
-device vfio-pci,host=09:00.0,x-nv-gpudirect-clique=0 \
-device vfio-pci,host=85:00.0,x-nv-gpudirect-clique=8 \
-device vfio-pci,host=86:00.0,x-nv-gpudirect-clique=8 \
-device vfio-pci,host=89:00.0,x-nv-gpudirect-clique=8 \
-device vfio-pci,host=8a:00.0,x-nv-gpudirect-clique=8 \

Hang. VNC couldn't connect.


My personal patch used to work, although it was done by straightforward
hacking and not that friendly to read.

--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c

 @@static int vfio_initfn(PCIDevice *pdev)

vfio_add_emulated_long(vdev, 0xc8, 0x50080009, ~0);
if (count < 4)
{
vfio_add_emulated_long(vdev, 0xcc, 0x5032, ~0);
}
else
{
vfio_add_emulated_long(vdev, 0xcc, 0x00085032, ~0);
}
vfio_add_emulated_word(vdev, 0x78, 0xc810, ~0);

2017-08-30 6:05 GMT+08:00 Alex Williamson :

> NVIDIA has a specification for exposing a virtual vendor capability
> which provides a hint to guest drivers as to which sets of GPUs can
> support direct peer-to-peer DMA.  Devices with the same clique ID are
> expected to support this.  The user can specify a clique ID for an
> NVIDIA graphics device using the new vfio-pci x-nv-gpudirect-clique=
> option, where valid clique IDs are a 4-bit integer.  It's entirely the
> user's responsibility to specify sets of devices for which P2P works
> correctly and provides some benefit.  This is only useful for DMA
> between NVIDIA GPUs, therefore it's only useful to specify cliques
> comprised of more than one GPU.  Furthermore, this does not enable DMA
> between VMs, there is no change to VM DMA mapping, this only exposes
> hints about existing DMA paths to the guest driver.  Thanks,
>
> Alex
>
> ---
>
> Alex Williamson (3):
>   vfio/pci: Do not unwind on error
>   vfio/pci: Add virtual capabilities quirk infrastructure
>   vfio/pci: Add NVIDIA GPUDirect Cliques support
>
>
>  hw/vfio/pci-quirks.c |  114 ++
> 
>  hw/vfio/pci.c|   17 +++
>  hw/vfio/pci.h|4 ++
>  3 files changed, 133 insertions(+), 2 deletions(-)
>


Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-09-01 Thread Bob Chen
More updates:

1. This behavior was found not only on M60, but also on TITAN 1080Ti or Xp.

2. When not setting up the p2p compatibility, i.e. run the original qemu
with GPUs attached to the root pcie bus, the LnkSta on host always remains
at 8 GT/s. Don't know why the new p2p change would cause the GPU driver in
guest to re-negotiate its speed.

I think it has gone beyond the community's responsibility to debug this
tricky issue. So I have contacted nvidia for technical support, and they
are expected to send me a reply in next few weeks. Will keep you guys
updated.


Bob

2017-08-31 0:43 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:

> On Wed, 30 Aug 2017 17:41:20 +0800
> Bob Chen <a175818...@gmail.com> wrote:
>
> > I think I have observed what you said...
> >
> > The link speed on host remained 8GT/s until I finished running
> > p2pBandwidthLatencyTest
> > for the first time. Then it became 2.5GT/s...
> >
> >
> > # lspci -s 09:00.0 -vvv
> ...
> > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt-
>
> So long as the device renegotiates to 8GT/s under load rather than
> getting stuck at 2.5GT/s, I think this is the expected behavior.  This
> is a power saving measure by the driver.  Thanks,
>
> Alex
>


Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-08-30 Thread Bob Chen
I think I have observed what you said...

The link speed on host remained 8GT/s until I finished running
p2pBandwidthLatencyTest
for the first time. Then it became 2.5GT/s...


# lspci -s 09:00.0 -vvv
09:00.0 3D controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
Subsystem: NVIDIA Corporation Device 115e
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- 
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024

Capabilities: [900 v1] #19
Kernel driver in use: vfio-pci
Kernel modules: nouveau


Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-08-29 Thread Bob Chen
The topology is already having all GPUs directly attached to root bus 0. In
this situation you can't see the LnkSta attribute in any capabilities.

The other way of using emulated switch would somehow show this attribute,
at 8 GT/s, although the real bandwidth is low as usual.

2017-08-23 2:06 GMT+08:00 Michael S. Tsirkin <m...@redhat.com>:

> On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote:
> > On Tue, 22 Aug 2017 15:04:55 +0800
> > Bob Chen <a175818...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I got a spec from Nvidia which illustrates how to enable GPU p2p in
> > > virtualization environment. (See attached)
> >
> > Neat, looks like we should implement a new QEMU vfio-pci option,
> > something like nvidia-gpudirect-p2p-id=.  I don't think I'd want to
> > code the policy of where to enable it into QEMU or the kernel, so we'd
> > push it up to management layers or users to decide.
> >
> > > The key is to append the legacy pci capabilities list when setting up
> the
> > > hypervisor, with a Nvidia customized capability config.
> > >
> > > I added some hack in hw/vfio/pci.c and managed to implement that.
> > >
> > > Then I found the GPU was able to recognize its peer, and the latency
> has
> > > dropped. ✅
> > >
> > > However the bandwidth didn't improve, but decreased instead. ❌
> > >
> > > Any suggestions?
> >
> > What's the VM topology?  I've found that in a Q35 configuration with
> > GPUs downstream of an emulated root port, the NVIDIA driver in the
> > guest will downshift the physical link rate to 2.5GT/s and never
> > increase it back to 8GT/s.  I believe this is because the virtual
> > downstream port only advertises Gen1 link speeds.
>
>
> Fixing that would be nice, and it's great that you now actually have a
> reproducer that can be used to test it properly.
>
> Exposing higher link speeds is a bit of work since there are now all
> kind of corner cases to cover as guests may play with link speeds and we
> must pretend we change it accordingly.  An especially interesting
> question is what to do with the assigned device when guest tries to play
> with port link speed. It's kind of similar to AER in that respect.
>
> I guess we can just ignore it for starters.
>
> >  If the GPUs are on
> > the root complex (ie. pcie.0) the physical link will run at 2.5GT/s
> > when the GPU is idle and upshift to 8GT/s under load.  This also
> > happens if the GPU is exposed in a conventional PCI topology to the
> > VM.  Another interesting data point is that an older Kepler GRID card
> > does not have this issue, dynamically shifting the link speed under
> > load regardless of the VM PCI/e topology, while a new M60 using the
> > same driver experiences this problem.  I've filed a bug with NVIDIA as
> > this seems to be a regression, but it appears (untested) that the
> > hypervisor should take the approach of exposing full, up-to-date PCIe
> > link capabilities and report a link status matching the downstream
> > devices.
>
>
> > I'd suggest during your testing, watch lspci info for the GPU from the
> > host, noting the behavior of LnkSta (Link Status) to check if the
> > devices gets stuck at 2.5GT/s in your VM configuration and adjust the
> > topology until it works, likely placing the GPUs on pcie.0 for a Q35
> > based machine.  Thanks,
> >
> > Alex
>


Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-08-08 Thread Bob Chen
Plus:

1 GB hugepages neither improved bandwidth nor latency. Results remained the
same.

2017-08-08 9:44 GMT+08:00 Bob Chen <a175818...@gmail.com>:

> 1. How to test the KVM exit rate?
>
> 2. The switches are separate devices of PLX Technology
>
> # lspci -s 07:08.0 -nn
> 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port
> PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca)
>
> # This is one of the Root Ports in the system.
> [:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon
> D DMI2
>  +-01.0-[01]00.0  LSI Logic / Symbios Logic MegaRAID SAS
> 2208 [Thunderbolt]
>  +-02.0-[02-05]--
>  +-03.0-[06-09]00.0-[07-09]--+-08.0-[08]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>  |   |\-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
>  |   \-10.0-[09]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>  |\-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
>
>
>
>
> 3. ACS
>
> It seemed that I had misunderstood your point? I finally found ACS
> information on switches, not on GPUs.
>
> Capabilities: [f24 v1] Access Control Services
> ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+
> EgressCtrl+ DirectTrans+
> ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
> EgressCtrl- DirectTrans-
>
>
>
> 2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:
>
>> On Mon, 7 Aug 2017 21:00:04 +0800
>> Bob Chen <a175818...@gmail.com> wrote:
>>
>> > Bad news... The performance had dropped dramatically when using emulated
>> > switches.
>> >
>> > I was referring to the PCIe doc at
>> > https://github.com/qemu/qemu/blob/master/docs/pcie.txt
>> >
>> > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
>> > q35,accel=kvm -nodefaults -nodefconfig \
>> > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
>> > -device x3130-upstream,id=upstream_port1,bus=root_port1 \
>> > -device
>> > xio3130-downstream,id=downstream_port1,bus=upstream_port1,
>> chassis=11,slot=11
>> > \
>> > -device
>> > xio3130-downstream,id=downstream_port2,bus=upstream_port1,
>> chassis=12,slot=12
>> > \
>> > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
>> > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
>> > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
>> > -device x3130-upstream,id=upstream_port2,bus=root_port2 \
>> > -device
>> > xio3130-downstream,id=downstream_port3,bus=upstream_port2,
>> chassis=21,slot=21
>> > \
>> > -device
>> > xio3130-downstream,id=downstream_port4,bus=upstream_port2,
>> chassis=22,slot=22
>> > \
>> > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
>> > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
>> > ...
>> >
>> >
>> > Not 8 GPUs this time, only 4.
>> >
>> > *1. Attached to pcie bus directly (former situation):*
>> >
>> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >D\D 0  1  2  3
>> >  0 420.93  10.03  11.07  11.09
>> >  1  10.04 425.05  11.08  10.97
>> >  2  11.17  11.17 425.07  10.07
>> >  3  11.25  11.25  10.07 423.64
>> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >D\D 0  1  2  3
>> >  0 425.98  10.03  11.07  11.09
>> >  1   9.99 426.43  11.07  11.07
>> >  2  11.04  11.20 425.98   9.89
>> >  3  11.21  11.21  10.06 425.97
>> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >D\D 0  1  2  3
>> >  0 430.67  10.45  19.59  19.58
>> >  1  10.44 428.81  19.49  19.53
>> >  2  19.62  19.62 429.52  10.57
>> >  3  19.60  19.66  10.43 427.38
>> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >D\D 0  1  2  3
>> >  0 429.47  10.47  19.52  19.39
>> >  1  10.48 427.15  19.64  19.52
>> >  2  19.64  19.59 429.02  10.42
>> >  3  19.60  19.64  10.47 427.81
>> > P2P=Disabled Latency Matrix (us)
>> >D\D 0  1  2  3
>> >  0   4.50  13.72  14.49  14.44
>> >  1  13.65   4.53  14.52  14.3

Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-08-07 Thread Bob Chen
1. How to test the KVM exit rate?

2. The switches are separate devices of PLX Technology

# lspci -s 07:08.0 -nn
07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port
PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca)

# This is one of the Root Ports in the system.
[:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D
DMI2
 +-01.0-[01]00.0  LSI Logic / Symbios Logic MegaRAID SAS
2208 [Thunderbolt]
 +-02.0-[02-05]--
 +-03.0-[06-09]00.0-[07-09]--+-08.0-[08]--+-00.0  NVIDIA
Corporation GP102 [TITAN Xp]
 |   |\-00.1  NVIDIA
Corporation GP102 HDMI Audio Controller
 |   \-10.0-[09]--+-00.0  NVIDIA
Corporation GP102 [TITAN Xp]
 |\-00.1  NVIDIA
Corporation GP102 HDMI Audio Controller




3. ACS

It seemed that I had misunderstood your point? I finally found ACS
information on switches, not on GPUs.

Capabilities: [f24 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+
DirectTrans+
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl-
DirectTrans-



2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:

> On Mon, 7 Aug 2017 21:00:04 +0800
> Bob Chen <a175818...@gmail.com> wrote:
>
> > Bad news... The performance had dropped dramatically when using emulated
> > switches.
> >
> > I was referring to the PCIe doc at
> > https://github.com/qemu/qemu/blob/master/docs/pcie.txt
> >
> > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
> > q35,accel=kvm -nodefaults -nodefconfig \
> > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
> > -device x3130-upstream,id=upstream_port1,bus=root_port1 \
> > -device
> > xio3130-downstream,id=downstream_port1,bus=upstream_
> port1,chassis=11,slot=11
> > \
> > -device
> > xio3130-downstream,id=downstream_port2,bus=upstream_
> port1,chassis=12,slot=12
> > \
> > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
> > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
> > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
> > -device x3130-upstream,id=upstream_port2,bus=root_port2 \
> > -device
> > xio3130-downstream,id=downstream_port3,bus=upstream_
> port2,chassis=21,slot=21
> > \
> > -device
> > xio3130-downstream,id=downstream_port4,bus=upstream_
> port2,chassis=22,slot=22
> > \
> > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
> > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
> > ...
> >
> >
> > Not 8 GPUs this time, only 4.
> >
> > *1. Attached to pcie bus directly (former situation):*
> >
> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
> >D\D 0  1  2  3
> >  0 420.93  10.03  11.07  11.09
> >  1  10.04 425.05  11.08  10.97
> >  2  11.17  11.17 425.07  10.07
> >  3  11.25  11.25  10.07 423.64
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >D\D 0  1  2  3
> >  0 425.98  10.03  11.07  11.09
> >  1   9.99 426.43  11.07  11.07
> >  2  11.04  11.20 425.98   9.89
> >  3  11.21  11.21  10.06 425.97
> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
> >D\D 0  1  2  3
> >  0 430.67  10.45  19.59  19.58
> >  1  10.44 428.81  19.49  19.53
> >  2  19.62  19.62 429.52  10.57
> >  3  19.60  19.66  10.43 427.38
> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >D\D 0  1  2  3
> >  0 429.47  10.47  19.52  19.39
> >  1  10.48 427.15  19.64  19.52
> >  2  19.64  19.59 429.02  10.42
> >  3  19.60  19.64  10.47 427.81
> > P2P=Disabled Latency Matrix (us)
> >D\D 0  1  2  3
> >  0   4.50  13.72  14.49  14.44
> >  1  13.65   4.53  14.52  14.33
> >  2  14.22  13.82   4.52  14.50
> >  3  13.87  13.75  14.53   4.55
> > P2P=Enabled Latency Matrix (us)
> >D\D 0  1  2  3
> >  0   4.44  13.56  14.58  14.45
> >  1  13.56   4.48  14.39  14.45
> >  2  13.85  13.93   4.86  14.80
> >  3  14.51  14.23  14.70   4.72
> >
> >
> > *2. Attached to emulated Root Port and Switches:*
> >
> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
> >D\D 0  1  2  3
> >  0 420.48   3.15   3.12   3.12
> >  1   3.13 422.

Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-08-07 Thread Bob Chen
Besides, I checked the lspci -vvv output, no capabilities of Access Control
are seen.

2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:

> On Tue, 1 Aug 2017 17:35:40 +0800
> Bob Chen <a175818...@gmail.com> wrote:
>
> > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:
> >
> > > On Tue, 1 Aug 2017 13:04:46 +0800
> > > Bob Chen <a175818...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a sketch of my hardware topology.
> > > >
> > > >   CPU0 <- QPI ->CPU1
> > > >| |
> > > > Root Port(at PCIe.0)Root Port(at PCIe.1)
> > > >/\   /   \
> > >
> > > Are each of these lines above separate root ports?  ie. each root
> > > complex hosts two root ports, each with a two-port switch downstream of
> > > it?
> > >
> >
> > Not quite sure if root complex is a concept or a real physical device ...
> >
> > But according to my observation by `lspci -vt`, there are indeed 4 Root
> > Ports in the system. So the sketch might need a tiny update.
> >
> >
> >   CPU0 <- QPI ->CPU1
> >
> >| |
> >
> >   Root Complex(device?)  Root Complex(device?)
> >
> >  /\   /\
> >
> > Root Port  Root Port Root Port  Root Port
> >
> >/\   /\
> >
> > SwitchSwitch SwitchSwitch
> >
> >  /   \  /  \  /   \ /   \
> >
> >GPU   GPU  GPU  GPU  GPU   GPU  GPU   GPU
>
>
> Yes, that's what I expected.  So the numbers make sense, the immediate
> sibling GPU would share bandwidth between the root port and upstream
> switch port, any other GPU should not double-up on any single link.
>
> > > > SwitchSwitch SwitchSwitch
> > > >  /   \  /  \  /   \/\
> > > >GPU   GPU  GPU  GPU  GPU   GPU GPU   GPU
> > > >
> > > >
> > > > And below are the p2p bandwidth test results.
> > > >
> > > > Host:
> > > >D\D 0  1  2  3  4  5  6  7
> > > >  0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
> > > >  1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
> > > >  2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
> > > >  3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
> > > >  4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
> > > >  5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
> > > >  6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
> > > >  7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15
> > > >
> > > > VM:
> > > >D\D 0  1  2  3  4  5  6  7
> > > >  0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
> > > >  1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
> > > >  2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
> > > >  3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
> > > >  4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
> > > >  5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
> > > >  6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
> > > >  7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23
> > >
> > > Interesting test, how do you get these numbers?  What are the units,
> > > GB/s?
> > >
> >
> >
> >
> > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are
> > GB/s. Asynchronous read and write. Bidirectional.
> >
> > However, the Unidirectional test had shown a different result. Didn't
> fall
> > down to a half.
> >
> > VM:
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >D\D 0  1  2  3  4  5  6  7
> >  0 424.07  10.02  11.33  11.30  11.09  11.05  11.06  11.10
> >  1  10.05 425.98  11.40  11.33  11.08  11.10  11.13  11.09
> >  2  11.31  11.28 423.67  10.10  11.14  11.13  11.13  11.11
> >  3  11.30  11.31  10.08 425.05  11.10  11.07

Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-08-07 Thread Bob Chen
Bad news... The performance had dropped dramatically when using emulated
switches.

I was referring to the PCIe doc at
https://github.com/qemu/qemu/blob/master/docs/pcie.txt

# qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
q35,accel=kvm -nodefaults -nodefconfig \
-device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
-device x3130-upstream,id=upstream_port1,bus=root_port1 \
-device
xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11
\
-device
xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12
\
-device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
-device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
-device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
-device x3130-upstream,id=upstream_port2,bus=root_port2 \
-device
xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21
\
-device
xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22
\
-device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
-device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
...


Not 8 GPUs this time, only 4.

*1. Attached to pcie bus directly (former situation):*

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 420.93  10.03  11.07  11.09
 1  10.04 425.05  11.08  10.97
 2  11.17  11.17 425.07  10.07
 3  11.25  11.25  10.07 423.64
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 425.98  10.03  11.07  11.09
 1   9.99 426.43  11.07  11.07
 2  11.04  11.20 425.98   9.89
 3  11.21  11.21  10.06 425.97
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 430.67  10.45  19.59  19.58
 1  10.44 428.81  19.49  19.53
 2  19.62  19.62 429.52  10.57
 3  19.60  19.66  10.43 427.38
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 429.47  10.47  19.52  19.39
 1  10.48 427.15  19.64  19.52
 2  19.64  19.59 429.02  10.42
 3  19.60  19.64  10.47 427.81
P2P=Disabled Latency Matrix (us)
   D\D 0  1  2  3
 0   4.50  13.72  14.49  14.44
 1  13.65   4.53  14.52  14.33
 2  14.22  13.82   4.52  14.50
 3  13.87  13.75  14.53   4.55
P2P=Enabled Latency Matrix (us)
   D\D 0  1  2  3
 0   4.44  13.56  14.58  14.45
 1  13.56   4.48  14.39  14.45
 2  13.85  13.93   4.86  14.80
 3  14.51  14.23  14.70   4.72


*2. Attached to emulated Root Port and Switches:*

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 420.48   3.15   3.12   3.12
 1   3.13 422.31   3.12   3.12
 2   3.08   3.09 421.40   3.13
 3   3.10   3.10   3.13 418.68
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 418.68   3.14   3.12   3.12
 1   3.15 420.03   3.12   3.12
 2   3.11   3.10 421.39   3.14
 3   3.11   3.08   3.13 419.13
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 424.36   5.36   5.35   5.34
 1   5.36 424.36   5.34   5.34
 2   5.35   5.36 425.52   5.35
 3   5.36   5.36   5.34 425.29
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3
 0 422.98   5.35   5.35   5.35
 1   5.35 423.44   5.34   5.33
 2   5.35   5.35 425.29   5.35
 3   5.35   5.34   5.34 423.21
P2P=Disabled Latency Matrix (us)
   D\D 0  1  2  3
 0   4.79  16.59  16.38  16.22
 1  16.62   4.77  16.35  16.69
 2  16.77  16.66   4.03  16.68
 3  16.54  16.56  16.78   4.08
P2P=Enabled Latency Matrix (us)
   D\D 0  1  2  3
 0   4.51  16.56  16.58  16.66
 1  15.65   3.87  16.74  16.61
 2  16.59  16.81   3.96  16.70
 3  16.47  16.28  16.68   4.03


Is it because the heavy load of CPU emulation had caused a bottleneck?



2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:

> On Tue, 1 Aug 2017 17:35:40 +0800
> Bob Chen <a175818...@gmail.com> wrote:
>
> > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:
> >
> > > On Tue, 1 Aug 2017 13:04:46 +0800
> > > Bob Chen <a175818...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a sketch of my hardware topology.
> > > >
> > > >   CPU0 <- QPI ->CPU1
> > > >| |
> > > > Root Port(at PCIe.0)Root Port(at PCIe.1)
> > > >/\   /   \
> > >
> > > Are each of these lines above separate root ports?  ie. each root
> > > complex hosts two root ports, each with a two-port switch downstream of
> > > it?
&g

Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-08-01 Thread Bob Chen
2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:

> On Tue, 1 Aug 2017 13:04:46 +0800
> Bob Chen <a175818...@gmail.com> wrote:
>
> > Hi,
> >
> > This is a sketch of my hardware topology.
> >
> >   CPU0 <- QPI ->CPU1
> >| |
> > Root Port(at PCIe.0)Root Port(at PCIe.1)
> >/\   /   \
>
> Are each of these lines above separate root ports?  ie. each root
> complex hosts two root ports, each with a two-port switch downstream of
> it?
>

Not quite sure if root complex is a concept or a real physical device ...

But according to my observation by `lspci -vt`, there are indeed 4 Root
Ports in the system. So the sketch might need a tiny update.


  CPU0 <- QPI ->CPU1

   | |

  Root Complex(device?)  Root Complex(device?)

 /\   /\

Root Port  Root Port Root Port  Root Port

   /\   /\

SwitchSwitch SwitchSwitch

 /   \  /  \  /   \ /   \

   GPU   GPU  GPU  GPU  GPU   GPU  GPU   GPU



>
> > SwitchSwitch SwitchSwitch
> >  /   \  /  \  /   \/\
> >GPU   GPU  GPU  GPU  GPU   GPU GPU   GPU
> >
> >
> > And below are the p2p bandwidth test results.
> >
> > Host:
> >D\D 0  1  2  3  4  5  6  7
> >  0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
> >  1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
> >  2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
> >  3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
> >  4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
> >  5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
> >  6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
> >  7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15
> >
> > VM:
> >D\D 0  1  2  3  4  5  6  7
> >  0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
> >  1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
> >  2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
> >  3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
> >  4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
> >  5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
> >  6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
> >  7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23
>
> Interesting test, how do you get these numbers?  What are the units,
> GB/s?
>



A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are
GB/s. Asynchronous read and write. Bidirectional.

However, the Unidirectional test had shown a different result. Didn't fall
down to a half.

VM:
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3  4  5  6  7
 0 424.07  10.02  11.33  11.30  11.09  11.05  11.06  11.10
 1  10.05 425.98  11.40  11.33  11.08  11.10  11.13  11.09
 2  11.31  11.28 423.67  10.10  11.14  11.13  11.13  11.11
 3  11.30  11.31  10.08 425.05  11.10  11.07  11.09  11.06
 4  11.16  11.17  11.21  11.17 423.67  10.08  11.25  11.28
 5  10.97  11.01  11.07  11.02  10.09 425.52  11.23  11.27
 6  11.09  11.13  11.16  11.10  11.28  11.33 422.71  10.10
 7  11.13  11.09  11.15  11.11  11.36  11.33  10.02 422.75

Host:
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D 0  1  2  3  4  5  6  7
 0 424.13  13.38  10.17  10.17  11.23  11.21  10.94  11.22
 1  13.38 424.06  10.18  10.19  11.20  11.19  11.19  11.14
 2  10.18  10.18 422.75  13.38  11.19  11.19  11.17  11.17
 3  10.18  10.18  13.38 425.05  11.05  11.08  11.08  11.06
 4  11.01  11.06  11.06  11.03 423.21  13.38  10.17  10.17
 5  10.91  10.91  10.89  10.92  13.38 425.52  10.18  10.18
 6  11.28  11.30  11.32  11.31  10.19  10.18 424.59  13.37
 7  11.18  11.20  11.16  11.21  10.17  10.19  13.38 424.13



>
> > In the VM, the bandwidth between two GPUs under the same physical switch
> is
> > obviously lower, as per the reasons you said in former threads.
>
> Hmm, I'm not sure I can explain why the number is lower than to more
> remote GPUs though.  Is the test simultaneously reading and writing and
> therefore we overload the link to the upstream switch port?  Otherwise
> I'd expect the bidirectional support in PCIe to be able to handle the
>

Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

2017-07-31 Thread Bob Chen
Hi,

This is a sketch of my hardware topology.

  CPU0 <- QPI ->CPU1
   | |
Root Port(at PCIe.0)Root Port(at PCIe.1)
   /\   /   \
SwitchSwitch SwitchSwitch
 /   \  /  \  /   \/\
   GPU   GPU  GPU  GPU  GPU   GPU GPU   GPU


And below are the p2p bandwidth test results.

Host:
   D\D 0  1  2  3  4  5  6  7
 0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
 1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
 2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
 3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
 4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
 5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
 6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
 7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15

VM:
   D\D 0  1  2  3  4  5  6  7
 0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
 1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
 2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
 3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
 4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
 5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
 6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
 7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23


In the VM, the bandwidth between two GPUs under the same physical switch is
obviously lower, as per the reasons you said in former threads.

But what confused me most is that GPUs under different switches could
achieve the same speed, as well as in the Host. Does that mean after IOMMU
address translation, data traversing has utilized QPI bus by default? Even
these two devices do not belong to the same PCIe bus?

In a word, I'm trying to build a massive deep-learning/HPC infrastructure
for the cloud environment. Nvidia itself released a solution based on
dockers, and I believe qemu/VMs could also do it. Hopefully I could get
some help from the community.

The emulated switch you suggested looks like a good option to me, I will
have a try.


Thanks,
Bob


2017-07-27 1:32 GMT+08:00 Alex Williamson :

> On Wed, 26 Jul 2017 19:06:58 +0300
> "Michael S. Tsirkin"  wrote:
>
> > On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote:
> > > On Wed, 26 Jul 2017 09:21:38 +0300
> > > Marcel Apfelbaum  wrote:
> > >
> > > > On 25/07/2017 11:53, 陈博 wrote:
> > > > > To accelerate data traversing between devices under the same PCIE
> Root
> > > > > Port or Switch.
> > > > >
> > > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-
> 07/msg07209.html
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > It may be possible, but maybe PCIe Switch assignment is not
> > > > the only way to go.
> > > >
> > > > Adding Alex and Michael for their input on this matter.
> > > > More info at:
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2017-
> 07/msg07209.html
> > >
> > > I think you need to look at where the IOMMU is in the topology and what
> > > address space the devices are working in when assigned to a VM to
> > > realize that it doesn't make any sense to assign switch ports to a VM.
> > > GPUs cannot do switch level peer to peer when assigned because they are
> > > operating in an I/O virtual address space.  This is why we configure
> > > ACS on downstream ports to prevent peer to peer.  Peer to peer
> > > transactions must be forwarded upstream by the switch ports in order to
> > > reach the IOMMU for translation.  Note however that we do populate peer
> > > to peer mappings within the IOMMU, so if the hardware supports it, the
> > > IOMMU can reflect the transaction back out to the I/O bus to reach the
> > > other device without CPU involvement.
> > >
> > > Therefore I think the better solution, if it encourages the NVIDIA
> > > driver to do the right thing, is to use emulated switches.  Assigning
> > > the physical switch would really do nothing more than make the PCIe
> link
> > > information more correct in the VM, everything else about the switch
> > > would be emulated.  Even still, unless you have an I/O topology which
> > > integrates the IOMMU into the switch itself, the data flow still needs
> > > to go all the way to the root complex to hit the IOMMU before being
> > > reflected to the other device.  Direct peer to peer between downstream
> > > switch ports operates in the wrong address space.  Thanks,
> > >
> > > Alex
> >
> > That's true of course. What would make sense would be for
> > hardware vendors to add ATS support to their cards.
> >
> > Then peer to peer should be allowed by hypervisor for translated
> transactions.
> >
> > Gives you the performance benefit without the 

Re: [Qemu-devel] [Device passthrough] Is there a way to passthrough PCIE switch/bridge ?

2017-07-24 Thread Bob Chen
Details attached...

I was trying to passthrough multiple GPUs into a single VM, the topology
looks like these below.

Without the help of PCIE switches, data traversing between GPUs are
extremely slow. (depends on memcopy by CPU)

So my question is can I passthrough all the switches and PCIE Root, in
order to make the Nvidia p2p feature work?



Host:
[root@localhost ~]# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X PIX PHB PHB SOC SOC SOC SOC 0-9,20-29
GPU1 PIX X PHB PHB SOC SOC SOC SOC 0-9,20-29
GPU2 PHB PHB X PIX SOC SOC SOC SOC 0-9,20-29
GPU3 PHB PHB PIX X SOC SOC SOC SOC 0-9,20-29
GPU4 SOC SOC SOC SOC X PIX PHB PHB 10-19,30-39
GPU5 SOC SOC SOC SOC PIX X PHB PHB 10-19,30-39
GPU6 SOC SOC SOC SOC PHB PHB X PIX 10-19,30-39
GPU7 SOC SOC SOC SOC PHB PHB PIX X 10-19,30-39


VM:
[root@titan-xp-chenbo-2 ~]# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X PHB PHB PHB PHB PHB PHB PHB 0-15
GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-15
GPU2 PHB PHB X PHB PHB PHB PHB PHB 0-15
GPU3 PHB PHB PHB X PHB PHB PHB PHB 0-15
GPU4 PHB PHB PHB PHB X PHB PHB PHB 0-15
GPU5 PHB PHB PHB PHB PHB X PHB PHB 0-15
GPU6 PHB PHB PHB PHB PHB PHB X PHB 0-15
GPU7 PHB PHB PHB PHB PHB PHB PHB X 0-15


Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU
sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge
(typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing
the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks



2017-07-24 14:03 GMT+08:00 Bob Chen <a175818...@gmail.com>:

>
> - Bob
>


[Qemu-devel] [Device passthrough] Is there a way to passthrough PCIE switch/bridge ?

2017-07-24 Thread Bob Chen
- Bob


[Qemu-devel] Questions about GPU passthrough + multiple PCIE switches on host

2017-06-29 Thread Bob Chen
Hi folks,

I have 8 GPU cards needed to passthrough to 1 vm.

These cards are placed at 2 PCIE switches on host server, in case there
might be bandwidth limit within a single bus.

So what is the correct QEMU bus parameter if I want to achieve the best
performance. The QEMU's pcie.0/1 parameter could really reflect to the
actual physical device?



Thanks,
Bob


[Qemu-devel] How to upgrade QEMU?

2017-02-14 Thread Bob Chen
Hi folks,

I am about to upgrade my QEMU version from an ancient 1.1.2 to the latest.
My plan is to override all the installing files with the new ones. Since I
used to rename all the qemu-xxx binaries under /usr/local/bin with a
specified version suffix, so they are not my concern.

 Just wondering if it is safe to replace those /usr/local/share files (ROM
and keymaps?), while some some of my QEMU guests are still running
in-flight.

Regards,
Bob


Re: [Qemu-devel] [Nbd] [Qemu-block] How to online resize qemu disk with nbd protocol?

2017-01-22 Thread Bob Chen
Hi folks,

My time schedule doesn't allow me to wait for the community's solution, so
I started to work on quick fix, which is to add a 'bdrv_truncate' function
to the current NBD's BlockDriver. Basically it's an 'active resize'
implementation.

I also realized that the 'bdrv_truncate' caller stack is not in a
coroutine, seemed to be the main thread? Then I tried some synchronous code
as below:

int nbd_truncate(BlockDriverState *bs, int64_t offset)
{
//...
nbd_client_detach_aio_context(bs);
qio_channel_set_blocking(client->ioc, true, NULL);

ret = nbd_send_request(client->ioc, );// step 1,
send custom NBD_CMD_RESIZE request
ret = nbd_receive_reply(client->ioc, );

read_sync(client->ioc, _size, sizeof(new_size)); // step 2,
expected to receive the confirmed new_size as data
new_size = be64_to_cpu(new_size);

qio_channel_set_blocking(client->ioc, false, NULL);
nbd_client_attach_aio_context(bs, aio_context);
//...
}

However at step 2, the 'new_size' I read is not always correct. Sometimes
the bytes are repeating, for instance 1073741824 (1GB)
became 1073741824073741824 ...

Could you help me figure out what went wrong?


Regards,
Bob

2017-01-18 16:01 GMT+08:00 Wouter Verhelst :

> On Mon, Jan 16, 2017 at 01:36:21PM -0600, Eric Blake wrote:
> > Maybe the structured reply proposal can be extended into this (reserve a
> > "reply" header that can be issued as many times as desired by the server
> > without the client ever having issued the request first, and where the
> > reply never uses the end-of-reply marker), but I'm not sure I want to go
> > that direction just yet.
>
> It's not necessarily a bad idea, which could also be used for:
> - keepalive probes from server to client
> - unsolicited ESHUTDOWN messages
>
> both of which are currently not possible and might be useful for the
> protocol to have.
>
> --
> < ron> I mean, the main *practical* problem with C++, is there's like a
> dozen
>people in the world who think they really understand all of its
> rules,
>and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12
>


Re: [Qemu-devel] [Qemu-block] [Nbd] How to online resize qemu disk with nbd protocol?

2017-01-12 Thread Bob Chen
There might be a time window between the NBD server's resize and the
client's `re-read size` request. Is it safe?

What about an active `resize` request from the client? Considering some NBD
servers might have the capability to do instant resizing, not applying to
LVM or host block device, of course...


Regards,
Bob

2017-01-13 0:54 GMT+08:00 Stefan Hajnoczi :

> On Thu, Jan 12, 2017 at 3:44 PM, Alex Bligh  wrote:
> >> On 12 Jan 2017, at 14:43, Eric Blake  wrote:
> >> That's because the NBD protocol lacks a resize command.  You'd have to
> >> first get that proposed as an NBD extension before qemu could support
> it.
> >
> > Actually the NBD protocol lacks a 'make a disk with size X' command,
> > let alone a resize command. The size of an NBD disk is (currently)
> > entirely in the hands of the server. What I think we'd really need
> > would be a 'reread size' command, and have the server capable of
> > supporting resizing. That would then work for readonly images too.
>
> That would be fine for QEMU.  Resizing LVM volumes or host block
> devices works exactly like this.
>
> Stefan
>
>


[Qemu-devel] How to online resize qemu disk with nbd protocol?

2017-01-12 Thread Bob Chen
Hi,


My qemu runs on a 3rd party distributed block storage, and the disk backend
protocol is nbd.

I notices that there are differences between default qcow2 local disk and
my nbd disk, in terms of resizing the disk on the fly.

Local qcow2 disk could work no matter using qemu-img resize or qemu monitor
'block_resize', but the nbd disk seemed to fail to detect the backend size
change(had resized the disk on EBS at first). It said "this feature or
command is not currently supported".

Is that possible to hack qemu nbd code, making it the same way as resizing
local qcow2 disk? I have the interface to resize EBS disk at backend.


Regards,
Bob


Re: [Qemu-devel] Live migration + cpu/mem hotplug

2017-01-09 Thread Bob Chen
Answer my own question:

The corresponding cmd-line parameter for memory hot-add by QEMU monitor is,
-object memory-backend-ram,id=mem0,size=1024M -device
pc-dimm,id=dimm0,memdev=mem0

2017-01-05 18:12 GMT+08:00 Daniel P. Berrange <berra...@redhat.com>:

> On Thu, Jan 05, 2017 at 04:27:26PM +0800, Bob Chen wrote:
> > Hi,
> >
> > According to the docs, the destination Qemu must have the exactly same
> > parameters as the source one. So if the source has just finished cpu or
> > memory hotplug, what would the dest's parameters be like?
> >
> > Does DIMM device, or logically QOM object, have to be reflected on the
> new
> > command-line parameters?
>
> Yes, if you have hotplugged any type of device since the VM was started,
> the QEMU command line args on the target host must include all the original
> args from the source QEMU, and also any args reflect to reflect the
> hotplugged devices too.
>
> A further complication is that on the target, you must also make sure you
> fully specify *all* device address information (PCI slots, SCSI luns, etc
> etc), because the addresses QEMU assigns to a device after hotplug may
> not be the same as the addresses QEMU assigns to a device whne coldplug.
>
> eg if you boot a guest with 1 NIC + 1 disk, and then hotplug a 2nd NIC
> you might get
>
>1st NIC  == PCI slot 2
>1st disk == PCI slot 3
>2nd NIC  == PCI slot 4
>
> if however, you started QEMU with 2 NICs and 1 disk straight away QEMU
> might assign addresses in the order
>
>1st NIC  == PCI slot 2
>2nd NIC  == PCI slot 3
>1st disk == PCI slot 4
>
> this would totally kill a guest OS during live migration as the slots
> for devices its using would change.
>
> So as a general rule when launching QEMU on a target host for migrating,
> you must be explicit about all device addresses and not rely on QEMU to
> auto-assign addresses. This is quite alot of work to get right, but if
> you're using libvirt it'll do pretty much all this automatically for
> you.
>
> Regards,
> Daniel
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org  -o- http://virt-manager.org
> :|
> |: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/
> :|
>


[Qemu-devel] Live migration + cpu/mem hotplug

2017-01-05 Thread Bob Chen
Hi,

According to the docs, the destination Qemu must have the exactly same
parameters as the source one. So if the source has just finished cpu or
memory hotplug, what would the dest's parameters be like?

Does DIMM device, or logically QOM object, have to be reflected on the new
command-line parameters?



Regards, Bob


[Qemu-devel] QEMU -smp paramater: Will multiplue threads and cores improve performance?

2016-12-21 Thread Bob Chen
Hi,

-smp 16

-smp cores=4,threads=4,sockets=1


Which one has better performance? The scenario is guest VMs running on
cloud server.


-Bob


[Qemu-devel] QEMU 1.1.2: block IO throttle might occasionally freeze running process's IO to zero

2016-11-30 Thread Bob Chen
Test case:

1. QEMU 1.1.2
2. Run fio inside the vm, give it some pressure. Watch the realtime
throughput
3. block_set_io_throttle drive_2 1 0 0 2000 0 0 # throttle
bps and iops, any value
4. Observed that the IO is very likely to freeze to zero. The fio process
stuck!
5. Kill the former fio process, start a new one. The IO turns back to normal

Didn't reproduce it with QEMU 2.5.


Actually I'm not wishfully thinking the community would help fix this bug
on such an ancient version. Just hope someone can tell me what is the root
cause. Then I have to evaluate whether I should move to higher version
QEMU, or fix this bug on 1.1.2 in-place(if it is a small one).


[Qemu-devel] cgroup blkio weight has no effect on qemu

2016-01-20 Thread Bob Chen
Sorry for disturbing by reply, don't know why I'm not able to send a new
mail.


Hi folks,

Could you enlighten me how to achieve proportional IO sharing by using
cgroup, instead of qemu's io-throttling?

My qemu config is like: -drive
file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device
virtio-blk-pci...

Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct

Cgroup blkio weight of the qemu process is properly configured as well.

But no matter how change the proportion, such as vm1=400 and vm2=100, I can
only get the equal IO speed.

Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu?


PS. cache=writethrough aio=threads is also tested, the same results.


- Bob



2016-01-20 18:04 GMT+08:00 Markus Armbruster :

>
>
>


[Qemu-devel] cgroup blkio weight has no effect on qemu

2016-01-20 Thread Bob Chen
Hi folks,



Could you enlighten me how to achieve proportional IO sharing by using cgroup, 
instead of qemu's io-throttling?


My qemu config is like: -drive 
file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device 
virtio-blk-pci...


Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct


Cgroup blkio weight of the qemu process is properly configured as well.


But no matter how change the proportion, such as vm1=400 and vm2=100, I can 
only get the equal IO speed.


Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu?




PS. cache=writethrough aio=threads is also tested, the same results. 




- Bob





 

[Qemu-devel] cgroup blkio weight has no effect on qemu

2016-01-20 Thread Bob Chen
Hi folks,

Could you enlighten me how to achieve proportional IO sharing by using
cgroup, instead of qemu's io-throttling?

My qemu config is like: -drive
file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device
virtio-blk-pci...

Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct

Cgroup blkio weight of the qemu process is properly configured as well.

But no matter how change the proportion, such as vm1=400 and vm2=100, I can
only get the equal IO speed.

Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu?


PS. cache=writethrough aio=threads is also tested, the same results.


- Bob


[Qemu-devel] cgroup blkio weight has no effect on qemu

2016-01-20 Thread Bob Chen
Hi folks,



Could you enlighten me how to achieve proportional IO sharing by using cgroup, 
instead of qemu's io-throttling?


My qemu config is like: -drive 
file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device 
virtio-blk-pci...


Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct


Cgroup blkio weight of the qemu process is properly configured as well.


But no matter how change the proportion, such as vm1=400 and vm2=100, I can 
only get the equal IO speed.


Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu?




PS. cache=writethrough aio=threads is also tested, the same results. 




- Bob

[Qemu-devel] cgroup blkio weight has no effect on qemu

2016-01-20 Thread Bob Chen
Hi folks,



Could you enlighten me how to achieve proportional IO sharing by using
cgroup, instead of qemu's io-throttling?


My qemu config is like: -drive
file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device
virtio-blk-pci...


Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct


Cgroup blkio weight of the qemu process is properly configured as well.


But no matter how change the proportion, such as vm1=400 and vm2=100, I can
only get the equal IO speed.


Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu?



PS. cache=writethrough aio=threads is also tested, the same results.




- Bob


[Qemu-devel] cgroup blkio.weight is not working for qemu

2016-01-20 Thread Bob Chen
Hi folks,

Could you enlighten me how to achieve proportional IO sharing by using
cgroup, instead of qemu's io-throttling?

My qemu config is like: -drive
file=$DISKFILe,if=none,format=qcow2,cache=none,aio=native -device
virtio-blk-pci...

Test command inside vm is like: dd if=/dev/vdc of=/dev/null iflag=direct

Cgroup blkio weight of the qemu process is properly configured as well.

But no matter how change the proportion, such as vm1=400 and vm2=100, I can
only get the equal IO speed.

Wondering cgroup blkio.weight or blkio.weight_device has no effect on qemu?


PS. cache=writethrough aio=threads is also tested, the same results.


- Bob