** Changed in: linux (Ubuntu)
Assignee: Colin Ian King (colin-king) => (unassigned)
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1838575
Title:
passthrough devices cause >17min
As outlined in the past conceptually there is nothing that qemu can do.
The kernel can in theory get memory zeroing to become concurrent and thereby
scale with CPUs but that is an effort that was already started twice and didn't
get into the kernel yet.
Workarounds are known to shrink that size
As qemu (seems) to be unable to do much I'll set it to triaged (we
understand what is going on) and low (can't do much).
** Changed in: qemu (Ubuntu)
Status: Incomplete => Triaged
** Changed in: qemu (Ubuntu)
Importance: Medium => Low
--
You received this bug notification because you
This is a silly but useful distribution check with log10 of the allocation
sizes:
Fast:
108 3
1293 4
12133 5
113330 6
27794 7
1119 8
Slow:
194 3
1738 4
17375 5
143411 6
55 7
3 8
I got no warnings about missed
I modified the kernel to have a few functions non-inlined to be better tracable:
vfio_dma_do_map
vfio_dma_do_unmap
mutex_lock
mutex_unlock
kzalloc
vfio_link_dma
vfio_pin_map_dma
vfio_pin_pages_remote
vfio_iommu_map
Then run tracing on this load with limited to the functions in my focus:
$ sudo
(systemtap)
probe module("vfio_iommu_type1").function("vfio_iommu_type1_ioctl") {
printf("New vfio_iommu_type1_ioctl\n");
start_stopwatch("vfioioctl");
}
probe module("vfio_iommu_type1").function("vfio_iommu_type1_ioctl").return {
timer=read_stopwatch_ns("vfioioctl")
The iommu is locked in there early and the iommu element is what is passed from
userspace.
That represents the vfio container for this device (container->fd)
qemu:
if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0
kernel:
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned
Each qemu (version) is slightly different in the road to this, but then
seems to behave.
This one is slightly better to get "in front" of the slow call to map all the
memory.
$ virsh nodedev-detach pci__21_00_1 --driver vfio
$ gdb /usr/bin/qemu-system-x86_64
(gdb) b vfio_dma_map
(gdb)
I could next build a test kernel with some debug around the vfio iommu dma map
to check how time below that call is spent.
I'm sure that data already is hidden in some of my trace data, but to
eventually change/experiment I need to build one anyway.
I expect anyway to summarize and go into a
Reference:
this is the call from qemu that I think we see above (on x86) is at [1].
If this time the assumption is correct the kernel place would be at
vfio_iommu_type1_ioctl.
For debugging:
$ gdb qemu/x86_64-softmmu/qemu-system-x86_64
(gdb) catch syscall 16
(gdb) run -m 131072 -smp 1
Many ioctls (as expected) but they are all fast and match what we knew from
strace.
Thread 1 "qemu-system-x86" hit Catchpoint 1 (call to syscall ioctl),
0x772fae0b in ioctl () at ../sysdeps/unix/syscall-template.S:78
78 in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0
Just when I thought I understood the pattern.
Sixth run (again kill and restart)
6384 9.826097 <... ioctl resumed> , 0x7ffcc8ed6e20) = 0 <19.495688>
So for now lets summarize that it varies :-/
But it always seems slow.
--
You received this bug notification because you are a member of
The above was through libvirt, doing that directly in qemu now to throw
it into debugging more easily:
$ virsh nodedev-detach pci__21_00_1 --driver vfio
$ qemu/x86_64-softmmu/qemu-system-x86_64 -name guest=test-vfio-slowness
-m 131072 -smp 1 -no-user-config -drive
On x86 this looks pretty similar and at the place we have seen before:
45397 0.73 readlink("/sys/bus/pci/devices/:21:00.1/iommu_group",
"../../../../kernel/iommu_groups/"..., 4096) = 34 <0.20>
45397 0.53 openat(AT_FDCWD, "/dev/vfio/45", O_RDWR|O_CLOEXEC) = 31
I built qemu head from git
$ export CFLAGS="-O0 -g"
$ ./configure --disable-user --disable-linux-user --disable-docs
--disable-guest-agent --disable-sdl --disable-gtk --disable-vnc --disable-xen
--disable-brlapi --enable-fdt --disable-bluez --disable-vde --disable-rbd
--disable-libiscsi
Hmm, with strace showing almost a hang on a single of those ioctl calls
you'D think that is easy to spot :-/
But this isn't as clear as expected:
sudo trace-cmd record -p function_graph -l vfio_pci_ioctl -O graph-time
Disable all but 1 CPUs to have less concurrency in the trace.
=> Not much
On this platform strace still confirms the same paths:
And perf as well (slight arch differences, but still mem setup).
46.85% [kernel] [k] lruvec_lru_size
16.89% [kernel] [k] clear_user_page
5.74% [kernel] [k]
As assumed this really seems to be cross arch and for all sizes.
Here 16 PU, 128G on ppc64el:
#1: 54 seconds
#2: 7 seconds
#3: 23 seconds
Upped to 192GB this has:
#1: 75 seconds
#2: 5 seconds
#3: 23 seconds
As a note, in this case I checked there are ~7 seconds before it does
into
You can do so even per-size via e.g.
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
As discussed the later the allocation the higher the chance to fail, so
re-check the sysfs file after each change if it actually got that much memory.
The default size is only a boot time parameter.
** Changed in: linux (Ubuntu)
Importance: Undecided => Medium
** Changed in: linux (Ubuntu)
Assignee: (unassigned) => Colin Ian King (colin-king)
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
Naive question: can we tweak the hugepage file settings at run time via
/proc/sys/vm/nr_hugepages and not require the kernel parameters?
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
Summary:
As I mentioned before (on the other bug that I referred).
The problem is that with a PT device it needs to reset and map the VDIO devices.
So with >0 PT devices attached it needs an init that scales with memory size of
the guest (see my fast results with PT but small guest memory).
As I
22 matches
Mail list logo