Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-09-13 Thread Max Reitz

On 12.09.2014 14:38, Stefan Hajnoczi wrote:

Max: Unrelated to this performance issue but I notice that the qcow2
metadata overlap check is high in the host CPU profile.  Have you had
any thoughts about optimizing the check?

Stefan


In fact, I have done so (albeit only briefly). Instead of gathering all 
the information in the overlap function itself, we could either have a 
generic list of typed ranges (e.g. cluster 0: header, clusters 1 to 
5: L1 table, etc.) or a not-really-bitmap (with 4 bits per entry 
specifying the cluster type (header, L1 table, free or data cluster, etc.)).


The disadvantage of the former would be that in its simplest form we'd 
have to run through the whole list to find out whether a cluster is 
already reserved for metadata or not. We could easily optimize this by 
keeping the list in order and then performing a binary search.


The disadvantage of the latter would obviously be its memory size. For a 
1 TB image with 64 kB clusters, it would be 8 MB in size. Could be 
considered acceptable, but I deem it too large. The advantage would be 
constant access time, of course.


We could combine both approaches, that is, using the bitmap as a cache: 
Whenever a cluster is overlap checked, the corresponding bitmap range 
(or bitmap window) is requested; if that is not available, it is 
generated from the range list and then put into the cache.


The remaining question is how large the range list would be in memory. 
Basically, its size would be comparable to an RLE version of the bitmap. 
In contrast to a raw RLE version, however, we'd have to add the start 
cluster to each entry in order to be able to perform binary search and 
we'd omit free and/or data clusters. So, we'd have 4 bits for the 
cluster type, let's say 12 bits for the cluster count and of course 64 
bits for the first cluster index. Or, for maximum efficiency, we'd have 
64 - 9 - 1 = 54 bits for the cluster index, 4 bits for the type and then 
6 bits for the cluster count. The first variant gives us 10 bytes per 
metadata range, the second 8. Considering one refcount block can handle 
cluster_size / 2 entries and one L2 table can handle cluster_size / 8 
entries, we have (for images with a cluster size of 64 kB) a ratio of 
about 1/32768 refcount blocks per cluster and 1/8192 L2 tables per 
cluster. I guess we therefore have a metadata ratio of about 1/6000. At 
the worst, each metadata cluster requires its own range list entry, 
which for 10 bytes per entry means less than 30 kB for the list of a 1 
TB image with 64 kB clusters. I think that's acceptable.


We could compress that list even more by making it a real RLE version of 
the bitmap, removing the cluster index from each entry; remember that 
for this mixed range list/bitmap approach we no longer need to be able 
to perform exact binary search but only need to be able to quickly seek 
to the beginning of a bitmap window. This can be achieved by forcing 
breaks in the range list at every window border and keeping track of 
those offsets along with the corresponding bitmap window index. When we 
want to generate a bitmap window, we look up the start offset in the 
range list (constant time), then generate it (linear to window size) and 
can then perform constant-time lookups for each overlap checks in that 
window.


I think that could greatly speed things up and also allow us to always 
perform range checks on data structures not kept in memory (inactive L1 
and L2 tables). The only question now remaining to me is whether that 
caching is actually feasible or whether binary search into the range 
list (which then would have to include the cluster index for each entry) 
would be faster than generating bitmap windows which might suffer from 
ping-pong effects.


Max
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-09-12 Thread Stefan Hajnoczi
On Fri, Sep 12, 2014 at 11:21:37AM +0800, Zhang Haoyu wrote:
If virtio-blk and virtio-serial share an IRQ, the guest operating 
system has to check each virtqueue for activity. Maybe there is some 
inefficiency doing that.
AFAIK virtio-serial registers 64 virtqueues (on 31 ports + console) 
even if everything is unused.
   
   That could be the case if MSI is disabled.
  
  Do the windows virtio drivers enable MSIs, in their inf file?
 
 It depends on the version of the drivers, but it is a reasonable guess
 at what differs between Linux and Windows.  Haoyu, can you give us the
 output of lspci from a Linux guest?
 
 I made a test with fio on rhel-6.5 guest, the same degradation happened too, 
  this degradation can be reproduced on rhel6.5 guest 100%.
 virtio_console module installed:
 64K-write-sequence: 285 MBPS, 4380 IOPS
 virtio_console module uninstalled:
 64K-write-sequence: 370 MBPS, 5670 IOPS
 
 I use top -d 1 -H -p qemu-pid to monitor the cpu usage, and found that,
 virtio_console module installed:
 qemu main thread cpu usage: 98%
 virtio_console module uninstalled:
 qemu main thread cpu usage: 60%
 
 perf top -p qemu-pid result,
 virtio_console module installed:
PerfTop:9868 irqs/sec  kernel:76.4%  exact:  0.0% [4000Hz cycles],  
 (target_pid: 88381)
 --
 
 11.80%  [kernel] [k] _raw_spin_lock_irqsave
  8.42%  [kernel] [k] _raw_spin_unlock_irqrestore
  7.33%  [kernel] [k] fget_light
  6.28%  [kernel] [k] fput
  3.61%  [kernel] [k] do_sys_poll
  3.30%  qemu-system-x86_64   [.] qcow2_check_metadata_overlap
  3.10%  [kernel] [k] __pollwait
  2.15%  qemu-system-x86_64   [.] qemu_iohandler_poll
  1.44%  libglib-2.0.so.0.3200.4  [.] g_array_append_vals
  1.36%  libc-2.13.so [.] 0x0011fc2a
  1.31%  libpthread-2.13.so   [.] pthread_mutex_lock
  1.24%  libglib-2.0.so.0.3200.4  [.] 0x0001f961
  1.20%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  0.99%  [kernel] [k] eventfd_poll
  0.98%  [vdso]   [.] 0x0771
  0.97%  [kernel] [k] remove_wait_queue
  0.96%  qemu-system-x86_64   [.] qemu_iohandler_fill
  0.95%  [kernel] [k] add_wait_queue
  0.69%  [kernel] [k] __srcu_read_lock
  0.58%  [kernel] [k] poll_freewait
  0.57%  [kernel] [k] _raw_spin_lock_irq
  0.54%  [kernel] [k] __srcu_read_unlock
  0.47%  [kernel] [k] copy_user_enhanced_fast_string
  0.46%  [kvm_intel]  [k] vmx_vcpu_run
  0.46%  [kvm][k] vcpu_enter_guest
  0.42%  [kernel] [k] tcp_poll
  0.41%  [kernel] [k] system_call_after_swapgs
  0.40%  libglib-2.0.so.0.3200.4  [.] g_slice_alloc
  0.40%  [kernel] [k] system_call
  0.38%  libpthread-2.13.so   [.] 0xe18d
  0.38%  libglib-2.0.so.0.3200.4  [.] g_slice_free1
  0.38%  qemu-system-x86_64   [.] address_space_translate_internal
  0.38%  [kernel] [k] _raw_spin_lock
  0.37%  qemu-system-x86_64   [.] phys_page_find
  0.36%  [kernel] [k] get_page_from_freelist
  0.35%  [kernel] [k] sock_poll
  0.34%  [kernel] [k] fsnotify
  0.31%  libglib-2.0.so.0.3200.4  [.] g_main_context_check
  0.30%  [kernel] [k] do_direct_IO
  0.29%  libpthread-2.13.so   [.] pthread_getspecific
 
 virtio_console module uninstalled:
PerfTop:9138 irqs/sec  kernel:71.7%  exact:  0.0% [4000Hz cycles],  
 (target_pid: 88381)
 --
 
  5.72%  qemu-system-x86_64   [.] qcow2_check_metadata_overlap
  4.51%  [kernel] [k] fget_light
  3.98%  [kernel] [k] _raw_spin_lock_irqsave
  2.55%  [kernel] [k] fput
  2.48%  libpthread-2.13.so   [.] pthread_mutex_lock
  2.46%  [kernel] [k] _raw_spin_unlock_irqrestore
  2.21%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  1.71%  [vdso]   [.] 0x060c
  1.68%  libc-2.13.so [.] 0x000e751f
  1.64%  libglib-2.0.so.0.3200.4  [.] 0x0004fca0
  1.20%  [kernel] [k] __srcu_read_lock
  1.14%  [kernel] [k] do_sys_poll
  0.96%  [kernel] [k] _raw_spin_lock_irq
  0.95%  [kernel] [k] __pollwait
  

Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-09-11 Thread Zhang Haoyu
   If virtio-blk and virtio-serial share an IRQ, the guest operating 
   system has to check each virtqueue for activity. Maybe there is some 
   inefficiency doing that.
   AFAIK virtio-serial registers 64 virtqueues (on 31 ports + console) 
   even if everything is unused.
  
  That could be the case if MSI is disabled.
 
 Do the windows virtio drivers enable MSIs, in their inf file?

It depends on the version of the drivers, but it is a reasonable guess
at what differs between Linux and Windows.  Haoyu, can you give us the
output of lspci from a Linux guest?

I made a test with fio on rhel-6.5 guest, the same degradation happened too,  
this degradation can be reproduced on rhel6.5 guest 100%.
virtio_console module installed:
64K-write-sequence: 285 MBPS, 4380 IOPS
virtio_console module uninstalled:
64K-write-sequence: 370 MBPS, 5670 IOPS

I use top -d 1 -H -p qemu-pid to monitor the cpu usage, and found that,
virtio_console module installed:
qemu main thread cpu usage: 98%
virtio_console module uninstalled:
qemu main thread cpu usage: 60%

perf top -p qemu-pid result,
virtio_console module installed:
   PerfTop:9868 irqs/sec  kernel:76.4%  exact:  0.0% [4000Hz cycles],  
(target_pid: 88381)
--

11.80%  [kernel] [k] _raw_spin_lock_irqsave
 8.42%  [kernel] [k] _raw_spin_unlock_irqrestore
 7.33%  [kernel] [k] fget_light
 6.28%  [kernel] [k] fput
 3.61%  [kernel] [k] do_sys_poll
 3.30%  qemu-system-x86_64   [.] qcow2_check_metadata_overlap
 3.10%  [kernel] [k] __pollwait
 2.15%  qemu-system-x86_64   [.] qemu_iohandler_poll
 1.44%  libglib-2.0.so.0.3200.4  [.] g_array_append_vals
 1.36%  libc-2.13.so [.] 0x0011fc2a
 1.31%  libpthread-2.13.so   [.] pthread_mutex_lock
 1.24%  libglib-2.0.so.0.3200.4  [.] 0x0001f961
 1.20%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
 0.99%  [kernel] [k] eventfd_poll
 0.98%  [vdso]   [.] 0x0771
 0.97%  [kernel] [k] remove_wait_queue
 0.96%  qemu-system-x86_64   [.] qemu_iohandler_fill
 0.95%  [kernel] [k] add_wait_queue
 0.69%  [kernel] [k] __srcu_read_lock
 0.58%  [kernel] [k] poll_freewait
 0.57%  [kernel] [k] _raw_spin_lock_irq
 0.54%  [kernel] [k] __srcu_read_unlock
 0.47%  [kernel] [k] copy_user_enhanced_fast_string
 0.46%  [kvm_intel]  [k] vmx_vcpu_run
 0.46%  [kvm][k] vcpu_enter_guest
 0.42%  [kernel] [k] tcp_poll
 0.41%  [kernel] [k] system_call_after_swapgs
 0.40%  libglib-2.0.so.0.3200.4  [.] g_slice_alloc
 0.40%  [kernel] [k] system_call
 0.38%  libpthread-2.13.so   [.] 0xe18d
 0.38%  libglib-2.0.so.0.3200.4  [.] g_slice_free1
 0.38%  qemu-system-x86_64   [.] address_space_translate_internal
 0.38%  [kernel] [k] _raw_spin_lock
 0.37%  qemu-system-x86_64   [.] phys_page_find
 0.36%  [kernel] [k] get_page_from_freelist
 0.35%  [kernel] [k] sock_poll
 0.34%  [kernel] [k] fsnotify
 0.31%  libglib-2.0.so.0.3200.4  [.] g_main_context_check
 0.30%  [kernel] [k] do_direct_IO
 0.29%  libpthread-2.13.so   [.] pthread_getspecific

virtio_console module uninstalled:
   PerfTop:9138 irqs/sec  kernel:71.7%  exact:  0.0% [4000Hz cycles],  
(target_pid: 88381)
--

 5.72%  qemu-system-x86_64   [.] qcow2_check_metadata_overlap
 4.51%  [kernel] [k] fget_light
 3.98%  [kernel] [k] _raw_spin_lock_irqsave
 2.55%  [kernel] [k] fput
 2.48%  libpthread-2.13.so   [.] pthread_mutex_lock
 2.46%  [kernel] [k] _raw_spin_unlock_irqrestore
 2.21%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
 1.71%  [vdso]   [.] 0x060c
 1.68%  libc-2.13.so [.] 0x000e751f
 1.64%  libglib-2.0.so.0.3200.4  [.] 0x0004fca0
 1.20%  [kernel] [k] __srcu_read_lock
 1.14%  [kernel] [k] do_sys_poll
 0.96%  [kernel] [k] _raw_spin_lock_irq
 0.95%  [kernel] [k] __pollwait
 0.91%  [kernel] [k] __srcu_read_unlock
 0.78%  [kernel] [k] tcp_poll
 0.74%  [kvm][k] 

Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-08-29 Thread Amit Shah
On (Fri) 29 Aug 2014 [15:45:30], Zhang Haoyu wrote:
 Hi, all
 
 I start a VM with virtio-serial (default ports number: 31), and found that 
 virtio-blk performance degradation happened, about 25%, this problem can be 
 reproduced 100%.
 without virtio-serial:
 4k-read-random 1186 IOPS
 with virtio-serial:
 4k-read-random 871 IOPS
 
 but if use max_ports=2 option to limit the max number of virio-serial ports, 
 then the IO performance degradation is not so serious, about 5%.
 
 And, ide performance degradation does not happen with virtio-serial.

Pretty sure it's related to MSI vectors in use.  It's possible that
the virtio-serial device takes up all the avl vectors in the guests,
leaving old-style irqs for the virtio-blk device.

If you restrict the number of vectors the virtio-serial device gets
(using the -device virtio-serial-pci,vectors= param), does that make
things better for you?


Amit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html