Re: Slow VM start/revert, when trying to start/revert dozens of VMs in parallel

2022-05-19 Thread Petr Beneš
Sure.
As for flushing/reading from disk - as I said, all VMs reside on ramdisk.
I'd also like to add, that the VMs are "linked-clones" with the same underlying 
base qcow2 - which is also​ in the ramdisk.

```xml

  win7-x86-1-101
  dc7296c0-228a-44e8-bd47-db019a1f6344
  
http://libosinfo.org/xmlns/libvirt/domain/1.0;>
  http://microsoft.com/win/10"/>

  
  8388608
  8388608
  4
  
hvm

  
  



  
  
  


  
  

  
  




  
  destroy
  restart
  preserve
  


  
  
/usr/bin/qemu-system-x86_64

  
  
  
  


  


  



  
  
  


  
  


  
  
  


  
  
  


  
  
  


  
  
  
  


  

  


  


  




  



  
  


  


  

```


Od: Daniel P. Berrangé 
Odesláno: úterý 10. května 2022 10:08
Komu: Petr Beneš 
Kopie: libvirt-users@redhat.com 
Předmět: Re: Slow VM start/revert, when trying to start/revert dozens of VMs in 
parallel

On Mon, May 09, 2022 at 06:52:32PM +, Petr Beneš wrote:
> Hi,
>
> my problem can be described simply: libvirt can't handle starting dozens of 
> VMs at the same time.
>
> (technically, it can, but it's really slow.)
>
> We have an AMD machine with 256 logical cores and 1.5T ram.
> On that machine there is roughly 200 VMs.
> Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other 
> half is Win7 x64.
> VMs are using qcow2 as the disk image. These images reside in the ramdisk 
> (tmpfs).
>
> We use these machines for automatic malware analysis, so our scenario 
> consists of this cycle:
> - reverting VM to a running state
> - execute sample inside of the VM for ~1-2 minutes
> - shutdown the VM
>
> Of course, this results in multiple VMs trying to start at the same time.
> At first, reverts/starts are really fast - second or two.
> After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, 
> which is really unacceptable.
> For comparison, we're running the same scenarion on Proxmox, where the 
> revertToSnapshot usually takes 2 seconds.

Can you share the XML configuration of one of your guests - assuming
they all have the same basic configuration.

As a gut feeling it sounds to me like it could be initially fast due to
utilization of host I/O cache, but then slows down due to having to
flush data to disk / read fresh from disk. This could be the case if
the disk configuration cache mode is set to certain values, so the XML
config will show us this info.

With regards,
Daniel
--
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: Slow VM start/revert, when trying to start/revert dozens of VMs in parallel

2022-05-10 Thread Daniel P . Berrangé
On Mon, May 09, 2022 at 06:52:32PM +, Petr Beneš wrote:
> Hi,
> 
> my problem can be described simply: libvirt can't handle starting dozens of 
> VMs at the same time.
> 
> (technically, it can, but it's really slow.)
> 
> We have an AMD machine with 256 logical cores and 1.5T ram.
> On that machine there is roughly 200 VMs.
> Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other 
> half is Win7 x64.
> VMs are using qcow2 as the disk image. These images reside in the ramdisk 
> (tmpfs).
> 
> We use these machines for automatic malware analysis, so our scenario 
> consists of this cycle:
> - reverting VM to a running state
> - execute sample inside of the VM for ~1-2 minutes
> - shutdown the VM
> 
> Of course, this results in multiple VMs trying to start at the same time.
> At first, reverts/starts are really fast - second or two.
> After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, 
> which is really unacceptable.
> For comparison, we're running the same scenarion on Proxmox, where the 
> revertToSnapshot usually takes 2 seconds.

Can you share the XML configuration of one of your guests - assuming
they all have the same basic configuration.

As a gut feeling it sounds to me like it could be initially fast due to
utilization of host I/O cache, but then slows down due to having to
flush data to disk / read fresh from disk. This could be the case if
the disk configuration cache mode is set to certain values, so the XML
config will show us this info.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Slow VM start/revert, when trying to start/revert dozens of VMs in parallel

2022-05-10 Thread Petr Beneš
Hi,

my problem can be described simply: libvirt can't handle starting dozens of VMs 
at the same time.

(technically, it can, but it's really slow.)

We have an AMD machine with 256 logical cores and 1.5T ram.
On that machine there is roughly 200 VMs.
Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other 
half is Win7 x64.
VMs are using qcow2 as the disk image. These images reside in the ramdisk 
(tmpfs).

We use these machines for automatic malware analysis, so our scenario consists 
of this cycle:
- reverting VM to a running state
- execute sample inside of the VM for ~1-2 minutes
- shutdown the VM

Of course, this results in multiple VMs trying to start at the same time.
At first, reverts/starts are really fast - second or two.
After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, 
which is really unacceptable.
For comparison, we're running the same scenarion on Proxmox, where the 
revertToSnapshot usually takes 2 seconds.

Few notes:
- Because of this fast cycle (~2-3 minutes) and because of VMs taking 10-15 
seconds to start, there is barely more than 25-30 VMs running at once.
  We would really love to utilise the whole potential of such beast machine of 
ours, and have at least ~100 VMs running at any given time.
- During the time running, the avg. CPU load isn't higher than 25%. Also, 
there's only about 280 GB of RAM used. Therefore, it's not limitation of our 
resources.
- When the framwork is running and libvirt is making its best to start our VMs, 
I noticed that every libvirt operation is suddenly very slow.
  Even simple "virsh list [--all]" takes few seconds to complete, even though 
it finishes instantly when no VM is running/starting.

I was trying to search for this issue, but didn't really find anything besides 
this presentation:
https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Scalability-and-Stability-of-libvirt-Experiences-with-Very-Large-Hosts-Marc-Hartmayer-IBM-1.pdf

However, I couldn't find those commits in your upstream.

Is this a known issue? Or is there some setting I don't know of which would 
magically make the VMs start faster?

As for steps to reproduce - I don't think there is anything special needed. 
Just try to start/destroy several VMs in a loop.
There is even provided one-liner for that in the presentation above.

```
# For multiple domains:
# while virsh start $vm && virsh destroy $vm; do : ; done
# → ~30s hang ups of the libvirtd main loop
```

Best Regards,
Petr



Re: Slow VM start/revert, when trying to start/revert dozens of VMs in parallel

2022-05-10 Thread Petr Beneš
I forgot to mention:

```
$ lsb_release -a
Distributor ID: Ubuntu
Description:Ubuntu 22.04 LTS
Release:22.04
Codename:   jammy

$ virsh version
Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0
```


Od: Petr Beneš 
Odesláno: pondělí 9. května 2022 20:52
Komu: libvirt-users@redhat.com 
Předmět: Slow VM start/revert, when trying to start/revert dozens of VMs in 
parallel

Hi,

my problem can be described simply: libvirt can't handle starting dozens of VMs 
at the same time.

(technically, it can, but it's really slow.)

We have an AMD machine with 256 logical cores and 1.5T ram.
On that machine there is roughly 200 VMs.
Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other 
half is Win7 x64.
VMs are using qcow2 as the disk image. These images reside in the ramdisk 
(tmpfs).

We use these machines for automatic malware analysis, so our scenario consists 
of this cycle:
- reverting VM to a running state
- execute sample inside of the VM for ~1-2 minutes
- shutdown the VM

Of course, this results in multiple VMs trying to start at the same time.
At first, reverts/starts are really fast - second or two.
After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, 
which is really unacceptable.
For comparison, we're running the same scenarion on Proxmox, where the 
revertToSnapshot usually takes 2 seconds.

Few notes:
- Because of this fast cycle (~2-3 minutes) and because of VMs taking 10-15 
seconds to start, there is barely more than 25-30 VMs running at once.
  We would really love to utilise the whole potential of such beast machine of 
ours, and have at least ~100 VMs running at any given time.
- During the time running, the avg. CPU load isn't higher than 25%. Also, 
there's only about 280 GB of RAM used. Therefore, it's not limitation of our 
resources.
- When the framwork is running and libvirt is making its best to start our VMs, 
I noticed that every libvirt operation is suddenly very slow.
  Even simple "virsh list [--all]" takes few seconds to complete, even though 
it finishes instantly when no VM is running/starting.

I was trying to search for this issue, but didn't really find anything besides 
this presentation:
https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Scalability-and-Stability-of-libvirt-Experiences-with-Very-Large-Hosts-Marc-Hartmayer-IBM-1.pdf

However, I couldn't find those commits in your upstream.

Is this a known issue? Or is there some setting I don't know of which would 
magically make the VMs start faster?

As for steps to reproduce - I don't think there is anything special needed. 
Just try to start/destroy several VMs in a loop.
There is even provided one-liner for that in the presentation above.

```
# For multiple domains:
# while virsh start $vm && virsh destroy $vm; do : ; done
# → ~30s hang ups of the libvirtd main loop
```

Best Regards,
Petr