[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-11 Thread Gianluca Cecchi
On Fri, Oct 11, 2019 at 6:04 AM Strahil  wrote:

> You can always check the *queue_if_no_path*  multipath.conf option and
> give it a try.
>
This setting would be at host side, where things are ok to put a timeout of
X seconds using entries such as

devices {
device {
all_devsyes
# Set timeout of queuing of 5*28 = 140 seconds
# similar to vSphere APD timeout
#no_path_retry   fail
no_path_retry   28
polling_interval5
}


> Don't forget that the higher  in I/O chain you go - the higher the timeout
> is needed, so your VM should also use multipath with that option, in
> addition to the host.
>
>
> Yes, in fact at guest side I put a udev rule

ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="QEMU*",
ATTRS{model}=="QEMU HARDDISK*", ENV{DEVTYPE}=="disk", RUN+="/bin/sh -c
'echo 180 > /sys$DEVPATH/device/timeout'"

so that guest disk timeout is more than host storage.
Any other settings?
Gianluca
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/L37FO7F6WNCIKSBDYZ32XQ2BWCC5QFQQ/


[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-10 Thread Strahil
You can always check the queue_if_no_path  multipath.conf option and give it a 
try.

But if your system that is queue-ing get's rebooted - that will be data loss -> 
use on your own risk.

Don't forget that the higher  in I/O chain you go - the higher the timeout is 
needed, so your VM should also use multipath with that option, in addition to 
the host.

Still, we can't help you if you use that feature  and you loose data.

Best Regards,
Strahil NikolovOn Oct 10, 2019 10:55, Francesco Romani  
wrote:
>
> On 10/10/19 9:07 AM, Gianluca Cecchi wrote:
>
>
>> > How is determined the timeout to use to put the VM in pause mode?
>>
>>
>> The VM is paused immediately as soon as libvirt, through QEMU, reports 
>> IOError, to avoid data corruption. Now, when libvirt reports this error
>>
>> depends laregly on the timeout set for the storage configuration, which 
>> is done at host level, using system tools (e.g. it is not a Vdsm tunable)
>>
>
> For test I have set this in multipath.conf of host:
>
> devices {
>     device {
>         all_devs                yes
> # Set timeout of queuing of 5*28 = 140 seconds
> # similar to vSphere APD timeout
> #        no_path_retry           fail
>         no_path_retry           28
>         polling_interval            5
>     }
>
> So it should wait at least 140 seconds before passing error to upper layer 
> correct? 
>
>
> AFAICT yes
>
>
>>
>>>
>>> > Sometimes I see after clearing the problems that the VM is 
>>> > automatically un-paused, sometimes no: how is this managed?
>>>
>>
>> I noticed that if I set disk as virtio-scsi (it seems virtio has no timeout 
>> definable and passes suddenly the error to upper layer) and disk timeout of 
>> vm disk (through udev rule) to 180 seconds, I can block access to the 
>> storage for example for 100 seconds and the host is able to reinstate paths 
>> and then vm is always unpaused.
>> But I would like to prevent VM from pausing at all
>> What else to tweak?
>
>
> The only way Vdsm will not pause the VM is if libvirt+qemu never reports any 
> ioerror, which is something I'm not sure is possible and that I'd never 
> recommend anyway.
>
> Vdsm always tries hard to be super-careful with respect possible data 
> corruption.
>
>
> Bests,
>
>
> -- 
>
> Francesco Romani
>
> Senior SW Eng., Virtualization R
>
> Red Hat
>
> IRC: fromani github: @fromanirh___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/4DPJR7HGNDC45BJQG7HGUNEL4NIPJ6DO/


[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-10 Thread Gianluca Cecchi
On Thu, Oct 10, 2019 at 1:10 PM Francesco Romani  wrote:

> On 10/10/19 10:44 AM, Gianluca Cecchi wrote:
>
> On Thu, Oct 10, 2019 at 9:56 AM Francesco Romani 
> wrote:
>
>>
>> The only way Vdsm will not pause the VM is if libvirt+qemu never reports
>> any ioerror, which is something I'm not sure is possible and that I'd never
>> recommend anyway.
>>
>> Vdsm always tries hard to be super-careful with respect possible data
>> corruption.
>>
>>
>> OK.
> In case of storage not accessible for a bunch of seconds is more a matter
> of I/O blocked than data corruption.
>
>
> True, but we can know only ex-poste that the storage was just temporarily
> unavailable, don't we?
>
>
> yes but I would like to have an option to say: don't do anything for X
seconds, both at host level and guest level.
X could be 5 seconds, or 10 seconds or 20 seconds according to several
needs.


> If no other host powers on the VM I think there is no risk of data
> corruption itself, or at least no more than when you have a physical server
> and for some reason the I/O operations to its physical disks (local or on a
> SAN) are blocked for some tens of seconds.
>
>
> IMO, a storage unresponsive for tens of seconds is something which should
> be uncommon and very alarming in every circumstances, especially for
> physical servers.
>
> What i'm trying to say is that yes, there probabily are ways to sidestep
> this behaviour, but I think this is the wrong direction and adds fragility
> rather than convenience to the system.
>
In general I agree with you on this



>
> So I think that if I want in any way to modify behavior I have to change
> the options so that I keep "report" for both write and read errors on
> virtual disks.
>
>
> Yep. I don't remember what Engine allows. Worst case you can use an hook,
> but once again this is making things a bit more fragile.
>
>
> I'm only experimenting to see possible different options to manage
> "temporary" problems at storage level, that often resolve without manual
> actions in tens of seconds, sometimes due to uncorrect operations at levels
> managed by other teams (network, storage, ecc).
>
>
> I think the best option is improve the current behaviour: learn why Vdsm
> fails to unpause the VM and improve here.
>
>
>
yes, I'm just experimenting on possible options and their pros & cons

I see that on my 4.3.6 environment with plain CentOS 7.7 hosts the qemu-kvm
process is spawned with "werror=stop,rerror=stop" for all virtual disks
I didn't find any related option in VM edit page

In my Fedora 30 when I start a VM (with virt-manager or "virsh start") I
see that the options are not present in command line and based on qemu-kvm
manual page:
"
The default setting is werror=enospc and rerror=report
"

In the mean time I created a wrapper script for qemu-kvm that changes
command line

1)
from werror=stop to werror=report
and
from rerror=stop to rerror=report

This seems worse, in the sense that the VM is not paused at all, as
expected, but strange behavior inside it
>From host point of view:
[root@ov300 ~]# virsh -r list
 IdName   State

 7 mydbsrvrunning

I suddenly get in VM /var/log/messsages something like

Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Sense Key : Aborted
Command [current]
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Add. Sense: I/O process
terminated
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] CDB: Write(10) 2a 00 03
07 a8 78 00 00 08 00
Oct 10 12:42:55 mydbsrv kernel: blk_update_request: I/O error, dev sdc,
sector 50833528
Oct 10 12:42:55 mydbsrv kernel: EXT4-fs warning (device dm-3):
ext4_end_bio:322: I/O error -5 writing to inode 1573304 (offset 0 size 0
starting block 6353935)
Oct 10 12:42:55 mydbsrv kernel: Buffer I/O error on device dm-3, logical
block 6353935
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Sense Key : Aborted
Command [current]
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Add. Sense: I/O process
terminated
Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] CDB: Write(10) 2a 00 03
07 a8 98 00 00 08 00
Oct 10 12:42:55 mydbsrv kernel: blk_update_request: I/O error, dev sdc,
sector 50833560
Oct 10 12:42:55 mydbsrv kernel: EXT4-fs warning (device dm-3):
ext4_end_bio:322: I/O error -5 writing to inode 1573308 (offset 0 size 0
starting block 6353939)
...

and only shell builtin commands apparently working inside VM making
necessary anyway a power off (from engine) and power on

[root@mydbsrv ~]# uptime
-bash: uptime: command not found
[root@mydbsrv ~]# df -h
-bash: df: command not found
[root@mydbsrv ~]# id
-bash: id: command 

[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-10 Thread Francesco Romani

On 10/10/19 10:44 AM, Gianluca Cecchi wrote:
On Thu, Oct 10, 2019 at 9:56 AM Francesco Romani > wrote:



The only way Vdsm will not pause the VM is if libvirt+qemu never
reports any ioerror, which is something I'm not sure is possible
and that I'd never recommend anyway.

Vdsm always tries hard to be super-careful with respect possible
data corruption.


OK.
In case of storage not accessible for a bunch of seconds is more a 
matter of I/O blocked than data corruption.



True, but we can know only ex-poste that the storage was just 
temporarily unavailable, don't we?



If no other host powers on the VM I think there is no risk of data 
corruption itself, or at least no more than when you have a physical 
server and for some reason the I/O operations to its physical disks 
(local or on a SAN) are blocked for some tens of seconds.



IMO, a storage unresponsive for tens of seconds is something which 
should be uncommon and very alarming in every circumstances, especially 
for physical servers.


What i'm trying to say is that yes, there probabily are ways to sidestep 
this behaviour, but I think this is the wrong direction and adds 
fragility rather than convenience to the system.



The host could ever do a poweroff of the VM itself, instead of leaving 
control to the underlying libvirt+qemu


I see that by default the qemu-kvm process in my oVirt 4.3.6 is 
spawned for every disk with the options:

...,werror=stop,rerror=stop,...

Only for the ide channel of the CD device I have:
...,werror=report,rerror=report,readonly=on

and the manual page for qemu-kvm tells:

           werror=action,rerror=action
               Specify which action to take on write and read errors. 
Valid actions are: "ignore"
               (ignore the error and try to continue), "stop" (pause 
QEMU), "report" (report the
               error to the guest), "enospc" (pause QEMU only if the 
host disk is full; report
               the error to the guest otherwise).  The default setting 
is werror=enospc and

               rerror=report.
So I think that if I want in any way to modify behavior I have to 
change the options so that I keep "report" for both write and read 
errors on virtual disks.



Yep. I don't remember what Engine allows. Worst case you can use an 
hook, but once again this is making things a bit more fragile.



I'm only experimenting to see possible different options to manage 
"temporary" problems at storage level, that often resolve without 
manual actions in tens of seconds, sometimes due to uncorrect 
operations at levels managed by other teams (network, storage, ecc).



I think the best option is improve the current behaviour: learn why Vdsm 
fails to unpause the VM and improve here.





--
Francesco Romani
Senior SW Eng., Virtualization R
Red Hat
IRC: fromani github: @fromanirh

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/UZ742LYQZMYQY7ZTE5ZGMYMLRS433A5T/


[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-10 Thread Gianluca Cecchi
On Thu, Oct 10, 2019 at 9:56 AM Francesco Romani  wrote:

>
> The only way Vdsm will not pause the VM is if libvirt+qemu never reports
> any ioerror, which is something I'm not sure is possible and that I'd never
> recommend anyway.
>
> Vdsm always tries hard to be super-careful with respect possible data
> corruption.
>
>
> OK.
In case of storage not accessible for a bunch of seconds is more a matter
of I/O blocked than data corruption.
If no other host powers on the VM I think there is no risk of data
corruption itself, or at least no more than when you have a physical server
and for some reason the I/O operations to its physical disks (local or on a
SAN) are blocked for some tens of seconds.
The host could ever do a poweroff of the VM itself, instead of leaving
control to the underlying libvirt+qemu

I see that by default the qemu-kvm process in my oVirt 4.3.6 is spawned for
every disk with the options:
...,werror=stop,rerror=stop,...

Only for the ide channel of the CD device I have:
...,werror=report,rerror=report,readonly=on

and the manual page for qemu-kvm tells:

   werror=action,rerror=action
   Specify which action to take on write and read errors. Valid
actions are: "ignore"
   (ignore the error and try to continue), "stop" (pause QEMU),
"report" (report the
   error to the guest), "enospc" (pause QEMU only if the host
disk is full; report
   the error to the guest otherwise).  The default setting is
werror=enospc and
   rerror=report.

So I think that if I want in any way to modify behavior I have to change
the options so that I keep "report" for both write and read errors on
virtual disks.

I'm only experimenting to see possible different options to manage
"temporary" problems at storage level, that often resolve without manual
actions in tens of seconds, sometimes due to uncorrect operations at levels
managed by other teams (network, storage, ecc).
In these circumstances experience told me it is better to "do nothing and
wait", instead of trying to taking any action that anyway will fail until
the "external" problem has been solved (automatically, thanks to logic
outside oVirt control, or manually).

It would be nice to "mimic" the behavior of vSphere in this sense and I'm
investigating possible actions to reach it...

Hope I clarified a bit the origin of my actions...
Thanks,
Gianluca
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/P3H4GBOKVYXDXDOMBDLJ443YIX3ADHY7/


[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-10 Thread Francesco Romani

On 10/10/19 9:07 AM, Gianluca Cecchi wrote:


> How is determined the timeout to use to put the VM in pause mode?


The VM is paused immediately as soon as libvirt, through QEMU,
reports
IOError, to avoid data corruption. Now, when libvirt reports this
error

depends laregly on the timeout set for the storage configuration,
which
is done at host level, using system tools (e.g. it is not a Vdsm
tunable)


For test I have set this in multipath.conf of host:

devices {
    device {
        all_devs                yes
# Set timeout of queuing of 5*28 = 140 seconds
# similar to vSphere APD timeout
#        no_path_retry           fail
        no_path_retry           28
        polling_interval            5
    }

So it should wait at least 140 seconds before passing error to upper 
layer correct?



AFAICT yes





> Sometimes I see after clearing the problems that the VM is
> automatically un-paused, sometimes no: how is this managed?


I noticed that if I set disk as virtio-scsi (it seems virtio has no 
timeout definable and passes suddenly the error to upper layer) and 
disk timeout of vm disk (through udev rule) to 180 seconds, I can 
block access to the storage for example for 100 seconds and the host 
is able to reinstate paths and then vm is always unpaused.

But I would like to prevent VM from pausing at all
What else to tweak?



The only way Vdsm will not pause the VM is if libvirt+qemu never reports 
any ioerror, which is something I'm not sure is possible and that I'd 
never recommend anyway.


Vdsm always tries hard to be super-careful with respect possible data 
corruption.



Bests,


--
Francesco Romani
Senior SW Eng., Virtualization R
Red Hat
IRC: fromani github: @fromanirh

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/7QGJTJ5DSD5ZKDGV6KD7TRSRRNUJFCDO/


[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-10 Thread Gianluca Cecchi
On Wed, Oct 9, 2019 at 2:55 PM Francesco Romani  wrote:

> On 10/8/19 4:06 PM, Gianluca Cecchi wrote:
>
> Hi Gianluca
>
> > Hello,
> > I'm doing some tests related to storage latency or problems manually
> > created to debug and manage reactions of hosts and VMs.
> > What is the subsystem/process/daemon responsible to pause a VM when
> > problems arise on storage for the host where the VM is running?
>
>
> It's Vdsm itself.
>
ok


>
>
> > How is determined the timeout to use to put the VM in pause mode?
>
>
> The VM is paused immediately as soon as libvirt, through QEMU, reports
> IOError, to avoid data corruption. Now, when libvirt reports this error
>
> depends laregly on the timeout set for the storage configuration, which
> is done at host level, using system tools (e.g. it is not a Vdsm tunable)
>
>
For test I have set this in multipath.conf of host:

devices {
device {
all_devsyes
# Set timeout of queuing of 5*28 = 140 seconds
# similar to vSphere APD timeout
#no_path_retry   fail
no_path_retry   28
polling_interval5
}

So it should wait at least 140 seconds before passing error to upper layer
correct?


> > Sometimes I see after clearing the problems that the VM is
> > automatically un-paused, sometimes no: how is this managed?
>
>
I noticed that if I set disk as virtio-scsi (it seems virtio has no timeout
definable and passes suddenly the error to upper layer) and disk timeout of
vm disk (through udev rule) to 180 seconds, I can block access to the
storage for example for 100 seconds and the host is able to reinstate paths
and then vm is always unpaused.
But I would like to prevent VM from pausing at all
What else to tweak?

Thanks,
Gianluca
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/R2CWSCQHKV2LIEUAKTEC6TOEZBBLMHNH/


[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-09 Thread Francesco Romani

On 10/8/19 4:06 PM, Gianluca Cecchi wrote:

Hi Gianluca


Hello,
I'm doing some tests related to storage latency or problems manually 
created to debug and manage reactions of hosts and VMs.
What is the subsystem/process/daemon responsible to pause a VM when 
problems arise on storage for the host where the VM is running?



It's Vdsm itself.



How is determined the timeout to use to put the VM in pause mode?



The VM is paused immediately as soon as libvirt, through QEMU, reports 
IOError, to avoid data corruption. Now, when libvirt reports this error


depends laregly on the timeout set for the storage configuration, which 
is done at host level, using system tools (e.g. it is not a Vdsm tunable)



Sometimes I see after clearing the problems that the VM is 
automatically un-paused, sometimes no: how is this managed?



It depends on the error condition that happens. Vdsm tries to recovery 
automatically when it is safe to do so. When in doubt, Vdsm always plays 
it safe wrt user data



Are there any counters so that if VM has been paused and and 
problems are not solved in a certain timeframe the unpause can be done 
only manually by the sysadmin?



AFAIR no, because if Vdsm can't be sure, the only real option is to let 
the sysadmin check and decide.


Bests,


--
Francesco Romani
Senior SW Eng., Virtualization R
Red Hat
IRC: fromani github: @fromanirh
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/FX5PLYPI6BRELZETPZRO3FGMMJEPV2QL/


[ovirt-users] Re: owner of vm paused/unpaused operation

2019-10-09 Thread Gianluca Cecchi
On Tue, Oct 8, 2019 at 4:06 PM Gianluca Cecchi 
wrote:

> Hello,
> I'm doing some tests related to storage latency or problems manually
> created to debug and manage reactions of hosts and VMs.
> What is the subsystem/process/daemon responsible to pause a VM when
> problems arise on storage for the host where the VM is running?
> How is determined the timeout to use to put the VM in pause mode?
> Sometimes I see after clearing the problems that the VM is automatically
> un-paused, sometimes no: how is this managed? Are there any counters so
> that if VM has been paused and and problems are not solved in a certain
> timeframe the unpause can be done only manually by the sysadmin?
>
> Thanks in advance,
> Gianluca
>
>
I have noticed that when virtual disk is virtio, the VM is not able to be
unpaused in storage unreachable for many seconds, while if I have
virtio-scsi and set high virtual disk timeout (like vSphere does on VMs
when vmware tools have been installed), then VM is able to be resumed.

The udev rule I have put into a CentOS 7 VM
inside /etc/udev/rules.d/99-ovirt.rules is this one

# Set timeout of virtio-SCSI disks to 180 secons like vSphere vmware tools
#
ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="QEMU*",
ATTRS{model}=="QEMU HARDDISK*", ENV{DEVTYPE}=="disk", RUN+="/bin/sh -c
'echo 180 > /sys$DEVPATH/device/timeout'"

What I have not understood is if it is possible to prevent at all vdsm (is
it the responsible?) to suddenly put the VM in paused state.
Eg for experiment I have iSCSI based storage domains and put this in
multipath.conf

devices {
device {
all_devsyes
# Set timeout of queuing of 5*28 = 140 seconds
# similar to vSphere APD timeout
#no_path_retry   fail
no_path_retry   28
polling_interval5
}

Then I create an iptables rule that for 100 seconds prevents host to reach
storage and a dd task that writes on disk inside VM
The effect is that vm is paused and after about 100 seconds

VM mydbsrv has recovered from paused back to up. 10/9/19 1:59:02 PM
VM mydbsrv has been paused due to storage I/O problem. 10/9/19 1:57:32 PM
VM mydbsrv has been paused. 10/9/19 1:57:32 PM

Any hint on how to prevent action of pausing the VM?

Thanks,
Gianluca
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/7RHOALCGHAZIIUPPBQLLAQBCZRXY4JQA/