[ovirt-users] Re: VMs paused due to IO issues - Dell Equallogic controller failover

2019-05-15 Thread Gary Lloyd
I asked on the Dell Storage Forum and they recommend the following:

*I recommend not using a numeric value for the "no_path_retry" variable
within /etc/multipath.conf as once that numeric value is reached, if no
healthy LUNs were discovered during that defined time multipath will
disable the I/O queue altogether.*

*I do recommend, however, changing the variable value from "12" (or even
"60") to "queue" which will then allow multipathd to continue queing I/O
until a healthy LUN is discovered (time of fail-over between controllers)
and I/O is allowed to flow once again.*

Can you see any issues with this recommendation as far as Ovirt is
concerned ?

Thanks again

*Gary Lloyd*

I.T. Systems:Keele University
Finance & IT Directorate
Keele:Staffs:IC1 Building:ST5 5NB:UK
+44 1782 733063 <%2B44%201782%20733073>


On 4 October 2016 at 19:11, Nir Soffer  wrote:

> On Tue, Oct 4, 2016 at 10:51 AM, Gary Lloyd  wrote:
>
>> Hi
>>
>> We have Ovirt 3.65 with a Dell Equallogic SAN and we use Direct Luns for
>> all our VMs.
>> At the weekend during early hours an Equallogic controller failed over to
>> its standby on one of our arrays and this caused about 20 of our VMs to be
>> paused due to IO problems.
>>
>> I have also noticed that this happens during Equallogic firmware upgrades
>> since we moved onto Ovirt 3.65.
>>
>> As recommended by Dell disk timeouts within the VMs are set to 60 seconds
>> when they are hosted on an EqualLogic SAN.
>>
>> Is there any other timeout value that we can configure in vdsm.conf to
>> stop VMs from getting paused when a controller fails over ?
>>
>
> You can set the timeout in multipath.conf.
>
> With current multipath configuration (deployed by vdsm), when all paths to
> a device
> are lost (e.g. you take down all ports on the server during upgrade), all
> io will fail
> immediately.
>
> If you want to allow 60 seconds gracetime in such case, you can configure:
>
> no_path_retry 12
>
> This will continue to monitor the paths 12 times, each 5 seconds
> (assuming polling_interval=5). If some path recover during this time, the
> io
> can complete and the vm will not be paused.
>
> If no path is available after these retries, io will fail and vms with
> pending io
> will pause.
>
> Note that this will also cause delays in vdsm in various flows, increasing
> the chance
> of timeouts in engine side, or delays in storage domain monitoring.
>
> However, the 60 seconds delay is expected only on the first time all paths
> become
> faulty. Once the timeout has expired, any access to the device will fail
> immediately.
>
> To configure this, you must add the # VDSM PRIVATE tag at the second line
> of
> multipath.conf, otherwise vdsm will override your configuration in the
> next time
> you run vdsm-tool configure.
>
> multipath.conf should look like this:
>
> # VDSM REVISION 1.3
> # VDSM PRIVATE
>
> defaults {
> polling_interval5
> no_path_retry   12
> user_friendly_names no
> flush_on_last_del   yes
> fast_io_fail_tmo5
> dev_loss_tmo30
> max_fds 4096
> }
>
> devices {
> device {
> all_devsyes
> no_path_retry   12
> }
> }
>
> This will use 12 retries (60 seconds) timeout for any device. If you like
> to
> configure only your specific device, you can add a device section for
> your specific server instead.
>
>
>>
>> Also is there anything that we can tweak to automatically unpause the VMs
>> once connectivity with the arrays is re-established ?
>>
>
> Vdsm will resume the vms when storage monitor detect that storage became
> available again.
> However we cannot guarantee that storage monitoring will detect that
> storage was down.
> This should be improved in 4.0.
>
>
>> At the moment we are running a customized version of storageServer.py, as
>> Ovirt has yet to include iscsi multipath support for Direct Luns out of the
>> box.
>>
>
> Would you like to share this code?
>
> Nir
>

--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se



--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 

[ovirt-users] Re: VMs paused due to IO issues - Dell Equallogic controller failover

2019-05-15 Thread Gary Lloyd
>From the sounds of it the best we can do then is to use a 60 second timeout
on paths in multipathd.
The main reason we use Direct Lun is because we replicate /snapshot VMs
associated Luns at SAN level as a means of disaster recovery.

I have read a bit of documentation of how to backup virtual machines in
storage domains, but the process of mounting snapshots for all our machines
within a dedicated VM doesn't seem as efficient when we have almost 300
virtual machines and only 1Gb networking.

Thanks for the advice.

*Gary Lloyd*

I.T. Systems:Keele University
Finance & IT Directorate
Keele:Staffs:IC1 Building:ST5 5NB:UK
+44 1782 733063 <%2B44%201782%20733073>


On 6 October 2016 at 11:07, Nir Soffer  wrote:

> On Thu, Oct 6, 2016 at 10:19 AM, Gary Lloyd  wrote:
>
>> I asked on the Dell Storage Forum and they recommend the following:
>>
>> *I recommend not using a numeric value for the "no_path_retry" variable
>> within /etc/multipath.conf as once that numeric value is reached, if no
>> healthy LUNs were discovered during that defined time multipath will
>> disable the I/O queue altogether.*
>>
>> *I do recommend, however, changing the variable value from "12" (or even
>> "60") to "queue" which will then allow multipathd to continue queing I/O
>> until a healthy LUN is discovered (time of fail-over between controllers)
>> and I/O is allowed to flow once again.*
>>
>> Can you see any issues with this recommendation as far as Ovirt is
>> concerned ?
>>
> Yes, we cannot work with unlimited queue. This will block vdsm for
> unlimited
> time when the next command try to access storage. Because we don't have
> good isolation between different storage domains, this may cause other
> storage
> domains to become faulty. Also engine flows that have a timeout will fail
> with
> a timeout.
>
> If you are on 3.x, this will be very painfull, on 4.0 it should be better,
> but it is not
> recommended.
>
> Nir
>
>

--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se



--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/SGEGODHSGVRCQ3A6KZI5BGB3HVT5JEER/


[ovirt-users] Re: VMs paused due to IO issues - Dell Equallogic controller failover

2019-05-14 Thread InterNetX - Juergen Gotteswinter
you need the eql hit kit to make it work at least somehow better, but
hit kit requires multipathd to be disabled which is an dependency to ovirt.

so far, no real workaround seems to be known

Am 06.10.2016 um 09:19 schrieb Gary Lloyd:
> I asked on the Dell Storage Forum and they recommend the following:
> 
> /I recommend not using a numeric value for the "no_path_retry" variable
> within /etc/multipath.conf as once that numeric value is reached, if no
> healthy LUNs were discovered during that defined time multipath will
> disable the I/O queue altogether./
> 
> /I do recommend, however, changing the variable value from "12" (or even
> "60") to "queue" which will then allow multipathd to continue queing I/O
> until a healthy LUN is discovered (time of fail-over between
> controllers) and I/O is allowed to flow once again./
> 
> Can you see any issues with this recommendation as far as Ovirt is
> concerned ?
> 
> Thanks again
> 
> 
> /Gary Lloyd/
> 
> I.T. Systems:Keele University
> Finance & IT Directorate
> Keele:Staffs:IC1 Building:ST5 5NB:UK
> +44 1782 733063 
> 
> 
> On 4 October 2016 at 19:11, Nir Soffer  > wrote:
> 
> On Tue, Oct 4, 2016 at 10:51 AM, Gary Lloyd  > wrote:
> 
> Hi
> 
> We have Ovirt 3.65 with a Dell Equallogic SAN and we use Direct
> Luns for all our VMs.
> At the weekend during early hours an Equallogic controller
> failed over to its standby on one of our arrays and this caused
> about 20 of our VMs to be paused due to IO problems.
> 
> I have also noticed that this happens during Equallogic firmware
> upgrades since we moved onto Ovirt 3.65.
> 
> As recommended by Dell disk timeouts within the VMs are set to
> 60 seconds when they are hosted on an EqualLogic SAN.
> 
> Is there any other timeout value that we can configure in
> vdsm.conf to stop VMs from getting paused when a controller
> fails over ?
> 
> 
> You can set the timeout in multipath.conf.
> 
> With current multipath configuration (deployed by vdsm), when all
> paths to a device
> are lost (e.g. you take down all ports on the server during
> upgrade), all io will fail
> immediately.
> 
> If you want to allow 60 seconds gracetime in such case, you can
> configure:
> 
> no_path_retry 12
> 
> This will continue to monitor the paths 12 times, each 5 seconds 
> (assuming polling_interval=5). If some path recover during this
> time, the io
> can complete and the vm will not be paused.
> 
> If no path is available after these retries, io will fail and vms
> with pending io
> will pause.
> 
> Note that this will also cause delays in vdsm in various flows,
> increasing the chance
> of timeouts in engine side, or delays in storage domain monitoring.
> 
> However, the 60 seconds delay is expected only on the first time all
> paths become
> faulty. Once the timeout has expired, any access to the device will
> fail immediately.
> 
> To configure this, you must add the # VDSM PRIVATE tag at the second
> line of
> multipath.conf, otherwise vdsm will override your configuration in
> the next time
> you run vdsm-tool configure.
> 
> multipath.conf should look like this:
> 
> # VDSM REVISION 1.3
> # VDSM PRIVATE
> 
> defaults {
> polling_interval5
> no_path_retry   12
> user_friendly_names no
> flush_on_last_del   yes
> fast_io_fail_tmo5
> dev_loss_tmo30
> max_fds 4096
> }
> 
> devices {
> device {
> all_devsyes
> no_path_retry   12
> }
> }
> 
> This will use 12 retries (60 seconds) timeout for any device. If you
> like to 
> configure only your specific device, you can add a device section for
> your specific server instead.
>  
> 
> 
> Also is there anything that we can tweak to automatically
> unpause the VMs once connectivity with the arrays is
> re-established ?
> 
> 
> Vdsm will resume the vms when storage monitor detect that storage
> became available again.
> However we cannot guarantee that storage monitoring will detect that
> storage was down.
> This should be improved in 4.0.
>  
> 
> At the moment we are running a customized version of
> storageServer.py, as Ovirt has yet to include iscsi multipath
> support for Direct Luns out of the box.
> 
> 
> Would you like to share this code?
> 
> Nir
> 
> 
> 
> 
> ___
> Users mailing list
> Users@ovirt.org
> 

[ovirt-users] Re: VMs paused due to IO issues - Dell Equallogic controller failover

2019-05-14 Thread Nir Soffer
On Thu, Oct 6, 2016 at 10:19 AM, Gary Lloyd  wrote:

> I asked on the Dell Storage Forum and they recommend the following:
>
> *I recommend not using a numeric value for the "no_path_retry" variable
> within /etc/multipath.conf as once that numeric value is reached, if no
> healthy LUNs were discovered during that defined time multipath will
> disable the I/O queue altogether.*
>
> *I do recommend, however, changing the variable value from "12" (or even
> "60") to "queue" which will then allow multipathd to continue queing I/O
> until a healthy LUN is discovered (time of fail-over between controllers)
> and I/O is allowed to flow once again.*
>
> Can you see any issues with this recommendation as far as Ovirt is
> concerned ?
>
Yes, we cannot work with unlimited queue. This will block vdsm for unlimited
time when the next command try to access storage. Because we don't have
good isolation between different storage domains, this may cause other
storage
domains to become faulty. Also engine flows that have a timeout will fail
with
a timeout.

If you are on 3.x, this will be very painfull, on 4.0 it should be better,
but it is not
recommended.

Nir

--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se



--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/CVXQHQBMPCIII6YX67XIV6CWAJKYZYLK/


[ovirt-users] Re: VMs paused due to IO issues - Dell Equallogic controller failover

2019-05-14 Thread Michal Skrivanek

> On 4 Oct 2016, at 09:51, Gary Lloyd  wrote:
> 
> Hi
> 
> We have Ovirt 3.65 with a Dell Equallogic SAN and we use Direct Luns for all 
> our VMs.
> At the weekend during early hours an Equallogic controller failed over to its 
> standby on one of our arrays and this caused about 20 of our VMs to be paused 
> due to IO problems.
> 
> I have also noticed that this happens during Equallogic firmware upgrades 
> since we moved onto Ovirt 3.65.
> 
> As recommended by Dell disk timeouts within the VMs are set to 60 seconds 
> when they are hosted on an EqualLogic SAN.
> 
> Is there any other timeout value that we can configure in vdsm.conf to stop 
> VMs from getting paused when a controller fails over ?

not really. but things are not so different when you look at it from the guest 
perspective. If the intention is to hide the fact that there is a problem and 
the guest should just see a delay (instead of dealing with error) then pausing 
and unpausing is the right behavior. From guest point of view this is just a 
delay it sees.

> 
> Also is there anything that we can tweak to automatically unpause the VMs 
> once connectivity with the arrays is re-established ?

that should happen when the storage domain monitoring detects error and then 
reactivate(http://gerrit.ovirt.org/16244). It may be that since you have direct 
luns it’s not working with those….dunno, storage people should chime in I 
guess...

Thanks,
michal

> 
> At the moment we are running a customized version of storageServer.py, as 
> Ovirt has yet to include iscsi multipath support for Direct Luns out of the 
> box.
> 
> Many Thanks
> 
> 
> Gary Lloyd
> 
> I.T. Systems:Keele University
> Finance & IT Directorate
> Keele:Staffs:IC1 Building:ST5 5NB:UK
> +44 1782 733063
> 
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se



--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se


___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/FL2E2FN75FUHZ4C5TU6BNHZE55PZ2KRX/


[ovirt-users] Re: VMs paused due to IO issues - Dell Equallogic controller failover

2019-05-14 Thread Nir Soffer
On Tue, Oct 4, 2016 at 10:51 AM, Gary Lloyd  wrote:

> Hi
>
> We have Ovirt 3.65 with a Dell Equallogic SAN and we use Direct Luns for
> all our VMs.
> At the weekend during early hours an Equallogic controller failed over to
> its standby on one of our arrays and this caused about 20 of our VMs to be
> paused due to IO problems.
>
> I have also noticed that this happens during Equallogic firmware upgrades
> since we moved onto Ovirt 3.65.
>
> As recommended by Dell disk timeouts within the VMs are set to 60 seconds
> when they are hosted on an EqualLogic SAN.
>
> Is there any other timeout value that we can configure in vdsm.conf to
> stop VMs from getting paused when a controller fails over ?
>

You can set the timeout in multipath.conf.

With current multipath configuration (deployed by vdsm), when all paths to
a device
are lost (e.g. you take down all ports on the server during upgrade), all
io will fail
immediately.

If you want to allow 60 seconds gracetime in such case, you can configure:

no_path_retry 12

This will continue to monitor the paths 12 times, each 5 seconds
(assuming polling_interval=5). If some path recover during this time, the io
can complete and the vm will not be paused.

If no path is available after these retries, io will fail and vms with
pending io
will pause.

Note that this will also cause delays in vdsm in various flows, increasing
the chance
of timeouts in engine side, or delays in storage domain monitoring.

However, the 60 seconds delay is expected only on the first time all paths
become
faulty. Once the timeout has expired, any access to the device will fail
immediately.

To configure this, you must add the # VDSM PRIVATE tag at the second line of
multipath.conf, otherwise vdsm will override your configuration in the next
time
you run vdsm-tool configure.

multipath.conf should look like this:

# VDSM REVISION 1.3
# VDSM PRIVATE

defaults {
polling_interval5
no_path_retry   12
user_friendly_names no
flush_on_last_del   yes
fast_io_fail_tmo5
dev_loss_tmo30
max_fds 4096
}

devices {
device {
all_devsyes
no_path_retry   12
}
}

This will use 12 retries (60 seconds) timeout for any device. If you like
to
configure only your specific device, you can add a device section for
your specific server instead.


>
> Also is there anything that we can tweak to automatically unpause the VMs
> once connectivity with the arrays is re-established ?
>

Vdsm will resume the vms when storage monitor detect that storage became
available again.
However we cannot guarantee that storage monitoring will detect that
storage was down.
This should be improved in 4.0.


> At the moment we are running a customized version of storageServer.py, as
> Ovirt has yet to include iscsi multipath support for Direct Luns out of the
> box.
>

Would you like to share this code?

Nir

--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se



--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/4BNCWDHI62PCBS5RRZ46U2K6HEVNJEAO/


[ovirt-users] Re: VMs paused due to IO issues - Dell Equallogic controller failover

2019-05-14 Thread Nir Soffer
On Tue, Oct 4, 2016 at 7:03 PM, Michal Skrivanek <
michal.skriva...@redhat.com> wrote:

>
> > On 4 Oct 2016, at 09:51, Gary Lloyd  wrote:
> >
> > Hi
> >
> > We have Ovirt 3.65 with a Dell Equallogic SAN and we use Direct Luns for
> all our VMs.
> > At the weekend during early hours an Equallogic controller failed over
> to its standby on one of our arrays and this caused about 20 of our VMs to
> be paused due to IO problems.
> >
> > I have also noticed that this happens during Equallogic firmware
> upgrades since we moved onto Ovirt 3.65.
> >
> > As recommended by Dell disk timeouts within the VMs are set to 60
> seconds when they are hosted on an EqualLogic SAN.
> >
> > Is there any other timeout value that we can configure in vdsm.conf to
> stop VMs from getting paused when a controller fails over ?
>
> not really. but things are not so different when you look at it from the
> guest perspective. If the intention is to hide the fact that there is a
> problem and the guest should just see a delay (instead of dealing with
> error) then pausing and unpausing is the right behavior. From guest point
> of view this is just a delay it sees.
>
> >
> > Also is there anything that we can tweak to automatically unpause the
> VMs once connectivity with the arrays is re-established ?
>
> that should happen when the storage domain monitoring detects error and
> then reactivate(http://gerrit.ovirt.org/16244). It may be that since you
> have direct luns it’s not working with those….dunno, storage people should
> chime in I guess...
>


We don't monitor direct luns, only storage domains, so we do not support
resuming vms using direct luns.

multipath does monitor all devices, so we could monitor the devices status
via multipath, and resume paused vms when a device move from faulty
state to active state.

Maybe open an RFE for this?

Nir

--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se



--
IMPORTANT!
This message has been scanned for viruses and phishing links.
However, it is your responsibility to evaluate the links and attachments you 
choose to click.
If you are uncertain, we always try to help.
Greetings helpd...@actnet.se


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2N5M7XDVNPV7OVI7COFTQRMUV7UMOQU2/