Re: [ovirt-users] NFS IO timeout configuration

2016-01-13 Thread Yaniv Kaul
On Tue, Jan 12, 2016 at 10:45 PM, Markus Stockhausen <
stockhau...@collogia.de> wrote:

> >> Von: Yaniv Kaul [yk...@redhat.com]
> >> Gesendet: Dienstag, 12. Januar 2016 13:15
> >> An: Markus Stockhausen
> >> Cc: users@ovirt.org; Mike Hildebrandt
> >> Betreff: Re: [ovirt-users] NFS IO timeout configuration
> >>
> >> On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen
> stockhau...@collogia.de> wrote:
> >> Hi there,
> >>
> >> we got a nasty situation yesterday in our OVirt 3.5.6 environment.
> >> We ran a LSM that failed during the cleanup operation. To be precise
> >> when the process deleted an image on the source NFS storage.
> >
> > Can you share with us your NFS server details?
> >Is the NFS connection healthy (can be seen with nfsstat)
> >Generally, delete on NFS should be a pretty quick operation.
> > Y.
>
> Hi Yaniv,
>
> we usually have no problems with our NFS server. From our observations we
> only have issues when deleting files with many extents. This applies to all
> OVirt images files. Several of them have more than 50.000 extents, a few
> even more than 300.000.
>
> > xfs_bmap 1cb5906f-65d8-4174-99b1-74f5b3cbc537
> ...
> 52976: [629122144..629130335]: 10986198592..10986206783
> 52977: [629130336..629130343]: 10986403456..10986403463
> 52978: [629130344..629138535]: 10986206792..10986214983
> 52979: [629138536..629138543]: 10986411656..10986411663
> 52980: [629138544..629145471]: 10986214992..10986221919
> 52981: [629145472..629145575]: 10809903560..10809903663
> 52982: [629145576..629145599]: 10737615056..10737615079
>
> Our XFS is mounted with:
>
> /dev/mapper/vg00-lvdata on /var/nas4 type xfs
> (rw,noatime,nodiratime,allocsize=16m)
>
> Why we use allocsize=16M? We once started with allocize=512MB. This
> led to sparse files that did not save much bytes. Because a single byte
> written
> resulted in a 512MB allocation. Thin allocation of these files resulted in
> long runtimes
> for formatting disks inside the VMS. So we reduced to 16MB as a balanced
> config
>
> This works quite well but not for remove operations.
>
> Better ideas?
>

Sounds like an XFS issue more than NFS.
I've consulted with one of our XFS gurus - here's his reply:

For vm image files, users should set up extent size hints to define
> the minimum extent allocation size in a file - allocsize does
> nothing for random writes into sparse files. I typically use a hint
> of 1MB for all my vm images`
>

Y.


>
> Markus
>
>
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] NFS IO timeout configuration

2016-01-12 Thread Nir Soffer
On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen
 wrote:
> Hi there,
>
> we got a nasty situation yesterday in our OVirt 3.5.6 environment.
> We ran a LSM that failed during the cleanup operation. To be precise
> when the process deleted an image on the source NFS storage.
>
> Engine log gives:
>
> 2016-01-11 20:49:45,120 INFO  
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] 
> (org.ovirt.thread.pool-8-thread-14) [77277f0] START, 
> DeleteImageGroupVDSCommand( storagePoolId = 
> 94ed7a19-fade-4bd6-83f2-2cbb2f730b95, ignoreFailoverLimit = false, 
> storageDomainId = 272ec473-6041-42ee-bd1a-732789dd18d4, imageGroupId = 
> aed132ef-703a-44d0-b875-db8c0d2c1a92, postZeros = false, forceDelete = 
> false), log id: b52d59c
> ...
> 2016-01-11 20:50:45,206 ERROR 
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] 
> (org.ovirt.thread.pool-8-thread-14) [77277f0] Failed in DeleteImageGroupVDS 
> method
>
> VDSM (SPM) log gives:
>
> Thread-97::DEBUG::2016-01-11 
> 20:49:45,737::fileSD::384::Storage.StorageDomain::(deleteImage) Removing 
> file: 
> /rhev/data-center/mnt/1.2.3.4:_var_nas2_OVirtIB/272ec473-6041-42ee-bd1a-732789dd18d4/images/_remojzBd1r/0d623afb-291e-4f4c-acba-caecb125c4ed
> ...
> Thread-97::ERROR::2016-01-11 
> 20:50:45,737::task::866::Storage.TaskManager.Task::(_setError) 
> Task=`cd477878-47b4-44b1-85a3-b5da19543a5e`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 873, in _run
> return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
> res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 1549, in deleteImage
> pool.deleteImage(dom, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
> return method(self, *args, **kwargs)
>   File "/usr/share/vdsm/storage/sp.py", line 1884, in deleteImage
> domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/fileSD.py", line 385, in deleteImage
> self.oop.os.remove(volPath)
>   File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove
> self._iop.unlink(path)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in 
> unlink
> return self._sendCommand("unlink", {"path": path}, self.timeout)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in 
> _sendCommand
> raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out

You stumbled into  https://bugzilla.redhat.com/1270220

>
> Reading the docs I got the idea that vdsm default 60 second timeout
> for IO operations might be changed within /etc/vdsm/vdsm.conf
>
> [irs]
> process_pool_timeout = 180
>
> Can anyone confirm that this will solve the problem?

Yes, this is the correct option.

But note that deleting an image on nfs means 3 unlink operations per volume.
If you have an image with one snapshot, that means 2 volumes, and 6
unlink calls.

If unlink takes 70 seconds (timing out with current 60 seconds
timeout), deleting
the image with one snaphost will take 420 seconds.

On the engine side, engine wait until deleteImage finish, or until vdsTimeout
expired (by default 180 seconds), so you may need to increase the engine
timeout as well.

While engine wait for deleteImage to finish, no other spm operation can run.

So increasing the timeout is not the correct solution. You should check why your
storage needs more then 60 seconds to perform unlink operation and change
your setup so unlink works in a timely manner.

As a start, it would be useful to see the results of nfsstat on the
host experiencing
the slow deletes.

In master we perform now the delteImage operation in a background task, so
slow unlink should not effect the engine side, and you can increase
process_pool_timeout
as needed.
See 
https://github.com/oVirt/vdsm/commit/3239e74d1a9087352fca454926224f47272da6c5

We don't plan to backport this change to 3.6 since it is risky and
does not fix the root
cause, which is the slow nfs server, but if you want to test it, I can
make a patch for 3.6.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] NFS IO timeout configuration

2016-01-12 Thread Vinzenz Feenstra

> On Jan 12, 2016, at 8:32 AM, Markus Stockhausen  
> wrote:
> 
> Hi there,
> 
> we got a nasty situation yesterday in our OVirt 3.5.6 environment. 
> We ran a LSM that failed during the cleanup operation. To be precise 
> when the process deleted an image on the source NFS storage. 
> 
> Engine log gives:
> 
> 2016-01-11 20:49:45,120 INFO  
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] 
> (org.ovirt.thread.pool-8-thread-14) [77277f0] START, 
> DeleteImageGroupVDSCommand( storagePoolId = 
> 94ed7a19-fade-4bd6-83f2-2cbb2f730b95, ignoreFailoverLimit = false, 
> storageDomainId = 272ec473-6041-42ee-bd1a-732789dd18d4, imageGroupId = 
> aed132ef-703a-44d0-b875-db8c0d2c1a92, postZeros = false, forceDelete = 
> false), log id: b52d59c
> ...
> 2016-01-11 20:50:45,206 ERROR 
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] 
> (org.ovirt.thread.pool-8-thread-14) [77277f0] Failed in DeleteImageGroupVDS 
> method
> 
> VDSM (SPM) log gives:
> 
> Thread-97::DEBUG::2016-01-11 
> 20:49:45,737::fileSD::384::Storage.StorageDomain::(deleteImage) Removing 
> file: 
> /rhev/data-center/mnt/1.2.3.4:_var_nas2_OVirtIB/272ec473-6041-42ee-bd1a-732789dd18d4/images/_remojzBd1r/0d623afb-291e-4f4c-acba-caecb125c4ed
> ...
> Thread-97::ERROR::2016-01-11 
> 20:50:45,737::task::866::Storage.TaskManager.Task::(_setError) 
> Task=`cd477878-47b4-44b1-85a3-b5da19543a5e`::Unexpected error
> Traceback (most recent call last):
>  File "/usr/share/vdsm/storage/task.py", line 873, in _run
>return fn(*args, **kargs)
>  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
>res = f(*args, **kwargs)
>  File "/usr/share/vdsm/storage/hsm.py", line 1549, in deleteImage
>pool.deleteImage(dom, imgUUID, volsByImg)
>  File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
>return method(self, *args, **kwargs)
>  File "/usr/share/vdsm/storage/sp.py", line 1884, in deleteImage
>domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
>  File "/usr/share/vdsm/storage/fileSD.py", line 385, in deleteImage
>self.oop.os.remove(volPath)
>  File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove
>self._iop.unlink(path)
>  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in 
> unlink
>return self._sendCommand("unlink", {"path": path}, self.timeout)
>  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in 
> _sendCommand
>raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out
> 
> Reading the docs I got the idea that vdsm default 60 second timeout
> for IO operations might be changed within /etc/vdsm/vdsm.conf
> 
> [irs]
> process_pool_timeout = 180
> 
> Can anyone confirm that this will solve the problem?

Well it will increase the time to 3 minutes and takes effect after restarting 
vdsm and supervdsm - If that is enough that might depend on your setup.

> 
> Markus
> 
> 
> 
> 
> 
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] NFS IO timeout configuration

2016-01-12 Thread Vinzenz Feenstra

> On Jan 12, 2016, at 9:11 AM, Markus Stockhausen <stockhau...@collogia.de> 
> wrote:
> 
>> Von: Vinzenz Feenstra [vfeen...@redhat.com]
>> Gesendet: Dienstag, 12. Januar 2016 09:00
>> An: Markus Stockhausen
>> Cc: users@ovirt.org; Mike Hildebrandt
>> Betreff: Re: [ovirt-users] NFS IO timeout configuration
>>> Hi there,
>>> 
>>> we got a nasty situation yesterday in our OVirt 3.5.6 environment. 
>>> We ran a LSM that failed during the cleanup operation. To be precise 
>>> when the process deleted an image on the source NFS storage. 
>>> 
> ...
>>> 
>>> Reading the docs I got the idea that vdsm default 60 second timeout
>>> for IO operations might be changed within /etc/vdsm/vdsm.conf
>>> 
>>> [irs]
>>> process_pool_timeout = 180
>>> 
>>> Can anyone confirm that this will solve the problem?
>> 
>> Well it will increase the time to 3 minutes and takes effect after 
>> restarting vdsm and supervdsm - If that is enough that might depend on your 
>> setup.
>> 
> 
> Thanks Vinzenz,
> 
> maybe my question was not 100% correct. I need to know, if this parameter 
> really influences the described timeout behaviour. The best value of the 
> parameter must be checked of course.

Well I might be wrong, but from what I can see that is the right value to 
configure this.

> 
> Markus

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] NFS IO timeout configuration

2016-01-12 Thread Markus Stockhausen
> Von: Vinzenz Feenstra [vfeen...@redhat.com]
> Gesendet: Dienstag, 12. Januar 2016 09:00
> An: Markus Stockhausen
> Cc: users@ovirt.org; Mike Hildebrandt
> Betreff: Re: [ovirt-users] NFS IO timeout configuration
> > Hi there,
> > 
> > we got a nasty situation yesterday in our OVirt 3.5.6 environment. 
> > We ran a LSM that failed during the cleanup operation. To be precise 
> > when the process deleted an image on the source NFS storage. 
> > 
...
> >
> > Reading the docs I got the idea that vdsm default 60 second timeout
> > for IO operations might be changed within /etc/vdsm/vdsm.conf
> >
> > [irs]
> > process_pool_timeout = 180
> >
> > Can anyone confirm that this will solve the problem?
> 
> Well it will increase the time to 3 minutes and takes effect after restarting 
> vdsm and supervdsm - If that is enough that might depend on your setup.
> 

Thanks Vinzenz,

maybe my question was not 100% correct. I need to know, if this parameter 
really influences the described timeout behaviour. The best value of the 
parameter must be checked of course.

Markus
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] NFS IO timeout configuration

2016-01-12 Thread Yaniv Kaul
On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen  wrote:

> Hi there,
>
> we got a nasty situation yesterday in our OVirt 3.5.6 environment.
> We ran a LSM that failed during the cleanup operation. To be precise
> when the process deleted an image on the source NFS storage.
>

Can you share with us your NFS server details?
Is the NFS connection healthy (can be seen with nfsstat)
Generally, delete on NFS should be a pretty quick operation.
Y.


>
> Engine log gives:
>
> 2016-01-11 20:49:45,120 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> (org.ovirt.thread.pool-8-thread-14) [77277f0] START,
> DeleteImageGroupVDSCommand( storagePoolId =
> 94ed7a19-fade-4bd6-83f2-2cbb2f730b95, ignoreFailoverLimit = false,
> storageDomainId = 272ec473-6041-42ee-bd1a-732789dd18d4, imageGroupId =
> aed132ef-703a-44d0-b875-db8c0d2c1a92, postZeros = false, forceDelete =
> false), log id: b52d59c
> ...
> 2016-01-11 20:50:45,206 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> (org.ovirt.thread.pool-8-thread-14) [77277f0] Failed in DeleteImageGroupVDS
> method
>
> VDSM (SPM) log gives:
>
> Thread-97::DEBUG::2016-01-11
> 20:49:45,737::fileSD::384::Storage.StorageDomain::(deleteImage) Removing
> file: /rhev/data-center/mnt/1.2.3.4:
> _var_nas2_OVirtIB/272ec473-6041-42ee-bd1a-732789dd18d4/images/_remojzBd1r/0d623afb-291e-4f4c-acba-caecb125c4ed
> ...
> Thread-97::ERROR::2016-01-11
> 20:50:45,737::task::866::Storage.TaskManager.Task::(_setError)
> Task=`cd477878-47b4-44b1-85a3-b5da19543a5e`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 873, in _run
> return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
> res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 1549, in deleteImage
> pool.deleteImage(dom, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
> return method(self, *args, **kwargs)
>   File "/usr/share/vdsm/storage/sp.py", line 1884, in deleteImage
> domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/fileSD.py", line 385, in deleteImage
> self.oop.os.remove(volPath)
>   File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove
> self._iop.unlink(path)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455,
> in unlink
> return self._sendCommand("unlink", {"path": path}, self.timeout)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385,
> in _sendCommand
> raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out
>
> Reading the docs I got the idea that vdsm default 60 second timeout
> for IO operations might be changed within /etc/vdsm/vdsm.conf
>
> [irs]
> process_pool_timeout = 180
>
> Can anyone confirm that this will solve the problem?
>
> Markus
>
>
>
>
>
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] NFS IO timeout configuration

2016-01-12 Thread Markus Stockhausen
>> Von: Yaniv Kaul [yk...@redhat.com]
>> Gesendet: Dienstag, 12. Januar 2016 13:15
>> An: Markus Stockhausen
>> Cc: users@ovirt.org; Mike Hildebrandt
>> Betreff: Re: [ovirt-users] NFS IO timeout configuration
>> 
>> On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen stockhau...@collogia.de> 
>> wrote:
>> Hi there,
>> 
>> we got a nasty situation yesterday in our OVirt 3.5.6 environment.
>> We ran a LSM that failed during the cleanup operation. To be precise
>> when the process deleted an image on the source NFS storage.
>
> Can you share with us your NFS server details? 
>Is the NFS connection healthy (can be seen with nfsstat)
>Generally, delete on NFS should be a pretty quick operation. 
> Y.

Hi Yaniv,

we usually have no problems with our NFS server. From our observations we 
only have issues when deleting files with many extents. This applies to all 
OVirt images files. Several of them have more than 50.000 extents, a few 
even more than 300.000.

> xfs_bmap 1cb5906f-65d8-4174-99b1-74f5b3cbc537
...
52976: [629122144..629130335]: 10986198592..10986206783
52977: [629130336..629130343]: 10986403456..10986403463
52978: [629130344..629138535]: 10986206792..10986214983
52979: [629138536..629138543]: 10986411656..10986411663
52980: [629138544..629145471]: 10986214992..10986221919
52981: [629145472..629145575]: 10809903560..10809903663
52982: [629145576..629145599]: 10737615056..10737615079

Our XFS is mounted with:

/dev/mapper/vg00-lvdata on /var/nas4 type xfs 
(rw,noatime,nodiratime,allocsize=16m)

Why we use allocsize=16M? We once started with allocize=512MB. This
led to sparse files that did not save much bytes. Because a single byte written
resulted in a 512MB allocation. Thin allocation of these files resulted in long 
runtimes
for formatting disks inside the VMS. So we reduced to 16MB as a balanced config

This works quite well but not for remove operations.

Better ideas?

Markus




Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users