[Users] Guests are paused without error message while doing maintenance on NFS storage

Karli Sjöberg Wed, 23 Jan 2013 06:33:02 -0800

Hi,

this is a bit complex issue, so I´l try and be as clear as possible. We are 
running oVirt-3.1 in our production environment, based on minimal Fedora 17 
installs. We have 4xHP 380's (Intel) running in one cluster, and 2xSun 7310's 
(AMD) in another cluster. They have shared storage over NFS to a FreeBSD-based 
system that uses ZFS as a filesystem. The storage boots off of a mirrored ZFS 
pool made up of two USB's that only houses /, while /var, /usr, etc. lies on a 
separate ZFS pool made up of the rest of the HDD's in the system. It looks like 
this:


FS                                            MOUNTPOINT
pool1 (The mirrored USB's)    none
pool1/root                               / (mounted ro)
pool2 (The regular HDD's)     none
pool2/root                               none
pool2/root/usr                        /usr
pool2/root/usr/home              /usr/home
pool2/root/usr/local               /usr/local
pool2/root/var                        /var
tmpfs                                       /tmp
pool2/export                          /export
pool2/export/ds1                   /export/ds1
pool2/export/ds1/data           /export/ds1/data
pool2/export/ds1/export       /export/ds1/export
pool2/export/ds1/iso             /export/ds1/iso
pool2/export/ds2                   /export/ds2
pool2/export/ds2/data           /export/ds2/data

/etc/exports:
/export/ds1/data     -alldirs -maproot=root 10.0.0.(all of the HV's)
/export/ds1/export -alldirs -maproot=root 10.0.0.(all of the HV's)
/export/ds1/iso        -alldirs -maproot=root 10.0.0.(all of the HV's)
/export/ds2/data     -alldirs -maproot=root 10.0.0.(all of the HV's)

To make those USB's last for as long as possible, / is usually mounted 
read-only. And when you need to change anything, you need to remount / to 
read-write, do the maintenance, and then remount back to read-only again. But 
when you issue a mount command, the VM's in oVirt pause. At first we didn´t 
understand that was actually the cause and tried to correlate the seemingly 
spontaneous pausing to just about anything, Then I was logged in to both 
oVirt's webadmin, and the storage at the same and issued "mount -uw /", and 
*boom*, random VM's started to pause:) Not all of them though, and not just 
every one in either cluster or something, it is completely random which VM's 
are paused every time.

# time mount -ur /

real 0m2.198s
user 0m0.000s
sys 0m0.002s

And here´s what vdsm on one of the HV's thought about that:
http://pastebin.com/MXjgpDfU

It begins with all VM's being "Up", then me issuing the remount on the storage 
from read-write to read-only which took 2 secs to complete, vdsm freaking out 
when it shortly looses it´s connections and lastly me at 14:34 making them all 
run again from webadmin.

Two things:
1) Does anyone know of any improvements that could be made on the storage side, 
apart from the obvious "stop remounting", since patching must eventually be 
made, configurations changed, and so on. A smarter way of configuring 
something? Booting from another ordinary HDD is sadly out of the question 
because there isn´t any room for any more, it´s full. And I would have really 
like it rather to boot from the HDD's that are already in there, but there are 
"other things" preventing that.
2) Nothing in engine was logged about it, no "Events" were made and nothing in 
engine.log that could indicate something had gone wrong at all. If it wasn´t 
serious enough to issue a warning, why disrupt the service with pausing the 
machines? Or at least automatically start them back up when connection to the 
storage almost immediately came back on it´s own. Saying nothing made it really 
hard to troubleshoot, since we didn´t initially knew at all what could be 
causing the pauses to happen, and when.

Best Regards
/Karli Sjöberg

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[Users] Guests are paused without error message while doing maintenance on NFS storage

Reply via email to