Hi,

I used the two links below to setup a test DC :

http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/
http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-two/

The only thing I did different is I did not usea hosted engine, but I dedicated a solid server for that.
So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0)

As in the doc above, my 3 hosts are publishing 300 Go of replicated gluster storage, above which ctdb is managing a floating virtual ip that is used by NFS as the master storage domain.

The last point is that the manager is also presenting a NFS storage I'm using as an export domain.

It took me some time to plug all this setup as it is a bit more complicated as my other DC with a real SAN and no gluster, but it is eventually working (I can run VMs, migrate them...)

I have made many severe tests (from a very dumb user point of view : unplug/replug the power cable of this server - does ctdb floats the vIP? does gluster self-heals?, does the VM restart?...) When precisely looking each layer one by one, all seems to be correct : ctdb is fast at managing the ip, NFS is OK, gluster seems to reconstruct, fencing eventually worked with the lanplus workaround, and so on...

But from times to times, there seem to appear a severe hicup which I have great difficulties to diagnose.
The messages in the web gui are not very precise, and not consistent:
- some tell about some host having network issues, but I can ping it from every place it needs to be reached (especially from the SPM and the manager) "On host serv-vm-al01, Error: Network error during communication with the Host"

- some tell that some volume is degraded, when it's not (gluster commands are showing no issue. Even the oVirt tab about the volumes are all green)

- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center"
Just by waiting a couple of seconds lead to a self heal with no action.

- Repeated "Detected change in status of brick serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP."
but absolutely no action is made on this filesystem.

At this time, zero VM is running in this test datacenter, and no action is made on the hosts. Though, I see some looping errors coming and going, and I find no way to diagnose.

Amongst the *actions* that I had the idea to use to solve some issues :
- I've found that trying to force the self-healing, and playing with gluster commands had no effect - I've found that playing with gluster adviced actions "find /gluster -exec stat {} \; ..." seem to have no either effect - I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb continue") DID SOLVE most of these issue. I believe that it's not what ctdb is doing that helps, but maybe one of its shell hook that is cleaning some troubles?

As this setup is complexe, I don't ask anyone a silver bullet, but maybe you may know which layer is the most fragile, and which one I should look at more closely?

--
Nicolas ECARNOT
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to