[ovirt-users] ovirt+gluster+NFS : storage hicups

Nicolas Ecarnot Wed, 05 Aug 2015 07:39:50 -0700

Hi,

I used the two links below to setup a test DC :


http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/
http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-two/

The only thing I did different is I did not usea hosted engine, but Idedicated a solid server for that.

So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0)

As in the doc above, my 3 hosts are publishing 300 Go of replicatedgluster storage, above which ctdb is managing a floating virtual ip thatis used by NFS as the master storage domain.

The last point is that the manager is also presenting a NFS storage I'musing as an export domain.

It took me some time to plug all this setup as it is a bit morecomplicated as my other DC with a real SAN and no gluster, but it iseventually working (I can run VMs, migrate them...)

I have made many severe tests (from a very dumb user point of view :unplug/replug the power cable of this server - does ctdb floats the vIP?does gluster self-heals?, does the VM restart?...)When precisely looking each layer one by one, all seems to be correct :ctdb is fast at managing the ip, NFS is OK, gluster seems toreconstruct, fencing eventually worked with the lanplus workaround, andso on...

But from times to times, there seem to appear a severe hicup which Ihave great difficulties to diagnose.

The messages in the web gui are not very precise, and not consistent:

- some tell about some host having network issues, but I can ping itfrom every place it needs to be reached (especially from the SPM and themanager)"On host serv-vm-al01, Error: Network error during communication withthe Host"

- some tell that some volume is degraded, when it's not (glustercommands are showing no issue. Even the oVirt tab about the volumes areall green)

- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN>attached to the Data Center"

Just by waiting a couple of seconds lead to a self heal with no action.

- Repeated "Detected change in status of brickserv-vm-al03:/gluster/data/brick of volume data from DOWN to UP."

but absolutely no action is made on this filesystem.

At this time, zero VM is running in this test datacenter, and no actionis made on the hosts. Though, I see some looping errors coming andgoing, and I find no way to diagnose.


Amongst the *actions* that I had the idea to use to solve some issues :

- I've found that trying to force the self-healing, and playing withgluster commands had no effect- I've found that playing with gluster adviced actions "find /gluster-exec stat {} \; ..." seem to have no either effect- I've found that forcing ctdb to move the vIp ("ctdb stop, ctdbcontinue") DID SOLVE most of these issue.I believe that it's not what ctdb is doing that helps, but maybe one ofits shell hook that is cleaning some troubles?

As this setup is complexe, I don't ask anyone a silver bullet, but maybeyou may know which layer is the most fragile, and which one I shouldlook at more closely?


--
Nicolas ECARNOT
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] ovirt+gluster+NFS : storage hicups

Reply via email to