I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now.
I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <[email protected]> wrote: > Ignore that, the issue came back after 10 minutes. > > I've even tried a gluster mount + nfs server on top of that, and the > same issue has come back. > > On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <[email protected]> wrote: >> Interesting, I put it all into global maintenance. Shut it all down >> for 10~ minutes, and it's regained it's sanlock control and doesn't >> seem to have that issue coming up in the log. >> >> On Fri, Jun 6, 2014 at 4:21 PM, combuster <[email protected]> wrote: >>> It was pure NFS on a NAS device. They all had different ids (had no >>> redeployements of nodes before problem occured). >>> >>> Thanks Jirka. >>> >>> >>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: >>>> >>>> I've seen that problem in other threads, the common denominator was "nfs >>>> on top of gluster". So if you have this setup, then it's a known problem. >>>> Or >>>> you should double check if you hosts have different ids otherwise they >>>> would >>>> be trying to acquire the same lock. >>>> >>>> --Jirka >>>> >>>> On 06/06/2014 08:03 AM, Andrew Lau wrote: >>>>> >>>>> Hi Ivan, >>>>> >>>>> Thanks for the in depth reply. >>>>> >>>>> I've only seen this happen twice, and only after I added a third host >>>>> to the HA cluster. I wonder if that's the root problem. >>>>> >>>>> Have you seen this happen on all your installs or only just after your >>>>> manual migration? It's a little frustrating this is happening as I was >>>>> hoping to get this into a production environment. It was all working >>>>> except that log message :( >>>>> >>>>> Thanks, >>>>> Andrew >>>>> >>>>> >>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <[email protected]> wrote: >>>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> this is something that I saw in my logs too, first on one node and then >>>>>> on >>>>>> the other three. When that happend on all four of them, engine was >>>>>> corrupted >>>>>> beyond repair. >>>>>> >>>>>> First of all, I think that message is saying that sanlock can't get a >>>>>> lock >>>>>> on the shared storage that you defined for the hostedengine during >>>>>> installation. I got this error when I've tried to manually migrate the >>>>>> hosted engine. There is an unresolved bug there and I think it's related >>>>>> to >>>>>> this one: >>>>>> >>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host score to >>>>>> zero] >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366 >>>>>> >>>>>> This is a blocker bug (or should be) for the selfhostedengine and, from >>>>>> my >>>>>> own experience with it, shouldn't be used in the production enviroment >>>>>> (not >>>>>> untill it's fixed). >>>>>> >>>>>> Nothing that I've done couldn't fix the fact that the score for the >>>>>> target >>>>>> node was Zero, tried to reinstall the node, reboot the node, restarted >>>>>> several services, tailed a tons of logs etc but to no avail. When only >>>>>> one >>>>>> node was left (that was actually running the hosted engine), I brought >>>>>> the >>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and >>>>>> after >>>>>> that, when I've tried to start the vm - it wouldn't load. Running VNC >>>>>> showed >>>>>> that the filesystem inside the vm was corrupted and when I ran fsck and >>>>>> finally started up - it was too badly damaged. I succeded to start the >>>>>> engine itself (after repairing postgresql service that wouldn't want to >>>>>> start) but the database was damaged enough and acted pretty weird >>>>>> (showed >>>>>> that storage domains were down but the vm's were running fine etc). >>>>>> Lucky >>>>>> me, I had already exported all of the VM's on the first sign of trouble >>>>>> and >>>>>> then installed ovirt-engine on the dedicated server and attached the >>>>>> export >>>>>> domain. >>>>>> >>>>>> So while really a usefull feature, and it's working (for the most part >>>>>> ie, >>>>>> automatic migration works), manually migrating VM with the hosted-engine >>>>>> will lead to troubles. >>>>>> >>>>>> I hope that my experience with it, will be of use to you. It happened to >>>>>> me >>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix >>>>>> available. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Ivan >>>>>> >>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I'm seeing this weird message in my engine log >>>>>> >>>>>> 2014-06-06 03:06:09,380 INFO >>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] >>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id >>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds >>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done >>>>>> 2014-06-06 03:06:12,494 INFO >>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName = >>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60, >>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false, >>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1 >>>>>> 2014-06-06 03:06:12,561 INFO >>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id: >>>>>> 62a9d4c1 >>>>>> 2014-06-06 03:06:12,652 INFO >>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>>> (DefaultQuartzScheduler_ >>>>>> Worker-89) Correlation ID: null, Call Stack: >>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit >>>>>> message: internal error Failed to acquire lock: error -243. >>>>>> >>>>>> It also appears to occur on the other hosts in the cluster, except the >>>>>> host which is running the hosted-engine. So right now 3 servers, it >>>>>> shows up twice in the engine UI. >>>>>> >>>>>> The engine VM continues to run peacefully, without any issues on the >>>>>> host which doesn't have that error. >>>>>> >>>>>> Any ideas? >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> [email protected] >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Users mailing list >>>>> [email protected] >>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>> >>>> >>> _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

