Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log.
On Fri, Jun 6, 2014 at 4:21 PM, combuster <[email protected]> wrote: > It was pure NFS on a NAS device. They all had different ids (had no > redeployements of nodes before problem occured). > > Thanks Jirka. > > > On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: >> >> I've seen that problem in other threads, the common denominator was "nfs >> on top of gluster". So if you have this setup, then it's a known problem. Or >> you should double check if you hosts have different ids otherwise they would >> be trying to acquire the same lock. >> >> --Jirka >> >> On 06/06/2014 08:03 AM, Andrew Lau wrote: >>> >>> Hi Ivan, >>> >>> Thanks for the in depth reply. >>> >>> I've only seen this happen twice, and only after I added a third host >>> to the HA cluster. I wonder if that's the root problem. >>> >>> Have you seen this happen on all your installs or only just after your >>> manual migration? It's a little frustrating this is happening as I was >>> hoping to get this into a production environment. It was all working >>> except that log message :( >>> >>> Thanks, >>> Andrew >>> >>> >>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <[email protected]> wrote: >>>> >>>> Hi Andrew, >>>> >>>> this is something that I saw in my logs too, first on one node and then >>>> on >>>> the other three. When that happend on all four of them, engine was >>>> corrupted >>>> beyond repair. >>>> >>>> First of all, I think that message is saying that sanlock can't get a >>>> lock >>>> on the shared storage that you defined for the hostedengine during >>>> installation. I got this error when I've tried to manually migrate the >>>> hosted engine. There is an unresolved bug there and I think it's related >>>> to >>>> this one: >>>> >>>> [Bug 1093366 - Migration of hosted-engine vm put target host score to >>>> zero] >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366 >>>> >>>> This is a blocker bug (or should be) for the selfhostedengine and, from >>>> my >>>> own experience with it, shouldn't be used in the production enviroment >>>> (not >>>> untill it's fixed). >>>> >>>> Nothing that I've done couldn't fix the fact that the score for the >>>> target >>>> node was Zero, tried to reinstall the node, reboot the node, restarted >>>> several services, tailed a tons of logs etc but to no avail. When only >>>> one >>>> node was left (that was actually running the hosted engine), I brought >>>> the >>>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and >>>> after >>>> that, when I've tried to start the vm - it wouldn't load. Running VNC >>>> showed >>>> that the filesystem inside the vm was corrupted and when I ran fsck and >>>> finally started up - it was too badly damaged. I succeded to start the >>>> engine itself (after repairing postgresql service that wouldn't want to >>>> start) but the database was damaged enough and acted pretty weird >>>> (showed >>>> that storage domains were down but the vm's were running fine etc). >>>> Lucky >>>> me, I had already exported all of the VM's on the first sign of trouble >>>> and >>>> then installed ovirt-engine on the dedicated server and attached the >>>> export >>>> domain. >>>> >>>> So while really a usefull feature, and it's working (for the most part >>>> ie, >>>> automatic migration works), manually migrating VM with the hosted-engine >>>> will lead to troubles. >>>> >>>> I hope that my experience with it, will be of use to you. It happened to >>>> me >>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix >>>> available. >>>> >>>> Regards, >>>> >>>> Ivan >>>> >>>> On 06/06/2014 05:12 AM, Andrew Lau wrote: >>>> >>>> Hi, >>>> >>>> I'm seeing this weird message in my engine log >>>> >>>> 2014-06-06 03:06:09,380 INFO >>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] >>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id >>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds >>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done >>>> 2014-06-06 03:06:12,494 INFO >>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName = >>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60, >>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false, >>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1 >>>> 2014-06-06 03:06:12,561 INFO >>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id: >>>> 62a9d4c1 >>>> 2014-06-06 03:06:12,652 INFO >>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>> (DefaultQuartzScheduler_ >>>> Worker-89) Correlation ID: null, Call Stack: >>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit >>>> message: internal error Failed to acquire lock: error -243. >>>> >>>> It also appears to occur on the other hosts in the cluster, except the >>>> host which is running the hosted-engine. So right now 3 servers, it >>>> shows up twice in the engine UI. >>>> >>>> The engine VM continues to run peacefully, without any issues on the >>>> host which doesn't have that error. >>>> >>>> Any ideas? >>>> _______________________________________________ >>>> Users mailing list >>>> [email protected] >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> >>>> >>> _______________________________________________ >>> Users mailing list >>> [email protected] >>> http://lists.ovirt.org/mailman/listinfo/users >>> >> > _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

