nvm, just as I hit send the error has returned. Ignore this..
On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau <[email protected]> wrote: > So after adding the L3 capabilities to my storage network, I'm no > longer seeing this issue anymore. So the engine needs to be able to > access the storage domain it sits on? But that doesn't show up in the > UI? > > Ivan, was this also the case with your setup? Engine couldn't access > storage domain? > > On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau <[email protected]> wrote: >> Interesting, my storage network is a L2 only and doesn't run on the >> ovirtmgmt (which is the only thing HostedEngine sees) but I've only >> seen this issue when running ctdb in front of my NFS server. I >> previously was using localhost as all my hosts had the nfs server on >> it (gluster). >> >> On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <[email protected]> wrote: >>> I just blocked connection to storage for testing, but on result I had this >>> error: "Failed to acquire lock error -243", so I added it in reproduce >>> steps. >>> If you know another steps to reproduce this error, without blocking >>> connection to storage it also can be wonderful if you can provide them. >>> Thanks >>> >>> ----- Original Message ----- >>> From: "Andrew Lau" <[email protected]> >>> To: "combuster" <[email protected]> >>> Cc: "users" <[email protected]> >>> Sent: Monday, June 9, 2014 3:47:00 AM >>> Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal >>> error Failed to acquire lock error -243 >>> >>> I just ran a few extra tests, I had a 2 host, hosted-engine running >>> for a day. They both had a score of 2400. Migrated the VM through the >>> UI multiple times, all worked fine. I then added the third host, and >>> that's when it all fell to pieces. >>> Other two hosts have a score of 0 now. >>> >>> I'm also curious, in the BZ there's a note about: >>> >>> where engine-vm block connection to storage domain(via iptables -I >>> INPUT -s sd_ip -j DROP) >>> >>> What's the purpose for that? >>> >>> On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <[email protected]> wrote: >>>> Ignore that, the issue came back after 10 minutes. >>>> >>>> I've even tried a gluster mount + nfs server on top of that, and the >>>> same issue has come back. >>>> >>>> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <[email protected]> wrote: >>>>> Interesting, I put it all into global maintenance. Shut it all down >>>>> for 10~ minutes, and it's regained it's sanlock control and doesn't >>>>> seem to have that issue coming up in the log. >>>>> >>>>> On Fri, Jun 6, 2014 at 4:21 PM, combuster <[email protected]> wrote: >>>>>> It was pure NFS on a NAS device. They all had different ids (had no >>>>>> redeployements of nodes before problem occured). >>>>>> >>>>>> Thanks Jirka. >>>>>> >>>>>> >>>>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: >>>>>>> >>>>>>> I've seen that problem in other threads, the common denominator was "nfs >>>>>>> on top of gluster". So if you have this setup, then it's a known >>>>>>> problem. Or >>>>>>> you should double check if you hosts have different ids otherwise they >>>>>>> would >>>>>>> be trying to acquire the same lock. >>>>>>> >>>>>>> --Jirka >>>>>>> >>>>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote: >>>>>>>> >>>>>>>> Hi Ivan, >>>>>>>> >>>>>>>> Thanks for the in depth reply. >>>>>>>> >>>>>>>> I've only seen this happen twice, and only after I added a third host >>>>>>>> to the HA cluster. I wonder if that's the root problem. >>>>>>>> >>>>>>>> Have you seen this happen on all your installs or only just after your >>>>>>>> manual migration? It's a little frustrating this is happening as I was >>>>>>>> hoping to get this into a production environment. It was all working >>>>>>>> except that log message :( >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Andrew >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>> this is something that I saw in my logs too, first on one node and >>>>>>>>> then >>>>>>>>> on >>>>>>>>> the other three. When that happend on all four of them, engine was >>>>>>>>> corrupted >>>>>>>>> beyond repair. >>>>>>>>> >>>>>>>>> First of all, I think that message is saying that sanlock can't get a >>>>>>>>> lock >>>>>>>>> on the shared storage that you defined for the hostedengine during >>>>>>>>> installation. I got this error when I've tried to manually migrate the >>>>>>>>> hosted engine. There is an unresolved bug there and I think it's >>>>>>>>> related >>>>>>>>> to >>>>>>>>> this one: >>>>>>>>> >>>>>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host score to >>>>>>>>> zero] >>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366 >>>>>>>>> >>>>>>>>> This is a blocker bug (or should be) for the selfhostedengine and, >>>>>>>>> from >>>>>>>>> my >>>>>>>>> own experience with it, shouldn't be used in the production enviroment >>>>>>>>> (not >>>>>>>>> untill it's fixed). >>>>>>>>> >>>>>>>>> Nothing that I've done couldn't fix the fact that the score for the >>>>>>>>> target >>>>>>>>> node was Zero, tried to reinstall the node, reboot the node, restarted >>>>>>>>> several services, tailed a tons of logs etc but to no avail. When only >>>>>>>>> one >>>>>>>>> node was left (that was actually running the hosted engine), I brought >>>>>>>>> the >>>>>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and >>>>>>>>> after >>>>>>>>> that, when I've tried to start the vm - it wouldn't load. Running VNC >>>>>>>>> showed >>>>>>>>> that the filesystem inside the vm was corrupted and when I ran fsck >>>>>>>>> and >>>>>>>>> finally started up - it was too badly damaged. I succeded to start the >>>>>>>>> engine itself (after repairing postgresql service that wouldn't want >>>>>>>>> to >>>>>>>>> start) but the database was damaged enough and acted pretty weird >>>>>>>>> (showed >>>>>>>>> that storage domains were down but the vm's were running fine etc). >>>>>>>>> Lucky >>>>>>>>> me, I had already exported all of the VM's on the first sign of >>>>>>>>> trouble >>>>>>>>> and >>>>>>>>> then installed ovirt-engine on the dedicated server and attached the >>>>>>>>> export >>>>>>>>> domain. >>>>>>>>> >>>>>>>>> So while really a usefull feature, and it's working (for the most part >>>>>>>>> ie, >>>>>>>>> automatic migration works), manually migrating VM with the >>>>>>>>> hosted-engine >>>>>>>>> will lead to troubles. >>>>>>>>> >>>>>>>>> I hope that my experience with it, will be of use to you. It happened >>>>>>>>> to >>>>>>>>> me >>>>>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix >>>>>>>>> available. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Ivan >>>>>>>>> >>>>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm seeing this weird message in my engine log >>>>>>>>> >>>>>>>>> 2014-06-06 03:06:09,380 INFO >>>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] >>>>>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id >>>>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds >>>>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done >>>>>>>>> 2014-06-06 03:06:12,494 INFO >>>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>>>>>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName = >>>>>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60, >>>>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false, >>>>>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1 >>>>>>>>> 2014-06-06 03:06:12,561 INFO >>>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id: >>>>>>>>> 62a9d4c1 >>>>>>>>> 2014-06-06 03:06:12,652 INFO >>>>>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>>>>>> (DefaultQuartzScheduler_ >>>>>>>>> Worker-89) Correlation ID: null, Call Stack: >>>>>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit >>>>>>>>> message: internal error Failed to acquire lock: error -243. >>>>>>>>> >>>>>>>>> It also appears to occur on the other hosts in the cluster, except the >>>>>>>>> host which is running the hosted-engine. So right now 3 servers, it >>>>>>>>> shows up twice in the engine UI. >>>>>>>>> >>>>>>>>> The engine VM continues to run peacefully, without any issues on the >>>>>>>>> host which doesn't have that error. >>>>>>>>> >>>>>>>>> Any ideas? >>>>>>>>> _______________________________________________ >>>>>>>>> Users mailing list >>>>>>>>> [email protected] >>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list >>>>>>>> [email protected] >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>> >>>>>>> >>>>>> >>> _______________________________________________ >>> Users mailing list >>> [email protected] >>> http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

