On 06/10/2014 01:02 AM, Andrew Lau wrote:
nvm, just as I hit send the error has returned. Ignore this..On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau <[email protected]> wrote:So after adding the L3 capabilities to my storage network, I'm no longer seeing this issue anymore. So the engine needs to be able to access the storage domain it sits on? But that doesn't show up in the UI? Ivan, was this also the case with your setup? Engine couldn't access storage domain? On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau <[email protected]> wrote:Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <[email protected]> wrote:I just blocked connection to storage for testing, but on result I had this error: "Failed to acquire lock error -243", so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks ----- Original Message ----- From: "Andrew Lau" <[email protected]> To: "combuster" <[email protected]> Cc: "users" <[email protected]> Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <[email protected]> wrote:Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <[email protected]> wrote:Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster <[email protected]> wrote:It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:I've seen that problem in other threads, the common denominator was "nfs on top of gluster". So if you have this setup, then it's a known problem. Or you should double check if you hosts have different ids otherwise they would be trying to acquire the same lock. --Jirka On 06/06/2014 08:03 AM, Andrew Lau wrote:Hi Ivan, Thanks for the in depth reply. I've only seen this happen twice, and only after I added a third host to the HA cluster. I wonder if that's the root problem. Have you seen this happen on all your installs or only just after your manual migration? It's a little frustrating this is happening as I was hoping to get this into a production environment. It was all working except that log message :( Thanks, Andrew On Fri, Jun 6, 2014 at 3:20 PM, combuster <[email protected]> wrote:Hi Andrew, this is something that I saw in my logs too, first on one node and then on the other three. When that happend on all four of them, engine was corrupted beyond repair. First of all, I think that message is saying that sanlock can't get a lock on the shared storage that you defined for the hostedengine during installation. I got this error when I've tried to manually migrate the hosted engine. There is an unresolved bug there and I think it's related to this one: [Bug 1093366 - Migration of hosted-engine vm put target host score to zero] https://bugzilla.redhat.com/show_bug.cgi?id=1093366 This is a blocker bug (or should be) for the selfhostedengine and, from my own experience with it, shouldn't be used in the production enviroment (not untill it's fixed). Nothing that I've done couldn't fix the fact that the score for the target node was Zero, tried to reinstall the node, reboot the node, restarted several services, tailed a tons of logs etc but to no avail. When only one node was left (that was actually running the hosted engine), I brought the engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after that, when I've tried to start the vm - it wouldn't load. Running VNC showed that the filesystem inside the vm was corrupted and when I ran fsck and finally started up - it was too badly damaged. I succeded to start the engine itself (after repairing postgresql service that wouldn't want to start) but the database was damaged enough and acted pretty weird (showed that storage domains were down but the vm's were running fine etc). Lucky me, I had already exported all of the VM's on the first sign of trouble and then installed ovirt-engine on the dedicated server and attached the export domain. So while really a usefull feature, and it's working (for the most part ie, automatic migration works), manually migrating VM with the hosted-engine will lead to troubles. I hope that my experience with it, will be of use to you. It happened to me two weeks ago, ovirt-engine was current (3.4.1) and there was no fix available. Regards, Ivan On 06/06/2014 05:12 AM, Andrew Lau wrote: Hi, I'm seeing this weird message in my engine log 2014-06-06 03:06:09,380 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds ov-hv2-2a-08-23 ignoring it in the refresh until migration is done 2014-06-06 03:06:12,494 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName = ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60, vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false, secondsToWait=0, gracefully=false), log id: 62a9d4c1 2014-06-06 03:06:12,561 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id: 62a9d4c1 2014-06-06 03:06:12,652 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_ Worker-89) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit message: internal error Failed to acquire lock: error -243. It also appears to occur on the other hosts in the cluster, except the host which is running the hosted-engine. So right now 3 servers, it shows up twice in the engine UI. The engine VM continues to run peacefully, without any issues on the host which doesn't have that error. Any ideas? _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users_______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users_______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

