Re: [ovirt-users] hosted engine health check issues

René Koch Wed, 23 Apr 2014 02:12:08 -0700

On 04/23/2014 11:08 AM, Martin Sivak wrote:

Hi René,

libvirtError: Failed to acquire lock: No space left on device

2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid
lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82


Can you please check the contents of /rhev/data-center/<your nfs mount>/<nfs 
domain uuid>/ha_agent/?

This is how it should look like:

[root@dev-03 ~]# ls -al 
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
total 2036
drwxr-x---. 2 vdsm kvm    4096 Mar 19 18:46 .
drwxr-xr-x. 6 vdsm kvm    4096 Mar 19 18:46 ..
-rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace
-rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata

The errors seem to indicate that you somehow lost the lockspace file.


True :)

Isn't this file created when hosted engine is started? Or how can Icreate this file manually?


--
Martin Sivák
[email protected]
Red Hat Czech
RHEV-M SLA / Brno, CZ

----- Original Message -----

On 04/23/2014 12:28 AM, Doron Fediuck wrote:

Hi Rene,
any idea what closed your ovirtmgmt bridge?
as long as it is down vdsm may have issues starting up properly
and this is why you see the complaints on the rpc server.

Can you try manually fixing the network part first and then
restart vdsm?
Once vdsm is happy hosted engine VM will start.


Thanks for your feedback, Doron.

My ovirtmgmt bridge seems to be on or isn't it:
# brctl show ovirtmgmt
bridge name     bridge id               STP enabled     interfaces
ovirtmgmt               8000.0025907587c2       no              eth0.200

# ip a s ovirtmgmt
7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UNKNOWN
      link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
      inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt
      inet6 fe80::225:90ff:fe75:87c2/64 scope link
         valid_lft forever preferred_lft forever

# ip a s eth0.200
6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP
      link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
      inet6 fe80::225:90ff:fe75:87c2/64 scope link
         valid_lft forever preferred_lft forever

I tried the following yesterday:
Copy virtual disk from GlusterFS storage to local disk of host and
create a new vm with virt-manager which loads ovirtmgmt disk. I could
reach my engine over the ovirtmgmt bridge (so bridge must be working).

I also started libvirtd with Option -v and I saw the following in
libvirtd.log when trying to start ovirt engine:
2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 :
Command result 0, with PID 11491
2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : Result
exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' is
not a chain

So it could be that something is broken in my hosted-engine network. Do
you have any clue how I can troubleshoot this?


Thanks,
René


----- Original Message -----

From: "René Koch" <[email protected]>
To: "Martin Sivak" <[email protected]>
Cc: [email protected]
Sent: Tuesday, April 22, 2014 1:46:38 PM
Subject: Re: [ovirt-users] hosted engine health check issues

Hi,

I rebooted one of my ovirt hosts today and the result is now that I
can't start hosted-engine anymore.

ovirt-ha-agent isn't running because the lockspace file is missing
(sanlock complains about it).
So I tried to start hosted-engine with --vm-start and I get the
following errors:

==> /var/log/sanlock.log <==
2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid
lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82

==> /var/log/messages <==
Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 12:38:17+0200 654
[3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name
2851af27-8744-445d-9fb1-a0d083c8dc82
Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering
disabled state
Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left promiscuous mode
Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering
disabled state

==> /var/log/vdsm/vdsm.log <==
Thread-21::DEBUG::2014-04-22
12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown
libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire
lock: No space left on device
Thread-21::DEBUG::2014-04-22
12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm)
vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations released
Thread-21::ERROR::2014-04-22
12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm)
vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process failed
Traceback (most recent call last):
     File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm
       self._run()
     File "/usr/share/vdsm/vm.py", line 3170, in _run
       self._connection.createXML(domxml, flags),
     File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
line 92, in wrapper
       ret = f(*args, **kwargs)
     File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in
createXML
       if ret is None:raise libvirtError('virDomainCreateXML() failed',
conn=self)
libvirtError: Failed to acquire lock: No space left on device

==> /var/log/messages <==
Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR
vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process
failed#012Traceback (most recent call last):#012  File
"/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012
self._run()#012  File "/usr/share/vdsm/vm.py", line 3170, in _run#012
    self._connection.createXML(domxml, flags),#012  File
"/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92,
in wrapper#012    ret = f(*args, **kwargs)#012  File
"/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in
createXML#012    if ret is None:raise libvirtError('virDomainCreateXML()
failed', conn=self)#012libvirtError: Failed to acquire lock: No space
left on device

==> /var/log/vdsm/vdsm.log <==
Thread-21::DEBUG::2014-04-22
12:38:17,569::vm::2731::vm.Vm::(setDownStatus)
vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down:
Failed to acquire lock: No space left on device


No space left on device is nonsense as there is enough space (I had this
issue last time as well where I had to patch machine.py, but this file
is now Python 2.6.6 compatible.

Any idea what prevents hosted-engine from starting?
ovirt-ha-broker, vdsmd and sanlock are running btw.

Btw, I can see in log that json rpc server module is missing - which
package is required for CentOS 6.5?
Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the json
rpc server module. Please make sure it is installed.


Thanks,
René



On 04/17/2014 10:02 AM, Martin Sivak wrote:

Hi,

How can I disable notifications?


The notification is configured in /etc/ovirt-hosted-engine-ha/broker.conf
section notification.
The email is sent when the key state_transition exists and the string
OldState-NewState contains the (case insensitive) regexp from the value.

Is it intended to send out these messages and detect that ovirt engine
is down (which is false anyway), but not to restart the vm?


Forget about emails for now and check the
/var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and attach them
as well btw).

oVirt hosts think that hosted engine is down because it seems that
hosts
can't write to hosted-engine.lockspace due to glusterfs issues (or at
least I think so).


The hosts think so or can't really write there? The lockspace is managed
by
sanlock and our HA daemons do not touch it at all. We only ask sanlock to
get make sure we have unique server id.

Is is possible or planned to make the whole ha feature optional?


Well the system won't perform any automatic actions if you put the hosted
engine to global maintenance and only start/stop/migrate the VM manually.
I would discourage you from stopping agent/broker, because the engine
itself has some logic based on the reporting.

Regards

--
Martin Sivák
[email protected]
Red Hat Czech
RHEV-M SLA / Brno, CZ

----- Original Message -----

On 04/15/2014 04:53 PM, Jiri Moskovcak wrote:

On 04/14/2014 10:50 AM, René Koch wrote:

Hi,

I have some issues with hosted engine status.

oVirt hosts think that hosted engine is down because it seems that
hosts
can't write to hosted-engine.lockspace due to glusterfs issues (or at
least I think so).

Here's the output of vm-status:

# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : False
Hostname                           : 10.0.200.102
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1397035677
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1397035677 (Wed Apr  9 11:27:57 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineUp


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : 10.0.200.101
Host ID                            : 2
Engine status                      : {'reason': 'vm not running on
this
host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}
Score                              : 0
Local maintenance                  : False
Host timestamp                     : 1397464031
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1397464031 (Mon Apr 14 10:27:11 2014)
        host-id=2
        score=0
        maintenance=False
        state=EngineUnexpectedlyDown
        timeout=Mon Apr 14 10:35:05 2014

oVirt engine is sending me 2 emails every 10 minutes with the
following
subjects:
- ovirt-hosted-engine state transition EngineDown-EngineStart
- ovirt-hosted-engine state transition EngineStart-EngineUp

In oVirt webadmin I can see the following message:
VM HostedEngine is down. Exit message: internal error Failed to
acquire
lock: error -243.

These messages are really annoying as oVirt isn't doing anything with
hosted engine - I have an uptime of 9 days in my engine vm.

So my questions are now:
Is it intended to send out these messages and detect that ovirt engine
is down (which is false anyway), but not to restart the vm?

How can I disable notifications? I'm planning to write a Nagios plugin
which parses the output of hosted-engine --vm-status and only Nagios
should notify me, not hosted-engine script.

Is is possible or planned to make the whole ha feature optional? I
really really really hate cluster software as it causes more troubles
then standalone machines and in my case the hosted-engine ha feature
really causes troubles (and I didn't had a hardware or network outage
yet only issues with hosted-engine ha agent). I don't need any ha
feature for hosted engine. I just want to run engine virtualized on
oVirt and if engine vm fails (e.g. because of issues with a host) I'll
restart it on another node.


Hi, you can:
1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and tweak
the logger as you like
2. or kill ovirt-ha-broker & ovirt-ha-agent services


Thanks for the information.
So engine is able to run when ovirt-ha-broker and ovirt-ha-agent isn't
running?


Regards,
René


--Jirka


Thanks,
René

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] hosted engine health check issues

Reply via email to