Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-15 Thread Charles Kozler
>> Thread-482175::INFO::2016-06-14
>>
12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
>> Cleaning up stale LV link
'/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
>>
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'

> This is also not normal, it means the storage disappeared.


> This seems to indicate there is some kind of issue with your network..
> are you sure that your firewall allows connections over lo interface
> and to the storage server?


Yes very much so. The network is 10.0.16.0/24 - this is the ovirtmgmt +
storage network and is 100% isolated and dedicated with no firewall between
oVirt nodes and storage. There is no firewall on the local server either.
Basically I have:

ovirtmgmt - bond0 in mode 2 (default when not using LACP in oVirt it
appears) - connects to dedicated storage switches. nodes1-3 are 10.0.16.5,
6, and 7 respectively
VM NIC - bond1 - trunk port for VLAN tagging in active/passive bond. This
is the VM network path. This connects to two different switches

storage is located at 10.0.16.100 (cluster IP / storage-vip is hostname),
10.0.16.101 (storage node 1), 10.0.16.102 (storage node 2), 10.0.16.103
(nas01, dedicated storage for ovirt engine outside of clustered storage for
other VMs)

Cluster IP of 10.0.16.100 is where VM storage goes
NAS IP of 10.0.16.103 is where oVirt engine storage is

All paths to the oVirt engine and other nodes are 100% clear with no
failures or firewalls between oVirt nodes and storage

[root@njsevcnp01 ~]# for i in $( seq 100 103 ); do ping -c 1 10.0.16.$i |
grep -i "\(rece\|time=\)"; echo "--"; done
64 bytes from 10.0.16.100: icmp_seq=1 ttl=64 time=0.071 ms
1 packets transmitted, 1 received, 0% packet loss, time 0ms
--
64 bytes from 10.0.16.101: icmp_seq=1 ttl=64 time=0.065 ms
1 packets transmitted, 1 received, 0% packet loss, time 0ms
--
64 bytes from 10.0.16.102: icmp_seq=1 ttl=64 time=0.099 ms
1 packets transmitted, 1 received, 0% packet loss, time 0ms
--
64 bytes from 10.0.16.103: icmp_seq=1 ttl=64 time=0.219 ms
1 packets transmitted, 1 received, 0% packet loss, time 0ms
--

This is dedicated storage for oVirt environment

[root@njsevcnp01 ~]# df -h | grep -i rhev
nas01:/volume1/vm_os/ovirt36_engine  2.2T  295G  1.9T  14%
/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine
storage-vip:/fast_ha-gv0 792G  125G  668G  16%
/rhev/data-center/mnt/glusterSD/storage-vip:_fast__ha-gv0
storage-vip:/slow_nonha-gv0  1.8T  212G  1.6T  12%
/rhev/data-center/mnt/glusterSD/storage-vip:_slow__nonha-gv0


>> >
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> > Error: 'Failed to start monitor , options {'hostname':
>> > 'njsevcnp01'}: Connection timed out' - trying to restart agent
>> > MainThread::WARNING::2016-06-15

> and connection timeout between agent and broker.

Everything I am providing right now is from njsevcnp01, why would it
timeout between agent and broker on the same box? Because broker is not
accepting connection? But the broker logs show it is accepting and doing
connection handling

Acknowledged on the STMP errors. At this time I am just trying to get
clustering working again because as of now I cannot live migrate the hosted
engine since it appears to be a split brain type of issue

What do I need to do to resolve this stale-data issue and get the cluster
working again / agents and brokers talking to themselves again?

Should I shut down the platform and delete the lock files then bring it
back up again?

Thanks for your help Martin!

On Wed, Jun 15, 2016 at 10:38 AM, Martin Sivak  wrote:

> >
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py",
> > line 24, in send_email
> > server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])
> >   File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__
> > (code, msg) = self.connect(host, port)
> >   File "/usr/lib64/python2.7/smtplib.py", line 315, in connect
> > self.sock = self._get_socket(host, port, self.timeout)
> >   File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket
> > return socket.create_connection((host, port), timeout)
> >   File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
> > raise err
> > error: [Errno 110] Connection timed out
>
> So you have connection timeout here (it is trying to reach the
> localhost smtp server)
>
> >> >
> 09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> >> > Error: 'Failed to start monitor , options {'hostname':
> >> > 'njsevcnp01'}: Connection timed out' - trying to restart agent
> >> > MainThread::WARNING::2016-06-15
>
> and connection timeout between agent and broker.
>
> > Thread-482175::INFO::2016-06-14
> >
> 12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
> > Cleaning up stale LV link
> 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-15 Thread Martin Sivak
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py",
> line 24, in send_email
> server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])
>   File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__
> (code, msg) = self.connect(host, port)
>   File "/usr/lib64/python2.7/smtplib.py", line 315, in connect
> self.sock = self._get_socket(host, port, self.timeout)
>   File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket
> return socket.create_connection((host, port), timeout)
>   File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
> raise err
> error: [Errno 110] Connection timed out

So you have connection timeout here (it is trying to reach the
localhost smtp server)

>> > 09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> > Error: 'Failed to start monitor , options {'hostname':
>> > 'njsevcnp01'}: Connection timed out' - trying to restart agent
>> > MainThread::WARNING::2016-06-15

and connection timeout between agent and broker.

> Thread-482175::INFO::2016-06-14
> 12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
> Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
> 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'

This is also not normal, it means the storage disappeared.


This seems to indicate there is some kind of issue with your network..
are you sure that your firewall allows connections over lo interface
and to the storage server?


Martin

On Wed, Jun 15, 2016 at 4:11 PM, Charles Kozler  wrote:
> Marin -
>
> Anything I should be looking for specifically? The only errors I see are
> smtp errors when it tries to send a notification but nothing indicating what
> the notification is / might be. I see this repeated about every minute
>
> Thread-482115::INFO::2016-06-14
> 12:58:54,431::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
> Thread-482109::INFO::2016-06-14
> 12:58:54,491::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
> Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
> 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace'
> Thread-482109::INFO::2016-06-14
> 12:58:54,515::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
> Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
> 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
>
> nas01 is the primary storage for the engine (as previously noted)
>
> Thread-482175::INFO::2016-06-14
> 12:59:30,398::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
> Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
> 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace'
> Thread-482175::INFO::2016-06-14
> 12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
> Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
> 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
>
>
> But otherwise the broker looks like its accepting and handling connections
>
> Thread-481980::INFO::2016-06-14
> 12:59:33,105::mem_free::53::mem_free.MemFree::(action) memFree: 26491
> Thread-482193::INFO::2016-06-14
> 12:59:33,977::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
> Thread-482193::INFO::2016-06-14
> 12:59:34,033::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
> Thread-482194::INFO::2016-06-14
> 12:59:34,034::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
> Thread-482194::INFO::2016-06-14
> 12:59:34,035::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
> Thread-482195::INFO::2016-06-14
> 12:59:34,035::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
> Thread-482195::INFO::2016-06-14
> 12:59:34,036::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
> Thread-482196::INFO::2016-06-14
> 12:59:34,037::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
> Thread-482196::INFO::2016-06-14
> 12:59:34,037::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
> Thread-482197::INFO::2016-06-14
> 12:59:38,544::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
> Thread-482197::INFO::2016-06-14
> 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-15 Thread Charles Kozler
Marin -

Anything I should be looking for specifically? The only errors I see are
smtp errors when it tries to send a notification but nothing indicating
what the notification is / might be. I see this repeated about every minute

Thread-482115::INFO::2016-06-14
12:58:54,431::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482109::INFO::2016-06-14
12:58:54,491::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace'
Thread-482109::INFO::2016-06-14
12:58:54,515::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'

nas01 is the primary storage for the engine (as previously noted)

Thread-482175::INFO::2016-06-14
12:59:30,398::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace'
Thread-482175::INFO::2016-06-14
12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'


But otherwise the broker looks like its accepting and handling connections

Thread-481980::INFO::2016-06-14
12:59:33,105::mem_free::53::mem_free.MemFree::(action) memFree: 26491
Thread-482193::INFO::2016-06-14
12:59:33,977::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482193::INFO::2016-06-14
12:59:34,033::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482194::INFO::2016-06-14
12:59:34,034::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482194::INFO::2016-06-14
12:59:34,035::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482195::INFO::2016-06-14
12:59:34,035::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482195::INFO::2016-06-14
12:59:34,036::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482196::INFO::2016-06-14
12:59:34,037::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482196::INFO::2016-06-14
12:59:34,037::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482197::INFO::2016-06-14
12:59:38,544::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482197::INFO::2016-06-14
12:59:38,598::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482198::INFO::2016-06-14
12:59:38,598::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482198::INFO::2016-06-14
12:59:38,599::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482199::INFO::2016-06-14
12:59:38,600::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482199::INFO::2016-06-14
12:59:38,600::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482200::INFO::2016-06-14
12:59:38,601::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-482200::INFO::2016-06-14
12:59:38,602::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-482179::INFO::2016-06-14
12:59:40,339::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load)
System load total=0.0078, engine=0., non-engine=0.0078


Thread-482178::INFO::2016-06-14
12:59:49,745::mem_free::53::mem_free.MemFree::(action) memFree: 26500
Thread-481977::ERROR::2016-06-14
12:59:50,263::notifications::35::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email)
[Errno 110] Connection timed out
Traceback (most recent call last):
  File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py",
line 24, in send_email
server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])
  File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__
(code, msg) = self.connect(host, port)
  File "/usr/lib64/python2.7/smtplib.py", line 315, in 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-15 Thread Martin Sivak
Charles, check the broker log too please. It is possible that the
broker process is running, but is not accepting connections for
example.

Martin

On Wed, Jun 15, 2016 at 3:32 PM, Charles Kozler  wrote:
> Actually, broker is the only thing acting "right" between broker and agent.
> Broker is up when I bring the system up but agent is restarting all the
> time. Have a look
>
> The 11th is when I restarted this node after doing 'reinstall' in the web UI
>
> ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability
> Communications Broker
>Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled;
> vendor preset: disabled)
>Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago
>  Main PID: 1285 (ovirt-ha-broker)
>CGroup: /system.slice/ovirt-ha-broker.service
>└─1285 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
>
> Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:mem_free.MemFree:memFree: 26408
>
> Uptime of proc ..
>
> # ps -Aef | grep -i broker
> vdsm   1285  1  2 Jun11 ?02:27:50 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
>
> But the agent... is restarting all the time
>
> # ps -Aef | grep -i ovirt-ha-agent
> vdsm  76116  1  0 09:19 ?00:00:01 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>
> 9:19 AM ET is last restart. Even the logs say it
>
> [root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent'
> agent.log | wc -l
> 232719
>
> And the restarts every
>
> [root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
> 'restarting agent'
> MainThread::WARNING::2016-06-15
> 09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '6'
> MainThread::WARNING::2016-06-15
> 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '7'
> MainThread::WARNING::2016-06-15
> 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '8'
> MainThread::WARNING::2016-06-15
> 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '9'
> MainThread::WARNING::2016-06-15
> 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '0'
> MainThread::WARNING::2016-06-15
> 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '1'
>
> Full log of restart is like this saying "connection timed out" but its not
> saying to *what* is timing out, so I have nothing else to really go on here
>
> [root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
> restart
> MainThread::ERROR::2016-06-15
> 09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor , options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '7'
> MainThread::ERROR::2016-06-15
> 09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor , options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '8'
> MainThread::ERROR::2016-06-15
> 09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-15 Thread Charles Kozler
Actually, broker is the only thing acting "right" between broker and agent.
Broker is up when I bring the system up but agent is restarting all the
time. Have a look

The 11th is when I restarted this node after doing 'reinstall' in the web UI

● ovirt-ha-broker.service - oVirt Hosted Engine High Availability
Communications Broker
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service;
enabled; vendor preset: disabled)
   Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago
 Main PID: 1285 (ovirt-ha-broker)
   CGroup: /system.slice/ovirt-ha-broker.service
   └─1285 /usr/bin/python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon

Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]:
INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:mem_free.MemFree:memFree: 26408

Uptime of proc ..

# ps -Aef | grep -i broker
vdsm   1285  1  2 Jun11 ?02:27:50 /usr/bin/python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon

But the agent... is restarting all the time

# ps -Aef | grep -i ovirt-ha-agent
vdsm  76116  1  0 09:19 ?00:00:01 /usr/bin/python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

9:19 AM ET is last restart. Even the logs say it

[root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent'
agent.log | wc -l
232719

And the restarts every

[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
'restarting agent'
MainThread::WARNING::2016-06-15
09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '6'
MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7'
MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8'
MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9'
MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '0'
MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '1'

Full log of restart is like this saying "connection timed out" but its not
saying to *what* is timing out, so I have nothing else to really go on here

[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
restart
MainThread::ERROR::2016-06-15
09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor , options {'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7'
MainThread::ERROR::2016-06-15
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor , options {'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8'
MainThread::ERROR::2016-06-15
09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor , options {'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9'
MainThread::ERROR::2016-06-15
09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor , options {'hostname':
'njsevcnp01'}: Connection 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-15 Thread Martin Sivak
> Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed:
> Connection timed out
> Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]:
> ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed
> to start monitor , options {'hostname': 'njsevcnp01'}:
> Connection timed out' - trying to restart agent

Broker is broken or down. Check the status of ovirt-ha-broker service.

> The other interesting thing is this log from node01. The odd thing is that
> it seems there is some split brain somewhere in oVirt because this log is
> from node02 but it is asking the engine and its getting back "vm not running
> on this host' rather than 'stale data'. But I dont know engine internals

This is another piece that points to broker or storage issues. Agent
collects local data and then publishes them to other nodes through
broker. So it is possible for the agent to know the status of the VM
locally, but not be able to publish it.

hosted-engine command line tool then reads the synchronization
whiteboard too, but it does not see anything that was not published
and ends up reporting stale data.

>> What is the status of the hosted engine services? systemctl status
>> ovirt-ha-agent ovirt-ha-broker

Please check the services.

Best regards

Martin

On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler  wrote:
> Martin -
>
> One thing I noticed on all of the nodes is this:
>
> Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed:
> Connection timed out
> Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]:
> ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed
> to start monitor , options {'hostname': 'njsevcnp01'}:
> Connection timed out' - trying to restart agent
>
> Then the agent is restarted
>
> [root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep
> vdsm  15713  1  0 08:09 ?00:00:01 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>
> I dont know why the connection would time out because as you can see that
> log is from node01 and I cant figure out why its timing out on the
> connection
>
> The other interesting thing is this log from node01. The odd thing is that
> it seems there is some split brain somewhere in oVirt because this log is
> from node02 but it is asking the engine and its getting back "vm not running
> on this host' rather than 'stale data'. But I dont know engine internals
>
> MainThread::INFO::2016-06-14
> 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df,
> host-ts: 3030}
> MainThread::INFO::2016-06-14
> 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb,
> host-ts: 10877406}
>
>
> And that same log on node02 where the engine is running
>
>
> MainThread::INFO::2016-06-14
> 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06,
> host-ts: 327}
> MainThread::INFO::2016-06-14
> 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb,
> host-ts: 10877406}
> MainThread::INFO::2016-06-14
> 08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge:
> True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True}
> MainThread::INFO::2016-06-14
> 08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1465906544.45 type=state_transition
> detail=StartState-ReinitializeFSM hostname=njsevcnp02
>
>
>
>
>
>
>
>
> On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak  wrote:
>>
>> Hi,
>>
>> is there anything interesting in the hosted engine log files?
>> /var/log/ovirt-hosted-engine-ha/agent.log
>>
>> There should be something appearing there every 10 seconds 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-14 Thread Sahina Bose
Are the ovirt-ha-agent and ovirt-ha-broker services running on all the 
nodes? If they are, check the agent.log and broker.log for errors.


On 06/14/2016 05:29 PM, Charles Kozler wrote:
Anyone have any other possible information on this? I've noticed this 
issue before and usually it just takes a bit of time for the cluster 
to 'settle' after some node reboots but its been a few days and its 
still marked as stale.




--== Host 1 status ==--

Status up-to-date  : False
Hostname   : njsevcnp01
Host ID: 1
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : 260dbf06
Host timestamp : 327


--== Host 2 status ==--

Status up-to-date  : False
Hostname   : njsevcnp02
Host ID: 2
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : 25da07df
Host timestamp : 3030


--== Host 3 status ==--

Status up-to-date  : False
Hostname   : njsevcnp03
Host ID: 3
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : c67818cb
Host timestamp : 10877406


&& vdsClient on node2 showing hosted engine is up on node 2

48207078-8cb0-413c-8984-40aa772f4d94
Status = Up
nicModel = rtl8139,pv
statusTime = 4540044460
emulatedMachine = pc
pid = 30571
vmName = HostedEngine
devices = [{'device': 'memballoon', 'specParams': {'model': 'none'}, 
'type': 'balloon', 'alias': 'balloon0'}, {'alias': 'scsi0', 
'deviceId': '17f10db1-2e9e-4422-9ea5-61a628072e29', 'address': 
{'slot': '0x04', 'bus': '0x00', 'domain': '0x', 'type': 'pci', 
'function': '0x0'}, 'device': 'scsi', 'model': 'virtio-scsi', 'type': 
'controller'}, {'device': 'usb', 'alias': 'usb', 'type': 'controller', 
'deviceId': '9be34ac0-7d00-4a95-bdfe-5b328fc1355b', 'address': 
{'slot': '0x01', 'bus': '0x00', 'domain': '0x', 'type': 'pci', 
'function': '0x2'}}, {'device': 'ide', 'alias': 'ide', 'type': 
'controller', 'deviceId': '222629a8-0dd6-4e8e-9b42-43aac314c0c2', 
'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x', 'type': 
'pci', 'function': '0x1'}}, {'device': 'virtio-serial', 'alias': 
'virtio-serial0', 'type': 'controller', 'deviceId': 
'7cbccd04-853a-408f-94c2-5b10b641b7af', 'address': {'slot': '0x05', 
'bus': '0x00', 'domain': '0x', 'type': 'pci', 'function': '0x0'}}, 
{'device': 'vnc', 'specParams': {'spiceSecureChannels': 
'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir', 
'displayIp': '0'}, 'type': 'graphics', 'port': '5900'}, {'nicModel': 
'pv', 'macAddr': '00:16:3e:16:83:91', 'linkActive': True, 'network': 
'ovirtmgmt', 'alias': 'net0', 'deviceId': 
'3f679659-142c-41f3-a69d-4264d7234fbc', 'address': {'slot': '0x03', 
'bus': '0x00', 'domain': '0x', 'type': 'pci', 'function': '0x0'}, 
'device': 'bridge', 'type': 'interface', 'name': 'vnet0'}, {'address': 
{'slot': '0x06', 'bus': '0x00', 'domain': '0x', 'type': 'pci', 
'function': '0x0'}, 'volumeInfo': {'domainID': 
'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 
'leaseOffset': 0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 
'leasePath': 
'/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease', 
'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path': 
'/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'}, 
'index': '0', 'iface': 'virtio', 'apparentsize': '10737418240', 
'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'readonly': 
'False', 'shared': 'exclusive', 'truesize': '6899802112', 'type': 
'disk', 'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'reqsize': 
'0', 'format': 'raw', 'deviceId': 
'8518ef4a-7b17-4291-856c-81875ba4e264', 'poolID': 
'----', 'device': 'disk', 'path': 
'/var/run/vdsm/storage/c6323975-2966-409d-b9e0-48370a513a98/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95', 
'propagateErrors': 'off', 'name': 'vda', 'bootOrder': '1', 'volumeID': 
'aa66d378-5a5f-490c-b0ab-993b79838d95', 'alias': 'virtio-disk0', 
'volumeChain': [{'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 
'volType': 'path', 'leaseOffset': 0, 'volumeID': 
'aa66d378-5a5f-490c-b0ab-993b79838d95', 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-14 Thread Charles Kozler
Anyone have any other possible information on this? I've noticed this issue
before and usually it just takes a bit of time for the cluster to 'settle'
after some node reboots but its been a few days and its still marked as
stale.



--== Host 1 status ==--

Status up-to-date  : False
Hostname   : njsevcnp01
Host ID: 1
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : 260dbf06
Host timestamp : 327


--== Host 2 status ==--

Status up-to-date  : False
Hostname   : njsevcnp02
Host ID: 2
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : 25da07df
Host timestamp : 3030


--== Host 3 status ==--

Status up-to-date  : False
Hostname   : njsevcnp03
Host ID: 3
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : c67818cb
Host timestamp : 10877406


&& vdsClient on node2 showing hosted engine is up on node 2

48207078-8cb0-413c-8984-40aa772f4d94
Status = Up
nicModel = rtl8139,pv
statusTime = 4540044460
emulatedMachine = pc
pid = 30571
vmName = HostedEngine
devices = [{'device': 'memballoon', 'specParams': {'model': 'none'},
'type': 'balloon', 'alias': 'balloon0'}, {'alias': 'scsi0', 'deviceId':
'17f10db1-2e9e-4422-9ea5-61a628072e29', 'address': {'slot': '0x04', 'bus':
'0x00', 'domain': '0x', 'type': 'pci', 'function': '0x0'}, 'device':
'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'device': 'usb',
'alias': 'usb', 'type': 'controller', 'deviceId':
'9be34ac0-7d00-4a95-bdfe-5b328fc1355b', 'address': {'slot': '0x01', 'bus':
'0x00', 'domain': '0x', 'type': 'pci', 'function': '0x2'}}, {'device':
'ide', 'alias': 'ide', 'type': 'controller', 'deviceId':
'222629a8-0dd6-4e8e-9b42-43aac314c0c2', 'address': {'slot': '0x01', 'bus':
'0x00', 'domain': '0x', 'type': 'pci', 'function': '0x1'}}, {'device':
'virtio-serial', 'alias': 'virtio-serial0', 'type': 'controller',
'deviceId': '7cbccd04-853a-408f-94c2-5b10b641b7af', 'address': {'slot':
'0x05', 'bus': '0x00', 'domain': '0x', 'type': 'pci', 'function':
'0x0'}}, {'device': 'vnc', 'specParams': {'spiceSecureChannels':
'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir',
'displayIp': '0'}, 'type': 'graphics', 'port': '5900'}, {'nicModel': 'pv',
'macAddr': '00:16:3e:16:83:91', 'linkActive': True, 'network': 'ovirtmgmt',
'alias': 'net0', 'deviceId': '3f679659-142c-41f3-a69d-4264d7234fbc',
'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x', 'type':
'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface', 'name':
'vnet0'}, {'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x',
'type': 'pci', 'function': '0x0'}, 'volumeInfo': {'domainID':
'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 'leaseOffset':
0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath':
'/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease',
'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path':
'/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'},
'index': '0', 'iface': 'virtio', 'apparentsize': '10737418240', 'imageID':
'8518ef4a-7b17-4291-856c-81875ba4e264', 'readonly': 'False', 'shared':
'exclusive', 'truesize': '6899802112', 'type': 'disk', 'domainID':
'c6323975-2966-409d-b9e0-48370a513a98', 'reqsize': '0', 'format': 'raw',
'deviceId': '8518ef4a-7b17-4291-856c-81875ba4e264', 'poolID':
'----', 'device': 'disk', 'path':
'/var/run/vdsm/storage/c6323975-2966-409d-b9e0-48370a513a98/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95',
'propagateErrors': 'off', 'name': 'vda', 'bootOrder': '1', 'volumeID':
'aa66d378-5a5f-490c-b0ab-993b79838d95', 'alias': 'virtio-disk0',
'volumeChain': [{'domainID': 'c6323975-2966-409d-b9e0-48370a513a98',
'volType': 'path', 'leaseOffset': 0, 'volumeID':
'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath':
'/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease',
'imageID': 

Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-13 Thread Charles Kozler
It is up. I can do "ps -Aef | grep -i qemu-kvm | grep -i hosted" and see it
running. I also forcefully shut it down with hosted-engine --vm-stop when
it was on node1 and then did --vm-start on node 2 and it came up. Also the
Web UI is reachable so thats how I also know the hosted engine VM is running

On Mon, Jun 13, 2016 at 8:24 AM, Alexis HAUSER <
alexis.hau...@telecom-bretagne.eu> wrote:

>
> > http://imgur.com/a/6xkaS
>
> I had similar errors with one single host and a hosted-engine VM.
> My case should be totally different, but one thing you could try first is
> to check VM is really up.
> In my issues, VM was shown by hosted-engine command as up, but was down.
> with vdsClient command, you can check if it's status with more details.
>
> What is the result for you of the following command ?
>
>  vdsClient -s 0 list
>



-- 

*Charles Kozler*
*Vice President, IT Operations*

FIX Flyer, LLC
225 Broadway | Suite 1600 | New York, NY 10007
1-888-349-3593
http://www.fixflyer.com 

NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
WHICH IS PROPRIETARY TO FIX FLYER LLC.  ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED.  ALL RIGHTS TO THIS
INFORMATION IS RESERVED BY FIX FLYER LLC.  IF YOU ARE NOT THE INTENDED
RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS
E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-13 Thread Alexis HAUSER

> http://imgur.com/a/6xkaS 

I had similar errors with one single host and a hosted-engine VM.
My case should be totally different, but one thing you could try first is to 
check VM is really up.
In my issues, VM was shown by hosted-engine command as up, but was down. with 
vdsClient command, you can check if it's status with more details.

What is the result for you of the following command ? 

 vdsClient -s 0 list
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] hosted-engine vm-status stale data and cluster seems "broken"

2016-06-11 Thread Charles Kozler
See linked images please. As you can see all three nodes are reporting
stale data. The results of this are:

1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut
them down to get them to be able to migrate again

2. Hosted engine refuses to move due to constraints (image). This part
doesnt make sense to me  because I can forcefully shut it down and then go
directly on a hosted engine node and bring it back up. Also, the Web UI
shows all nodes under the cluster except then it thinks its not apart of
the cluster

3. Time is in sync (image)

4. Storage is 100% fine. Gluster back end reports mirroring and status
'started'. No split brain has occurred and ovirt nodes have never lost
connectivity to storage

5. I reinstalled all three nodes. For some reason only node 3 still shows
as having updates available. (image). For clarity, I did not click
"upgrade" I simply did 'reinstall' from the Web UI. Having looked at the
output and yum.log from /var/log it almost looks like it did do an update.
All package versions across all three nodes are the same (respective to
ovirt/vdsm) (image). For some reason
though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on
node 1 but not on node 2 or 3. Could this be relative? I dont recall
installing that specifically on node 1 but I may have

Been slamming my head on this so I am hoping you can provide some assistance

http://imgur.com/a/6xkaS

Thanks!

-- 

*Charles Kozler*
*Vice President, IT Operations*

FIX Flyer, LLC
225 Broadway | Suite 1600 | New York, NY 10007
1-888-349-3593
http://www.fixflyer.com 

NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
WHICH IS PROPRIETARY TO FIX FLYER LLC.  ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED.  ALL RIGHTS TO THIS
INFORMATION IS RESERVED BY FIX FLYER LLC.  IF YOU ARE NOT THE INTENDED
RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS
E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users