[ovirt-users] Re: Ovirt 4.3.1 problem with HA agent

Strahil Nikolov Sat, 09 Mar 2019 07:02:22 -0800

 Hi Simone,
and thanks for your help.
So far I found out that there is some problem with the local copy of the 
HostedEngine config (see attached part of vdsm.log).
I have found out an older xml configuration (in an old vdsm.log) and defining 
the VM works, but powering it on reports:
[root@ovirt1 ~]# virsh define hosted-engine.xmlDomain HostedEngine defined from 
hosted-engine.xml
[root@ovirt1 ~]# virsh list --all Id    Name                           
State---------------------------------------------------- -     HostedEngine    
               shut off
[root@ovirt1 ~]# virsh start HostedEngineerror: Failed to start domain 
HostedEngineerror: Network not found: no network with matching name 
'vdsm-ovirtmgmt'
[root@ovirt1 ~]# virsh net-list --all Name                 State      Autostart 
    Persistent---------------------------------------------------------- 
;vdsmdummy;          active     no            no default              inactive  
 no            yes
[root@ovirt1 ~]# brctl showbridge name     bridge id               STP enabled  
   interfaces;vdsmdummy;             8000.000000000000       noovirtmgmt        
       8000.bc5ff467f5b3       no              enp2s0
[root@ovirt1 ~]# ip a s1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue 
state UNKNOWN group default qlen 1000    link/loopback 00:00:00:00:00:00 brd 
00:00:00:00:00:00    inet 127.0.0.1/8 scope host lo       valid_lft forever 
preferred_lft forever    inet6 ::1/128 scope host        valid_lft forever 
preferred_lft forever2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 
qdisc mq master ovirtmgmt state UP group default qlen 1000    link/ether 
bc:5f:f4:67:f5:b3 brd ff:ff:ff:ff:ff:ff3: ovs-system: <BROADCAST,MULTICAST> mtu 
1500 qdisc noop state DOWN group default qlen 1000    link/ether 
f6:78:c7:2d:32:f9 brd ff:ff:ff:ff:ff:ff4: br-int: <BROADCAST,MULTICAST> mtu 
1500 qdisc noop state DOWN group default qlen 1000    link/ether 
66:36:dd:63:dc:48 brd ff:ff:ff:ff:ff:ff20: ovirtmgmt: 
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
qlen 1000    link/ether bc:5f:f4:67:f5:b3 brd ff:ff:ff:ff:ff:ff    inet 
192.168.1.90/24 brd 192.168.1.255 scope global ovirtmgmt       valid_lft 
forever preferred_lft forever    inet 192.168.1.243/24 brd 192.168.1.255 scope 
global secondary ovirtmgmt       valid_lft forever preferred_lft forever    
inet6 fe80::be5f:f4ff:fe67:f5b3/64 scope link        valid_lft forever 
preferred_lft forever21: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop 
state DOWN group default qlen 1000    link/ether ce:36:8d:b7:64:bd brd 
ff:ff:ff:ff:ff:ff


192.168.1.243/24 is the one of the IPs in ctdb..

So , now comes the question - is there an xml in the logs that defines the 
network ?My hope is to power up the HostedEngine properly and hope that it will 
push all the configurations to the right places ... maybe this is way too 
optimistic.
At least I have learned a lot for oVirt.
Best Regards,Strahil Nikolov


    В четвъртък, 7 март 2019 г., 17:55:12 ч. Гринуич+2, Simone Tiraboschi 
<stira...@redhat.com> написа:  
 
 

On Thu, Mar 7, 2019 at 2:54 PM Strahil Nikolov <hunter86...@yahoo.com> wrote:

 

  
>The OVF_STORE volume is going to get periodically recreated by the engine so 
>at least you need a running engine.
>In order to avoid this kind of issue we have two OVF_STORE disks, in your case:
>MainThread::INFO::2019-03-06 
>06:50:02,391::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan)
> Found >OVF_STORE: imgUUID:441abdc8-6cb1-49a4-903f-a1ec0ed88429, 
>volUUID:c3309fc0-8707-4de1-903d-8d4bbb024f81>MainThread::INFO::2019-03-06 
>06:50:02,748::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan)
> Found >OVF_STORE: imgUUID:94ade632-6ecc-4901-8cec-8e39f3d69cb0, 
>volUUID:9460fc4b-54f3-48e3-b7b6-da962321ecf4
>Can you please check if you have at lest the second copy?
Second Copy is empty too:[root@ovirt1 ~]# ll 
/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429
total 66561
-rw-rw----. 1 vdsm kvm       0 Mar  4 05:23 c3309fc0-8707-4de1-903d-8d4bbb024f81
-rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 
c3309fc0-8707-4de1-903d-8d4bbb024f81.lease
-rw-r--r--. 1 vdsm kvm     435 Mar  4 05:24 
c3309fc0-8707-4de1-903d-8d4bbb024f81.meta



>And even in the case you lost both, we are storing on the shared storage the 
>initial vm.conf:>MainThread::ERROR::2019-03-06 
>>06:50:02,971::config_ovf::70::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::>(_get_vm_conf_content_from_ovf_store)
> Failed extracting VM OVF from the OVF_STORE volume, falling back to initial 
>vm.conf

>Can you please check what do you have in 
>/var/run/ovirt-hosted-engine-ha/vm.conf ? It exists and has the following:
[root@ovirt1 ~]# cat /var/run/ovirt-hosted-engine-ha/vm.conf
# Editing the hosted engine VM is only possible via the manager UI\API
# This file was generated at Thu Mar  7 15:37:26 2019

vmId=8474ae07-f172-4a20-b516-375c73903df7
memSize=4096
display=vnc
devices={index:2,iface:ide,address:{ controller:0, target:0,unit:0, bus:1, 
type:drive},specParams:{},readonly:true,deviceId:,path:,device:cdrom,shared:false,type:disk}
devices={index:0,iface:virtio,format:raw,poolID:00000000-0000-0000-0000-000000000000,volumeID:a9ab832f-c4f2-4b9b-9d99-6393fd995979,imageID:8ec7a465-151e-4ac3-92a7-965ecf854501,specParams:{},readonly:false,domainID:808423f9-8a5c-40cd-bc9f-2568c85b8c74,optional:false,deviceId:a9ab832f-c4f2-4b9b-9d99-6393fd995979,address:{bus:0x00,
 slot:0x06, domain:0x0000, type:pci, 
function:0x0},device:disk,shared:exclusive,propagateErrors:off,type:disk,bootOrder:1}
devices={device:scsi,model:virtio-scsi,type:controller}
devices={nicModel:pv,macAddr:00:16:3e:62:72:c8,linkActive:true,network:ovirtmgmt,specParams:{},deviceId:,address:{bus:0x00,
 slot:0x03, domain:0x0000, type:pci, function:0x0},device:bridge,type:interface}
devices={device:console,type:console}
devices={device:vga,alias:video0,type:video}
devices={device:vnc,type:graphics}
vmName=HostedEngine
spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
smp=1
maxVCpus=8
cpuType=Opteron_G5
emulatedMachine=emulated_machine_list.json['values']['system_option_value'][0]['value'].replace('[','').replace(']','').split(',
 ')|first
devices={device:virtio,specParams:{source:urandom},model:virtio,type:rng}


You should be able to copy it to /root/myvm.conf.xml and start the engine VM 
withhosted-engine --vm-start --vm-conf=/root/myvm.conf 




Also, I think this happened when I was upgrading ovirt1 (last in the gluster 
cluster) from 4.3.0 to 4.3.1 . The engine got restarted , because I forgot to 
enable the global maintenance.

>Sorry, I don't understand>Can you please explain what happened?
I have updated the engine first -> All OK, next was the arbiter -> again no 
issues with it.Next was the empty host -> ovirt2 and everything went OK.After 
that I migrated the engine to ovirt2 , and tried to updated ovirt1.The web 
showed that the installation failed, but using "yum update" was working.During 
the update via yum of ovirt1 -> the engine app crashed and restarted on 
ovirt2.After the reboot of ovirt1 I have noticed the error about pinging the 
gateway ,thus I stopped the engine and stopped the following services on both 
hosts (global maintenance):ovirt-ha-agent ovirt-ha-broker vdsmd supervdsmd 
sanlock
Next was a reinitialization of the sanlock space via 'sanlock direct -s'. 
In the end I have managed to power on the hosted-engine and it was running for 
a while.
As the errors did not stop - I have decided to shutdown everything, then power 
it up , heal gluster and check what will happen.
Currently I'm not able to power up the engine:

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status


!! Cluster is in GLOBAL MAINTENANCE mode !!


Please notice that in global maintenance mode nothing will try to start the 
engine VM for you.I assume you tried to exit global maintenance mode or at 
least you tried to manually start it with hosted-engine --vm-start, right? 



--== Host ovirt1.localdomain (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt1.localdomain
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", 
"health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 45e6772b
local_conf_timestamp               : 288
Host timestamp                     : 287
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=287 (Thu Mar  7 15:34:06 2019)
        host-id=1
        score=3400
        vm_conf_refresh_time=288 (Thu Mar  7 15:34:07 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


--== Host ovirt2.localdomain (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt2.localdomain
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", 
"health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 2e9a0444
local_conf_timestamp               : 3886
Host timestamp                     : 3885
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3885 (Thu Mar  7 15:34:05 2019)
        host-id=2
        score=3400
        vm_conf_refresh_time=3886 (Thu Mar  7 15:34:06 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start
Command VM.getStats with args {'vmID': '8474ae07-f172-4a20-b516-375c73903df7'} 
failed:
(code=1, message=Virtual machine does not exist: {'vmId': 
u'8474ae07-f172-4a20-b516-375c73903df7'})
[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start
VM exists and is down, cleaning up and restarting

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status


!! Cluster is in GLOBAL MAINTENANCE mode !!



--== Host ovirt1.localdomain (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt1.localdomain
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": 
"bad", "vm": "down", "detail": "Down"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 6b086b7c
local_conf_timestamp               : 328
Host timestamp                     : 327
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=327 (Thu Mar  7 15:34:46 2019)
        host-id=1
        score=3400
        vm_conf_refresh_time=328 (Thu Mar  7 15:34:47 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


--== Host ovirt2.localdomain (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt2.localdomain
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", 
"health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : c5890e9c
local_conf_timestamp               : 3926
Host timestamp                     : 3925
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3925 (Thu Mar  7 15:34:45 2019)
        host-id=2
        score=3400
        vm_conf_refresh_time=3926 (Thu Mar  7 15:34:45 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@ovirt1 ovirt-hosted-engine-ha]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     HostedEngine                   shut off

I am really puzzled why both volumes are wiped out .

This is really scaring: can you please double check gluster logs for warning 
and errors?
 


Best Regards,Strahil Nikolov

failed-hosted-engine-vdsm.log
Description: Binary data

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/TCNPAHJMT6PGJB6TGO3CUYQJOCGSH2EC/

[ovirt-users] Re: Ovirt 4.3.1 problem with HA agent

Reply via email to