On 2012-5-15 14:21, Haim Ateya wrote:

----- Original Message -----
From: "Shu Ming"<[email protected]>
To: "Haim Ateya"<[email protected]>
Cc: "[email protected]"<[email protected]>
Sent: Tuesday, May 15, 2012 9:03:42 AM
Subject: Re: [Users] The SPM host  node is in unresponsive mode

On 2012-5-15 12:19, Haim Ateya wrote:
----- Original Message -----
From: "Shu Ming"<[email protected]>
To: "[email protected]"<[email protected]>
Sent: Tuesday, May 15, 2012 4:56:36 AM
Subject: [Users] The SPM host  node is in unresponsive mode

Hi,
     I attached one host node in my engine.  Because it is the only
     one
node, it is automatically the SPM node.  And it used to run well
in
my
engine.  Yesterday, some errors happened in the network work of
the
host
node.  That made the node become "unresponsive" in the engine.  I
am
sure the network errors are fixed and want to bring the node back
to
life now.  However, I found that the only one node could not  be
"confirm as host been rebooted" and could not be set into the
maintenance mode.   The reason  given there is no active host in
the
datacenter and SPM can not enter into maintenance mode.  It seems
that
it fell into a logic loop here.  Losting network can be quite
common
in
developing environment even in production environment, I think we
should
have a way to address this problem on how to repair a host node
encountering network down for a while.
Hi Shu,

first, for the manual fence to work ("confirm host have been
rebooted") you will need
another host in the cluster which will be used as a proxy and send
the actual manual fence command.
second, you are absolutely right, loss of network is a common
scenario, and we should be able
to recover, but lets try to understand why your host remain
unresponsive after network returned.
please ssh to the host and try the following:

- vdsClient -s 0 getVdsCaps (validity check making sure vdsm
service is up and running and communicate with its network socket
from localhost)
[root@ovirt-node1 ~]# vdsClient -s 0 getVdsCaps
Connection to 9.181.129.110:54321 refused
[root@ovirt-node1 ~]#

root@ovirt-node1 ~]# ps -ef |grep vdsm
root      1365     1  0 09:37 ?        00:00:00 /usr/sbin/libvirtd
--listen # by vdsm
root      5534  4652  0 13:53 pts/0    00:00:00 grep --color=auto
vdsm
[root@ovirt-node1 ~]# service vdsmd start
Redirecting to /bin/systemctl  start vdsmd.service

root@ovirt-node1 ~]# ps -ef |grep vdsm
root      1365     1  0 09:37 ?        00:00:00 /usr/sbin/libvirtd
--listen # by vdsm
root      5534  4652  0 13:53 pts/0    00:00:00 grep --color=auto
vdsm

It seems that VDSM process was gone while libvirtd spawned by VDSM
was
there.  Then I tried to start the VDSM daemon, however it did
nothing.
After checking the vdsm.log file, the latest message was five hours
ago
and useless.  Also, there was no useful message in libvirtd.log.
[HA] problem is systemctl doesn't show real reason why service didn't go, lets 
try the following:
- # cd /lib/systemd/
- # ./systemd-vdsmd restart



[root@ovirt-node1 systemd]# ./systemd-vdsmd start
WARNING: no socket to connect to
vdsm: libvirt already configured for vdsm                  [  OK  ]
Starting iscsid:
Starting libvirtd (via systemctl):                         [  OK  ]
Stopping network (via systemctl):                          [  OK  ]
Starting network (via systemctl): Job failed. See system logs and 'systemctl status' for details.
                                                           [FAILED]
Starting up vdsm daemon:
vdsm start                                                 [  OK  ]



I did futher test on this system. After I killed the solo libivrtd process, vdsm processs can be started without libvirtd. However, vdsm can not work either in this way. After several round of "killall libvirtd", "service vdsmd start", "vdsmd stop", both vdsm and libivirtd processs now start. In summary: 1) the libvirtd started by vdsm process may stand there even after its parent vdsm process is gone.
2) the legacy libvirtd may block the start process of vdsm service
3) vdsm service can work with the legacy libvirtd sometime without creating a new one.

Here are my service process in the host node, please notice that the libvirtd process is earlier than the vdsm process that means this libvirtd was a legacy process not created by the vdsm process in this round. The problem still exist in engine that I don't have a way to reactivate the host node.

[root@ovirt-node1 systemd]# ps -ef |grep vdsm
root 8738 1 0 14:33 ? 00:00:00 /usr/sbin/libvirtd --listen # by vdsm vdsm 9900 1 0 14:35 ? 00:00:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share vdsm/vdsm vdsm 9903 9900 0 14:35 ? 00:00:01 /usr/bin/python /usr/share/vdsm vdsm root 9926 9903 0 14:35 ? 00:00:00 /usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.py b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903 root 9927 9926 0 14:35 ? 00:00:00 /usr/bin/python /usr/share/vdsm/supervdsmServer.py b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903
root     10451  4652  0 14:38 pts/0    00:00:00 grep --color=auto vdsm
[root@ovirt-node1 systemd]# ps -ef |grep vdsm
root 8738 1 0 14:33 ? 00:00:00 /usr/sbin/libvirtd --listen # by vdsm vdsm 9900 1 0 14:35 ? 00:00:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share vdsm/vdsm vdsm 9903 9900 0 14:35 ? 00:00:01 /usr/bin/python /usr/share/vdsm vdsm root 9926 9903 0 14:35 ? 00:00:00 /usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.py b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903 root 9927 9926 0 14:35 ? 00:00:00 /usr/bin/python /usr/share/vdsm/supervdsmServer.py b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903
root     10463  4652  0 14:38 pts/0    00:00:00 grep --color=auto vdsm
[root@ovirt-node1 systemd]# vdsClient -s 0 getVdsCaps
HBAInventory = {'iSCSI': [{'InitiatorName': 'iqn.1994-05.com.redhat:f1b658ea7af8'}], 'FC': []}
        ISCSIInitiatorName = iqn.1994-05.com.redhat:f1b658ea7af8
bondings = {'bond4': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond0': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond1': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond2': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond3': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}}
        clusterLevels = ['3.0', '3.1']
        cpuCores = 12
cpuFlags = fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,pdpe1gb,rdtscp,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,nopl,xtopology,nonstop_tsc,aperfmperf,pni,pclmulqdq,dtes64,monitor,ds_cpl,vmx,smx,est,tm2,ssse3,cx16,xtpr,pdcm,pcid,dca,sse4_1,sse4_2,popcnt,aes,lahf_lm,arat,epb,dts,tpr_shadow,vnmi,flexpriority,ept,vpid,model_coreduo,model_Conroe
        cpuModel = Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
        cpuSockets = 2
        cpuSpeed = 1596.000
emulatedMachines = ['pc-0.14', 'pc', 'fedora-13', 'pc-0.13', 'pc-0.12', 'pc-0.11', 'pc-0.10', 'isapc', 'pc-0.14', 'pc', 'fedora-13', 'pc-0.13', 'pc-0.12', 'pc-0.11', 'pc-0.10', 'isapc']
        guestOverhead = 65
hooks = {'before_vm_migrate_destination': {'50_vhostmd': {'md5': '2aa9ac48ef07de3c94e3428975e9df1a'}}, 'after_vm_destroy': {'50_vhostmd': {'md5': '47f8d385859e4c3c96113d8ff446b261'}}, 'before_vm_dehibernate': {'50_vhostmd': {'md5': '2aa9ac48ef07de3c94e3428975e9df1a'}}, 'before_vm_start': {'50_vhostmd': {'md5': '2aa9ac48ef07de3c94e3428975e9df1a'}, '10_faqemu': {'md5': 'c899c5a7004c29ae2234bd409ddfa39b'}}}
        kvmEnabled = true
        lastClient = 9.181.129.153
        lastClientIface = ovirtmgmt
        management_ip =
        memSize = 72486
networks = {'ovirtmgmt': {'addr': '9.181.129.110', 'cfg': {'IPADDR': '9.181.129.110', 'ONBOOT': 'yes', 'DELAY': '0', 'NETMASK': '255.255.255.0', 'BOOTPROTO': 'static', 'DEVICE': 'ovirtmgmt', 'TYPE': 'Bridge', 'GATEWAY': '9.181.129.1'}, 'mtu': '1500', 'netmask': '255.255.255.0', 'stp': 'off', 'bridged': True, 'gateway': '9.181.129.1', 'ports': ['eth0']}} nics = {'p4p1': {'hwaddr': '00:00:C9:E5:A1:36', 'netmask': '', 'speed': 0, 'addr': '', 'mtu': '1500'}, 'p4p2': {'hwaddr': '00:00:C9:E5:A1:3A', 'netmask': '', 'speed': 0, 'addr': '', 'mtu': '1500'}, 'eth1': {'hwaddr': '5C:F3:FC:E4:32:A2', 'netmask': '', 'speed': 0, 'addr': '', 'mtu': '1500'}, 'eth0': {'hwaddr': '5C:F3:FC:E4:32:A0', 'netmask': '', 'speed': 1000, 'addr': '', 'mtu': '1500'}} operatingSystem = {'release': '1', 'version': '16', 'name': 'oVirt Node'} packages2 = {'kernel': {'release': '4.fc16.x86_64', 'buildtime': 1332237940.0, 'version': '3.3.0'}, 'spice-server': {'release': '1.fc16', 'buildtime': '1327339129', 'version': '0.10.1'}, 'vdsm': {'release': '0.183.git107644d.fc16.shuming1336622293', 'buildtime': '1336622307', 'version': '4.9.6'}, 'qemu-kvm': {'release': '4.fc16', 'buildtime': '1327954752', 'version': '0.15.1'}, 'libvirt': {'release': '1.fc17', 'buildtime': '1333539009', 'version': '0.9.11'}, 'qemu-img': {'release': '4.fc16', 'buildtime': '1327954752', 'version': '0.15.1'}}
        reservedMem = 321
        software_revision = 0
        software_version = 4.9
        supportedProtocols = ['2.2', '2.3']
        supportedRHEVMs = ['3.0']
        uuid = 47D88E9A-FC0F-11E0-B09A-5CF3FCE432A0_00:00:C9:E5:A1:36
        version_name = Snow Man
        vlans = {}
        vmTypes = ['kvm']
[root@ovirt-node1 systemd]#






- please ping between host and engine
   It works in both ways.


- please make sure there is no firewall on blocking tcp 54321 (on
both host and engine)
No firewall.

also, please provide vdsm.log (from the time network issues begun)
and spm-lock.log (both located on /var/log/vdsm/).

as for a mitigation, we can always manipulate db and set it
correctly, but first, lets try the above.
Also, there is no useful message in spm-lock.log.  The latest message
was 24 hours ago.

--
Shu Ming<[email protected]>
IBM China Systems and Technology Laboratory


_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users


--
Shu Ming<[email protected]>
IBM China Systems and Technology Laboratory





--
Shu Ming<[email protected]>
IBM China Systems and Technology Laboratory


_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to