On 07.09.2015 13:54, Dan Kenigsberg wrote: > On Mon, Sep 07, 2015 at 11:47:48AM +0200, Patrick Hurrelmann wrote: >> On 06.09.2015 11:30, Dan Kenigsberg wrote: >>> On Fri, Sep 04, 2015 at 10:26:39AM +0200, Patrick Hurrelmann wrote: >>>> Hi all, >>>> >>>> I just updated my existing oVirt 3.5.3 installation (iSCSI hosted-engine on >>>> CentOS 7.1). The engine update went fine. Updating the hosts succeeds >>>> until the >>>> first reboot. After a reboot the host does not come up again. It is >>>> missing all >>>> network configuration. All network cfgs in /etc/sysconfig/network-scripts >>>> are >>>> missing except ifcfg-lo. The host boots up without working networking. >>>> Using >>>> IPMI and config backups, I was able to restore the lost network configs. >>>> Once >>>> these are restored and the host is rebooted again all seems to be back to >>>> good. >>>> This has now happend to 2 updated hosts (this installation has a total of 4 >>>> hosts, so 2 more to debug/try). I'm happy to assist in furter debugging. >>>> >>>> Before updating the second host, I gathered some information. All these >>>> hosts >>>> have 3 physical nics. One is used for the ovirtmgmt bridge and the other 2 >>>> are >>>> used for iSCSI storage vlans. >>>> >>>> ifcfgs before update: >>>> >>>> /etc/sysconfig/network-scripts/ifcfg-em1 >>>> # Generated by VDSM version 4.16.20-0.el7.centos >>>> DEVICE=em1 >>>> HWADDR=d0:67:e5:f0:e5:c6 >>>> BRIDGE=ovirtmgmt >>>> ONBOOT=yes >>>> NM_CONTROLLED=no >>> /etc/sysconfig/network-scripts/ifcfg-lo >>>> DEVICE=lo >>>> IPADDR=127.0.0.1 >>>> NETMASK=255.0.0.0 >>>> NETWORK=127.0.0.0 >>>> # If you're having problems with gated making 127.0.0.0/8 a martian, >>>> # you can change this to something else (255.255.255.255, for example) >>>> BROADCAST=127.255.255.255 >>>> ONBOOT=yes >>>> NAME=loopback >>>> >>>> /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt >>>> # Generated by VDSM version 4.16.20-0.el7.centos >>>> DEVICE=ovirtmgmt >>>> TYPE=Bridge >>>> DELAY=0 >>>> STP=off >>>> ONBOOT=yes >>>> IPADDR=1.2.3.16 >>>> NETMASK=255.255.255.0 >>>> GATEWAY=1.2.3.11 >>>> BOOTPROTO=none >>>> DEFROUTE=yes >>>> NM_CONTROLLED=no >>>> HOTPLUG=no >>>> >>>> /etc/sysconfig/network-scripts/ifcfg-p4p1 >>>> # Generated by VDSM version 4.16.20-0.el7.centos >>>> DEVICE=p4p1 >>>> HWADDR=68:05:ca:01:bc:0c >>>> ONBOOT=no >>>> IPADDR=4.5.7.102 >>>> NETMASK=255.255.255.0 >>>> BOOTPROTO=none >>>> MTU=9000 >>>> DEFROUTE=no >>>> NM_CONTROLLED=no >>>> >>>> /etc/sysconfig/network-scripts/ifcfg-p3p1 >>>> # Generated by VDSM version 4.16.20-0.el7.centos >>>> DEVICE=p3p1 >>>> HWADDR=68:05:ca:18:86:45 >>>> ONBOOT=no >>>> IPADDR=4.5.6.102 >>>> NETMASK=255.255.255.0 >>>> BOOTPROTO=none >>>> MTU=9000 >>>> DEFROUTE=no >>>> NM_CONTROLLED=no >>>> >>>> /etc/sysconfig/network-scripts/ifcfg-lo >>>> >>>> >>>> ip link before update: >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode >>>> DEFAULT >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> 2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode >>>> DEFAULT >>>> link/ether 46:50:22:7a:f3:9d brd ff:ff:ff:ff:ff:ff >>>> 3: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master >>>> ovirtmgmt state UP mode DEFAULT qlen 1000 >>>> link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff >>>> 4: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state >>>> UP mode DEFAULT qlen 1000 >>>> link/ether 68:05:ca:18:86:45 brd ff:ff:ff:ff:ff:ff >>>> 5: p4p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state >>>> UP mode DEFAULT qlen 1000 >>>> link/ether 68:05:ca:01:bc:0c brd ff:ff:ff:ff:ff:ff >>>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue >>>> state UP mode DEFAULT >>>> link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff >>>> 8: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode >>>> DEFAULT >>>> link/ether ce:0f:16:49:a7:da brd ff:ff:ff:ff:ff:ff >>>> >>>> vdsm files before update: >>>> /var/lib/vdsm >>>> /var/lib/vdsm/bonding-defaults.json >>>> /var/lib/vdsm/netconfback >>>> /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt >>>> /var/lib/vdsm/netconfback/ifcfg-em1 >>>> /var/lib/vdsm/netconfback/route-ovirtmgmt >>>> /var/lib/vdsm/netconfback/rule-ovirtmgmt >>>> /var/lib/vdsm/netconfback/ifcfg-p4p1 >>>> /var/lib/vdsm/netconfback/ifcfg-p3p1 >>>> /var/lib/vdsm/persistence >>>> /var/lib/vdsm/persistence/netconf >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079 >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1 >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2 >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt >>>> /var/lib/vdsm/upgrade >>>> /var/lib/vdsm/upgrade/upgrade-unified-persistence >>>> /var/lib/vdsm/transient >>>> >>>> >>>> File in /var/lib/vdsm/netconfback each only contained a comment: >>>> # original file did not exist >>> This is quite peculiar. Do you know when these where created? >>> Have you made any networking changes on 3.5.3 just before boot? >>> >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt >>>> {"nic": "em1", "netmask": "255.255.255.0", "bootproto": "none", "ipaddr": >>>> "1.2.3.16", "gateway": "1.2.3.11"} >>>> >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1 >>>> {"nic": "p3p1", "netmask": "255.255.255.0", "ipaddr": "4.5.6.102", >>>> "bridged": "false", "mtu": "9000"} >>>> >>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2 >>>> {"nic": "p4p1", "netmask": "255.255.255.0", "ipaddr": "4.5.7.102", >>>> "bridged": "false", "mtu": "9000"} >>>> >>>> >>>> After update and reboot, no ifcfg scripts are left. Only interface lo is >>>> up. >>>> Syslog doess not seem to contain anything suspicious before refore reboot. >>> Have you tweaked vdsm.conf in any way? In particular did you set >>> net_persistence? >>> >>>> Log excerpts from bootup: >>>> >>>> Sep 3 17:27:23 vhm-prd-02 network: Bringing up loopback interface: [ OK >>>> ] >>>> Sep 3 17:27:23 vhm-prd-02 systemd-ovirt-ha-agent: Starting >>>> ovirt-ha-agent: [ OK ] >>>> Sep 3 17:27:23 vhm-prd-02 systemd: Started oVirt Hosted Engine High >>>> Availability Monitoring Agent. >>>> Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is >>>> not ready >>>> Sep 3 17:27:23 vhm-prd-02 kernel: device em1 entered promiscuous mode >>>> Sep 3 17:27:23 vhm-prd-02 network: Bringing up interface em1: [ OK ] >>>> Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: >>>> link is not ready >>>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Joining mDNS multicast group >>>> on interface ovirtmgmt.IPv4 with address 1.2.3.16. >>>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: New relevant interface >>>> ovirtmgmt.IPv4 for mDNS. >>>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Registering new address >>>> record for 1.2.3.16 on ovirtmgmt.IPv4. >>>> Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Link is up at >>>> 1000 Mbps, full duplex >>>> Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Flow control is >>>> off for TX and off for RX >>>> Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: >>>> link becomes ready >>>> Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered >>>> forwarding state >>>> Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered >>>> forwarding state >>>> Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): >>>> ovirtmgmt: link becomes ready >>>> Sep 3 17:27:26 vhm-prd-02 network: Bringing up interface ovirtmgmt: [ >>>> OK ] >>>> Sep 3 17:27:26 vhm-prd-02 systemd: Started LSB: Bring up/down networking. >>>> Sep 3 17:27:26 vhm-prd-02 systemd: Starting Network. >>>> Sep 3 17:27:26 vhm-prd-02 systemd: Reached target Network. >>>> >>>> So ovirtmgmt and em1 were restore and initialized just fine (p3p1 and p4p1 >>>> should have been started, too, but engine configured them as ONBOOT=no). >>>> >>>> Further in messages (full log is attached): >>> would you also attach your post-boot supervdsm.log? >>> >>>> Sep 3 17:27:26 vhm-prd-02 systemd: Starting Virtual Desktop Server >>>> Manager network restoration... >>>> Sep 3 17:27:26 vhm-prd-02 systemd: Started OSAD daemon. >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Terminate Plymouth Boot Screen. >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Wait for Plymouth Boot Screen >>>> to Quit. >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Serial Getty on ttyS1... >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Serial Getty on ttyS1. >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Getty on tty1... >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Getty on tty1. >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Login Prompts. >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Reached target Login Prompts. >>>> Sep 3 17:27:27 vhm-prd-02 iscsid: iSCSI daemon with pid=1300 started! >>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address >>>> record for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.*. >>>> Sep 3 17:27:27 vhm-prd-02 kdumpctl: kexec: loaded kdump kernel >>>> Sep 3 17:27:27 vhm-prd-02 kdumpctl: Starting kdump: [OK] >>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Crash recovery kernel arming. >>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address >>>> record for fe80::d267:e5ff:fef0:e5c6 on em1.*. >>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record >>>> for 1.2.3.16 on ovirtmgmt. >>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Leaving mDNS multicast group >>>> on interface ovirtmgmt.IPv4 with address 1.2.3.16. >>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Interface ovirtmgmt.IPv4 no >>>> longer relevant for mDNS. >>>> Sep 3 17:27:27 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled >>>> state >>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record >>>> for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt. >>>> Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing address record >>>> for fe80::d267:e5ff:fef0:e5c6 on em1. >>>> Sep 3 17:27:28 vhm-prd-02 kernel: device em1 left promiscuous mode >>>> Sep 3 17:27:28 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled >>>> state >>>> Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing workstation >>>> service for ovirtmgmt. >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last): >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/vdsm-restore-net-config", line 345, in <module> >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: restore(args) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/vdsm-restore-net-config", line 314, in restore >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: unified_restoration() >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/vdsm-restore-net-config", line 93, in unified_restoration >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: setupNetworks(nets, bonds, >>>> connectivityCheck=False, _inRollback=True) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/network/api.py", line 642, in setupNetworks >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: implicitBonding=False, >>>> _netinfo=_netinfo) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/network/api.py", line 213, in wrapped >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: ret = func(**attrs) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/network/api.py", line 429, in delNetwork >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: netEnt.remove() >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/network/models.py", line 100, in remove >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configurator.removeNic(self) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/network/configurators/ifcfg.py", line 215, in removeNic >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: >>>> self.configApplier.removeNic(nic.name) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/share/vdsm/network/configurators/ifcfg.py", line 657, in removeNic >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: with open(cf) as nicFile: >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: IOError: [Errno 2] No such file or >>>> directory: u'/etc/sysconfig/network-scripts/ifcfg-p4p1' >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last): >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, >>>> in main >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: return >>>> tool_command[cmd]["command"](*args) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 40, in >>>> restore_command >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: exec_restore(cmd) >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File >>>> "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 53, in >>>> exec_restore >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: raise EnvironmentError('Failed to >>>> restore the persisted networks') >>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: EnvironmentError: Failed to restore >>>> the persisted networks >>>> Sep 3 17:27:28 vhm-prd-02 systemd: vdsm-network.service: main process >>>> exited, code=exited, status=1/FAILURE >>>> Sep 3 17:27:28 vhm-prd-02 systemd: Failed to start Virtual Desktop Server >>>> Manager network restoration. >>>> Sep 3 17:27:28 vhm-prd-02 systemd: Dependency failed for Virtual Desktop >>>> Server Manager. >>>> Sep 3 17:27:28 vhm-prd-02 systemd: >>>> Sep 3 17:27:28 vhm-prd-02 systemd: Unit vdsm-network.service entered >>>> failed state. >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Postfix Mail Transport Agent. >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Multi-User System. >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Reached target Multi-User System. >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Update UTMP about System >>>> Runlevel Changes... >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Stop Read-Ahead Data >>>> Collection 10s After Completed Startup. >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Stop Read-Ahead Data >>>> Collection 10s After Completed Startup. >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Update UTMP about System >>>> Runlevel Changes. >>>> Sep 3 17:27:33 vhm-prd-02 systemd: Startup finished in 2.964s (kernel) + >>>> 2.507s (initrd) + 15.996s (userspace) = 21.468s. >>>> >>>> So, as I have two more hosts, that need updating, I'm happy to assist in >>>> bisecting and debugging this update issue. Suggestions and help are very >>>> welcome. >>> Thanks for this important report. I assume that calling >>> >>> vdsClient -s 0 setSafeNetworkConfig >>> >>> on the host before upgrade would make your problems go away, please do >>> not do that yet - your assistence in debugging this further is >>> important. >> Hi Dan, >> >> >From backups I could extract the pre-update timestamps of the files in >> /var/lib/vdsm/netconfback: >> ifcfg-em1 2015-08-10 16:40:19 >> ifcfg-ovirtmgmt 2015-08-10 16:40:19 >> ifcfg-p3p1 2015-08-10 16:40:25 >> ifcfg-p4p1 2015-08-10 16:40:22 >> route-ovirtmgmt 2015-08-10 16:40:20 >> rule-ovirtmgmt 2015-08-10 16:40:20 >> >> The ifcfg-scripts had the same corresponding timestamps: >> ifcfg-em1 2015-08-10 16:40:19 >> ifcfg-lo 2015-01-15 09:57:03 >> ifcfg-ovirtmgmt 2015-08-10 16:40:19 >> ifcfg-p3p1 2015-08-10 16:40:25 >> ifcfg-p4p1 2015-08-10 16:40:22 > Do you recall what has been done on 2015-08-10? > Was your 3.5.3 host rebooted ever since? I just tried to reconstruct the happings on 2015-08-10 and it seems, that in fact the network configuration was not touched. I was mislead by the dates. At that date/time an updated kernel and some more CentOS rpms where updated (the whole cluster was updated one by one). A reboot on this specific host was initiated after the update at 2015-08-10 16:40:04. The timestamps from my previous email seem still to be _within_ the bootup-process. So yes, the host was rebooted ever since update to 3.5.3 (that happened on 2015-06-15).
Reboots since 2015-06-15: reboot system boot 3.10.0-229.11.1. Mon Aug 10 16:56 - 14:34 (27+21:37) reboot system boot 3.10.0-229.7.2.e Mon Jul 27 17:48 - 16:53 (13+23:05) reboot system boot 3.10.0-229.7.2.e Wed Jun 24 16:46 - 17:46 (33+00:59) reboot system boot 3.10.0-229.4.2.e Mon Jun 15 18:10 - 16:44 (8+22:34) I checked the 2 remaining hosts (still 3.5.3) and both do not have any different content in /var/lib/vdsm/netconfback. Again only single line comments: # original file did not exist My other productive oVirt 3.4 hosts don't even have these. The directory /var/lib/vdsm/netconfback is empy on those. What should/could I check on the remaining 2 hosts prior to the update? Try syncing the network-configuration and verify the contents in /var/lib/vdsm/netconfback? > > If the networks have been configured on the host back then, but never > persisted, any reboot (regardless of upgrade) would cause their removal. > > Vdsm should be more robust in handling missing ifcfg; but that's a > second-order bug > > 1256252 Vdsm should recover ifcfg files in case they are no > longer exist and recover all networks on the server > > I'd like to first understand how come you have these placeholders left > behind. > >> The attached supervdsm.log contains everything from network configuration >> done on 2015-08-10 till vdsm update on 2015-09-03 at 17:20 and the reboot >> performed afterwards. > Thanks. Maybe Ido could find further hints inside it -- Lobster SCM GmbH, Hindenburgstraße 15, D-82343 Pöcking HRB 178831, Amtsgericht München Geschäftsführer: Dr. Martin Fischer, Rolf Henrich _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users